# Part 2 - Zero-shot summaries

In this part we will use Hugging Face's high-level Pipeline API to create summaries with a pre-trained model. There are three main steps involved when you pass some text to a pipeline:

1) The text is preprocessed into a format the model can understand.

2) The preprocessed inputs are passed to the model.

3) The predictions of the model are post-processed, so you can make sense of them.

In [None]:
from transformers import pipeline
summarizer = pipeline("summarization")

This line of code allows us to see which model is being used by default. We can also find this information in the source code for pipelines: https://huggingface.co/transformers/_modules/transformers/pipelines.html

In [None]:
summarizer.model.config.__getattribute__('_name_or_path')

The model for the standard summarisation task is https://huggingface.co/sshleifer/distilbart-cnn-12-6, which has been specifically trained on 2 datasets: https://huggingface.co/datasets/cnn_dailymail and https://huggingface.co/datasets/xsum. We will keep using this model, but if we wanted to use a different model we could easily do this by specifing it like below. All the models that are trained for summarisation can be viewed here: https://huggingface.co/models?pipeline_tag=summarization&sort=downloads

In [None]:
# summarizer = pipeline("summarization", model='facebook/bart-large-cnn')

In [None]:
import pandas as pd
df_test = pd.read_csv('data/test.csv')
ref_summaries = list(df_test['summary'])
texts = list(df_test['text'])

Testing the pipeline with an abstract from the test dataset

In [None]:
texts[0]

In [None]:
summarizer(texts[0], max_length=20)

Running the pipeline over all 2,000 examples. Because this will take a while we print a counter to keep track of the progress. This should take around 50 minutes.

In [None]:
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 100 == 0:
        print(i)
    candidate = summarizer(text, min_length=5, max_length=20)
    candidate_summaries.append(candidate[0]['summary_text'])

Saving the candidate summaries in case we want to investigate further.

In [None]:
file = open("summaries/zero-shot-summaries.txt", "w")
for s in candidate_summaries:
    file.write(s + "\n")
file.close()

In [None]:
candidate_summaries[:5]

Calculating the ROUGE scores

In [None]:
from datasets import load_metric
metric = load_metric("rouge")

In [None]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [None]:
calc_rouge_scores(candidate_summaries, ref_summaries)