# Part 2 - Zero-shot summaries

In this part we will use Hugging Face's high-level Pipeline API to create summaries with a pre-trained model. There are three main steps involved when you pass some text to a pipeline:

1) The text is preprocessed into a format the model can understand.

2) The preprocessed inputs are passed to the model.

3) The predictions of the model are post-processed, so you can make sense of them.

In [53]:
from transformers import pipeline
summarizer = pipeline("summarization")

This line of code allows us to see which model is being used by default

In [54]:
summarizer.model.config.__getattribute__('_name_or_path')

'sshleifer/distilbart-cnn-12-6'

In [55]:
import pandas as pd
df_test = pd.read_csv('data/test.csv')

In [56]:
ref_summaries = list(df_test['summary'])
texts = list(df_test['text'])

Testing the pipeline with an abstract from the test dataset

In [None]:
texts[0]

In [59]:
summarizer(texts[0], max_length=20)

[{'summary_text': ' Threefold $X$ has a unique anticanonical section which is a Jacob'}]

Running the pipeline over all 2,000 examples. Because this will take a while we print a counter to keep track of the progress.

In [60]:
# params = {"min_length":5, "max_length":20, "num_beams":50}
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 50 == 0:
        print(i)
    candidate = summarizer(text, min_length=5, max_length=20)#, **params)
    candidate_summaries.append(candidate[0]['summary_text'])

0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950


Saving the candidate summaries in case we want to investigate further.

In [61]:
file = open("zero-shot-summaries.txt", "w")
for s in candidate_summaries:
    file.write(s + "\n")
file.close()

In [62]:
candidate_summaries[:5]

[' Threefold $X$ has a unique anticanonical section which is a Jacob',
 ' New affine algebras A_{\\hbar,\\eta}(\\',
 ' Deep Learning (DL) components within a Case-Based Reasoning (CBR)',
 ' An innovative mechanism allows multifold reconfiguration of mechanical rotation of semiconductor nanoent',
 ' Light scattering by inhomogeneities in the index of refraction of a fluid can']

In [1]:
from datasets import load_metric
metric = load_metric("rouge")

ModuleNotFoundError: No module named 'datasets'

In [64]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [65]:
calc_rouge_scores(candidate_summaries, ref_summaries)

{'rouge1': 30.3, 'rouge2': 14.0, 'rougeL': 26.1, 'rougeLsum': 26.1}