#Zero-shot summaries

In this part we will use Hugging Face's Pipeline  to create summaries with a pre-trained model. There are three main steps involved when you pass some text to a pipeline:

1) The text is preprocessed into a format the model can understand.

2) The preprocessed inputs are passed to the model.

3) The predictions of the model are post-processed, so you can make sense of them.

In [22]:
!pip install transformers
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




This line of code allows us to see which model is being used by default. We can also find this information in the source code for pipelines:https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/__init__.py

In [23]:
summarizer.model.config.__getattribute__('_name_or_path')

'facebook/bart-large-cnn'

The model for the standard summarisation task is https://huggingface.co/sshleifer/distilbart-cnn-12-6, which has been specifically trained on 2 datasets: https://huggingface.co/datasets/cnn_dailymail and https://huggingface.co/datasets/xsum. We will keep using this model, but if we wanted to use a different model we could easily do this by specifing it like below. All the models that are trained for summarisation can be viewed here: https://huggingface.co/models?pipeline_tag=summarization&sort=downloads


In [5]:
# summarizer = pipeline("summarization", model='facebook/bart-large-cnn')

In [6]:
# summarizer = pipeline("summarization", model='facebook/bart-large-cnn')

In [7]:
# from datasets import load_dataset

# dataset = load_dataset("bazzhangz/sumdataset")

In [24]:
import pandas as pd
df_test = pd.read_csv('./Arxiv_Preprocessing/test.csv')
ref_summaries = list(df_test['summary'])
texts = list(df_test['text'])
print("Hi")

Hi


In [25]:
texts=texts[:2000]

In [26]:
ref_summaries=ref_summaries[:2000]

In [27]:
len(texts)

2000

In [28]:
len(ref_summaries)

2000

Testing the pipeline with an abstract from the test dataset

In [29]:
texts[0]


'  This review paper presents the results, which cover the study of current problems of approximation theory in abstract linear spaces. Such research has been actively developed since the 2000s, based on the ideas and approaches initiated in the articles by Stepanets. In particular, the review contains results concerning the best, best $n$-term approximations and widths of some functional compacts in the spaces ${\\mathcal S}^p$. Direct and inverse approximation theorems are also formulated in these spaces. '

In [30]:
ref_summaries[0]

'Problems of approximation theory in abstract linear spaces'

In [13]:
summarizer(texts[0], max_length=60)

[{'summary_text': 'This review paper presents the results, which cover the study of current problems of approximation theory in abstract linear spaces. Such research has been actively developed since the 2000s, based on the ideas and approaches initiated in the articles by Stepanets. In particular, the review contains results concerning the'}]

Running the pipeline over all 2,000 examples. Because this will take a while we print a counter to keep track of the progress. This should take around 50 minutes.

In [31]:
candidate_summaries = []

for i, text in enumerate(texts):
    if i % 100 == 0:
        print(i)
    #print(text)
    candidate = summarizer(text, min_length=5, max_length=20, truncation=False)
    candidate_summaries.append(candidate[0]['summary_text'])

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900


In [32]:
print("Hi")

Hi


Saving the candidate summaries in case we want to investigate further.

In [33]:
!pip install pandas
df = pd.DataFrame(candidate_summaries, columns=["Predictions"])
df.to_csv("./output_Bart_Arxiv_20_False.csv")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [None]:
# file = open("summaries/zero-shot-summaries.txt", "w")
# for s in candidate_summaries:
#     file.write(s + "\n")
# file.close()

In [34]:
candidate_summaries[:10]

['This review paper presents the results, which cover the study of current problems of approximation theory',
 'In this talk I will describe the deep influence Planck had on the development of statistical',
 'The paper deals with the solution of Shevrin ans Sapir problem. Infinite fin',
 'Ecodriving guidance includes courses or suggestions for human drivers to improve driving behaviour.',
 'Weyl fermions are in the fundamental representation or the two index antisymm',
 'We study an analogue of the classical Bianchi-Darboux transformation for L-',
 'Spectral clustering is a popular algorithm that clusters points using the eigenvalues and',
 'We study the azimuthal structure of the stellar disks of 18 face-on',
 'The ABC effect is found to be very modest - if present at all, which might',
 'The performance of all subsystems of the CMS muon detector has been studied. The']

Calculating the ROUGE scores

In [35]:
!pip install datasets
#!pip install flake8-noqa
!pip install rouge-score
from datasets import load_metric
metric = load_metric("rouge")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [36]:
def calc_rouge_scores(candidates, references):
    result = metric.compute(predictions=candidates, references=references, use_stemmer=True)
    result = {key: round(value.mid.fmeasure * 100, 1) for key, value in result.items()}
    return result

In [37]:
calc_rouge_scores(candidate_summaries, ref_summaries)

{'rouge1': 29.9, 'rouge2': 13.8, 'rougeL': 26.1, 'rougeLsum': 26.1}