# Comparing Text Summarization Pipelines

---

Objective: In this notebook, we will load pre-trained summarization models from the transformers library provided from huggingface. The goal is to get you familiar with loading Large Language Models for any use case.

## Installing Dependecies
----
To get started, we need to install the datasets and transformers libraries in our instance. We use the pip library to be able to do so.

In [None]:
!pip install datasets==2.0.0
!pip install transformers
!pip install sentencepiece

## Loading the Dataset
---
For this notebook, we will be using the CNN/DailyMail Dataset. The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

### Loading the Dataset

---
We need to load the 'cnn_dailymail' dataset from the datasets library.

In [None]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")



  0%|          | 0/3 [00:00<?, ?it/s]

### Visualizing a sample

---
Let's visualize a sample from our dataset to get a good look at what it looks like. We will visulaize the first 500 charachters of it.

In [None]:
sample = dataset["train"][0]

#Printing the article
print(f"""
Article (excerpt of 500 characters, total length: {len(sample["article"])}):
""")
print(sample["article"][:500])

#Printing the summary
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])


Article (excerpt of 500 characters, total length: 9396):

It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of che

Summary (length: 294):
Syrian official: Obama climbed to the top of the tree, "doesn't know how to get down"
Obama sends a letter to the heads of the House and Senate .
Obama to seek congressional approval on military action against Syria .
Aim is to determine whether CW were used, not by whom, says U.N. spokesman .


## Text Summarization Pipelines
---
In this section, we will be loading the state of the art models that we will compare with.

We will be infering on a sample text from the dataset and storing it in the sample_text variable.

We'll collect the generated summaries of each model in a dictionary called summaries.

In [None]:
sample_text = dataset["train"][1]["article"][:2000]
summaries = {}

### Summarization Baseline
---
As a baseline, we will be using the sent_tokenize function from the NLTK library. The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk. tokenize. punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation. We will store its output in "baseline".


In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

We will define a function that gives us a three sentence summary of the input text by taking the first 3 token outputs.

In [None]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [None]:
summaries["baseline"] = three_sentence_summary(sample_text)

### Pegasus
---
The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Google. Pegasus’ pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.

We will import the "pegasus_summarizer" model from the huggingface user turner007.

In [None]:
from transformers import pipeline, set_seed

set_seed(42) #random seed for reproducibility
pipe = pipeline("summarization", model="tuner007/pegasus_summarizer")
pegasus_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(pegasus_query, max_length=400, clean_up_tokenization_spaces=True)
summaries["pegasus"] = "\n".join(
    sent_tokenize(pipe_out[0]["summary_text"]))

### BART
---
The Bart model was proposed in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.

In [None]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

## Comparing Different Summaries

Now, we will compare the different summaries generated by the models.

In [None]:
print("ORIGINAL TEXT:")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("SUMMARY LENGTH: ", len(summaries[model_name]))
    print("")

ORIGINAL TEXT:
Usain Bolt wins third gold of world championship .
Anchors Jamaica to 4x100m relay victory .
Eighth gold at the championships for Bolt .
Jamaica double up in women's 4x100m relay .

BASELINE
(CNN) -- Usain Bolt rounded off the world championships Sunday by claiming his third gold in Moscow as he anchored Jamaica to victory in the men's 4x100m relay.
The fastest man in the world charged clear of United States rival Justin Gatlin as the Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds.
The U.S finished second in 37.56 seconds with Canada taking the bronze after Britain were disqualified for a faulty handover.
SUMMARY LENGTH:  473

PEGASUS
Jamaica's Usain Bolt won his third gold medal at the world championships on Sunday as he anchored his team to victory in the men's 4x100m relay.
The Jamaican quartet of Nesta Carter, Kemar Bailey-Cole, Nickel Ashmeade and Bolt won in 37.36 seconds.
The US finished second in 37.56 seconds w