# Comparing Text Summarization Pipelines

---

Objective: In this notebook, we will load pre-trained summarization models from the transformers library provided from huggingface. The goal is to get you familiar with loading Large Language Models for any use case.

## Installing Dependecies
----
To get started, we need to install the datasets and transformers libraries in our instance. We use the pip library to be able to do so.

In [None]:
!pip install datasets==2.0.0
!pip install transformers

## Loading the Dataset
---
For this notebook, we will be using the CNN/DailyMail Dataset. The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

### Loading the Dataset

---
We need to load the 3.0.0 version of the 'cnn_dailymail' dataset from the datasets library.

In [None]:
from datasets import load_dataset

dataset = load_dataset("---", version="----") #add the model name then model version

### Visualizing a sample

---
Let's visualize a sample from our dataset to get a good look at what it looks like. We will visulaize the first 500 charachters of it.

In [None]:
sample = dataset["train"][-] #enter the index of the first element

#Printing the article
print(f"""
Article (excerpt of 500 characters, total length: {len(sample["article"])}):
""")
print(sample["article"][---]) #enter the slice that would take the first 500 charachters of a string

#Printing the summary
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])

## Text Summarization Pipelines
---
In this section, we will be loading the state of the art models that we will compare with.

We will be infering on a sample text from the dataset and storing it in the sample_text variable.

We'll collect the generated summaries of each model in a dictionary called summaries.

In [None]:
sample_text = dataset["train"][1]["article"][:2000]
summaries = {}

### Summarization Baseline
---
As a baseline, we will be using the sent_tokenize function from the NLTK library. The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk. tokenize. punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation. We will store its output in "baseline".


In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

We will define a function that gives us a three sentence summary of the input text by taking the first 3 token outputs.

In [None]:
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [None]:
summaries["baseline"] = three_sentence_summary(sample_text)

### Pegasus
---
The Pegasus model was proposed in PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Google. Pegasus’ pretraining task is intentionally similar to summarization: important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary.

We will import the "pegasus_summarizer" model from the huggingface user turner007.

In [None]:
#hide_output
from transformers import pipeline, set_seed

set_seed(42) #random seed for reproducibility
pipe = pipeline("summarization", model="turner007/---") #enter model name
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(
    sent_tokenize(pipe_out[0]["generated_text"]))

### BART
---
The Bart model was proposed in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. We can call the summarization fine tuned model which is called "facebook/bart-large-cnn"

In [None]:
pipe = pipeline("---", model="---")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

## Comparing Different Summaries

Now, we will compare the different summaries generated by the models.

In [None]:
print("ORIGINAL TEXT:")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("SUMMARY LENGTH: ", len(summaries[model_name]))
    print("")