# Summarization 

This chapter explaines the summarization task first by the pipeline and then fine-tuning an encoder-decoder model. It explores two datasets: one from news domain (CNN/Daily), another from short dialogues. 

* **Extractive vs abstractive**: In the extractive summary the model uses the sentences from the given input to form a meaningful summary. In abstractive summarization, the model generates new sentences.   
* **Teacher forcing** : is a strategy used for sequence-to-sequence models. The decoder receives input tokens that consists of the lables shifted by one in addition to the encoder output. 

* **Gradient accumulation** : Instead of calculating the gradients at each step for a full batch, the trained makes smaller batches and aggregate the gradients. This strategy can be usefull when the batch is 1.  

In [None]:
!pip install transformers 
!pip install datasets
!pip install sentencepiece

In [None]:
!pip install sacrebleu
!pip install rouge_score

In [None]:
!pip install py7zr

## CNN/Daily Dataset 

CNN/Daily Mail dataset consists of ~300K pairs of news articles and their summaries as bullet points. The summaries are *abstractive*.  


In [1]:
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", version="3.0.0")
print(f"Features: {dataset['train'].column_names}")



  0%|          | 0/3 [00:00<?, ?it/s]

Features: ['article', 'highlights', 'id']


In [2]:
# An example from train set 
sample = dataset["train"][1]
print(f"""
Article (excerpt of 500 characters, total length: {len(sample["article"])}):
""")
print(sample["article"][:500])
print(f'\nSummary (length: {len(sample["highlights"])}):')
print(sample["highlights"])


Article (excerpt of 500 characters, total length: 4051):

Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most s

Summary (length: 281):
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .


## Summarization Pipeline 

There is a limit on the input token size (as 1K) in the transformer models. For now, we truncate long examples as max limit is 2K. 

In [3]:
sample_text = dataset["train"][1]["article"][:2000]
# We'll collect the generated summaries of each model in a dictionary
summaries = {}

In [4]:
# Sentence tokenization using NLTK 
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

string = "The U.S. are a country. The U.N. is an organization."
sent_tokenize(string)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['The U.S. are a country.', 'The U.N. is an organization.']

## Baseline 

The simplest baseline in summarization is to use first few sentences as summary. 

In [5]:
# Baseline: choosing 3 sentences 
def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])

In [6]:
summaries["baseline"] = three_sentence_summary(sample_text)

## GPT-2

In GPT-2, we can create a summary by adding "TL;DR" at the end of the input text. 

In [10]:
from transformers import pipeline, set_seed


In [7]:
# Summarization pipeline with GPT-2 
set_seed(42)
pipe = pipeline("text-generation", model="gpt2")
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)
summaries["gpt2"] = "\n".join(
    sent_tokenize(pipe_out[0]["generated_text"][len(gpt2_query) :]))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## T5 

T5 is pre-trained for multiple NLP text-to-text tasks. For summarization, we need to add "summarize: " at the begining of the article. 

In [8]:
pipe = pipeline("summarization", model="t5-small")
pipe_out = pipe(sample_text)
summaries["t5"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


## BART 

BART is an encoder-decoder model pre-trained to reconstruct corrupted inputs. 

In [9]:
pipe = pipeline("summarization", model="ainize/bart-base-cnn")
pipe_out = pipe(sample_text)
summaries["bart"] = "\n".join(sent_tokenize(pipe_out[0]["summary_text"]))

## PEGASUS 

It is an encoder-decoder architecture by pre-training to predict masked sentences in multisentence texts.

In [8]:
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n")

## Comparing models 

Note that GPT-2 has not been trained on dataset, T5 has been fine-tuned on the task, BART and PEGASUS have been exclusively fine-tuned. 

In [11]:
print("GROUND TRUTH")
print(dataset["train"][1]["highlights"])
print("")

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name])
    print("")

GROUND TRUTH
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .

BASELINE
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.
Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.
MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."

GPT2

-1.
The 9th degree refers to a prisoner who is "forgotten" because of his illness

-2.
As this sentence makes no sense, it probably doesn't hold much of a weight.
Most inmates on the 9th cannot handle an indefinite te

# Measuring the quality of generated text 

* **BLEU** : the simple idea is to look n-grams instead of looking the words. It is a precision-based metric, we count the number of words of generation also occur in the reference and divide it by the length of generation.  

To avoid the effect of the repetitive words, a word is counted as many times as it occurs in the reference. 

The precision score favors short generations, to compensate for that the brevity penalty is added. 

Generally, BLEU-4 socre is reported, as 4 stands n-gram count. 

In [4]:
from datasets import load_metric

bleu_metric = load_metric("sacrebleu")

  bleu_metric = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [16]:
import pandas as pd
import numpy as np
prediction = "the the the the the the"
references = ["the cat is on the mat"]
print("Prediction : ", prediction)
print("Reference : ", references)
bleu_metric.add(
    prediction=prediction, reference=references)
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Prediction :  the the the the the the
Reference :  ['the cat is on the mat']


Unnamed: 0,Value
score,0.0
counts,"[2, 0, 0, 0]"
totals,"[6, 5, 4, 3]"
precisions,"[33.33, 0.0, 0.0, 0.0]"
bp,1.0
sys_len,6
ref_len,6


In [17]:
# another example 
prediction = "the cat is on mat"
references = ["the cat is on the mat"]
print("Prediction : ", prediction)
print("Reference : ", references)
bleu_metric.add(
    prediction=prediction, reference=references)
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])

Prediction :  the cat is on mat
Reference :  ['the cat is on the mat']


Unnamed: 0,Value
score,57.893007
counts,"[5, 3, 2, 1]"
totals,"[5, 4, 3, 2]"
precisions,"[100.0, 75.0, 66.67, 50.0]"
bp,0.818731
sys_len,5
ref_len,6


* **ROUGE:** : was specifically designed for summarization task in which we favor recall over precision. Similar to BLEU, it compares the number of n-grams between the reference and the generation text. The difference is that ROUGE-N divides the count by the length of the reference text. 

A separate score to measiure the longest common string (LCS), is called ROUGE-L. It has a F-score like scheme where the LCS is normalized with the length of the reference and generated text, then the two normalized score mixed together. 

In [5]:
rouge_metric = load_metric("rouge")

Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [21]:
# evaluate models' results 
reference = dataset["train"][1]["highlights"]
records = []
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    rouge_metric.add(prediction=summaries[model_name], reference=reference)
    score = rouge_metric.compute()
    rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
    records.append(rouge_dict)
pd.DataFrame.from_records(records, index=summaries.keys())

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.365079,0.145161,0.206349,0.285714
gpt2,0.275229,0.018692,0.146789,0.275229
t5,0.195652,0.022222,0.108696,0.173913
bart,0.255814,0.095238,0.162791,0.255814
pegasus,0.323232,0.206186,0.282828,0.323232


## Evaluating PEGASUS on CNN/Daily Dataset 



In [22]:
# Evaluate function
def evaluate_summaries_baseline(dataset, 
                                metric,
                                column_text="article",
                                column_summary="highlights"):
    summaries = [three_sentence_summary(text) for text in dataset[column_text]]
    metric.add_batch(predictions=summaries,
                     references=dataset[column_summary])
    score = metric.compute()
    return score

In [23]:
# Instead of using all samples from test,
# we will sample 1K 
test_sampled = dataset["test"].shuffle(seed=42).select(range(1000))

score = evaluate_summaries_baseline(test_sampled, rouge_metric)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame.from_dict(rouge_dict, orient="index", columns=["baseline"]).T

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.389276,0.171296,0.245061,0.354239


In [6]:
from tqdm import tqdm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(list_of_elements, batch_size):
    """Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def evaluate_summaries_pegasus(dataset, metric, model, tokenizer,
                               batch_size=16, device=device,
                               column_text="article",
                               column_summary="highlights"):
    #form batches for the dataset  
    article_batches = list(chunks(dataset[column_text], batch_size))
    target_batches = list(chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):
        # each batch is tokenized 
        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")
        # feed the tokenized batches to the generation function  
        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)
        # decode the final summaries 
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        # compute ROUGE metric for each batch 
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    score = metric.compute()
    return score


In [17]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
score = evaluate_summaries_pegasus(test_sampled, rouge_metric,
                                   model, tokenizer, batch_size=8)
rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=["pegasus"])

# Training a summarization model 

In this section, we will build a text summarization model using SAMSum dataset. The dataset contains the dialogues along with their summaries. 


In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [None]:
from datasets import load_dataset

dataset_samsum = load_dataset("samsum")
split_lengths = [len(dataset_samsum[split])for split in dataset_samsum]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print("\nDialogue:")
print(dataset_samsum["test"][0]["dialogue"])
print("\nSummary:")
print(dataset_samsum["test"][0]["summary"])

In [11]:
# Running PEGASUS on SAMSum example 
from transformers import pipeline, set_seed

pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail")

pipe_out = pipe(dataset_samsum["test"][0]["dialogue"])
print("Generated Summary:")
print(pipe_out[0]["summary_text"].replace(" .<n>", ".\n"))
print("Reference :")
print(dataset_samsum["test"][0]["summary"])

Your max_length is set to 128, but you input_length is only 122. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Generated Summary:
Amanda: Ask Larry Amanda: He called her last time we were at the park together.
Hannah: I'd rather you texted him.
Amanda: Just text him .
Reference :
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


In [None]:
score = evaluate_summaries_pegasus(dataset_samsum["test"], rouge_metric, model,
                                   tokenizer, column_text="dialogue",
                                   column_summary="summary", batch_size=8)

rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=["pegasus"])

100%|██████████| 103/103 [18:13<00:00, 10.62s/it]


NameError: ignored

## Looking the dataset distribution by length

In [None]:
# About dataset 
import matplotlib.pyplot as plt 

d_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["dialogue"]]
s_len = [len(tokenizer.encode(s)) for s in dataset_samsum["train"]["summary"]]

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5), sharey=True)
axes[0].hist(d_len, bins=20, color="C0", edgecolor="C0")
axes[0].set_title("Dialogue Token Length")
axes[0].set_xlabel("Length")
axes[0].set_ylabel("Count")
axes[1].hist(s_len, bins=20, color="C0", edgecolor="C0")
axes[1].set_title("Summary Token Length")
axes[1].set_xlabel("Length")
plt.tight_layout()
plt.show()


## Fine-tuning PEGASUS 

In [None]:
# Tokenization function 
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch["dialogue"], max_length=1024,
                                truncation=True)

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=128,
                                     truncation=True)

    return {"input_ids": input_encodings["input_ids"],
            "attention_mask": input_encodings["attention_mask"],
            "labels": target_encodings["input_ids"]}

dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features,
                                       batched=True)
columns = ["input_ids", "labels", "attention_mask"]
dataset_samsum_pt.set_format(type="torch", columns=columns)

In [None]:
from transformers import DataCollatorForSeq2Seq

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='pegasus-samsum', 
    num_train_epochs=1,
    warmup_steps=500,
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,
    weight_decay=0.01, 
    logging_steps=10, 
    push_to_hub=False,
    evaluation_strategy='steps', 
    eval_steps=500, 
    save_steps=1e6,
    gradient_accumulation_steps=16)

trainer = Trainer(model=model, 
                  args=training_args,
                  tokenizer=tokenizer, 
                  data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["train"],
                  eval_dataset=dataset_samsum_pt["validation"])
trainer.train()
score = evaluate_summaries_pegasus(
    dataset_samsum["test"], rouge_metric, trainer.model, tokenizer,
    batch_size=2, column_text="dialogue", column_summary="summary")

rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)
pd.DataFrame(rouge_dict, index=[f"pegasus"])

## Generating summaries with fine-tuned model 

In [None]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}
sample_text = dataset_samsum["test"][0]["dialogue"]
reference = dataset_samsum["test"][0]["summary"]
pipe = pipeline("summarization", model="transformersbook/pegasus-samsum")

print("Dialogue:")
print(sample_text)
print("\nReference Summary:")
print(reference)
print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

In [None]:
# Try with a custom input 
custom_dialogue = """\
Thom: Hi guys, have you heard of transformers?
Lewis: Yes, I used them recently!
Leandro: Indeed, there is a great library by Hugging Face.
Thom: I know, I helped build it ;)
Lewis: Cool, maybe we should write a book about it. What do you think?
Leandro: Great idea, how hard can it be?!
Thom: I am in!
Lewis: Awesome, let's do it together!
"""
print(pipe(custom_dialogue, **gen_kwargs)[0]["summary_text"])