In this colab we:
1. Import CNN dailymail (news, summarization) pairs.
2. SUMMARIZE first example (about Hary Potter)
    - baseline = first 3 senteces of article
    - GPT-2 = append TL;DR and generate next tokens
    

In [None]:
# !pip install transformers datasets rouge_score sacrebleu evaluate py7zr pynvml xformers sentencepiece accelerate

In [2]:
import os
import json
import pandas as pd
import torch

import transformers
import evaluate
from datasets import load_dataset

import nltk
nltk.download('punkt')

device = 'cuda'

cnn_dataset = load_dataset("cnn_dailymail", version="3.0.0")
cnn_dataset['train'].column_names

summaries = {}

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nikit\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Found cached dataset cnn_dailymail (C:/Users/nikit/.cache/huggingface/datasets/cnn_dailymail/default/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
sample = cnn_dataset['train'][0]
print(sample['article'][29:211], '...')
print('- - - - -')
print(sample['highlights'])

Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel ...
- - - - -
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .


In [3]:
# Baseline = first 3 senteces
def three_sentence_summary(text):
    return " ".join(nltk.sent_tokenize(text)[:3])
summaries['baseline'] = three_sentence_summary(sample['article'][:1000])

# GPT-2 (not trained to SUMMARIZE but at least we can give a shot by appending "TL;DR")
if 'COLAB_GPU' in os.environ:
    gpt2_query = sample['article']  + "\nTL;DR:\n"
    gpt2_pipe = transformers.pipeline("text-generation", model="gpt2-xl")
    gpt2_out = gpt2_pipe(gpt2_query, max_length=1024, clean_up_tokenization_spaces=True)
    summaries['gpt2'] = "".join(nltk.sent_tokenize(gpt2_out[0]["generated_text"][len(gpt2_query) :]))

    # T5 fine-tuned on Summarization (CNN/DailyMail included)
    t5_pipe = transformers.pipeline("summarization", model="t5-large")
    t5_out = t5_pipe(sample['article'])
    summaries['t5'] = "".join(nltk.sent_tokenize(t5_out[0]["summary_text"]))

    # BART exclusively fine-tuned on CNN/DailyMail
    bart_pipe = transformers.pipeline("summarization", model="facebook/bart-large-cnn")
    bart_out = bart_pipe(sample['article'])
    summaries['bart'] = "".join(nltk.sent_tokenize(bart_out[0]["summary_text"]))

    
    # PEGASUS exclusively fine-tuned on CNN/DailyMail
    pegasus_pipe = transformers.pipeline("summarization", model="google/pegasus-cnn_dailymail")
    pegasus_out = pegasus_pipe(sample['article'])
    summaries['pegasus'] = "".join(nltk.sent_tokenize(pegasus_out[0]["summary_text"])).replace(" .<n>", ".\n") 
    
else: # import results
    with open('files/harry0_summaries_gpt2_t5_bart_pegasus.json', 'r') as file:
        summaries = json.load(file)
 

print("GROUND TRUTH")
print(sample['highlights'], '\n')    

for model_name in summaries:
    print(model_name.upper())
    print(summaries[model_name], '\n')


GROUND TRUTH
Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund . 

BASELINE
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix"To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. 

GPT2
Rudyard Kipling's youngest son is coming of age in "Harry Potter and the Order of the Phoenix."Exp

# Let's compute BLEU and also ROUGE

First on Harry Potter sample, then on the whole CNN/DailyMail

In [4]:
# Evaluate Bleu & Rouge on "Harry Potter"

bleu_metric = evaluate.load("sacrebleu")
rouge_metric = evaluate.load("rouge")

records = []

rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

for model_name in summaries:
    predictions = [summaries[model_name]]
    references = [sample['highlights']]

    rouge_score = rouge_metric.compute(predictions=predictions, references=references)
    bleu_score = bleu_metric.compute(predictions=predictions, references=references)

    record_dict = {**rouge_score, 'sacre_bleu' : bleu_score['score']}

    records.append(record_dict)

print('Rouge & Bleu for SINGLE "Harry Potter" example by MODEL:')
pd.DataFrame.from_records(records, index=summaries.keys())

Rouge & Bleu for SINGLE "Harry Potter" example by MODEL:


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum,sacre_bleu
baseline,0.335484,0.248366,0.296774,0.335484,13.078777
gpt2,0.174757,0.039604,0.135922,0.174757,1.473461
t5,0.285714,0.213592,0.247619,0.266667,14.509719
bart,0.613636,0.372093,0.545455,0.568182,28.544473
pegasus,0.8,0.692308,0.8,0.8,64.657462


In [5]:
# Evaluate BASELINE Rouge on 1k CNN/DailyMail
cnn_test_sampled = cnn_dataset['test'].shuffle(seed=42).select(range(1000))

rouge_metric = evaluate.load("rouge")
bleu_metric = evaluate.load("sacrebleu")

def evaluate_summaries_baseline(dataset, metric,
                                column_text="article",
                                column_summary="highlights"):
    summaries = [three_sentence_summary(text) for text in dataset[column_text]]
    metric.add_batch(predictions=[summaries],
                     references=[dataset[column_summary]])
    score = metric.compute()
    return score

if 'COLAB_GPU' in os.environ:
    rouge_score = evaluate_summaries_baseline(cnn_test_sampled, rouge_metric)
    bleu_score = evaluate_summaries_baseline(cnn_test_sampled, bleu_metric)

    metrics = rouge_score
    metrics['sacre_bleu'] = bleu_score['score']
else:
    with open('files/baseline@cnn.json', 'r') as file:
        metrics = json.load(file)

print('\033[1m' + '\nRouge & Bleu for 1k-CNN DATASET for MODEL=BASELINE:' + '\033[0m')
pd.DataFrame.from_dict(metrics, orient="index", columns=["baseline"]).T



[1m
Rouge & Bleu for 1k-CNN DATASET for MODEL=BASELINE:[0m


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.396061,0.173995,0.245815,0.361158


In [6]:
# Evaluate PEGASUS Rouge on CNN/DailyMail
import os, json, pandas
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tqdm import tqdm
import evaluate


def chunks(list_of_elements, batch_size):
    """Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def evaluate_summaries_pegasus(dataset, metric, model, tokenizer,
                               batch_size=16, device='cuda',
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(chunks(dataset[column_text], batch_size))
    target_batches = list(chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=8, max_length=128)

        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]
        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    score = metric.compute()
    return score



if 'COLAB_GPU' in os.environ:
    cnn_test_sampled = cnn_dataset['test'].shuffle(seed=42).select(range(1000))
    rouge_metric = evaluate.load("rouge")

    model_ckpt = "google/pegasus-cnn_dailymail"
    pegasus_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    pegasus_model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
    score = evaluate_summaries_pegasus(cnn_test_sampled, rouge_metric, pegasus_model, pegasus_tokenizer, batch_size=8)
else:
    with open('files/cnn_pegasus@cnn.json', 'r') as file:
        score = json.load(file)

# published paper results: 
# R1 - 0.439, R2 - 0.212, RL - 0.407

print('\033[1m' + '\nRouge & Bleu for 1k-CNN DATASET for MODEL=PEGASUS FT@CNN:' + '\033[0m')
pandas.DataFrame(score, index=['pegasus'])

[1m
Rouge & Bleu for 1k-CNN DATASET for MODEL=PEGASUS FT@CNN:[0m


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.434365,0.216317,0.312109,0.37413


# Fine-tune Pegasus on SAMSum
Consider Summarization for another dataset : Dialogues (SAMSum).
- The summarization should be more abstract and written from third-person-like

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "google/pegasus-cnn_dailymail"
pegasus_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
pegasus_model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

dataset_samsum = load_dataset('samsum')
print('Dataset size: train:14,732 - teset:819 - validation:818')

samsum_sample = dataset_samsum['test'][0]
print('Dialogue:')
print(samsum_sample['dialogue'])
print('\nSummary:')
print(samsum_sample['summary'])

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Found cached dataset samsum (C:/Users/nikit/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset size: train:14,732 - teset:819 - validation:818
Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


In [None]:
# Let's zero-shot PEGASUS on Hannah example
pegasus_pipe = transformers.pipeline("summarization", model="google/pegasus-cnn_dailymail")
pegasus_out = pegasus_pipe(dataset_samsum['test'][0]['dialogue'])
print('Pegasus summary:')
print(pegasus_out[0]['summary_text'].replace(' .<n>', '.\n'))

- the model tries to summarize by extracting the key sentences. 
- That is OK for CNN/DailyMail but not SAMSum

Let's compute zero-shot **Rouge** of PEGASUS on whole SAMSum

In [10]:
# Let's zero-shot PEGASUS on Hannah example (819 examples | 27 GB VRAM required | ~10 min A100 GPU)

if 'COLAB_GPU' in os.environ:
    model_ckpt = "google/pegasus-cnn_dailymail"
    pegasus_tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    pegasus_model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

    rouge_metric = evaluate.load("rouge")

    score = evaluate_summaries_pegasus(dataset=dataset_samsum['test'].select(range(300)), 
                                        metric=rouge_metric,
                                        model=pegasus_model,
                                        tokenizer=pegasus_tokenizer,
                                        column_text='dialogue',
                                        column_summary='summary',
                                        batch_size=8)
else:
    with open('files/cnn_pegasus@samsum.json', 'r') as file:
        score = json.load(file)


# HG book: R1 - 0.296, R2 - 0.088, RL - 0.230, RLsum - 0.230
print('\033[1m' + '\nRouge on DATASET=819.SAMSum for MODEL=PEGASUS:' + '\033[0m')
pd.DataFrame(score, index=["pegasus"])

[1m
Rouge on DATASET=819.SAMSum for MODEL=PEGASUS:[0m


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.296198,0.087426,0.22933,0.229199


In [11]:
# Prepare data for fine-tuning

def cast_dataset_to_features(batch):

    input_encodings = pegasus_tokenizer(batch['dialogue'], max_length=1024, truncation=True)
    with pegasus_tokenizer.as_target_tokenizer():
        target_encodings = pegasus_tokenizer(batch['summary'], max_length=128, truncation=True)
    
    return {'input_ids' : input_encodings['input_ids'],
            'attention_mask' : input_encodings['attention_mask'],
            'labels' : target_encodings['input_ids']}

dataset_samsum_pt = dataset_samsum.map(cast_dataset_to_features, batched=True)

columns = ['input_ids', 'labels', 'attention_mask']
dataset_samsum_pt.set_format(type='torch', columns=columns)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataccolator:
- stack all tensors from batch
- prepare decoder targets (teacher forcing)

In [15]:
from transformers import DataCollatorForSeq2Seq
from transformers import TrainingArguments, Trainer
from huggingface_hub import notebook_login

if 'COLAB_GPU' in os.environ:
    notebook_login() #             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

    seq2seq_data_collator = DataCollatorForSeq2Seq(pegasus_tokenizer, model=pegasus_model)

    training_args = TrainingArguments(output_dir='pegasus-samsum', num_train_epochs=1,
    warmup_steps=500, per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10, push_to_hub=True, evaluation_strategy='steps',
    eval_steps=500, save_steps=1e6, gradient_accumulation_steps=16)

    trainer = Trainer(model=pegasus_model, args=training_args,
                    tokenizer=pegasus_tokenizer, data_collator=seq2seq_data_collator,
                    train_dataset=dataset_samsum_pt["train"],
                    eval_dataset=dataset_samsum_pt["validation"])
    
    trainer.train() # T4 GPU 15GB ~ 50 min

    score = evaluate_summaries_pegasus(dataset_samsum['test'], rouge_metric, trainer.model, pegasus_tokenizer,
    batch_size=2, column_text='dialogue', column_summary='summary')

    score = pd.DataFrame(score, index=['pegasus'])
else:
    with open('files/samsum_pegasus@samsum.json', 'r') as file:
        score = json.load(file)
print("Pegasus(FT:Samsum) evaluated on SamSum['test']:")        
score

Pegasus(FT:Samsum) evaluated on SamSum['test']:


{'rouge1': 0.4267732831551504,
 'rouge2': 0.19817278181827536,
 'rougeL': 0.3425978145713223,
 'rougeLsum': 0.3426185352735921}

## Let's generate Dialogue summaries

In [4]:
# Generate Hannah dialogue summary:
from transformers import pipeline

gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}
sample_text = dataset_samsum["test"][0]["dialogue"]
pipe = pipeline("summarization", model="nikitakapitan/pegasus-samsum")

pipe(sample_text, **gen_kwargs)[0]["summary_text"]

Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


"Amanda can't find Betty's number. Larry called Betty last time they were at the park together. Hannah wants Amanda to text Larry. Amanda will text Larry."

In [5]:
custom_dialogue = """\
Thom: Hi guys, have you heard of transformers?
Lewis: Yes, I used them recently!
Leandro: Indeed, there is a great library by Hugging Face.
Thom: I know, I helped build it ;)
Lewis: Cool, maybe we should write a book about it. What do you think?
Leandro: Great idea, how hard can it be?!
Thom: I am in!
Lewis: Awesome, let's do it together!
"""

pipe(custom_dialogue, **gen_kwargs)[0]["summary_text"]

Your max_length is set to 128, but your input_length is only 91. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)


'Thom, Lewis and Leandro are going to write a book about transformers. Thom helped build a library by Hugging Face. Lewis and Leandro are going to do it together.'