# Testing Pegasus Summarization on BBC Sports

## About PEGASUS

In the last week of December 2019, Google Brain team launched this state of the art summarization model PEGASUS, which expands to Pre-training with Extracted Gap-sentences for Abstractive Summarization. Here, we will just be looking at how we can generate summaries using the pre-trained model.

Let’s see how we can use the given pre-trained model to generate summaries for our text.

**Reference Used**

- https://signal.onepointltd.com/post/102ghb9/exploring-pegasus-a-new-text-summarization-nlp-model**
- https://huggingface.co/transformers/model_doc/pegasus.html#usage-example
- https://github.com/google-research/pegasus

## Import Libraries and Settings

In [2]:
!pip install transformers==4.2.0

Collecting transformers==4.2.0
  Downloading transformers-4.2.0-py3-none-any.whl (1.8 MB)
[?25l[K     |▏                               | 10 kB 25.9 MB/s eta 0:00:01[K     |▍                               | 20 kB 32.2 MB/s eta 0:00:01[K     |▋                               | 30 kB 29.8 MB/s eta 0:00:01[K     |▊                               | 40 kB 24.0 MB/s eta 0:00:01[K     |█                               | 51 kB 17.8 MB/s eta 0:00:01[K     |█▏                              | 61 kB 15.7 MB/s eta 0:00:01[K     |█▎                              | 71 kB 14.9 MB/s eta 0:00:01[K     |█▌                              | 81 kB 16.4 MB/s eta 0:00:01[K     |█▊                              | 92 kB 17.5 MB/s eta 0:00:01[K     |█▉                              | 102 kB 18.8 MB/s eta 0:00:01[K     |██                              | 112 kB 18.8 MB/s eta 0:00:01[K     |██▎                             | 122 kB 18.8 MB/s eta 0:00:01[K     |██▍                             | 133

In [3]:
!pip install torch

You should consider upgrading via the '/opt/python/envs/default/bin/python -m pip install --upgrade pip' command.[0m


In [4]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.95-cp38-cp38-manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 27.0 MB/s eta 0:00:01[K     |▌                               | 20 kB 33.5 MB/s eta 0:00:01[K     |▉                               | 30 kB 23.4 MB/s eta 0:00:01[K     |█                               | 40 kB 27.6 MB/s eta 0:00:01[K     |█▍                              | 51 kB 25.1 MB/s eta 0:00:01[K     |█▋                              | 61 kB 20.8 MB/s eta 0:00:01[K     |██                              | 71 kB 22.0 MB/s eta 0:00:01[K     |██▏                             | 81 kB 21.4 MB/s eta 0:00:01[K     |██▌                             | 92 kB 19.7 MB/s eta 0:00:01[K     |██▊                             | 102 kB 19.7 MB/s eta 0:00:01[K     |███                             | 112 kB 19.7 MB/s eta 0:00:01[K     |███▎                            | 122 kB 19.7 MB/s eta 0:00:01[K     |███▌                   

In [5]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.6.2-py3-none-any.whl (2.7 MB)
[?25l[K     |▏                               | 10 kB 33.2 MB/s eta 0:00:01[K     |▎                               | 20 kB 28.4 MB/s eta 0:00:01[K     |▍                               | 30 kB 18.3 MB/s eta 0:00:01[K     |▌                               | 40 kB 15.5 MB/s eta 0:00:01[K     |▋                               | 51 kB 18.0 MB/s eta 0:00:01[K     |▊                               | 61 kB 15.4 MB/s eta 0:00:01[K     |▉                               | 71 kB 16.1 MB/s eta 0:00:01[K     |█                               | 81 kB 16.3 MB/s eta 0:00:01[K     |█                               | 92 kB 17.5 MB/s eta 0:00:01[K     |█▏                              | 102 kB 17.6 MB/s eta 0:00:01[K     |█▎                              | 112 kB 17.6 MB/s eta 0:00:01[K     |█▍                              | 122 kB 17.6 MB/s eta 0:00:01[K     |█▌                              | 133 kB

In [6]:
import numpy as np
import pandas as pd
import seaborn as sns
import os
import io

# conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
import torch

# conda install -c conda-forge python-dotenv
# from dotenv import load_dotenv

# conda install -c anaconda sqlalchemy
# from sqlalchemy import create_engine

# conda install -c conda-forge transformers
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from transformers import pipeline
from transformers import AutoModelWithLMHead, AutoTokenizer

In [7]:
# Read BBC Sports
bbc_sports = pd.read_csv("/data/workspace_files/bbc_sports.csv")
print(bbc_sports.shape)
bbc_sports.head()

(737, 4)


Unnamed: 0.1,Unnamed: 0,category,titles,contents
0,0,athletics,Claxton hunting first major medal,British hurdler Sarah Claxton is confident she...
1,1,athletics,O'Sullivan could run in Worlds,Sonia O'Sullivan has indicated that she would ...
2,2,athletics,Greene sets sights on world title,Maurice Greene aims to wipe out the pain of lo...
3,3,athletics,IAAF launches fight against drugs,The IAAF - athletics' world governing body - h...
4,4,athletics,"Dibaba breaks 5,000m world record",Ethiopia's Tirunesh Dibaba set a new world rec...


## Abstract Summarization With Pegasus on `bbc_sports`

In [8]:
# Generating pegasus summary
pegasus_summaries = np.array([])

In [9]:
# Choosing a model: "Pegasus-XSUM"
model_name = 'google/pegasus-xsum'
model_name

'google/pegasus-xsum'

In [10]:
# Set PyTorch
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch_device

'cpu'

In [11]:
# Set Tokenizer based on model above
tokenizer = PegasusTokenizer.from_pretrained(model_name)
tokenizer

PreTrainedTokenizer(name_or_path='google/pegasus-xsum', vocab_size=96103, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'mask_token': '<mask_2>', 'additional_special_tokens': ['<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<unk_60>', '<unk_61>', '<unk_62>

In [12]:
# Set the Pegasus Model
# This line will run for ~60s
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
model

PegasusForConditionalGeneration(
  (model): PegasusModel(
    (shared): Embedding(96103, 1024, padding_idx=0)
    (encoder): PegasusEncoder(
      (embed_tokens): Embedding(96103, 1024, padding_idx=0)
      (embed_positions): PegasusSinusoidalPositionalEmbedding(512, 1024)
      (layers): ModuleList(
        (0): PegasusEncoderLayer(
          (self_attn): PegasusAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=1024, out_features=4096, bias=True)
          (fc2): Linear(in_features=4096, out_features=1024, bias=True)
          (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementw

**NOTE: The following line will run for a long time (t=). If you need to re-generate the CSV with the summaries, re-run this line. Otherwise, it is better to skip it and simply re-import the previously-generated CSV.**

In [13]:
# Loop through the texts to generate the summaries
for txt in bbc_sports["contents"]:
    batch = tokenizer.prepare_seq2seq_batch(
        txt, 
        truncation=True, 
        padding='longest',
        return_tensors="pt"
    ).to(torch_device)

    tgt_text = tokenizer.batch_decode(
        model.generate(**batch), 
        skip_special_tokens=True
    )

    # Append result
    pegasus_summaries = np.append(pegasus_summaries, tgt_text[0])

    # Finally, return the short summary
    #return tgt_text[0]


# Runtime Total: BBC-Sports (25) = 

In [14]:
# Check list of summaries
pegasus_summaries

array(['All images are copyrighted.', 'All images are copyrighted.',
       '"I believe if I was in the middle of the race I would have been able to react to people that came ahead of me."',
       "Athletics' world governing body has met anti-doping officials, coaches and athletes to co-ordinate the fight against drugs in sport. The IAAF - athletics' world governing body - has met anti-doping officials, coaches and athletes to co-ordinate the fight against drugs in sport.",
       "Ethiopia's Tirunesh Dibaba set a new world record in winning the women's 5,000m at the Boston Indoor Games.",
       'Two world records were set at the Diamond League meeting in France.',
       "Ireland's former world cross country champion will return to defend her Great Ireland Run title next month.",
       'All images are copyrighted.',
       "Britain's Jason Gardener shook off an upset stomach to win the 60m at Sunday's Leipzig International meeting.",
       "Four of Britain's best sprinters will co

In [15]:
# bbc_sports["summary_pegasus"] = bbc_sports["contents"].map(generate_pegasus_summary)
bbc_sports["summary_pegasus"] = pegasus_summaries
bbc_sports

Unnamed: 0.1,Unnamed: 0,category,titles,contents,summary_pegasus
0,0,athletics,Claxton hunting first major medal,British hurdler Sarah Claxton is confident she...,All images are copyrighted.
1,1,athletics,O'Sullivan could run in Worlds,Sonia O'Sullivan has indicated that she would ...,All images are copyrighted.
2,2,athletics,Greene sets sights on world title,Maurice Greene aims to wipe out the pain of lo...,"""I believe if I was in the middle of the race ..."
3,3,athletics,IAAF launches fight against drugs,The IAAF - athletics' world governing body - h...,Athletics' world governing body has met anti-d...
4,4,athletics,"Dibaba breaks 5,000m world record",Ethiopia's Tirunesh Dibaba set a new world rec...,Ethiopia's Tirunesh Dibaba set a new world rec...
...,...,...,...,...,...
732,732,tennis,Agassi into second round in Dubai,Fourth seed Andre Agassi beat Radek Stepanek 6...,Andre Agassi beat Radek Stepanek 6-4 7-5 in th...
733,733,tennis,Mauresmo fights back to win title,World number two Amelie Mauresmo came from a s...,All images are copyrighted.
734,734,tennis,Federer wins title in Rotterdam,World number one Roger Federer won the World I...,All images are copyrighted.
735,735,tennis,GB players warned over security,Britain's Davis Cup players have been warned n...,All images are copyrighted.


In [16]:
# Export result to CSV
bbc_sports.to_csv("/data/workspace_files/bbc_sports_pegasus_summarized.csv")

## Measurig Performance

- While the text is syntactically correct and superficially looks ok, it can give you wrong information

In [0]:
!pip install nltk

In [0]:
bbc_sports = pd.read_csv("/data/workspace_files/bbc_sports_pegasus_summarized.csv")

In [0]:
import nltk
import re
from nltk.corpus import stopwords
import itertools
import collections

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [0]:
def remove_punctuation(txt):
    """Replace URLs and other punctuation found in a text string with nothing 
    (i.e. it will remove the URL from the string).

    Parameters
    ----------
    txt : string
        A text string that you want to parse and remove urls.

    Returns
    -------
    The same txt string with URLs and punctuation removed.
    """

    return " ".join(re.sub("([^0-9A-Za-z \t])|(\w+:\/\/\S+)", "", txt).split())

In [0]:
def clean_text(txt):
    """Removes punctuation, changes to lowercase, removes
        stopwords and calculates word frequencies.

    Parameters
    ----------
    txt : string
        A text string that you want to clean.

    Returns
    -------
    Words and frequencies
    """
    
    tmp = [remove_punctuation(t) for t in txt]
    tmp = [t.lower().split() for t in tmp]
    
    tmp = [[w for w in t if not w in stop_words]
              for t in tmp]
    
    tmp = list(itertools.chain(*tmp))
    tmp = collections.Counter(tmp)
        
    return list(tmp.keys())

### How Much Shorter Are The Summaries On Average?

- Calculate the Mean Summary-Length-To-Original-Length Ratio
- Plot the lengths distributions next to each other

### Does It Produce Correct Spelling Of Outputs?

Let's create a spellchecker

In [0]:
summary_spell_check_words = clean_text(set(list(bbc_sports['summary_pegasus'])))
input_text_check_words = clean_text(set(list(bbc_sports['contents'])))

In [0]:
from spellchecker import SpellChecker
import seaborn as sns

In [0]:
spell = SpellChecker()

# find those words that may be misspelled
summary_misspelled = spell.unknown(summary_spell_check_words)
original_misspelled = spell.unknown(input_text_check_words)
len(summary_misspelled), len(original_misspelled)

In [0]:
f"Summary mispelled ratio: {len(summary_misspelled) / len(summary_spell_check_words)}"

In [0]:
f"Original misspelled ratio: {len(original_misspelled) / len(input_text_check_words)}"

In [0]:
misspelling_df = pd.DataFrame({'misspellings' : ["summary_misspelled" for _ in range(len(summary_misspelled))] + ["input_misspelled" for _ in range(len(original_misspelled))]})

sns.countplot(misspelling_df['misspellings'])

### Generate HTML For Visual Proofreading

In [0]:
table_data = ""
for i, d in bbc_sports['summary_pegasus'].iterrows():
    table_data += f"<tr><td>{d['input_text']}</td><td>{d['summary']}</td></tr>"

In [0]:
simple_visual_check = f"""
<html>
<body>
<table>
    <tr><th>Original</th><th>Input Text</th></tr>
    {table_data}
</table>
</body>
</html>
"""

In [0]:
with open('/data/workspace_files/visual_check.html', 'w') as f:
    f.write(simple_visual_check)

## Additional Training

In [0]:
in_df = pd.read_csv("/data/workspace_files/bbc_sports_pegasus_summarized.csv")

In [0]:
# Train Test Split
train_pct = 0.6
test_pct = 0.2

In [0]:
in_df = in_df.sample(len(in_df), random_state=20)

In [0]:
train_sub = int(len(in_df) * train_pct)
test_sub = int(len(in_df) * test_pct) + train_sub
train_df = in_df[0:train_sub]
test_df = in_df[train_sub:test_sub]
val_df = in_df[test_sub:]

In [0]:
train_texts = list(train_df['contents'])
test_texts = list(test_df['contents'])
val_texts = list(val_df['contents'])

In [0]:
# This should be the correct summaries
train_decode = list(train_df['summary_pegasus'])
test_decode = list(test_df['summary_pegasus'])
val_decode = list(val_df['summary_pegasus'])

In [0]:
import transformers
import torch

min_length = 15
max_length = 40

In [0]:
# Setup model
model_name = 'google/pegasus-xsum'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = transformers.PegasusTokenizer.from_pretrained(model_name)

model = transformers.PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
in_text = [in_df['contents'].iloc[3]]
batch = tokenizer.prepare_seq2seq_batch(in_text, truncation=True, padding='longest').to(torch_device) 

translated = model.generate(min_length=min_length, max_length=max_length, **batch)
tgt_text0 = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(tgt_text0)

In [0]:
# Tokenize
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

train_labels = tokenizer(train_decode, truncation=True, padding=True)
val_labels = tokenizer(val_decode, truncation=True, padding=True)
test_labels = tokenizer(test_decode, truncation=True, padding=True)

In [0]:
# Setup dataset objects
class Summary_dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])  # torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings)

In [0]:
train_dataset = Summary_dataset(train_encodings, train_labels)
val_dataset = Summary_dataset(val_encodings, val_labels)
test_dataset = Summary_dataset(test_encodings, test_labels)

**Training**

In [0]:
from transformers import Trainer, TrainingArguments

In [0]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1000,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

In [0]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

In [0]:
trainer.train()

In [0]:
# Check results
in_text = [in_df['contents'].iloc[3]]
batch = tokenizer.prepare_seq2seq_batch(in_text, truncation=True, padding='longest').to(torch_device) 

translated = model.generate(min_length=min_length, max_length=max_length, **batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
print(tgt_text)