## **Abstractive Text Summarization using Pegasus Large for BBC XSum**

***Note***: This notebook runs on the following Project ID: level-facility-347712, Zone: us-central1-b, and instance: colab-2-vm. Otherwise, use lower batch size and lower traning dataset size as shown in the comments. Also, this code used the pretrained trasnformer from HuggingFace: https://github.com/patrickvonplaten/notebooks/blob/master/RoBERTaShared_for_BBC_XSum.ipynb 






### **Data Preprocessing and Augmentation**


In [2]:
%%capture
!pip install datasets==1.0.2
!pip install transformers
!pip install sentencepiece
!pip install nlpaug

import datasets
import transformers

In [3]:
from transformers import PegasusTokenizer

tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")
train_data = datasets.load_dataset("xsum", split="train")
val_data = datasets.load_dataset("xsum", split="validation[:10%]")

Using custom data configuration default
Reusing dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.1.0/128741c17b7a4c939dbf844a75a5e83deadd07deaf4b2eda2056ed8eebdb03ae)
Using custom data configuration default
Reusing dataset xsum (/root/.cache/huggingface/datasets/xsum/default/1.1.0/128741c17b7a4c939dbf844a75a5e83deadd07deaf4b2eda2056ed8eebdb03ae)


In [5]:
# DATA AUG
# UNCOMMENT TO PERFORM DATA AUGMENTATION
!pip install sacremoses
from datasets.arrow_dataset import concatenate_datasets

import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
import pandas as pd 

from nlpaug.util import Action

aug_bt = naw.BackTranslationAug(
from_model_name='facebook/wmt19-en-de', 
to_model_name='facebook/wmt19-de-en'
)

train_data_original = train_data
size = 1  #increase for full augmentation
train_data_for_aug = train_data_original.select(range(size))
print (f"Sample for augmentation:\n{train_data_for_aug}")
train_data_sums = []
train_data_docs = []

for i in range(size):
  train_data_docs.append(aug_bt.augment(train_data_for_aug[i]['document']))
  train_data_sums.append(aug_bt.augment(train_data_for_aug[i]['summary']))

print("Augmentation Samples (Summaries):")
for i in range(size):
  print(f"Before augmentation: {train_data_for_aug[i]['summary']}\nAfter Augmentation: {train_data_sums[i]}\n")

df = pd.DataFrame({'document':train_data_docs, 'summary':train_data_sums})

train_data_aug_ds = datasets.Dataset.from_pandas(df)
train_data_aug_ds.save_to_disk("datasets\\xsum_aug")
train_data_aug = datasets.Dataset.load_from_disk("datasets\\xsum_aug")

train_data = concatenate_datasets([train_data_aug, train_data_original])
print(train_data)

Sample for augmentation:
Dataset(features: {'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, num_rows: 1)
Augmentation Samples (Summaries):
Before augmentation: Sony has told owners of older models of its PlayStation 3 console to stop using the machine because of a problem with the PlayStation Network.

After Augmentation: Sony has asked owners of older models of its PlayStation 3 console to stop using the machine because of a problem with the PlayStation Network.

Dataset(features: {'document': Value(dtype='string', id=None), 'summary': Value(dtype='string', id=None)}, num_rows: 204018)


In [3]:
batch_size=8  # 4 for low GPU resources
encoder_max_length=512
decoder_max_length=64

def process_data_to_model_inputs(batch):                                                               
    # Tokenizer will automatically set [BOS] <text> [EOS]                                               
    inputs = tokenizer(batch["document"], padding="max_length", truncation=True, max_length=encoder_max_length)
    outputs = tokenizer(batch["summary"], padding="max_length", truncation=True, max_length=decoder_max_length)
                                                                                                        
    batch["input_ids"] = inputs.input_ids                                                               
    batch["attention_mask"] = inputs.attention_mask                                                     
    batch["decoder_input_ids"] = outputs.input_ids                                                      
    batch["labels"] = outputs.input_ids.copy()                                                                                                                                    
    batch["labels"] = [                                                                                 
        [-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]
    ]                     
    batch["decoder_attention_mask"] = outputs.attention_mask                                                                              
                                                                                                         
    return batch  

train_data = train_data.select(range(128))  #change if using GPU

train_data = train_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["document", "summary"],
)
train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

val_data = val_data.map(
    process_data_to_model_inputs, 
    batched=True, 
    batch_size=batch_size, 
    remove_columns=["document", "summary"],
)
val_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids", "decoder_attention_mask", "labels"],
)

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/142 [00:00<?, ?ba/s]

### **Encoder-Decoder Model & Word Embedding Implementation**

In [14]:
from transformers import EncoderDecoderModel

pegasusModel = EncoderDecoderModel.from_encoder_decoder_pretrained("google/pegasus-large", "google/pegasus-large", tie_encoder_decoder=True)

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [5]:
import torch
import torch.nn as nn
import torch.autograd as autograd
import torch.optim as optim
import torch.nn.functional as F


class Word2Vec(nn.Module):

    def __init__(self, embedding_size, vocab_size):
        super(Word2Vec, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_size)
        self.linear = nn.Linear(embedding_size, vocab_size)
        
    def forward(self, context_word):
        emb = self.embeddings(context_word)
        hidden = self.linear(emb)
        out = F.log_softmax(hidden)
        return out

In [6]:
pegasusModel.config.decoder_start_token_id = tokenizer.bos_token_id                                             
pegasusModel.config.eos_token_id = tokenizer.eos_token_id

pegasusModel.set_input_embeddings =  Word2Vec(50265, 1024)  
pegasusModel.set_output_embeddings =  Word2Vec(50265, 1024)   
pegasusModel.config.max_length = 64
pegasusModel.config.early_stopping = True
pegasusModel.config.no_repeat_ngram_size = 3
pegasusModel.config.length_penalty = 2.0
pegasusModel.config.num_beams = 4

### **Fine-Tuning Warm-Started Encoder-Decoder Models**

In [7]:
%%capture
!pip install git-python==1.0.3
!pip install sacrebleu==1.4.12
!pip install rouge_score
from transformers import TrainingArguments, Seq2SeqTrainer
from dataclasses import dataclass, field
from typing import Optional

Adding some parameters  from TrainingArguments compatible with the Seq2SeqTrainer. The arguments are brought from https://github.com/patrickvonplaten/transformers/blob/make_seq2seq_trainer_self_contained/examples/seq2seq/finetune_trainer.py.

In [8]:
@dataclass
class Seq2SeqTrainingArguments(TrainingArguments):
    label_smoothing: Optional[float] = field(
        default=0.0, metadata={"help": "The label smoothing epsilon to apply (if not zero)."}
    )
    sortish_sampler: bool = field(default=False, metadata={"help": "Whether to SortishSamler or not."})
    predict_with_generate: bool = field(
        default=False, metadata={"help": "Whether to use generate to calculate generative metrics (ROUGE, BLEU)."}
    )
    adafactor: bool = field(default=False, metadata={"help": "whether to use adafactor"})
    encoder_layerdrop: Optional[float] = field(
        default=None, metadata={"help": "Encoder layer dropout probability. Goes into model.config."}
    )
    decoder_layerdrop: Optional[float] = field(
        default=None, metadata={"help": "Decoder layer dropout probability. Goes into model.config."}
    )
    dropout: Optional[float] = field(default=None, metadata={"help": "Dropout probability. Goes into model.config."})
    attention_dropout: Optional[float] = field(
        default=None, metadata={"help": "Attention dropout probability. Goes into model.config."}
    )
    lr_scheduler: Optional[str] = field(
        default="linear", metadata={"help": f"Which lr scheduler to use."}
    )

Evaluation Metric (ROUGE 2) definition.

In [9]:
rouge = datasets.load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    # all unnecessary tokens are removed
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

Start Training.

In [10]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    predict_with_generate=True,
    do_train=True,
    do_eval=True,
    logging_steps=1,
    save_steps=16, 
    eval_steps=4, 
    warmup_steps=1,
    num_train_epochs=5, #change for full training
    overwrite_output_dir=True,
    save_total_limit=3,
)

trainer = Seq2SeqTrainer(
    model=pegasusModel,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=val_data,
)
trainer.train()



Step,Training Loss
1,12.1916
2,11.668
3,9.728
4,9.6763
5,8.2878
6,8.6899
7,8.351
8,8.0879
9,7.9471
10,7.9518




TrainOutput(global_step=80, training_loss=7.655910170078277, metrics={'train_runtime': 130.919, 'train_samples_per_second': 4.889, 'train_steps_per_second': 0.611, 'total_flos': 427820189614080.0, 'train_loss': 7.655910170078277, 'epoch': 5.0})

### **Evaluation**


In [11]:
import datasets
from transformers import EncoderDecoderModel

model = EncoderDecoderModel.from_pretrained("./checkpoint-80")  #CHANGE to -# where # is the number of steps performed in the training above
model.to("cuda")

test_data = datasets.load_dataset("xsum", split="test")

# only use 10 testing examples for Demo
test_data = test_data.select(range(128))

batch_size = 16 

In [12]:
def generate_summary(batch):
    # Tokenizer will automatically set [BOS] <text> [EOS]
    inputs = tokenizer(batch["document"], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to("cuda")
    attention_mask = inputs.attention_mask.to("cuda")
    
    print("Inputs are:")
    print(batch["document"])
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    

    # all special tokens including will be removed
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    batch["pred"] = output_str
    print("Outputs are:")
    print(output_str)
    
    return batch

results = test_data.map(generate_summary, batched=True, batch_size=batch_size, remove_columns=["document"])

pred_str = results["pred"]
label_str = results["summary"]

rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

print(rouge_output)

Inputs are:
['The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning \'Oh I think you\'re nominated\'", said Dappy."And I was like \'Oh yeah, which one?\' And now we\'ve got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it\'s best of us to come down and mingle with everyone and say hello to the cameras. And now we find we\'ve got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn\'t be too disappointed if they didn\'t win this time around."At the end of the day we\'re grateful to be where we are in our careers."If it don\'t happen then it don\'t happen - live to fight another day and k

  0%|          | 0/8 [00:00<?, ?ba/s]

Inputs are:
['The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning \'Oh I think you\'re nominated\'", said Dappy."And I was like \'Oh yeah, which one?\' And now we\'ve got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it\'s best of us to come down and mingle with everyone and say hello to the cameras. And now we find we\'ve got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn\'t be too disappointed if they didn\'t win this time around."At the end of the day we\'re grateful to be where we are in our careers."If it don\'t happen then it don\'t happen - live to fight another day and k