# Finetuning Text Summarization Notebook:

# Hugging Face T5 transformer & XSum dataset

In [4]:
from os.path import join, isfile
from os import listdir
import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from rouge_score import rouge_scorer
import torch
from torch.utils.data import TensorDataset, random_split
from torch.utils.data import  DataLoader, RandomSampler, SequentialSampler #Dataset,
from transformers import get_linear_schedule_with_warmup, AdamW
from transformers import T5Tokenizer, T5ForConditionalGeneration

**FYI:** pd.__version__ '1.3.4' gets this error AttributeError: 'functools.partial' object has no attribute '__name__'
https://github.com/pandas-dev/pandas/issues/42748 . pip install pandas==1.2.5 removes error. 

## Loading data from Hugging Face 

In [10]:
from datasets import load_dataset
datasets = load_dataset('xsum')
# datasets = load_dataset('samsum')

df = datasets['train'].to_pandas()

Using custom data configuration default
Reusing dataset xsum (C:\Users\megra\.cache\huggingface\datasets\xsum\default\1.2.0\32c23220eadddb1149b16ed2e9430a05293768cfffbdfd151058697d4c11f934)


  0%|          | 0/3 [00:00<?, ?it/s]

In [68]:
# shortening data down for faster tutorial training 
df = df.iloc[:, :2].copy()
df = df.sample(n=5000, random_state=1)
df.columns = ['ctext', 'text']
print(df.shape)
df.head()

(5000, 2)


Unnamed: 0,ctext,text
12567,Theresa May has said she will form a governme...,The Democratic Unionists look set to be the po...
49255,The Saffrons face Carlow in the Division Two A...,Antrim hurlers aim to end an awful two years f...
166249,Seoul claims Beijing is retaliating economical...,South Korea has appealed to the World Trade Or...
167890,Bolasie's impressive start to the season was c...,Crystal Palace winger Yannick Bolasie says he ...
2914,"This entitles her to access to health care, ed...",A top South African court has cleared the way ...


In [107]:
df['ctext'].values[3]

'Bolasie\'s impressive start to the season was cut short when he picked up a calf injury in December, with Palace tailing off in his absence.\nBut the DR Congo international, with eight goals in 50 appearances in the Premier League, is hoping to learn lessons from Togo\'s Adebayor, citing the striker\'s 96 goals in England\'s top flight - as a remarkable feat.\n"I am looking forward to learning from Adebayor, I believe he will guide myself and our other players into goal scoring positions to best help Palace score goals and win games, " Bolasie told BBC Sport.\n"When you have a proven striker like him who knows the league very well and with a remarkable scoring record then you have a dangerous asset who can hurt your opponents and grab the goals.\n"Players like him are the ones I\'ve got to look at and if I can add more goals to my game then it will only improve me and the team" Bolasie added.\nAt 26, Bolasie knows he still has plenty to learn about the game and hopes any time he spend

In [109]:
df['text'].values[3]

'Crystal Palace winger Yannick Bolasie says he is keen to learn from new signing Emmanuel Adebayor in a bid to add more goals to his game.'

## Model building 

In [70]:
class T5Finetuner(pl.LightningModule):
    '''
    Documentation-In-Progress
    '''

    def __init__(self, df = pd.DataFrame):
        super().__init__()
        self.save_hyperparameters()
        self.source_len = 512
        self.summ_len = 200
        self.lr = .0001
        self.bs = 8
        self.num_workers = 8
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
        self.tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
        self.data = df
        self.scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        self.output = 'temp/'
        self.name = 'test'

    def encode_text(self, context, text):
        ctext = str(context) # context text 
        ctext = ' '.join(ctext.split())
        text = str(text) # summarized text
        text = ' '.join(text.split())
        source = self.tokenizer.batch_encode_plus([ctext], 
                                                max_length= self.source_len, 
                                                truncation=True,
                                                padding='max_length',
                                                return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], 
                                                max_length= self.summ_len,
                                                truncation=True,
                                                padding='max_length',
                                                return_tensors='pt')
        y = target['input_ids']
        target_id = y[:, :-1].contiguous()
        target_label = y[:, 1:].clone().detach()
        target_label[y[:, 1:] == self.tokenizer.pad_token_id] = -100 #edge-case  handling when no labels are there
        return source['input_ids'], source['attention_mask'], target_id, target_label
    
    def prepare_data(self):
        source_ids, source_masks, target_ids, target_labels = [], [], [], [] 
        for _, row in self.data.iterrows():
            source_id, source_mask, target_id, target_label = self.encode_text(row.ctext, row.text)
            source_ids.append(source_id)
            source_masks.append(source_mask)
            target_ids.append(target_id)
            target_labels.append(target_label)

        # Transforming lists into tensors
        source_ids = torch.cat(source_ids, dim=0)
        source_masks = torch.cat(source_masks, dim=0)
        target_ids = torch.cat(target_ids, dim=0)
        target_labels = torch.cat(target_labels, dim=0)
        # Splitting data into standard train, val, and test sets 
        data = TensorDataset(source_ids, source_masks, target_ids, target_labels)
        train_size, val_size = int(0.8 * len(data)), int(0.1 * len(data))
        test_size = len(data) - (train_size + val_size)
        self.train_dat, self.val_dat, self.test_dat = \
            random_split(data, [train_size, val_size, test_size])
    
    def forward(self, batch, batch_idx):
        source_ids, source_mask, target_ids, target_labels = batch[:4]
        return self.model(
            input_ids = source_ids, 
            attention_mask = source_mask, 
            decoder_input_ids=target_ids, 
            labels=target_labels
        )
        
    def training_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        self.log('train loss', loss, prog_bar = True, logger = True)
        return {'loss': loss, 'log': {'train_loss': loss}}

    def validation_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        self.log('valid loss', loss, prog_bar = True, logger = True)
        return {'loss': loss}

    def validation_epoch_end(self, outputs): 
        loss = sum([o['loss'] for o in outputs]) / len(outputs)
        out = {'val_loss': loss}
        return {**out, 'log': out}

    def test_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        self.log('test loss', loss, prog_bar = True, logger = True)
        return {'loss': loss}

    def test_epoch_end(self, outputs):
        loss = sum([o['loss'] for o in outputs]) / len(outputs)
        out = {'test_loss': loss}
        return {**out, 'log': out}
    
    def train_dataloader(self):
        return DataLoader(
            self.train_dat,
            batch_size=self.bs,
            num_workers=self.num_workers, 
            sampler=RandomSampler(self.train_dat)
        )

    def val_dataloader(self):
        return DataLoader(
            self.val_dat, 
            batch_size=self.bs, 
            num_workers=self.num_workers,
            sampler=SequentialSampler(self.val_dat)
        )

    def test_dataloader(self):
        return DataLoader(
            self.test_dat, 
            batch_size=self.bs, 
            num_workers=self.num_workers,
            sampler=SequentialSampler(self.test_dat)
        )    

    def configure_optimizers(self):
        optimizer = AdamW(self.model.parameters(), lr=self.lr, eps=1e-4)
        return {'optimizer': optimizer}
    
    def save_core_model(self):
        store_path = join(self.output, self.name, 'core')
        self.model.save_pretrained(store_path)
        self.tokenizer.save_pretrained(store_path)

In [71]:
class MetricsCallback(pl.Callback):
    def __init__(self):
        super().__init__()
        self.metrics = []

    def on_validation_end(self, trainer, pl_module):
        self.metrics.append(trainer.callback_metrics)

## Selecting model name 

In [72]:
#######################
MODEL_NAME = 't5-base'
#######################
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5Finetuner(df = df)
num_epochs = 3

## Loading tensorboard for logging 

In [73]:
%load_ext tensorboard
%tensorboard --logdir ./lightning_logs --host localhost --port 4000

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 4000 (pid 50276), started 1 day, 1:05:16 ago. (Use '!kill 50276' to kill it.)

In [74]:
checkpoint_callback = ModelCheckpoint(
    dirpath = 'checkpoints', 
    filename = 'best-checkpoint', 
    save_top_k = 1, 
    verbose = True, 
    monitor = 'val_loss', 
    mode= 'min', 
)

logger = TensorBoardLogger('lightning_logs', name  = 'custom_summary_from_xsum_data') 

trainer = pl.Trainer(
    logger = logger, 
    checkpoint_callback = checkpoint_callback, 
    max_epochs = num_epochs, 
    gpus = 1, 
)

  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [75]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

  rank_zero_deprecation(


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

## Loading best model in training 

In [41]:
trained_model = T5Finetuner.load_from_checkpoint(trainer.checkpoint_callback.best_model_path)

In [42]:
trained_model.freeze()

## Inference 

In [120]:
class Inference():
    def get_example(index = int):
        full_text = df['ctext'].values[index]
        summary = df['text'].values[index]
        return full_text, summary

    def summarize(trained_model, tokenizer, text): 
        text_encoding = tokenizer(
            text,
            max_length = 512, 
            padding = 'max_length', 
            truncation = True, 
            return_attention_mask = True, 
            return_tensors = 'pt'
        )
    #     generated_ids = trained_model.model.generate(
        generated_ids = trained_model.model.generate(
            input_ids=text_encoding['input_ids'], 
            attention_mask = text_encoding['attention_mask'], 
            max_length = 200,
            num_beams = 2,
            repetition_penalty = 2.5,
            length_penalty = 1.0,
    #         early_stopping = True
        )
        preds = [
            tokenizer.decode(gen_id, 
                skip_special_tokens = True, 
                clean_up_tokenization_spaces = True)
            for gen_id in generated_ids
        ]
        return "".join(preds)
        
example_index = 4

example_text, example_summary = Inference.get_example(example_index)
print('~~~ Original text: \n\n', example_text, '\n\n\n ~~~ Summary: \n\n', example_summary)

~~~ Original text: 

 This entitles her to access to health care, education and other welfare services which she had been denied.
As her parents have been out of Cuba for some time, the girl had been unable to claim Cuban citizenship and she had been effectively left "stateless".
This test case will affect other children in such legal limbo.
The case has been going through the South African courts for several years, and the Supreme Court of Appeal's decision came after the government challenged a ruling brought by a lower court.
The BBC's Karen Allen in Johannesburg says the home affairs ministry had argued that granting the girl a South African birth certificate would open the floodgates to new applications.
The court's judgement is a reaffirmation of existing laws in South Africa which give citizenship to stateless children.
The Supreme Court of Appeal gave the government 18 months to get its house in order and put in place a mechanism for processing similar claims.
Lawyers say the i

In [121]:
# generating prediction from T5Finetuner method summarize 
prediction = Inference.summarize(trained_model, tokenizer, example_text)
prediction

'granted Cuban citizenship to a 10-year-old girl born in South Africa.'

In [122]:
#using RougeScorer to assess example 
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scorer.score(example_summary, prediction)

{'rouge1': Score(precision=0.7692307692307693, recall=0.4, fmeasure=0.5263157894736842),
 'rouge2': Score(precision=0.25, recall=0.125, fmeasure=0.16666666666666666),
 'rougeL': Score(precision=0.46153846153846156, recall=0.24, fmeasure=0.3157894736842105)}