# Finetuning Text Summarization Notebook

## Hugging Face T5 transformer Trained on Custom Rotten Tomatoes 

# 


- Custom Rotten Tomatoes data is collecting hundreds of the top movies from IMDB, then
- Scraping those movies rotten tomatoes pages for:
    - **each critic consensus summary as y_target** 
    - **all critic reviews as combined x_train input**

In [1]:
from os.path import join, isfile
from os import listdir
import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning.callbacks import ModelCheckpoint
from rouge_score import rouge_scorer
import torch
from torch.utils.data import TensorDataset, random_split
from torch.utils.data import  DataLoader, RandomSampler, SequentialSampler #Dataset,
from transformers import get_linear_schedule_with_warmup, AdamW
from transformers import T5Tokenizer, T5ForConditionalGeneration

## Loading data from Hugging Face 

In [2]:
df = pd.read_pickle('../ingestion/rotten_tomatoes_training_df.pkl')
print(df.shape)
df.head()

(2152, 6)


Unnamed: 0,movie_key,reviews,url,critic_summary,user_summary,movie
0,a_quiet_place_part_ii,"[Really tense throughout the movie, the sequen...",https://www.rottentomatoes.com/m/a_quiet_place...,A nerve-wracking continuation of its predecess...,"Almost as scary and intense as the original, A...",A Quiet Place Part II
1,being_the_ricardos,[Nicole Kidman transforms herself into Lucile ...,https://www.rottentomatoes.com/m/being_the_ric...,Being the Ricardos can't hope to truly capture...,It'll probably mean most to viewers who grew u...,Being the Ricardos
2,belle_2022,"[Very beautiful movie, masterpiece 🎆🌌🌠🎇 but de...",https://www.rottentomatoes.com/m/belle_2022,A remarkable story brought to life with dazzli...,"Beautiful to watch as well as listen to, Belle...",Belle
3,bo_burnham_inside,[This special is not for me but it is definetl...,https://www.rottentomatoes.com/m/bo_burnham_in...,A claustrophobic masterclass in comedy and int...,A brilliant blend of sharp humor and emotional...,Inside
4,candyman_2021,[There are some really interesting ideas in th...,https://www.rottentomatoes.com/m/candyman_2021,"Candyman takes an incisive, visually thrilling...",The 2021 Candyman may not be as scary as the o...,Candyman


In [3]:
def truncate_reviews(text = list) -> str:
    '''
    take a list of reviews and return only n amount depending on desired char len
    '''
    text = text[:250]
    text = ' '.join([str(t) for t in text])
    return text

In [4]:
df['text'] = df['reviews'].apply(lambda x: truncate_reviews(x))

In [5]:
df = df[['critic_summary', 'text']]
df.columns = ['ctext', 'text']
print(df.shape)
df.head()

(2152, 2)


Unnamed: 0,ctext,text
0,A nerve-wracking continuation of its predecess...,"Really tense throughout the movie, the sequenc..."
1,Being the Ricardos can't hope to truly capture...,Nicole Kidman transforms herself into Lucile B...
2,A remarkable story brought to life with dazzli...,"Very beautiful movie, masterpiece 🎆🌌🌠🎇 but def..."
3,A claustrophobic masterclass in comedy and int...,This special is not for me but it is definetly...
4,"Candyman takes an incisive, visually thrilling...",There are some really interesting ideas in thi...


In [6]:
df['ctext'].values[234]

"Driven by Al Pacino and Robin Williams' performances, Insomnia is a smart and riveting psychological drama."

In [23]:
df['text'].values[234]

'Watching Nolan\'s final pre-Batman outing reveals a subtle finessing of his M.O. - not just thematically, but visually. Dormer has a weariness that Pacino wears perfectly, always finding some new depth to his exhaustion and despair without ever being a sleepy presence on screen. A deceptively run-of-the-mill cop thriller based round an ingenious psychological theme. ...evocative imagery, a compelling story, and one of Pacino\'s best performances of the 21st century. Insomnia is not so much about the murder mystery as it is about Will\'s internal struggle with what\'s right and what\'s possibly okay. Who allowed these performances, or maybe even encouraged them? Christopher Nolan, that\'s who. He was so intent on dolloping pizazz onto this story that he didn\'t notice the visual syrup was drowning a six-inch stack of toaster waffles. In the world of Christopher Nolan, memory is still as treacherous as nitroglycerin. Insomnia proves that Memento wasn\'t a fluke: Nolan is a major talent 

## Model building 

In [8]:
class T5Finetuner(pl.LightningModule):
    '''
    Documentation-In-Progress
    '''

    def __init__(self, df = pd.DataFrame , bs = int):
        super().__init__()
        self.save_hyperparameters()
        self.source_len = 512
        self.summ_len = 200
        self.lr = .0001
        self.bs = 8
        self.num_workers = 8
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
        self.tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
        self.data = df
        self.scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        self.output = 'temp/'
        self.name = 'test'

    def encode_text(self, context, text):
        ctext = str(context) # context text 
        ctext = ' '.join(ctext.split())
        text = str(text) # summarized text
        text = ' '.join(text.split())
        source = self.tokenizer.batch_encode_plus([ctext], 
                                                max_length= self.source_len, 
                                                truncation=True,
                                                padding='max_length',
                                                return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], 
                                                max_length= self.summ_len,
                                                truncation=True,
                                                padding='max_length',
                                                return_tensors='pt')
        y = target['input_ids']
        target_id = y[:, :-1].contiguous()
        target_label = y[:, 1:].clone().detach()
        target_label[y[:, 1:] == self.tokenizer.pad_token_id] = -100 #edge-case  handling when no labels are there
        return source['input_ids'], source['attention_mask'], target_id, target_label
    
    def prepare_data(self):
        source_ids, source_masks, target_ids, target_labels = [], [], [], [] 
        for _, row in self.data.iterrows():
            source_id, source_mask, target_id, target_label = self.encode_text(row.ctext, row.text)
            source_ids.append(source_id)
            source_masks.append(source_mask)
            target_ids.append(target_id)
            target_labels.append(target_label)

        # Transforming lists into tensors
        source_ids = torch.cat(source_ids, dim=0)
        source_masks = torch.cat(source_masks, dim=0)
        target_ids = torch.cat(target_ids, dim=0)
        target_labels = torch.cat(target_labels, dim=0)
        # Splitting data into standard train, val, and test sets 
        data = TensorDataset(source_ids, source_masks, target_ids, target_labels)
        train_size, val_size = int(0.8 * len(data)), int(0.1 * len(data))
        test_size = len(data) - (train_size + val_size)
        self.train_dat, self.val_dat, self.test_dat = \
            random_split(data, [train_size, val_size, test_size])
    
    def forward(self, batch, batch_idx):
        source_ids, source_mask, target_ids, target_labels = batch[:4]
        return self.model(
            input_ids = source_ids, 
            attention_mask = source_mask, 
            decoder_input_ids=target_ids, 
            labels=target_labels
        )
        
    def training_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        self.log('train loss', loss, prog_bar = True, logger = True)
        return {'loss': loss, 'log': {'train_loss': loss}}

    def validation_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        self.log('valid loss', loss, prog_bar = True, logger = True)
        return {'loss': loss}

    def validation_epoch_end(self, outputs): 
        loss = sum([o['loss'] for o in outputs]) / len(outputs)
        out = {'val_loss': loss}
        return {**out, 'log': out}

    def test_step(self, batch, batch_idx):
        loss = self(batch, batch_idx)[0]
        self.log('test loss', loss, prog_bar = True, logger = True)
        return {'loss': loss}

    def test_epoch_end(self, outputs):
        loss = sum([o['loss'] for o in outputs]) / len(outputs)
        out = {'test_loss': loss}
        return {**out, 'log': out}
    
    def train_dataloader(self):
        return DataLoader(
            self.train_dat,
            batch_size=self.bs,
            num_workers=self.num_workers, 
            sampler=RandomSampler(self.train_dat)
        )

    def val_dataloader(self):
        return DataLoader(
            self.val_dat, 
            batch_size=self.bs, 
            num_workers=self.num_workers,
            sampler=SequentialSampler(self.val_dat)
        )

    def test_dataloader(self):
        return DataLoader(
            self.test_dat, 
            batch_size=self.bs, 
            num_workers=self.num_workers,
            sampler=SequentialSampler(self.test_dat)
        )    

    def configure_optimizers(self):
        optimizer = AdamW(self.model.parameters(), lr=self.lr, eps=1e-4)
        return {'optimizer': optimizer}
    
    def save_core_model(self):
        store_path = join(self.output, self.name, 'core')
        self.model.save_pretrained(store_path)
        self.tokenizer.save_pretrained(store_path)

In [9]:
class MetricsCallback(pl.Callback):
    def __init__(self):
        super().__init__()
        self.metrics = []

    def on_validation_end(self, trainer, pl_module):
        self.metrics.append(trainer.callback_metrics)

## Selecting model name 

In [10]:
#######################
MODEL_NAME = 't5-base'
#######################
num_epochs = 20
batch_size = 5
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5Finetuner(df = df, bs = batch_size)

## Loading tensorboard for logging 

In [11]:
%load_ext tensorboard
%tensorboard --logdir ./lightning_logs --host localhost --port 7000

In [12]:
experiment_name = 'fine_tuning_text_summarizer_rt_v_0_2'

checkpoint_callback = ModelCheckpoint(
    dirpath = 'checkpoints', 
    filename = 'best-checkpoint', 
    save_top_k = 1, 
    verbose = True, 
    monitor = 'val_loss', 
    mode= 'min', 
)

logger = TensorBoardLogger('lightning_logs', name  = experiment_name) 

trainer = pl.Trainer(
    logger = logger, 
    checkpoint_callback = checkpoint_callback, 
    max_epochs = num_epochs, 
    gpus = 1)

  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [13]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

  rank_zero_deprecation(


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

## Loading best model in training 

In [14]:
trained_model = T5Finetuner.load_from_checkpoint(trainer.checkpoint_callback.best_model_path)

In [28]:
import joblib
experiment_name = 'fine_tuning_text_summarizer_rt_v_0_2'
joblib.dump(model_path, f'model_path_{experiment_name}.txt') #NOTE:this can go in a config for for inference .py's
trainer.checkpoint_callback.best_model_path

'lightning_logs\\fine_tuning_text_summarizer_rt_v_0_2\\version_0\\checkpoints\\epoch=19-step=4319.ckpt'

In [16]:
trained_model.freeze()

## Inference 

In [20]:
class Inference():
    def get_example(index = int):
        full_text = df['ctext'].values[index]
        summary = df['text'].values[index]
        return full_text, summary

    def summarize(trained_model, tokenizer, text): 
        text_encoding = tokenizer(
            text,
            max_length = 512, 
            padding = 'max_length', 
            truncation = True, 
            return_attention_mask = True, 
            return_tensors = 'pt'
        )
    #     generated_ids = trained_model.model.generate(
        generated_ids = trained_model.model.generate(
            input_ids=text_encoding['input_ids'], 
            attention_mask = text_encoding['attention_mask'], 
            max_length = 50,
            num_beams = 2,
            repetition_penalty = 2.5,
            length_penalty = 1.0,
    #         early_stopping = True
        )
        preds = [
            tokenizer.decode(gen_id, 
                skip_special_tokens = True, 
                clean_up_tokenization_spaces = True)
            for gen_id in generated_ids
        ]
        return "".join(preds)
        
example_index = 234

example_text, example_summary = Inference.get_example(example_index)
print('~~~ Original text: \n\n', example_text, '\n\n\n ~~~ Summary: \n\n', example_summary)

~~~ Original text: 

 Driven by Al Pacino and Robin Williams' performances, Insomnia is a smart and riveting psychological drama. 


 ~~~ Summary: 

 Watching Nolan's final pre-Batman outing reveals a subtle finessing of his M.O. - not just thematically, but visually. Dormer has a weariness that Pacino wears perfectly, always finding some new depth to his exhaustion and despair without ever being a sleepy presence on screen. A deceptively run-of-the-mill cop thriller based round an ingenious psychological theme. ...evocative imagery, a compelling story, and one of Pacino's best performances of the 21st century. Insomnia is not so much about the murder mystery as it is about Will's internal struggle with what's right and what's possibly okay. Who allowed these performances, or maybe even encouraged them? Christopher Nolan, that's who. He was so intent on dolloping pizazz onto this story that he didn't notice the visual syrup was drowning a six-inch stack of toaster waffles. In the world

In [21]:
# generating prediction from T5Finetuner method summarize 
prediction = Inference.summarize(trained_model, tokenizer, example_text)
prediction

"Insomnia is a film that's not only entertaining, but also deeply moving. It's a movie about the power of love and loss in our lives. A powerful psychological thriller with a lot of heart."

In [22]:
#using RougeScorer to assess example 
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scorer.score(example_summary, prediction)

{'rouge1': Score(precision=0.9722222222222222, recall=0.010051694428489374, fmeasure=0.019897669130187607),
 'rouge2': Score(precision=0.42857142857142855, recall=0.0043091065785693765, fmeasure=0.008532423208191127),
 'rougeL': Score(precision=0.7222222222222222, recall=0.007466973004020678, fmeasure=0.014781125639567938)}