# Fine Tuning Transformer for Summary Generation

<a id='section01'></a>
### Preparing Environment and Importing Libraries

At this step we will be installing the necessary libraries followed by importing the libraries and modules needed to run our script. 
We will be installing:
* transformers
* wandb

Libraries imported are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* T5 Model and Tokenizer

Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU. First we will check the GPU avaiable to us, using the nvidia command followed by defining our device.

In [1]:
!pip install transformers -q

# Code for TPU packages install
# !curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# !python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [2]:
# Importing stock libraries
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing the T5 modules from huggingface/transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [3]:
# Checking out the GPU we have access to. This is output is from the google colab version. 
!nvidia-smi

Wed Jun  7 14:52:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# # Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

# Preparing for TPU usage
# import torch_xla
# import torch_xla.core.xla_model as xm
# device = xm.xla_device()

<a id='section02'></a>
### Preparing the Dataset for data processing: Class

We will start with creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. This dataset will be used the the Dataloader method that will feed  the data in batches to the neural network for suitable training and processing. 
The Dataloader and Dataset will be used inside the `main()`.
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *CustomDataset* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the **T5** model for training. 
- We are using the **T5** tokenizer to tokenize the data in the `text` and `ctext` column of the dataframe. 
- The tokenizer uses the ` batch_encode_plus` method to perform tokenization and generate the necessary outputs, namely: `source_id`, `source_mask` from the actual text and `target_id` and `target_mask` from the summary text.
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/t5.html#t5tokenizer)
- The *CustomDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader: Called inside the `main()`
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of data loaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [6]:
# Creating a custom dataset for reading the dataframe and loading it into the dataloader to pass it to the neural network at a later stage for finetuning the model and to prepare it for predictions
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, source_len, summ_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.source_len = 300 # original: source_len
        self.summ_len = summ_len
        self.text = self.data.text
        self.ctext = self.data.ctext

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        ctext = str(self.ctext[index])
        ctext = ' '.join(ctext.split())

        text = str(self.text[index])
        text = ' '.join(text.split())

        source = self.tokenizer.batch_encode_plus([ctext], max_length= self.source_len, pad_to_max_length=True,return_tensors='pt')
        target = self.tokenizer.batch_encode_plus([text], max_length= self.summ_len, pad_to_max_length=True,return_tensors='pt')

        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()
        target_ids = target['input_ids'].squeeze()
        target_mask = target['attention_mask'].squeeze()

        return {
            'source_ids': source_ids.to(dtype=torch.long), 
            'source_mask': source_mask.to(dtype=torch.long), 
            'target_ids': target_ids.to(dtype=torch.long),
            'target_ids_y': target_ids.to(dtype=torch.long)
        }

<a id='section03'></a>
### Fine Tuning the Model: Function

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

This function is called in the `main()`

Following events happen in this function to fine tune the neural network:
- The epoch, tokenizer, model, device details, testing_ dataloader and optimizer are passed to the `train ()` when its called from the `main()`
- The dataloader passes data to the model based on the batch size.
- `language_model_labels` are calculated from the `target_ids` also, `source_id` and `attention_mask` are extracted.
- The model outputs first element gives the loss for the forward pass. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 10 steps the loss value is logged in the wandb service. This log is then used to generate graphs for analysis. Such as [these](https://app.wandb.ai/abhimishra-91/transformers_tutorials_summarization?workspace=user-abhimishra-91)
- After every 500 steps the loss value is printed in the console.

In [7]:
# Creating the training function. This will be called in the main function. It is run depending on the epoch value.
# The model is put into train mode and then we wnumerate over the training loader and passed to the defined network 
def train(epoch, tokenizer, model, device, loader, optimizer):
    model.train()
    print('||| Training Model |||')
    for _,data in tqdm(enumerate(loader, 0)):
        y = data['target_ids'].to(device, dtype = torch.long)
        y_ids = y[:, :-1].contiguous()
        lm_labels = y[:, 1:].clone().detach()
        lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100
        ids = data['source_ids'].to(device, dtype = torch.long)
        mask = data['source_mask'].to(device, dtype = torch.long)

        outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, lm_labels=lm_labels)
        loss = outputs[0]
        
        if _%500==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # xm.optimizer_step(optimizer)
        # xm.mark_step()

<a id='section04'></a>
### Validating the Model Performance: Function

During the validation stage we pass the unseen data(Testing Dataset), trained model, tokenizer and device details to the function to perform the validation run. This step generates new summary for dataset that it has not seen during the training session. 

This function is called in the `main()`

This unseen data is the 20% of `news_summary.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 

It depends on the `Beam-Search coding` method developed for sequence generation for models with LM head. 

The generated text and originally summary are decoded from tokens to text and returned to the `main()`

In [8]:
def validate(epoch, tokenizer, model, device, loader):
    model.eval()
    predictions = []
    actuals = []
    print('||| Model Validation |||')
    with torch.no_grad():
        for _, data in tqdm(enumerate(loader, 0)):
            y = data['target_ids'].to(device, dtype = torch.long)
            ids = data['source_ids'].to(device, dtype = torch.long)
            mask = data['source_mask'].to(device, dtype = torch.long)

            generated_ids = model.generate(
                input_ids = ids,
                attention_mask = mask, 
                max_length=100, 
                num_beams=2,
                temperature=1.0,
                repetition_penalty=2.5, 
                length_penalty=1.2, 
                early_stopping=True
            )
            
            preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]
            target = [tokenizer.decode(t, skip_special_tokens=True, clean_up_tokenization_spaces=True)for t in y]
            
            if _%100==0:
                print(f'Completed {_}')

            predictions.extend(preds)
            actuals.extend(target)
    return predictions, actuals

<a id='section05'></a>
### Main Function

The `main()` as the name suggests is the central location to execute all the functions/flows created above in the notebook. The following steps are executed in the `main()`:


<a id='section502'></a>
#### Importing and Pre-Processing the domain data

We will be working with the data and preparing it for fine tuning purposes. 
*Assuming that the `news_summary.csv` is already downloaded in your `data` folder*

* The file is imported as a dataframe and give it the headers as per the documentation.
* Cleaning the file to remove the unwanted columns.
* A new string is added to the main article column `summarize: ` prior to the actual article. This is done because **T5** had similar formatting for the summarization dataset. 
* The final Dataframe will be something like this:

|text|ctext|
|--|--|
|summary-1|summarize: article 1|
|summary-2|summarize: article 2|
|summary-3|summarize: article 3|

* Top 5 rows of the dataframe are printed on the console.

<a id='section503'></a>
#### Creation of Dataset and Dataloader

* The updated dataframe is divided into 80-20 ratio for test and validation. 
* Both the data-frames are passed to the `CustomerDataset` class for tokenization of the new articles and their summaries.
* The tokenization is done using the length parameters passed to the class.
* Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders.
* These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action.
* The shape of datasets is printed in the console.


<a id='section504'></a>
#### Neural Network and Optimizer

* In this stage we define the model and optimizer that will be used for training and to update the weights of the network. 
* We are using the `t5-base` transformer model for our project. You can read about the `T5 model` and its features above. 
* We use the `T5ForConditionalGeneration.from_pretrained("t5-base")` commad to define our model. The `T5ForConditionalGeneration` adds a Language Model head to our `T5 model`. The Language Model head allows us to generate text based on the training of `T5 model`.
* We are using the `Adam` optimizer for our project. This has been a standard for all our tutorials and is something that can be changed updated to see how different optimizer perform with different learning rates. 
* There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters. All those concepts have been kept out of scope for these tutorials. 


<a id='section505'></a>
#### Training Model

* We call the `train()` with all the necessary parameters.
* Loss at every 500th step is printed on the console.


<a id='section506'></a>
#### Validation and generation of Summary

* After the training is completed, the validation step is initiated.
* As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate new summaries based on the article text.
* An output is printed on the console giving a count of how many steps are complete after every 100th step. 
* The original summary and generated summary are converted into a list and returned to the main function. 
* Both the lists are used to create the final dataframe with 2 columns **Generated Summary** and **Actual Summary**
* The dataframe is saved as a csv file in the local drive.
* A qualitative analysis can be done with the Dataframe. 

In [None]:
# df = pd.read_csv('/kaggle/input/news-summary/news_summary.csv', 'latin-1', engine='python')
# df.sample(10)

In [None]:
# print(df.loc[1]['author,date,headlines,read_more,text,ctext'])

In [None]:
# import time

# # Example loop
# for i in tqdm(range(200)):
#     # Perform some task
#     time.sleep(0.5)

In [None]:
def main():
    TRAIN_BATCH_SIZE = 4    # input batch size for training (default: 64)
    VALID_BATCH_SIZE = 2    # input batch size for testing (default: 1000)
    TRAIN_EPOCHS = 4        # number of epochs to train (default: 10)
    VAL_EPOCHS = 2
    LEARNING_RATE = 1e-4    # learning rate (default: 0.01)
    SEED = 42               # random seed (default: 42)
    MAX_LEN = 300
    SUMMARY_LEN = 100
    T5MODEL = "t5-small"

    # Set random seeds and deterministic pytorch for reproducibility
    torch.manual_seed(SEED) # pytorch random seed
    np.random.seed(SEED) # numpy random seed
    torch.backends.cudnn.deterministic = True

    # tokenzier for encoding the text
    tokenizer = T5Tokenizer.from_pretrained(T5MODEL)

    # Importing and Pre-Processing the domain data
    # Selecting the needed columns only. 
    # Adding the summarzie text in front of the text. This is to format the dataset similar to how T5 model was trained for summarization task. 
    df = pd.read_csv('../input/news-summary/news_summary.csv', encoding='latin-1')
    df = df[['text','ctext']]
    df.ctext = 'summarize: ' + df.ctext # specify task for t5 transformer (summarization)
    print(df.sample(5))
    
    # Creation of Dataset and Dataloader
    # Defining the train size. So 80% of the data will be used for training and the rest will be used for validation. 
    train_size = 0.8
    train_dataset=df.sample(frac=train_size, random_state = SEED).reset_index(drop=True)
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)

    print("FULL Dataset: {}".format(df.shape))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("TEST Dataset: {}".format(val_dataset.shape))

    # Creating the Training and Validation dataset for further creation of Dataloader
    training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)
    val_set = CustomDataset(val_dataset, tokenizer, MAX_LEN, SUMMARY_LEN)

    # Defining the parameters for creation of dataloaders
    train_params = {
        'batch_size': TRAIN_BATCH_SIZE,
        'shuffle': True,
        'num_workers': 0
    }

    val_params = {
        'batch_size': VALID_BATCH_SIZE,
        'shuffle': False,
        'num_workers': 0
    }

    # Creation of Dataloaders for testing and validation. This will be used down for training and validation stage for the model.
    training_loader = DataLoader(training_set, **train_params)
    val_loader = DataLoader(val_set, **val_params)

    # Defining the model. We are using t5-base model and added a Language model layer on top for generation of Summary. 
    # Further this model is sent to device (GPU/TPU) for using the hardware.
    model = T5ForConditionalGeneration.from_pretrained(T5MODEL)
    model = model.to(device)

    # Defining the optimizer that will be used to tune the weights of the network in the training session. 
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

    # Training loop
    # takes approx 4 mins per epoch, 452 iterations (batch size 8)
    print('||| Initiating Fine-Tuning |||')
    for epoch in tqdm(range(TRAIN_EPOCHS)):
        train(epoch, tokenizer, model, device, training_loader, optimizer)

    # takes approx 5 mins per 100 iterations 
    # Validation loop and saving the resulting file with predictions and actuals in a dataframe.
    # Saving the dataframe as predictions.csv
    print('Now generating summaries on our fine tuned model for the validation dataset and saving it in a dataframe')
    for epoch in tqdm(range(VAL_EPOCHS)):
        predictions, actuals = validate(epoch, tokenizer, model, device, val_loader)
        final_df = pd.DataFrame({'Generated Text':predictions,'Actual Text':actuals})
        final_df.to_csv('predictions.csv')
        print('Output Files generated for review')
        
    ######################################
    # add code to save the model here
#     torch.save(NewsSummaryModel().state_dict(), 'NewsSummaryModel.pth')
    # Save the trained model and tokenizer
    print('Saving Model, Tokenizer, Weights')
    basepath = '/kaggle/working'
    model.save_pretrained(basepath)
    tokenizer.save_pretrained(basepath)
#     model.save_weights(basepath)

if __name__ == '__main__':
    main()

In [9]:
# T5MODEL = "t5-small"
# tokenizer = T5Tokenizer.from_pretrained(T5MODEL)
# model = T5ForConditionalGeneration.from_pretrained(T5MODEL)
# model = model.to(device)

# basepath = '/kaggle/working'
# model.save_pretrained(basepath)
# tokenizer.save_pretrained(basepath)
# # model.save_weights(basepath)

('/kaggle/working/spiece.model',
 '/kaggle/working/special_tokens_map.json',
 '/kaggle/working/added_tokens.json')

In [10]:
pd.set_option("display.max_colwidth", 200)
preds = pd.read_csv('/kaggle/working/predictions.csv')
preds.sample(10)

Unnamed: 0.1,Unnamed: 0,Generated Text,Actual Text
323,323,"State Bank of India has ruled out a spike in bad loans following the mega merger that catampulted the country's largest lender into top 50 globally with close to USD 500 billion balancesheet. ""Inc...","State Bank of India Chairperson Arundhati Bhattacharya has said that after the merger with associate banks, the lender has completed a ""mass mission"". ""I don't think anybody in the world would hav..."
32,32,"<extra_id_0>anas, he said. He added that the government has been able to provide support and support for the development of new technologies in the country. ""It'll be very important,"" he added on ...","India has become the world's fifth-largest military spender, witnessing a growth of around 8.5% in expenditure in 2016, a report has claimed. According to the figures released by the Stockholm Int..."
435,435,"the central government has allocated 965 Megahertz spectrum through auction in October, 2016 to various telecom service providers for access services. This will enable the telecom service provider...","Referring to National Association of Software and Services Companies (NASSCOM) and Akamai, the Telecom Ministry on Monday said that Internet users in India will reach 730 million by 2020. Minister..."
863,863,"Saina Nehwal, who was appointed 'Athletes Commission' last year to represent the panel in Badminton World Federation (BWF), would join the Bwf Athlete Commission as representative of IOC AC. She w...","Indian shuttler Saina Nehwal, who became a member of the International Olympic Committee's Athletes' Commission (IOC AC) last year, would be representing the panel in the Badminton World Federatio..."
247,247,"Sunita Kaushik, who was handpicked by the party' CM Manoj Tiwari in an affildvit that she has assets worth 1.35 crore including three residential houses in west Delhi. Kaushi is not an income-tax ...","A BJP candidate for the upcoming MCD polls in Delhi Sunita Kaushik, touted as a slum dweller and the face of the city's urban poor, turned out to be a crorepati. In her election affidavit, Kaushik..."
702,702,former Pakistan leg spinner Danish Kaneria has appealed to the cricket authorities to have his case revisiting by an inquiry tribunal set-up for probe into allegations against Sharjeels Khan and K...,"Former Pakistan leg-spinner Danish Kaneria, serving a life ban for spot-fixing, has appealed to the cricket authorities to have his case revisited by the inquiry tribunal set-up to probe into the ..."
844,844,"the president of Poland's Supreme Court has urged judges to fight for every inch in justice as the rightwing government pushed for changes that criticis say would make judicial independence a ""pur...","The President of Poland's Supreme Court Malgorzata Gersdorf has urged the country's judges to ""fight every inch"" for justice as the ruling party plans to ""democratise"" the way judges are appointed..."
129,129,Delhi metro has ordered 516 coaches (86 trains) to run only on these two lines and cannot be integrated with other existing lines. The trains will run only between Mumbai-Kalkaji Mandir and Botani...,The Delhi Metro Rail Corporation (DMRC) is set to get its first 'driverless' trains on tracks between Noida and Kalkaji in June this year. These trains will run only on two metro lines? Pink (Muku...
381,381,"Congress general secretary Rahul Gandhi on Tuesday met Tamil Nadu farmers protesting at Delhi's Jantar Mantar for over two weeks. ""Neither the government nor PM Modi listen to them (Tamil Nadu far...",Congress Vice-President Rahul Gandhi on Friday met Tamil Nadu farmers protesting at Delhi's Jantar Mantar for over two weeks and accused Prime Minister Narendra Modi of neglecting the farmers whil...
473,473,"students of Sharda Vidyan Mandir in Porbandar, Gujarat were caught in obscene videos after the teacher showed them obscene videos and passed lews comments. Not only did she dance and talked in an ...","A female teacher from Sharda Vidya Mandir in Porbandar, Gujarat, allegedly showed obscene videos to students and danced half-naked in front of them after taking them inside a room. While she threa..."


# Saving Data

In [11]:
import zipfile
import os
from IPython.display import FileLink

def zip_dir(directory = os.curdir, file_name = 'directory.zip'):
    """
    zip all the files in a directory
    
    Parameters
    _____
    directory: str
        directory needs to be zipped, defualt is current working directory
        
    file_name: str
        the name of the zipped file (including .zip), default is 'directory.zip'
        
    Returns
    _____
    Creates a hyperlink, which can be used to download the zip file)
    """
    os.chdir(directory)
    zip_ref = zipfile.ZipFile(file_name, mode='w')
    for folder, _, files in os.walk(directory):
        for file in files:
            if file_name in file:
                pass
            else:
                zip_ref.write(os.path.join(folder, file))

    return FileLink(file_name)

In [12]:
zip_dir()

# Loading and Using Model

In [11]:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the saved model and tokenizer from a specific path
load_directory = '/kaggle/working/'
model = T5ForConditionalGeneration.from_pretrained(load_directory)
tokenizer = T5Tokenizer.from_pretrained(load_directory)

In [21]:
# Example input text
input_text = ""

# Tokenize the input text
input_ids = tokenizer.encode(input_text, truncation=True, padding='longest', return_tensors='pt')

# Generate the summary
summary_ids = model.generate(input_ids, max_length=2048, num_beams=2, early_stopping=True)
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("||| Input text |||\n", input_text)
print("\n||| Generated summary ||| \n", summary_text)

||| Input text |||
 After much wait, the first UDAN flight took off from Shimla today after being flagged off by Prime Minister Narendra Modi.The flight will be operated by Alliance Air, the regional arm of Air India. PM Narendra Modi handed over boarding passes to some of passengers travelling via the first UDAN flight at the Shimla airport.Tomorrow PM @narendramodi will flag off the first UDAN flight under the Regional Connectivity Scheme, on Shimla-Delhi sector.Air India yesterday opened bookings for the first launch flight from Shimla to Delhi with all inclusive fares starting at Rs2,036.THE GREAT 'UDAN'The UDAN (Ude Desh ka Aam Naagrik) scheme seeks to make flying more affordable for the common people, holding a plan to connect over 45 unserved and under-served airports.Under UDAN, 50 per cent of the seats on each flight would have a cap of Rs 2,500 per seat/hour. The government has also extended subsidy in the form of viability gap funding to the operators flying on these routes.