# Fine Tuning Transformer for Question Answering

### Introduction

In this tutorial we will be fine tuning a transformer model for the **Question Answering** problem.

#### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Importing Python Libraries and preparing the environment](#section01)
2. [Importing and Pre-Processing the domain data](#section02)
3. [Preparing the Dataset](#section03)
4. [Preparing the DataLoader and DataCollator](#section04)
5. [Creating the Neural Network for Fine Tuning](#section05)
6. [Fine Tuning the Model](#section06)
7. [Validating the Model Performance](#section07)
8. [Saving the model and artifacts for Inference in Future](#section08)

#### Data

 - Data:
	 - We are using the the coqa dataset as shard via huggingface [COQA](https://huggingface.co/datasets/coqa)
	 - We can access via de Datasets API of huggingface
	 - There are rows of data.  Where each row has the following column:
		 - Source
		 - `story` > Context of the Question
		 - `questions` > Several questions about the context
		 - `answers` > Answers aligned with questions

 - Language Model Used:
	 - T5 is used for this project.
	 - [Research Paper](https://arxiv.org/pdf/1910.10683.pdf)
     - [Documentation for python](https://huggingface.co/docs/transformers/model_doc/t5)

---

 - Hardware Requirements:
	 - Python 3.6 and above
	 - Pytorch, Transformers and All the stock Python ML Libraries
	 - GPU enabled setup


 - Script Objective:
	 - The objective of this script is to fine tune T5 to answer a question given a context.

<a id='section01'></a>
### Importing Python Libraries and preparing the environment

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers and Datasets (from huggingface)
* T5 Model and Tokenizer

Followed by that we will preapre the device for GPU execeution. This configuration is needed if you want to leverage on onboard GPU.

*I have included the code for TPU configuration, but commented it out. If you plan to use the TPU, please comment the GPU execution codes and uncomment the TPU ones to install the packages and define the device.*

In [1]:
# Installing the transformers library and additional libraries if looking process

!pip install -q transformers datasets

# Code for TPU packages install
# !curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
# !python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
# Importing stock ml libraries
import numpy as np
import pandas as pd
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
from torch import nn
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['PYTORCH_CUDA_ALLOC_CONF']='max_split_size_mb:21'


# Preparing for TPU usage
# import torch_xla
# import torch_xla.core.xla_model as xm
# device = xm.xla_device()

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# # Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cuda'

<a id='section02'></a>
### Importing and Pre-Processing the domain data

We will be working with the data and preparing for fine tuning purposes. We use the `load_dataset` function from the datasets library to import the COQA train and validation split.

In [3]:
train_dataset = load_dataset('coqa', split="train")
valid_dataset = load_dataset('coqa', split="validation")

<a id='section03'></a>
### Preparing the Dataset

We will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of CustomDataset class - This defines how the text is split and preprocessed before sending it to the neural network.
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *CustomDataset* Dataset Class
- This class is defined to accept the `dataset` and `max_length` as input and generate tokenized output and tags that is used by the BERT model for training.

In [4]:
# @title Customize your key variables here
# Sections of config

# Defining some key variables that will be used later on in the training
MAX_LEN = 0 # @param {type:"integer"}
TRAIN_BATCH_SIZE = 1 # @param {type:"integer"}
VALID_BATCH_SIZE = 1 # @param {type:"integer"}
EPOCHS = 1 # @param {type:"integer"}
LEARNING_RATE = 1e-5 # @param {type:"number"}

#### First coding exercise
Define the key CustomDataset `__getitem__` as follows:

1.  The story is the context for the transformer, but it lacks a question.
2.  Select a random question from the questions set of the row. If `self.at_random` is `False` select the first question.
3.  Append the question to the context.
4.  Assign the output text as the corresponding answer to the selected question.
---
Note: Later we will preprocess the texts within a batch!


In [5]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')

In [6]:
from math import trunc
import random
class CustomDataset(Dataset):
    def __init__(self, dataset, at_random=False,max_len=512):
        self.dataset = dataset
        self.at_random = at_random
        self.tokenizer = AutoTokenizer.from_pretrained('t5-base')
        self.max_len = max_len

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        data_row = self.dataset[index]
        story = data_row['story']
        questions = data_row['questions']
        answers = data_row['answers']['input_text']
        # print(len(answers),answers)
        # print(len(questions),questions)
        # print(questions[0])
        # print(answers[0])
        if not self.at_random:
            question=questions[0]
            answer=answers[0]
        else:
            random_number= random.randrange(len(questions))
            question=questions[random_number]
            answer=answers[random_number]
        story += question

        story = " ".join(story.split())
        inputs = self.tokenizer.encode_plus(
                story,
                None,
                add_special_tokens=True,
                max_length=self.max_len,
                pad_to_max_length=True,
                return_token_type_ids=True,
                truncation=True
            )
        outputs = str(answer)
        outputs = " ".join(outputs.split())
        outputs = self.tokenizer.encode_plus(
                outputs,
                None,
                add_special_tokens=True,
                max_length=self.max_len,
                pad_to_max_length=True,
                return_token_type_ids=True,
                truncation=True
            )
        
        encoder_ids = inputs['input_ids']
        encoder_mask = inputs['attention_mask']
        decoder_ids = outputs['input_ids'][:-1]
        decoder_mask = outputs['attention_mask'][:-1]
        decoder_outs = outputs['input_ids'][1:]
        decoder_outs_mask = outputs['attention_mask'][1:]

        # Your code goes here!

        return {
      
      'encoder_ids': torch.tensor(encoder_ids, dtype=torch.long),
      'encoder_mask': torch.tensor(encoder_mask, dtype=torch.long),
      'decoder_input_ids': torch.tensor(decoder_ids, dtype=torch.long),
      'decoder_input_mask': torch.tensor(decoder_mask, dtype=torch.long),
      'targets': torch.tensor(decoder_outs, dtype=torch.float),
      'target_mask': torch.tensor(decoder_outs_mask, dtype=torch.float)
    }

<a id='section04'></a>
### Preparing the DataLoader and DataCollator

The DataLoader batches groups of indices of the CustomDataset of objects. We may use a collator function to describe exactly how is the batch built. The DataCollator is another custom class that will describe the batch.
Inside it we will extract the data from each row, tokenize them and build the batch with the input ids and mask as well as the output ids and mask!.

### The DataCollator
We define this as an object that contains the tokenizer among other relevant parameters for the text batching. It will recieve several rows of dictionaries of {`encoder_text`, `decoder_text`}. Using these pairs we are meant to build a dictionary with {`encoder_ids`, `encoder_mask`, `decoder_input_ids`, `decoder_input_mask`, `targets`}.

#### Second coding exercise
Define the key DataCollator `__call__` as follows:

1.  Tokenize the `encoder_text` and `decoder_text`
2.  Displace the `decoder_input_ids` and `decoder_input_mask` to create the `targets`
3.  Return the dictionary!

The training dataset and validation dataset are already defined, complete the following code.

When the datasets are split, declare the `training_set` and `testing_set` variables with your `CustomDataset` data class.

In [7]:
# Creating the dataset and dataloader for the neural network
training_set = CustomDataset(train_dataset,True)
testing_set = CustomDataset(valid_dataset,True)

train_params = {'batch_size': 10,
                'shuffle': True,
                'num_workers': 0,
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0,
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section05'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `AutoModel`.
 - This network will have the `T5` model.  Follwed by a `Droput` and `Linear Layer`. They are added for the purpose of **Regulariaztion** and **Classification** respectively.

#### Loss Function and Optimizer
 - The Loss is defined in the next cell as `loss_fn`.
 - As defined above, the loss function used will be a combination of Categorical Cross Entropy which is implemented as [CrossEntropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) in PyTorch
 - `Optimizer` is defined in the next cell.
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

#### Further Reading
- You can refer to my [Pytorch Tutorials](https://github.com/abhimishra91/pytorch-tutorials) to get an intuition of Loss Function and Optimizer.
- [Pytorch Documentation for Loss Function](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Pytorch Documentation for Optimizer](https://pytorch.org/docs/stable/optim.html)
- Refer to the links provided on the top of the notebook to read more about `BertModel`.

#### Third coding exercise
Initialize the BERTClass with three layers, the bert transformer (`AutoModel.from_pretrained('t5-base')`) a dropout (`nn.Dropout`) and a dense layer (`nn.Linear`). When the model is declared, you have to code the forward pass, detailing the relationship between neural modules.

In [15]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model.

import tokenize


class T5Class(torch.nn.Module):
    def __init__(self, vocabulary):
        super(T5Class, self).__init__()
        self.dropout = 0.1
        self.hidden_embd = 768
        self.output_layer = vocabulary
        # Declare the layers here
        self.l1 = AutoModel.from_pretrained('t5-base')
        self.l2 = torch.nn.Dropout(self.dropout)
        self.l3 = torch.nn.Linear(self.hidden_embd, self.output_layer)

    def forward(self, enc_ids, enc_mask, dec_ids, dec_mask):
        # Use the transformer, then the dropout and the linear in that order.
        output_1= self.l1(input_ids=enc_ids,attention_mask=enc_mask,decoder_input_ids=dec_ids,decoder_attention_mask=dec_mask).last_hidden_state
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

model = T5Class(vocabulary=tokenizer.vocab_size)
model.to('cpu')


T5Class(
  (l1): T5Model(
    (shared): Embedding(32128, 768)
    (encoder): T5Stack(
      (embed_tokens): Embedding(32128, 768)
      (block): ModuleList(
        (0): T5Block(
          (layer): ModuleList(
            (0): T5LayerSelfAttention(
              (SelfAttention): T5Attention(
                (q): Linear(in_features=768, out_features=768, bias=False)
                (k): Linear(in_features=768, out_features=768, bias=False)
                (v): Linear(in_features=768, out_features=768, bias=False)
                (o): Linear(in_features=768, out_features=768, bias=False)
                (relative_attention_bias): Embedding(32, 12)
              )
              (layer_norm): T5LayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (1): T5LayerFF(
              (DenseReluDense): T5DenseActDense(
                (wi): Linear(in_features=768, out_features=3072, bias=False)
                (wo): Linear(in_features=3072, out_features=768, 

In [16]:
def loss_fn(outputs, targets,mask):
    nn.CrossEntropyLoss()(outputs[mask],targets[mask])
    return None

In [17]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section06'></a>
### Fine Tuning the Model

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network.

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size.
- Subsequent output from the model and the actual category are compared to calculate the loss.
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

#### Last coding exercise
Now you have to code the training setup:

1.   Zero-out the gradients with `optimizer.zero_grad()`
2.   Get a batch of data (`encoder_ids`, `encoder_mask`, `decoder_input_ids`, `decoder_input_mask`, `targets`) and move it to gpu with `.to`
3.   Compute outputs
4.   Compute the loss using `loss_fn` earlier declared.
5.   Make a backward pass with `loss.backward()`
6.   Make the optimizer move forward with `optimizer.step()`



In [18]:
def train(epoch):
    model.train()
    for index, data in enumerate(training_loader):
      print(index)
      targets=data['targets'].to('cpu',dtype=torch.long)
      target_mask=data['target_mask'].to('cpu',dtype=torch.long)
      encoder_ids=data['encoder_ids'].to('cpu',dtype=torch.long)
      encoder_mask=data['encoder_mask'].to('cpu',dtype=torch.int)
      decoder_input_ids=data['decoder_input_ids'].to('cpu',dtype=torch.long)
      decoder_input_mask=data['decoder_input_mask'].to(device,dtype=torch.int)
      output=model.forward(encoder_ids,encoder_mask,decoder_input_ids,decoder_input_mask)
      output=output.to('cpu')
      optimizer.zero_grad()
      loss=loss_fn(output,targets,target_mask)
      loss.backward()
      optimizer.step()
      del output, targets, target_mask, encoder_ids, encoder_mask, decoder_input_ids, decoder_input_mask
      torch.cuda.empty_cache()

In [None]:
for epoch in range(EPOCHS):
    train(epoch)

0


<a id='section07'></a>
### Validating the Model

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data.

As defined above to get a measure of our models performance we are using the following metrics.
- Accuracy Score
- F1 Micro
- F1 Macro

**Extract the data and compute the outputs as you did on the training step!**

In [None]:
def validation(epoch):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = None
            mask = None
            token_type_ids = None
            targets = None
            outputs = None
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

Compute accuracy and f1_score here.

In [None]:
for epoch in range(EPOCHS):
    outputs, targets = validation(epoch)
    outputs = None
    accuracy = None
    f1_score_micro = None
    f1_score_macro = None
    print(f"Accuracy Score = {accuracy}")
    print(f"F1 Score (Micro) = {f1_score_micro}")
    print(f"F1 Score (Macro) = {f1_score_macro}")

<a id='section08'></a>
### Saving the Trained Model Artifacts for inference

This is the final step in the process of fine tuning the model.

The model and its vocabulary are saved locally. These files are then used in the future to make inference on new inputs of news headlines.

Please remember that a trained neural network is only useful when used in actual inference after its training.

Below you can try to generate an answer with the fine-tuned model!