# Problem Statement
## Fine-Tuning T5 for Question Answering using HuggingFace Transformers
To build a model that can generate answers based on given context and questions , we focused on finetuning a hugging face model T5 for creating a q&a context chatbot


Topic and code reference : https://analyticsindiamag.com/guide-to-question-answering-system-with-t5-transformer/

# About T5 transformer Hugging Face

About the T5 transformer-
The T5 model is based on the Transformer architecture, which is a type of neural network that is designed to process sequential input data efficiently. It consists of an encoder and a decoder, which are both made up of a series of interconnected "layers."

Each layer in the encoder and decoder is made up of a series of "attention" mechanisms and "feedforward" networks. The attention mechanisms allow the model to focus on different parts of the input sequence at different times, while the feedforward networks transform the input data using a series of weights and biases.

The T5 model also uses something called "self-attention," which allows each element in the input sequence to attend to all of the other elements in the sequence. This enables the model to capture relationships between words and phrases in the input data, which is important for many NLP tasks.

In addition to the encoder and decoder, the T5 model also includes something called a "language model head," which is used to predict the next word in a sequence given the previous words. This is important for tasks like translation and text generation, where the model needs to generate coherent and natural-sounding output.

Overall, the T5 model is a very large and complex neural network, but it is designed to be highly efficient and effective at processing sequential data. It has been trained on a massive dataset of text and can perform a wide range of NLP tasks with state-of-the-art accuracy.

# Import necessary libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


import torch

from transformers import T5Tokenizer
from transformers import T5ForConditionalGeneration, AdamW

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

pl.seed_everything(100)

import warnings
warnings.filterwarnings("ignore")

Global seed set to 100


# Import Dataset

In [2]:
df = pd.read_csv("D:/GeorgeBrownSubjects/DeepLearningMaths/Project2/Data/SQuAD_csv.csv")
df.columns

Index(['Unnamed: 0', 'context', 'question', 'id', 'answer_start', 'text'], dtype='object')

In [3]:
df = df[['context','question', 'text']]

In [4]:
print("Number of records: ", df.shape[0])

Number of records:  86821


In [5]:
df["context"] = df["context"].str.lower()
df["question"] = df["question"].str.lower()
df["text"] = df["text"].str.lower()

df.head()

Unnamed: 0,context,question,text
0,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,when did beyonce start becoming popular?,in the late 1990s
1,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,what areas did beyonce compete in when she was...,singing and dancing
2,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,when did beyonce leave destiny's child and bec...,2003
3,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,in what city and state did beyonce grow up?,"houston, texas"
4,beyoncé giselle knowles-carter (/biːˈjɒnseɪ/ b...,in which decade did beyonce become famous?,late 1990s


In [6]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
INPUT_MAX_LEN = 512 # Input length
OUT_MAX_LEN = 128 # Output Length
TRAIN_BATCH_SIZE = 8 # Training Batch Size
VALID_BATCH_SIZE = 2 # Validation Batch Size
EPOCHS = 5 # Number of Iteration

In [7]:
MODEL_NAME = "t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME, model_max_length= INPUT_MAX_LEN)

In [8]:
print("eos_token: {} and id: {}".format(tokenizer.eos_token, tokenizer.eos_token_id)) # End of token (eos_token)
print("unk_token: {} and id: {}".format(tokenizer.unk_token, tokenizer.eos_token_id)) # Unknown token (unk_token)
print("pad_token: {} and id: {}".format(tokenizer.pad_token, tokenizer.eos_token_id)) # Pad token (pad_token)

eos_token: </s> and id: 1
unk_token: <unk> and id: 1
pad_token: <pad> and id: 1


# Dataset Preparation

<hr>

In [9]:
class T5Dataset:

    def __init__(self, context, question, target):
        self.context = context
        self.question = question
        self.target = target
        self.tokenizer = tokenizer
        self.input_max_len = INPUT_MAX_LEN
        self.out_max_len = OUT_MAX_LEN

    def __len__(self):
        return len(self.context)

    def __getitem__(self, item):
        context = str(self.context[item])
        context = " ".join(context.split())

        question = str(self.question[item])
        question = " ".join(question.split())

        target = str(self.target[item])
        target = " ".join(target.split())
        
        
        inputs_encoding = self.tokenizer(
            context,
            question,
            add_special_tokens=True,
            max_length=self.input_max_len,
            padding = 'max_length',
            truncation='only_first',
            return_attention_mask=True,
            return_tensors="pt"
        )
        

        output_encoding = self.tokenizer(
            target,
            None,
            add_special_tokens=True,
            max_length=self.out_max_len,
            padding = 'max_length',
            truncation= True,
            return_attention_mask=True,
            return_tensors="pt"
        )


        inputs_ids = inputs_encoding["input_ids"].flatten()
        attention_mask = inputs_encoding["attention_mask"].flatten()
        labels = output_encoding["input_ids"]

        labels[labels == 0] = -100  # As per T5 Documentation

        labels = labels.flatten()

        out = {
            "context": context,
            "question": question,
            "answer": target,
            "inputs_ids": inputs_ids,
            "attention_mask": attention_mask,
            "targets": labels
        }


        return out    

# DataLoader



In [10]:
class T5DatasetModule(pl.LightningDataModule):

    def __init__(self, df_train, df_valid):
        super().__init__()
        self.df_train = df_train
        self.df_valid = df_valid
        self.tokenizer = tokenizer
        self.input_max_len = INPUT_MAX_LEN
        self.out_max_len = OUT_MAX_LEN


    def setup(self, stage=None):

        self.train_dataset = T5Dataset(
        context=self.df_train.context.values,
        question=self.df_train.question.values,
        target=self.df_train.text.values
        )

        self.valid_dataset = T5Dataset(
        context=self.df_valid.context.values,
        question=self.df_valid.question.values,
        target=self.df_valid.text.values
        )

    def train_dataloader(self):
        return torch.utils.data.DataLoader(
         self.train_dataset,
         batch_size= TRAIN_BATCH_SIZE,
         shuffle=True, 
         num_workers=4
        )


    def val_dataloader(self):
        return torch.utils.data.DataLoader(
         self.valid_dataset,
         batch_size= VALID_BATCH_SIZE,
         num_workers=1
        )

# Model Building





In [12]:
class T5Model(pl.LightningModule):
    
    def __init__(self):
        super().__init__()
        self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)

    def forward(self, input_ids, attention_mask, labels=None):

        output = self.model(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )

        return output.loss, output.logits


    def training_step(self, batch, batch_idx):

        input_ids = batch["inputs_ids"]
        attention_mask = batch["attention_mask"]
        labels= batch["targets"]
        loss, outputs = self(input_ids, attention_mask, labels)

        
        self.log("train_loss", loss, prog_bar=True, logger=True)

        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch["inputs_ids"]
        attention_mask = batch["attention_mask"]
        labels= batch["targets"]
        loss, outputs = self(input_ids, attention_mask, labels)

        self.log("val_loss", loss, prog_bar=True, logger=True)
        
        return loss


    def configure_optimizers(self):
        return AdamW(self.parameters(), lr=0.0001)

# Model Training


In [12]:
def run():
    
    df_train, df_valid = train_test_split(
        df[0:10000], test_size=0.2, random_state=101
    )
    
    df_train = df_train.fillna("none")
    df_valid = df_valid.fillna("none")
    
    df_train['context'] = df_train['context'].apply(lambda x: " ".join(x.split()))
    df_valid['context'] = df_valid['context'].apply(lambda x: " ".join(x.split()))
    
    df_train['text'] = df_train['text'].apply(lambda x: " ".join(x.split()))
    df_valid['text'] = df_valid['text'].apply(lambda x: " ".join(x.split()))
    
    df_train['question'] = df_train['question'].apply(lambda x: " ".join(x.split()))
    df_valid['question'] = df_valid['question'].apply(lambda x: " ".join(x.split()))

   
    df_train = df_train.reset_index(drop=True)
    df_valid = df_valid.reset_index(drop=True)
    
    dataModule = T5DatasetModule(df_train, df_valid)
    dataModule.setup()

    device = DEVICE
    models = T5Model()
    models.to(device)

    checkpoint_callback  = ModelCheckpoint(
        dirpath="/kaggle/working",
        filename="best_checkpoint",
        save_top_k=2,
        verbose=True,
        monitor="val_loss",
        mode="min"
    )

    trainer = pl.Trainer(
        callbacks = checkpoint_callback,
        max_epochs= EPOCHS,
        gpus=1,
        accelerator="gpu"
    )

    trainer.fit(models, dataModule)

run()


Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

# Model Prediction

<hr>

In [13]:
train_model = T5Model.load_from_checkpoint("/kaggle/working/best_checkpoint-v1.ckpt")

train_model.freeze()

def generate_question(context, question):

    inputs_encoding =  tokenizer(
        context,
        question,
        add_special_tokens=True,
        max_length= INPUT_MAX_LEN,
        padding = 'max_length',
        truncation='only_first',
        return_attention_mask=True,
        return_tensors="pt"
        )

    
    generate_ids = train_model.model.generate(
        input_ids = inputs_encoding["input_ids"],
        attention_mask = inputs_encoding["attention_mask"],
        max_length = INPUT_MAX_LEN,
        num_beams = 4,
        num_return_sequences = 1,
        no_repeat_ngram_size=2,
        early_stopping=True,
        )

    preds = [
        tokenizer.decode(gen_id,
        skip_special_tokens=True, 
        clean_up_tokenization_spaces=True)
        for gen_id in generate_ids
    ]

    return "".join(preds)



## Prediction

In [14]:
context = "Clustering groups of similar cases, for example, \
can find similar patients, or can be used for customer segmentation in the \
banking field. Association technique is used for finding items or events that \
often co-occur, for example, grocery items that are usually bought together\
by a particular customer. Anomaly detection is used to discover abnormal \
and unusual cases, for example, it is used for credit card fraud detection."

que = "what is the example of Anomaly detection?"

print(generate_question(context, que))


credit card fraud detection


In [16]:
context = "Classification is used when your target is categorical, while regression is used when your target variable\
is continuous. Both classification and regression belong to the category of supervised machine learning algorithms."

que = "When is classification used?"

print(generate_question(context, que))

when your target is categorical


#Results

Our model is successfully able to answer to the given questions based on the feeded context

#Resources

Data:https://www.kaggle.com/datasets/rtatman/questionanswer-dataset

t5 resource : https://huggingface.co/transformers/v3.0.2/model_doc/t5.html

Code and topic reference from this youtube video: https://www.youtube.com/watch?v=r6XY80Z9eSA