Question answering comes in many forms. In this example, we’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the Stanford Question Answering Dataset (SQuAD) 2.0.


## Prerequisites: 

1. Download and install the required libraries below.
2. Import the required libraries

In [1]:
!pip install torch==1.7.1
!pip install pytorch-lightning==1.1.2
!pip install transformers==4.1.1
!pip install sentencepiece==0.1.94
!pip install wandb==0.10.12





In [2]:
import torch
import transformers as tfs
import pytorch_lightning as pl
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning.metrics import Accuracy
from pytorch_lightning import loggers as pl_loggers
import json
from pathlib import Path
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm
from torch.utils.tensorboard import SummaryWriter
from transformers import AdamW, DistilBertForQuestionAnswering, DistilBertTokenizerFast
import string, re

# 1. Data Understanding

In this section we will import the data & convert it correctly into paralell lists of contexts, questions and answers provided in the SQuAD 2.0 Dataset. 

## **Download SQuAD 2.0 Data**

Note : This dataset can be explored in the Hugging Face model hub (SQuAD V2), and can be alternatively downloaded with the 🤗 NLP library with load_dataset("squad_v2").

In [3]:
## Create a squad directory and download the train and evaluation datasets directly into the library
# !mkdir squad
# !wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O squad/train-v2.0.json
# !wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O squad/dev-v2.0.json

Below we will import the data and convert it into parallel lists of contexts, questions, and answers.

In [4]:
def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    combined_qac=[] #combined contexts, questions & answers
    counter=0
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']  
            for qa in passage['qas']:
                question = qa['question']
                q_answers = qa['answers'].copy()
                q_answers = list(map(lambda x:x['text'], q_answers))
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)
                    combined_qac.append({'context':context,'question':question,'answers':q_answers})
    return contexts, questions, answers, combined_qac

train_contexts, train_questions, train_answers,train_qac = read_squad('squad/train-v2.0.json')
val_contexts, val_questions, val_answers, val_qac = read_squad('squad/dev-v2.0.json')

Now that we have converted the data into parallel lists, let us assess what the dataset holds.

In [5]:
len(train_contexts)

86821

In [6]:
train_contexts[0]

'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'

In [7]:
train_questions[0]

'When did Beyonce start becoming popular?'

In [8]:
train_answers[0]

{'text': 'in the late 1990s', 'answer_start': 269}

In [9]:
len(train_qac)

86821

In [10]:
train_qac[0]

{'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': ['in the late 1990s']}

Inspecting Validation Data

In [11]:
len(val_contexts)

20302

In [12]:
val_contexts[0]

'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'

In [13]:
val_questions[0]

'In what country is Normandy located?'

In [14]:
val_answers[0]

{'text': 'France', 'answer_start': 159}

In [15]:
val_qac[0]

{'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
 'question': 'In what country is Normandy located?',
 'answers': ['France', 'France', 'France', 'France']}

## Observations:

 - We have successfully created 3 subsets of both the training and validation sets
 - We gathered the following stats:
     - **Training Data**
         - Length: 86821
         - The combined_qac shows the way things will work, i.e.: We submit a context & a question to the model & receive the answer already highlighted
         - train_answers shows the answer for a particular question and the start index value
     - **Validation Data**
         - Length: 20302
         - Similar to the train_qac we have created a val_qac to understand the validation dataset better as well

# 2. Data Processing

In this section we will prepare the data appropriately for modelling and training. 

We will extract token positions where answers begins & ends for train & validation data.


The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which token positions the answer begins and ends.

First, let’s get the character position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [16]:
## Index the answers and contexts in the training and validation sets. This will help us generate the tokens 
## and help get better answers for our questions
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        # sometimes squad answers are off by a character or two – fix this
        if context[start_idx:end_idx] == gold_text:
            answer['answer_end'] = end_idx
        elif context[start_idx-1:end_idx-1] == gold_text:
            answer['answer_start'] = start_idx - 1
            answer['answer_end'] = end_idx - 1     # When the gold label is off by one character
        elif context[start_idx-2:end_idx-2] == gold_text:
            answer['answer_start'] = start_idx - 2
            answer['answer_end'] = end_idx - 2     # When the gold label is off by two characters

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [17]:
## Initialize a tokenizer using DistilBERT which will help us tokenize our training questions and answers

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

## obtain encoded training and validation sets from the tokenizer
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [18]:
## Create a function to add token positions
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

### Observations

In this section we have successfully taken the split datasets and:

    - Added end index postions which helps us identify the correct end values for an answer in a particular
      text piece.
    - Tokenized our data using DistilBert 
    - Added token positions to the start and end of the answers based on their encoded positions.

Our data is ready. Let’s just put it in a PyTorch dataset so that we can easily use it for training. In PyTorch, we define a custom Dataset class.

# 3. Train & Validation Dataset Creation

In [19]:
## Creating the taining and validation datasets using the encoded training and validation sets we created in
## the section above
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

### Observations

Our training and validation sets have been successully created. We will now use these to train, validate and score our model below.

# 4. Model Building & Training


In [20]:
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

### Observations
We have created a basic model using DistilBert. This model is still not trained. We shall train and validate this model using our available compute unit. 

In [21]:
# Training the created model using the available cuda gpu or cpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device) # send the model to the available device for training.
model.train()

train_dataloader = DataLoader(train_dataset, batch_size=12, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val_dataset,batch_size=8,shuffle=False)

optim = AdamW(model.parameters(), lr=5e-5)


### Observations

We have uploaded the base model to our compute device. This helps in faster access to the model, model inputs & outputs while it is being trained. 

We have also defined our training and validation data loaders with a batch size of 8. A batch size of 8 splits the training into chunks of 8 & uses those chunks for processing together. 

We chose the AdamW optimizer because it allows us to handle sparse gradient on highly noisy datasets. 


This was chosen due to compute limitations. A notebook with the batch size of 16 can be found here: https://colab.research.google.com/drive/1dNpiCmNmAKUm8tL3wkW89oV7RxJtwdC1?usp=sharing

In [22]:
# Removing articles and punctuation, and standardizing whitespace are all typical text processing steps


def normalize_text(s):

    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


In [23]:
# Function to compute the exact match for an answer.
# This will help us determine how accurately do our answers match with the suggested answers
def compute_exact_match(prediction, truth):
    return int(normalize_text(prediction) == normalize_text(truth))


In [24]:
# Function to compute the F1 Statistic 

def compute_f1(prediction, truth):
    pred_tokens = normalize_text(prediction).split()
    truth_tokens = normalize_text(truth).split()
    
    # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return int(pred_tokens == truth_tokens)
    
    common_tokens = set(pred_tokens) & set(truth_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(truth_tokens)
    
    return 2 * (prec * rec) / (prec + rec)


In [25]:
# Function to calculate exact match and exact F1 score for a particular training epoch
def calculate_stats(input_ids,start,end,idx):
    batch_start = 8*idx
    batch_end = batch_start+8
    data = val_qac[batch_start:batch_end]
    em = 0
    ef1 = 0
    for i,d in enumerate(data):
        answer_start = start[i]
        answer_end = end[i]
        answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[i][answer_start:answer_end]))
        gold_ans = d['answers']
        if len(gold_ans)==0:
            gold_ans.append("")
        em_s= max((compute_exact_match(answer, g_answer)) for g_answer in gold_ans)
        ef1_s = max((compute_f1(answer, g_answer)) for g_answer in gold_ans)
        em+=em_s
        ef1+=ef1_s
    return em,ef1

### Observations

Above we have created a few functions that will help us with validation better. 

The normalize text function allows us to create a uniform text format. It removes punctuations, fixes whitespaces, removes articles and converts everything to lower text. This helps in ensuring that the input and outputs match properly.

The calculate stats function returns the exact match and F1 scores for each item we train on. This helps us identiy how well our model is performing. 

In [None]:
# Train for the model, perform validation on it per epoch and generate files for a tensorboard
num_epochs = 20

writer = SummaryWriter()

for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)
    model.train()
    running_loss = 0.0
    tk0 = tqdm(train_dataloader, total=int(len(train_dataloader)))    
    counter = 0
    for idx,batch in enumerate(tk0):
        optim.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        loss = outputs[0]
        loss.backward()
        optim.step()
        running_loss += loss.item() *  batch['input_ids'].size(0)
        counter += 1
        tk0.set_postfix(loss=(running_loss / (counter * train_dataloader.batch_size)))
    epoch_loss = running_loss / len(train_dataloader)
    writer.add_scalar('Train/Loss', epoch_loss,epoch)
    print('Training Loss: {:.4f}'.format(epoch_loss))

    model.eval()
    running_val_loss=0
    running_val_em=0
    running_val_f1=0
    tk1 = tqdm(val_dataloader, total=int(len(val_dataloader)))  
    for idx,batch in enumerate(tk1):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
        running_val_loss += loss.item() *  batch['input_ids'].size(0)
        counter += 1
        tk1.set_postfix(loss=(running_loss / (counter * val_dataloader.batch_size)))
        answer_start = torch.argmax(outputs['start_logits'], dim=1)  
        answer_end = torch.argmax(outputs['end_logits'], dim=1) + 1 
        em_score, f1_score = calculate_stats(input_ids,answer_start,answer_end,idx)
        running_val_em += em_score
        running_val_f1 += f1_score
    l = len(val_qac)
    epoch_v_loss = running_val_loss /l
    epoch_v_em = running_val_em/l
    epoch_val_f1 = running_val_f1/l
    writer.add_scalar('Val/Loss', epoch_v_loss,epoch)
    writer.add_scalar('Val/EM', epoch_v_em,epoch)
    writer.add_scalar('Val/F1', epoch_val_f1,epoch)
    print('Val Loss: {:.4f}, EM: {:.4f}, F1: {:.4f} '.format(epoch_v_loss,epoch_v_em,epoch_val_f1))  

Epoch 0/19
----------


HBox(children=(FloatProgress(value=0.0, max=7236.0), HTML(value='')))

### Observations

After running our model training and evaluation for 10 epochs with a batch size of 8 we observe:

 - The training loss dropped from 12.0271 to 1.8473 in the final epoch. This means that training got better with time.
 - The validation loss dropped from 2.0035 to 0.0196. This means that our validation also improved with time. 
 - There is a very large difference between the training loss and the validation loss. This highlights that there is underfitting happening. Possible ways to improve this are increase the batch size to 16 and run for a higher number of, epochs to suggest a few. 
 - Our exact match score, EM, suggests that our model is able to get perfect matches, with an F1 score = 1, for 64.21% of our validation set. This is in line with our understanding of the validation dataset that the set has a lot of 'no answer' type questions. 
 - Our best results on the validation set came in the 2nd epoch. The problem with this is that the training loss for that epoch was very high at 7.8%.
 - We see a consistent decline in the F1 score after the 3rd epoch. The only anamoly is the 8th epoch where the score jumps suddenly. 
 
 Overall our model has performed well considering the limitations in computing that we have faced.

In [None]:
# We save our model so that it can be reused later

torch.save(model,'./model.pt')

In [None]:
# Generate a Tensorboard

%load_ext tensorboard
%tensorboard --logdir runs

### Observations

We have created a Tensorboard to map the loss and accuracy across the various epochs that the model has trained at.


We will now run some examples to see how our model is performing & is it responding correctly to our questions.


# 5. Running The Model

We will now test the model on some contexts and questions to see if we are getting the correct answers

In [None]:
test_context = """The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."""


test_question = """Who was the Norse leader?"""

test_answer =  "Rollo"


In [None]:
def question_answer(question, context, model):
    inputs = tokenizer(question,context, return_tensors='pt')

    input_ids = inputs['input_ids'].to(device)

    attention_mask = inputs['attention_mask'].to(device)
    inputs.to(device)
    start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]

    all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
    answer = tokenizer.convert_tokens_to_ids(answer.split())
    answer = tokenizer.decode(answer)
    return answer

In [None]:
question_answer(test_question, test_context, model)

In [None]:
## Checking the response is the same as we had in the validation set. 
question_answer(val_questions[0], val_contexts[0], model)


### Observations

We can see that our model is performing correctly. As an aditional step below we will load the model again and try to predict the answers for the same questions as we did above. 

In [None]:
import torch
from transformers import DistilBertTokenizerFast

model_loaded = torch.load('./model.pt')
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [None]:
test_context = """The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries."""


test_question = """Who was the Norse leader?"""

test_answer =  "Rollo"


In [None]:
def question_answer(question, context, model):
    inputs = tokenizer(question,context, return_tensors='pt')

    input_ids = inputs['input_ids'].to(device)

    attention_mask = inputs['attention_mask'].to(device)
    inputs.to(device)
    start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]

    all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
    answer = tokenizer.convert_tokens_to_ids(answer.split())
    answer = tokenizer.decode(answer)
    return answer

In [None]:
question_answer(test_question, test_context, model_loaded)

In [None]:
## we will now take some text at random from Wikipedia and test our model. This excerpt can be found at:
## https://en.wikipedia.org/wiki/Long_short-term_memory under the Idea heading.
context = """In theory, classic (or "vanilla") RNNs can keep track of arbitrary long-term dependencies in the input sequences. The problem with vanilla RNNs is computational (or practical) in nature: when training a vanilla RNN using back-propagation, the gradients which are back-propagated can "vanish" (that is, they can tend to zero) or "explode" (that is, they can tend to infinity), because of the computations involved in the process, which use finite-precision numbers. RNNs using LSTM units partially solve the vanishing gradient problem, because LSTM units allow gradients to also flow unchanged. However, LSTM networks can still suffer from the exploding gradient problem."""
question = """What problem can LSTM suffer from?"""
answer = """exploding gradient problem"""

In [None]:
question_answer(question, context, model_loaded)

# Final Observations

Through this project we have shown the process to process the SQuAD 2.0 dataset and how to use it for building a Question Answering System.

We have observed that to be able to successfully create a model that has good accuracy, we need to:
- **Prepocess Data** : We need to correctly separate the data into question, answers & contexts so that our model can correctly identify these fields. We need to insert index values to identify the start and end of a context and answer to be able to successfully generate tokens. We then encoded our data and tokenized it, preparing it for modelling. 
- **Model Creation** : We then built a base model using the pre-trained DistilBert and trained it using PyTorch.
- **Training & Validation**: We trained the model and evaluated it using our SQuAD dataset, over 10 epochs with a batch size of 8. Some observations from that are:

     - The training loss dropped from 12.0271 to 1.8473 in the final epoch. This means that training got better with time.
     - The validation loss dropped from 2.0035 to 0.0196. This means that our validation also improved with time. 
     - There is a very large difference between the training loss and the validation loss. This highlights that there is underfitting happening. Possible ways to improve this are increase the batch size to 16 and run for a higher number of, epochs to suggest a few. 
     - Our exact match score, EM, suggests that our model is able to get perfect matches, with an F1 score = 1, for 64.21% of our validation set. This is in line with our understanding of the validation dataset that the set has a lot of 'no answer' type questions. 
     - Our best results on the validation set came in the 2nd epoch. The problem with this is that the training loss for that epoch was very high at 7.8%.
     - We see a consistent decline in the F1 score after the 3rd epoch. The only anamoly is the 8th epoch where the score jumps suddenly. 

     Overall our model has performed well considering the limitations in computing that we have faced.
     
     We also ran a separate model in Google Colab with a batch size of 16 for 10 epochs. Link to colab: 
     https://colab.research.google.com/drive/1dNpiCmNmAKUm8tL3wkW89oV7RxJtwdC1?usp=sharing
     
- **Testing The Model**: Finally we tested our model using data from the training set as well as a random excerpt from Wiki pedia. In both cases our model is performing well. 

# END OF FILE