# 2023 CITS4012 Assignment

# Readme

#### **Group 31**
- Abhishek Anand (23598144)
- Shaikh Enamul Haque (23440037)
- Jia Min Ho (23337561)


In this project, we are implementing a Wiki QA (Question Answering) framework using the Sequence model and different NLP features. The QA framework has the ability to read documents/texts and answer questions about them.


**Load and Evaluation of Model**

To load the provided model and evaluate it on test data, below commands need to be performed.

Apart from the below steps, it is also essential that all the code blocks in DataSet Processing and QA Model Implementation sections are executed upfront.

The Dataset Processing section takes care of data wrangling. QA Model Implementation section takes care of building all word vector representations which are needed for the loaded document and question models.

The below command will print out the precision, recall and f1 score achieved by the model on the test dataset.

```bash
model_document = torch.load('document_model.pt')
model_document.eval()

question_model = torch.load('question_model.pt')
question_model.eval()

precision, recall, f1 = evaluate_model(model_document, question_model)
```

# 1.DataSet Processing

Set up the necessary dependencies, downloads training and testing files from Google Drive, and loads them into pandas DataFrames. These files ('WikiQA-train.tsv' and 'WikiQA-test.tsv') contains sentences, questions and answer lebels for further analysis.

In [None]:
# Code to download file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

import pandas as pd
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Read the training file
id = '1SXoGbD9WZHwhpqR-cBw7-8_7Ri06nIb6'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('WikiQA-train.tsv')  
wiki_train_df = pd.read_table('WikiQA-train.tsv')

# Read the test file
id = '1TwuDSxlcAFDnTRpF-GRvqRXoR_UsJznH'
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('WikiQA-test.tsv')  
wiki_test_df = pd.read_table('WikiQA-test.tsv')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Data Wrangling

Process the question and sentence data from the DataFrame, tokenizes the sentences, and assigns appropriate labels to the words based on their position relative to the answers. The processed data is then stored in separate lists for both training and test sets. We have defined 4 lebels here "BA" - Before Answer, "A" - Answer, "AA" - After Answer and "PAD".

In [None]:
# BA - Before Answer, A - Answer, AA - After Answer
token_labels = ['[BA]', '[A]', '[AA]', '[PAD]']

# Build the Training and Test Question, Document and Document Labels
def construct_questions_documents_labels(wiki_df, is_test_data):
  unique_question_ids = wiki_df['QuestionID'].unique()

  questions = []
  documents = []
  document_sentences = []
  document_labels = []
  document_answers = []

  for one_question_id in unique_question_ids:
    matching_indices = wiki_df.index[wiki_df['QuestionID'] == one_question_id].tolist()
    
    question = wiki_df.loc[matching_indices[0]]['Question']
    # Remove all punctuations
    question = re.sub(r'[^\w\s]','', question)

    one_document = []
    one_document_sentences = ''
    one_document_labels = []
    one_document_answer = []
    answer_label = ''
    answer_covered = False

    for matching_index in matching_indices:
      one_sentence = wiki_df.loc[matching_index]['Sentence']

      # If the sentence does not end with a dot, then add the dot to it.
      if one_sentence.endswith('.'):
        one_document_sentences += one_sentence + " "
      else:
        one_document_sentences += one_sentence + ". "
      
      # Get the words of the sentence after removing all punctuations
      one_sentence_words = word_tokenize(re.sub(r'[^\w\s]','', one_sentence))

      # Form the document labels - before answer will be BA, answer will be A, after answer will be AA, and PAD
      answer_label = wiki_df.loc[matching_index]['Label']

      if answer_label == 1:
        one_document_answer.append(one_sentence)
        answer_covered = True
        word_counter = 0

        for oneWord in one_sentence_words:
          one_document.append(oneWord)
          if word_counter == 0:
            one_document_labels.append(token_labels.index('[A]'))
          elif word_counter == len(one_sentence_words) - 1:
            one_document_labels.append(token_labels.index('[A]'))
          else:
            one_document_labels.append(token_labels.index('[A]'))
          word_counter += 1
      else:
        if answer_covered == True:
          # Answer is covered, so we put all tags as After Answer
          for oneWord in one_sentence_words:
            one_document.append(oneWord)
            one_document_labels.append(token_labels.index('[AA]'))
        else:
          for oneWord in one_sentence_words:
            one_document.append(oneWord)
            one_document_labels.append(token_labels.index('[BA]'))

    # For training data, add only those documents which have answer. But for test, add all the documents to the list
    if answer_covered or is_test_data:
      questions.append(question)
      documents.append(one_document)
      document_sentences.append(one_document_sentences)
      document_labels.append(one_document_labels)
      document_answers.append(one_document_answer)

  return questions, documents, document_sentences, document_labels, document_answers


# train_questions - List of questions. Each element is a list of words in the question.
# train_documents - List of documents. Each element is a list of sentences in the document.
# train_document_sentences - List of documents. Each element is a string of all sentences in the document.
# train_document_labels - List of document labels. Each element is a list of labels for each word in the document.
# train_document_answers - List of answers for the questions. This will be used only during model evaluation to compute precision, recall etc.
train_questions, train_documents, train_document_sentences, train_document_labels, train_document_answers = construct_questions_documents_labels(wiki_train_df, False)
test_questions, test_documents, test_document_sentences, test_document_labels, test_document_answers = construct_questions_documents_labels(wiki_test_df, True)

Calculate the average document length from the training set and identifies the maximum question length. Then check the number of test documents that exceed the average document length. Finally, calculate the total number of training questions.

In [None]:
# Get the average document length from training set
max_document_length = sum(len(sublist) for sublist in train_document_labels) // len(train_document_labels)

# Get the maximum question length
max_question_length = len(max(train_questions, key=len))

# Check how many test documents are greater than max_document_length
long_test_documents = [ one_test_document for one_test_document in test_document_labels if len(one_test_document) > max_document_length ]
print('There are ' + str(len(long_test_documents)) + ' test documents longer than the average length in training documents. We will let their prediction be affected rather than increase the length of encoder.')

no_of_train_questions = len(train_questions)


There are 249 test documents longer than the average length in training documents. We will let their prediction be affected rather than increase the length of encoder.


# 2.QA Model Implementation

### Word Embedding: FastText

Training and testing documents are now merged to create one final document. FastText model is used and trained on the final_document corpus using the Skip_Gram architecture and with few other parameters like vector size, window size, minimum count etc.

In [None]:
from gensim.models import FastText, Word2Vec

# We use 50 dimension word vector representation
word_vector_size = 50

# To ensure all words have their vector representations, we add train and test documents and use that as corpus
all_documents = train_documents + test_documents

# Now we initialize and train FastText with Skip Gram architecture (sg=1)
# We are using FastText to ensure that POS tags and NER tags also have their vector representations
ft_sg_model = FastText(all_documents, vector_size=word_vector_size, window=5, min_count=2, workers=2, sg=1)


### Feature Extraction

#### TF-IDF 

Calculating the TF-IDF and return them as dictionary where the keys are the words, and the values are their corresponding counts. This dictionary is then assigned to the another dictionary with the document ID as the key.

This function then used to calculate the TF-IDF score for training and testing documents.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
import numpy as np
from collections import Counter
import math


def GetTDIDF(document_sentences):
  # Calculate Document Frequency first - Number of documents in which a word is present
  DF = {}
  for one_document in document_sentences:
    one_document = re.sub(r'[^\w\s]','', one_document)
    word_list = word_tokenize(one_document)
    
    words_in_lower_case = [t.lower() for t in word_list]
    for one_word in np.unique(words_in_lower_case):
      try:
        DF[one_word] +=1
      except:
        DF[one_word] =1

  # Calculate TF-IDF for each word in each document
  doc_id = 0
  total_num_of_documents = len(document_sentences)
  # Dictionary of format { doc_id: { 'word1': tdidf1, 'word2': tdidf2 }}
  tf_idf_documents = {}

  for one_document in document_sentences:
    one_document = re.sub(r'[^\w\s]','', one_document)
    word_list = word_tokenize(one_document)

    words_in_lower_case = [t.lower() for t in word_list]
    # Initialise counter for the doc
    counter = Counter(words_in_lower_case)
    
    # Calculate total number of words in the doc
    total_num_of_words = len(words_in_lower_case)

    tf_idf_document = {}
    # Get each unique word in the doc
    for one_word in np.unique(words_in_lower_case):
      
      # Calculate Term Frequency 
      tf = counter[one_word]/total_num_of_words
          
      # Calculate Document Frequency
      df = DF[one_word]

      # Calculate Inverse Document Frequency
      idf = math.log(total_num_of_documents/(df+1))+1

      # Calculate TF-IDF
      tf_idf_document[one_word] = tf*idf

    # Store the document in dictionary
    tf_idf_documents[doc_id] = tf_idf_document
    doc_id += 1
  
  return tf_idf_documents


# Get the TF-IDF for training and test sets
train_tf_idf_documents = GetTDIDF(train_document_sentences)
test_tf_idf_documents = GetTDIDF(test_document_sentences)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### POS Tag

In **GetPOSTags** function we are using the "averaged_perceptron_tagger" for POS tagging and save the "pos_tag" into a dictionary. where the keys are the words, and the values are their corresponding POS tags. This dictionary is then assigned to the pos_tags_documents dictionary with the document ID as the key.

This function is used to get the POS_Tags for both training and testing documents.

In [None]:
# Import the POS tagger
from nltk.tag import pos_tag
# download the dependency and resource as required
nltk.download('averaged_perceptron_tagger')

def GetPOSTags(document_sentences):
  # Get their POS tag and store in a dictionary
  # Dictionary of format { doc_id: { 'word1': 'tag1', 'word2': 'tag2' }}
  pos_tags_documents = {}
  doc_id = 0

  for one_document in document_sentences:
    one_document = re.sub(r'[^\w\s]','', one_document)
    word_list = word_tokenize(one_document)
    pos_tags = pos_tag(word_list)

    pos_tags_document = {}
    for word, word_pos_tag in pos_tags:
      pos_tags_document[word] = word_pos_tag

    pos_tags_documents[doc_id] = pos_tags_document
    doc_id += 1
  return pos_tags_documents


# Get the POS tags for training and test sets
train_pos_tags_documents = GetPOSTags(train_document_sentences)
test_pos_tags_documents = GetPOSTags(test_document_sentences)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


#### NER Tag

Similarly in **GetNERTags** we are using similar function as GetPOSTags to get NER tags save it to a dictionary.

In [None]:
import spacy
import en_core_web_sm

# loading pre-trained model of NER
nlp = en_core_web_sm.load()

def GetNERTags(document_sentences):

  # Get their Named Entity tag and store in dictionary
  # Dictionary of format { doc_id: { 'word1': 'tag1', 'word2': 'tag2' }}
  ner_tags_documents = {}
  doc_id = 0

  for one_document in document_sentences:
    one_document = re.sub(r'[^\w\s]','', one_document)
    ner_one_document = nlp(one_document)

    ner_tags_document = {}
    for token in ner_one_document:
      if token.ent_type_ == '':
        ner_tag = 'O'
      else:
        ner_tag = token.ent_type_
      
      if token.text in ner_tags_document and ner_tags_document[token.text] == 'O':
        ner_tags_document[token.text] = ner_tag
      
      if token.text not in ner_tags_document:
        ner_tags_document[token.text] = ner_tag

    ner_tags_documents[doc_id] = ner_tags_document
    doc_id += 1
  
  return ner_tags_documents


# Get the NER Tags for training and test sets
train_ner_tags_documents = GetNERTags(train_document_sentences)
test_ner_tags_documents = GetNERTags(test_document_sentences)


We created functions to encode and pad textual data using word vectors from a pre-trained FastText model. It incorporates options for TF-IDF values, POS tags, and NER tags, and ensures consistent vector length for documents and questions.

In [None]:
# Form the vector for each word in document (pad to match the max length)
def encode_pad_documents(documents, use_tf_idf, tf_idf_documents, use_pos_tags, pos_tags_documents, use_ner_tags, ner_tags_documents, max_length, word_vector_size):
  doc_id = 0
  encoded_documents = []

  for one_document in documents:
    encoded_one_document = []
    one_document = re.sub(r'[^\w\s]','', one_document)
    word_list = word_tokenize(one_document)

    word_counter = 0
    for one_word in word_list:
      encoded_one_word = []
      #word_vec = word_rep.get_vector(word_rep.get_vector(one_word))
      encoded_one_word.extend(ft_sg_model.wv[one_word])

      # Check if TF-IDF should be added to word vector
      if use_tf_idf:
        word_tf_idf = tf_idf_documents[doc_id][one_word.lower()]
        encoded_one_word.append(word_tf_idf)

      # Check if POS Tags  should be added to word vector
      if use_pos_tags:
        word_pos_tag = pos_tags_documents[doc_id][one_word]
        #encoded_one_word.extend(word_rep.get_vector(word_pos_tag))
        encoded_one_word.extend(ft_sg_model.wv[word_pos_tag])
      
      # Check if NER Tags  should be added to word vector
      if use_ner_tags:
        try:
          word_ner_tag = ner_tags_documents[doc_id][one_word]
        except:
          word_ner_tag = 'O'
        #encoded_one_word.extend(word_rep.get_vector(word_ner_tag))
        encoded_one_word.extend(ft_sg_model.wv[word_ner_tag])

      encoded_one_document.append(encoded_one_word)
      word_counter += 1

    # Pad it with zero vector
    while word_counter < max_length:
      pad_vector = [0] * word_vector_size
      encoded_one_document.append(pad_vector)
      word_counter += 1

    # Add encoded document to the list
    encoded_documents.append(encoded_one_document)
    doc_id += 1

  return encoded_documents


def pad_document_labels(document_labels, max_length):
  padded_document_labels = []

  for one_document_label in document_labels:
    label_counter = len(one_document_label)
    
    # Pad it with index of [PAD] token
    if label_counter < max_length:
      pad_vector = [token_labels.index('[PAD]')] * (max_length - label_counter)
      one_document_label.extend(pad_vector)
    padded_document_labels.append(one_document_label)

  return padded_document_labels


def encode_pad_questions(questions, max_length, word_vector_size):
  question_id = 0
  encoded_questions = []

  for one_question in questions:
    encoded_one_question = []
    one_question = re.sub(r'[^\w\s]','', one_question)
    word_list = word_tokenize(one_question)

    word_counter = 0
    for one_word in word_list:
      encoded_one_word = []
      word_vec = ft_sg_model.wv[one_word]
      #word_vec = word_rep.get_vector(one_word)
      encoded_one_word.extend(word_vec)

      encoded_one_question.append(encoded_one_word)
      word_counter += 1

    # Pad it with zero vector
    while word_counter < max_length:
      pad_vector = [0] * word_vector_size
      encoded_one_question.append(pad_vector)
      word_counter += 1

    # Add encoded question to the list
    encoded_questions.append(encoded_one_question)
    question_id += 1

  return encoded_questions


### BiLSTMForQuestion

Here, we define a BiLSTM (Bidirectional LSTM) model for question processing. The model takes an input with dimensions input_dim and applies a bidirectional LSTM layer. The output includes the LSTM outputs and the concatenated last hidden states from the forward and backward LSTMs.

In [None]:
import torch
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class BiLSTMForQuestion(nn.Module):
    def __init__(self, input_dim, hidden_dim):

      super(BiLSTMForQuestion, self).__init__()

      self.input_dim = hidden_dim
      self.hidden_dim = hidden_dim
      self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True, bidirectional=True)

    def forward(self, input):
      out, (h_n, _) = self.lstm(input)
      
      # Concatenate the last hidden states of the forward and backward LSTM
      hidden_out = torch.cat((h_n[0, :], h_n[1, :]), dim=0)
      
      return out, hidden_out


### BiLSTMForDocument

Here, we define a BiLSTM (Bidirectional LSTM) model for document processing. The model takes an input with dimensions input_dim and applies a bidirectional LSTM layer. It also includes an attention mechanism that calculates attention weights based on the specified method (e.g., Dot Product, Scaled Dot Product, or Cosine Similarity) between the document tokens and the summary of a question. The final attention output is passed through a fully-connected layer to produce the model's output.

In [None]:
class BiLSTMForDocument(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers = 1, attention_method = "Dot Product"):

      super(BiLSTMForDocument, self).__init__()

      self.input_dim = input_dim
      self.hidden_dim = hidden_dim
      self.output_dim = output_dim
      self.attention_method = attention_method
      self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=num_layers, batch_first=True, bidirectional=True)
      self.fc = nn.Linear(hidden_dim * 2, output_dim)

    def cal_attention(self, document_token, question_summary):
      
      if self.attention_method == "Dot Product":
        # document_token of shape [1, 1, 100] & question_summary of shape [1, 100] resulting in [1, 1]
        attn_weights = torch.bmm(document_token, question_summary.T.unsqueeze(0))
        attn_output = torch.bmm(attn_weights, document_token)
        final_output = torch.add(attn_output[0], document_token[0])
      
      elif self.attention_method == "Scaled Dot Product":
        attn_weights = torch.bmm(document_token, question_summary.T.unsqueeze(0)) / np.sqrt(question_summary.shape[1])
        attn_output = torch.bmm(attn_weights, document_token)
        final_output = torch.add(attn_output[0], document_token[0])
      
      elif self.attention_method == "Cosine Similarity":
        attn_weights = torch.nn.functional.cosine_similarity(document_token.view(-1, 2 * self.hidden_dim, 1), question_summary.T.unsqueeze(0), dim=1)
        attn_output = torch.bmm(attn_weights.view(1, 1, -1), document_token)
        final_output = torch.add(attn_output[0], document_token[0])

      return final_output

    def forward(self, x, question_summary):
      out, (h_n, _) = self.lstm(x)
      attention_output = self.cal_attention(out, question_summary)
      
      # Pass the final attention output through the fully-connected layer
      out = F.softmax(self.fc(attention_output[0]), dim=0)

      return out


The **asMinutes** function converts the given time in seconds to a string representation in minutes and seconds. The timeSince function calculates the elapsed time since a given starting time (since) and returns a string representation of the elapsed time and the estimated remaining time based on the current progress (percent).



In [None]:
import time
import math

# Helper functions for training
def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

### Model Training

The **train** function performs training on the given document and question tensors using the provided models, optimizers, and criterion. It calculates the loss based on the predicted document labels and the actual document labels. The gradients are computed and the models are updated using the optimizer.

The **trainIters** function iterates over a specified number of training iterations. It randomly selects a document, question, and document labels from the training data. It calls the train function to perform the training on the selected data. It keeps track of the total loss for printing and plotting purposes. It prints the average loss every print_every iterations and appends the average loss to plot_losses every plot_every iterations.

In [None]:
import random

def train(document_tensor, question_tensor, document_labels_tensor, model_document, model_question, model_document_optimizer, model_question_optimizer, criterion):

    document_length = document_tensor.size(0)
    question_length = question_tensor.size(0)
    document_labels_length = document_labels_tensor.size(0)

    loss = 0      
    model_document_optimizer.zero_grad()
    model_question_optimizer.zero_grad()


    # Get the question summary
    _, question_summary = model_question(question_tensor)
    
    for i in range(document_length):
      # Process each token of the document
      document_label_output = model_document(document_tensor[i].view(1, 1, -1), question_summary.view(1, 2 * model_question.hidden_dim))

      # Compare the predicted token with actual token
      loss += criterion(document_label_output.view(1, -1), document_labels_tensor[i].view(1))
    
    loss.backward()

    model_question_optimizer.step()
    model_document_optimizer.step()

    return loss.item() / document_labels_length


def trainIters(model_document, model_question, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every
    random.seed(1234)

    model_document_optimizer = optim.Adam(model_document.parameters(), lr=learning_rate)
    model_question_optimizer = optim.Adam(model_question.parameters(), lr=learning_rate)
    
    criterion = nn.CrossEntropyLoss()

    for iter in range(1, n_iters + 1):
        random_choice_ix = random.choice(range(no_of_train_questions)) # Get a random index within the scope of input data
        document_index_r = encoded_train_documents[random_choice_ix]
        question_index_r = encoded_train_questions[random_choice_ix]
        document_labels_index_r = encoded_train_document_labels[random_choice_ix]
        
        document_tensor = torch.FloatTensor(document_index_r).to(device)
        question_tensor = torch.FloatTensor(question_index_r).to(device) 
        document_labels_tensor = torch.LongTensor(document_labels_index_r).to(device)

        loss = train(document_tensor, question_tensor, document_labels_tensor, model_document, model_question, model_document_optimizer, model_question_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0


### Model Prediction and Evaluation

**predict_answer**: Predicts the answer for a given document and question by utilizing trained models. It returns a list of predicted probabilities for the Start-Of-Answer (SOA) tag.

**evaluate_model**: Evaluates the performance of the document and question models on a test set. It predicts answers for each test question using the predict_answer function and compares them to the actual answers. The function calculates precision, recall, and F1 score to assess the model's performance.

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, f1_score


def predict_answer(model_document, model_question, encoded_document, encoded_question, document):
  with torch.no_grad():
    document_tensor = torch.FloatTensor(encoded_document).to(device)
    question_tensor = torch.FloatTensor(encoded_question).to(device)
    
    _, question_summary = model_question(question_tensor)

    predicted_soa_probabilities = []

    # We need to predict only for the actual length of the document and not for the padded tokens
    document_actual_length = len(document)
    for i in range(document_actual_length):
      document_label_output = model_document(document_tensor[i].view(1, 1, -1), question_summary.view(1, 100))
      
      # Extract the probability of SOA tag to find the word with maximum probability
      predicted_soa_probabilities.append(document_label_output[1].item())
      

  return predicted_soa_probabilities


def evaluate_model(model_document, model_question, print_first_5 = False):

  # Evaluate Model on Test set
  no_of_questions = len(encoded_test_questions)
  #no_of_questions = 50
  actual_answer_array = [ 0 ] * no_of_questions
  predicted_answer_array = [ 0 ] * no_of_questions
  for i in range(no_of_questions):
    predicted_soa_probabilities = predict_answer(model_document, model_question, encoded_test_documents[i], encoded_test_questions[i], test_documents[i])

    soa_index = predicted_soa_probabilities.index(max(predicted_soa_probabilities))
    soa_word = test_documents[i][soa_index]
    
    # There is an actual answer for the question
    if len(test_document_answers[i]) > 0:
      actual_answer_array[i] = 1

    sentences_in_document = sent_tokenize(test_document_sentences[i])
    answer_for_document = []
    for one_sentence in sentences_in_document:
      clean_sentence = re.sub(r'[^\w\s]','', one_sentence)
      if soa_word in word_tokenize(clean_sentence):
        answer_for_document.append(one_sentence)
    
    # Predicted Answer matches Actual Answer OR There was no answer for the question but we still predicted answer
    if len(set(answer_for_document).intersection(test_document_answers[i])) > 0 or actual_answer_array[i] == 0:
      predicted_answer_array[i] = 1
    
    # Print the actual and predicted answers for the first 5 questions
    if print_first_5 and i <= 5:
      print("Actual Answer:", test_document_answers[i])
      print("Predicted Answer:", answer_for_document)
      print("\n")


  precision = precision_score(actual_answer_array, predicted_answer_array)
  recall = recall_score(actual_answer_array, predicted_answer_array)
  f1 = f1_score(actual_answer_array, predicted_answer_array)

  print("\nBelow is the performance metrics of this model on test data:")
  print("Precision:", precision)
  print("Recall:", recall)
  print("F1 score:", f1)

  return precision, recall, f1


# 3.Model Testing

###3.1. Input Embedding Ablation Study




This embedding ablation study evaluates the performance of a model using only FastText word embeddings without additional features such as TF-IDF, POS tags, and NER tags. The training and evaluation processes are conducted, and the resulting precision, recall, and F1 scores are recorded in a DataFrame called **performance_df**.

In [None]:
# We will create a dataframe that will capture performance metrics of all variants we try
performance_df = pd.DataFrame(columns=['Model Type', 'Model Description', 'Precision Score', 'Recall Score', 'F1 Score'])

# We will first test with only FastText embedding
use_tf_idf = False
use_pos_tags = False
use_ner_tags = False
document_word_vector_size = word_vector_size + (word_vector_size * use_pos_tags) + (word_vector_size * use_ner_tags) + use_tf_idf

# encoded_train_documents - List of all documents - Each element is a list of size max_document_length. Each element of this list is a vector of size [word_vector_size * X + 1] - X can be 1,2 or 3
# padded_train_document_labels - Similar to train_document_labels. Only padded to match the max_document_length
# encoded_train_questions - List of all questions - Each element is a list of size max_question_length. Each element of this list is a vector of size [word_vector_size]

encoded_train_documents = encode_pad_documents(train_document_sentences, 
                                               use_tf_idf, train_tf_idf_documents, 
                                               use_pos_tags, train_pos_tags_documents, 
                                               use_ner_tags, train_ner_tags_documents, 
                                               max_document_length, document_word_vector_size)

encoded_train_document_labels = pad_document_labels(train_document_labels, max_document_length)
encoded_train_questions = encode_pad_questions(train_questions, max_question_length, word_vector_size)

encoded_test_documents = encode_pad_documents(test_document_sentences, 
                                              use_tf_idf, test_tf_idf_documents, 
                                              use_pos_tags, test_pos_tags_documents, 
                                              use_ner_tags, test_ner_tags_documents, 
                                              max_document_length, document_word_vector_size)

encoded_test_questions = encode_pad_questions(test_questions, max_question_length, word_vector_size)


# Hidden dimension of the Bi-LSTM layer
hidden_size = 50

# Create the model
model_question1 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document1 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

no_of_epoch = 1000
learning_rate_val = 0.001

# Train the model on training data
trainIters(model_document1, model_question1, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document1.eval()
model_question1.eval()
precision, recall, f1 = evaluate_model(model_document1, model_question1)

performance_df.loc[0] = ['Input Embedding Ablation Study', 'FastText only. No TFIDF, POS and NER tags', precision, recall, f1 ]

0m 46s (- 7m 0s) (100 10%) 1.2448
1m 33s (- 6m 13s) (200 20%) 1.0818
2m 18s (- 5m 22s) (300 30%) 0.9802
3m 8s (- 4m 42s) (400 40%) 0.9775
3m 50s (- 3m 50s) (500 50%) 0.9903
4m 37s (- 3m 5s) (600 60%) 0.9926
5m 24s (- 2m 18s) (700 70%) 1.0275
6m 13s (- 1m 33s) (800 80%) 0.9608
6m 59s (- 0m 46s) (900 90%) 0.9763
7m 46s (- 0m 0s) (1000 100%) 1.0016

Below is the performance metrics of this model on test data:
Precision: 0.17584745762711865
Recall: 0.34439834024896265
F1 score: 0.23281907433380086


we experiment with including TF-IDF and POS tag embeddings in addition to FastText word embeddings. We evaluate the performance of the model on the test data after training it on the modified training data. The results, including precision, recall, and F1 score, are recorded in the **performance_df** DataFrame. This variant represents an extension of the input embedding ablation study, incorporating more contextual information for document representation.

In [None]:
# We will now test with all TFIDF and POS Tag embeddings added
use_tf_idf = True
use_pos_tags = True
use_ner_tags = False
document_word_vector_size = word_vector_size + (word_vector_size * use_pos_tags) + (word_vector_size * use_ner_tags) + use_tf_idf

encoded_train_documents = encode_pad_documents(train_document_sentences, 
                                               use_tf_idf, train_tf_idf_documents, 
                                               use_pos_tags, train_pos_tags_documents, 
                                               use_ner_tags, train_ner_tags_documents, 
                                               max_document_length, document_word_vector_size)

#encoded_train_document_labels = pad_document_labels(train_document_labels, max_document_length)
#encoded_train_questions = encode_pad_questions(train_questions, max_question_length, word_vector_size)

encoded_test_documents = encode_pad_documents(test_document_sentences, 
                                              use_tf_idf, test_tf_idf_documents, 
                                              use_pos_tags, test_pos_tags_documents, 
                                              use_ner_tags, test_ner_tags_documents, 
                                              max_document_length, document_word_vector_size)

#encoded_test_questions = encode_pad_questions(test_questions, max_question_length, word_vector_size)


# Hidden dimension of the Bi-LSTM layer
hidden_size = 50

# Create the model
model_question2 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document2 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

# Train the model on training data
trainIters(model_document2, model_question2, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document2.eval()
model_question2.eval()
precision, recall, f1 = evaluate_model(model_document2, model_question2)

performance_df.loc[1] = ['Input Embedding Ablation Study', 'FastText with TFIDF and POS tags. No NER tag', precision, recall, f1 ]

0m 52s (- 7m 53s) (100 10%) 1.2280
1m 43s (- 6m 52s) (200 20%) 1.0574
2m 37s (- 6m 8s) (300 30%) 0.9793
3m 34s (- 5m 21s) (400 40%) 0.9768
4m 24s (- 4m 24s) (500 50%) 0.9899
5m 20s (- 3m 33s) (600 60%) 0.9920
6m 13s (- 2m 40s) (700 70%) 1.0274
7m 12s (- 1m 48s) (800 80%) 0.9608
8m 7s (- 0m 54s) (900 90%) 0.9759
9m 0s (- 0m 0s) (1000 100%) 1.0015

Below is the performance metrics of this model on test data:
Precision: 0.1670235546038544
Recall: 0.3236514522821577
F1 score: 0.22033898305084748


we further extend the input embedding ablation study by including TF-IDF, POS tags, and NER tags as additional feature embeddings. We encode and pad the training and test documents with these embeddings and train a model using the modified training data. The performance of the model is evaluated on the test data, and the results are recorded in the **performance_df** DataFrame. This variant explores the impact of incorporating multiple types of contextual information in document representation.

In [None]:
# We will now test with all 3 feature embeddings
use_tf_idf = True
use_pos_tags = True
use_ner_tags = True
document_word_vector_size = word_vector_size + (word_vector_size * use_pos_tags) + (word_vector_size * use_ner_tags) + use_tf_idf

encoded_train_documents = encode_pad_documents(train_document_sentences, 
                                               use_tf_idf, train_tf_idf_documents, 
                                               use_pos_tags, train_pos_tags_documents, 
                                               use_ner_tags, train_ner_tags_documents, 
                                               max_document_length, document_word_vector_size)

#encoded_train_document_labels = pad_document_labels(train_document_labels, max_document_length)
#encoded_train_questions = encode_pad_questions(train_questions, max_question_length, word_vector_size)

encoded_test_documents = encode_pad_documents(test_document_sentences, 
                                              use_tf_idf, test_tf_idf_documents, 
                                              use_pos_tags, test_pos_tags_documents, 
                                              use_ner_tags, test_ner_tags_documents, 
                                              max_document_length, document_word_vector_size)

#encoded_test_questions = encode_pad_questions(test_questions, max_question_length, word_vector_size)


# Hidden dimension of the Bi-LSTM layer
hidden_size = 50

# Create the model
model_question3 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document3 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

# Train the model on training data
trainIters(model_document3, model_question3, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document3.eval()
model_question3.eval()
precision, recall, f1 = evaluate_model(model_document3, model_question3)

performance_df.loc[2] = ['Input Embedding Ablation Study', 'FastText with TFIDF, POS and NER tags', precision, recall, f1 ]

0m 57s (- 8m 38s) (100 10%) 1.2133
1m 50s (- 7m 21s) (200 20%) 1.0522
2m 46s (- 6m 28s) (300 30%) 0.9786
3m 44s (- 5m 36s) (400 40%) 0.9760
4m 35s (- 4m 35s) (500 50%) 0.9894
5m 28s (- 3m 39s) (600 60%) 0.9905
6m 21s (- 2m 43s) (700 70%) 1.0274
7m 18s (- 1m 49s) (800 80%) 0.9602
8m 10s (- 0m 54s) (900 90%) 0.9746
9m 3s (- 0m 0s) (1000 100%) 1.0005

Below is the performance metrics of this model on test data:
Precision: 0.1487964989059081
Recall: 0.2821576763485477
F1 score: 0.19484240687679086


###3.2. Attention Ablation Study

we continue with the best performing input embedding variant, which is FastText embedding only. We set the flags for TF-IDF, POS tags, and NER tags to **False** to exclude them. We encode and pad the training and test documents using only FastText embeddings. The model architecture consists of a BiLSTM for question processing and a BiLSTM with attention mechanism for document processing. The attention mechanism used is **dot product attention**. The model is trained on the training data and evaluated on the test data. The performance metrics are recorded in the performance_df DataFrame. This variant explores the impact of using dot product attention in the document processing stage.

In [None]:
# We will move ahead with only FastText embedding as it was the best performing input embedding variant tried above
use_tf_idf = False
use_pos_tags = False
use_ner_tags = False
document_word_vector_size = word_vector_size + (word_vector_size * use_pos_tags) + (word_vector_size * use_ner_tags) + use_tf_idf

encoded_train_documents = encode_pad_documents(train_document_sentences, 
                                               use_tf_idf, train_tf_idf_documents, 
                                               use_pos_tags, train_pos_tags_documents, 
                                               use_ner_tags, train_ner_tags_documents, 
                                               max_document_length, document_word_vector_size)

encoded_train_document_labels = pad_document_labels(train_document_labels, max_document_length)
encoded_train_questions = encode_pad_questions(train_questions, max_question_length, word_vector_size)

encoded_test_documents = encode_pad_documents(test_document_sentences, 
                                              use_tf_idf, test_tf_idf_documents, 
                                              use_pos_tags, test_pos_tags_documents, 
                                              use_ner_tags, test_ner_tags_documents, 
                                              max_document_length, document_word_vector_size)

encoded_test_questions = encode_pad_questions(test_questions, max_question_length, word_vector_size)


# Create the model
model_question4 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document4 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels), num_layers = 1, attention_method = "Dot Product").to(device)

# Train the model on training data - using dot product for attention
trainIters(model_document4, model_question4, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document4.eval()
model_question4.eval()
precision, recall, f1 = evaluate_model(model_document4, model_question4)

performance_df.loc[3] = ['Attention Ablation Study', 'FastText only. Dot Product Attention', precision, recall, f1 ]

0m 46s (- 7m 2s) (100 10%) 1.2277
1m 31s (- 6m 7s) (200 20%) 1.0611
2m 21s (- 5m 29s) (300 30%) 0.9798
3m 11s (- 4m 46s) (400 40%) 0.9774
3m 54s (- 3m 54s) (500 50%) 0.9902
4m 42s (- 3m 8s) (600 60%) 0.9926
5m 28s (- 2m 20s) (700 70%) 1.0275
6m 19s (- 1m 34s) (800 80%) 0.9608
7m 5s (- 0m 47s) (900 90%) 0.9763
7m 51s (- 0m 0s) (1000 100%) 1.0016

Below is the performance metrics of this model on test data:
Precision: 0.1723404255319149
Recall: 0.3360995850622407
F1 score: 0.2278481012658228


We continue with the same word embeddings and create a model with a BiLSTM for question processing and a BiLSTM with scaled dot product attention for document processing. The model is trained and evaluated on the test data, and the performance metrics are recorded in performance_df. This variant investigates the impact of using **scaled dot product attention**.

In [None]:
# We can use the previous encoded variables as we won't change word embeddings now

# Create the model
model_question5 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document5 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels), num_layers = 1, attention_method = "Scaled Dot Product").to(device)

# Train the model on training data - using scaled dot product for attention
trainIters(model_document5, model_question5, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document5.eval()
model_question5.eval()
precision, recall, f1 = evaluate_model(model_document5, model_question5)

performance_df.loc[4] = ['Attention Ablation Study', 'FastText only. Scaled Dot Product Attention', precision, recall, f1 ]

0m 49s (- 7m 27s) (100 10%) 1.2618
1m 37s (- 6m 28s) (200 20%) 1.1771
2m 24s (- 5m 38s) (300 30%) 1.0007
3m 16s (- 4m 55s) (400 40%) 0.9803
4m 2s (- 4m 2s) (500 50%) 0.9917
4m 52s (- 3m 15s) (600 60%) 0.9938
5m 42s (- 2m 26s) (700 70%) 1.0282
6m 36s (- 1m 39s) (800 80%) 0.9614
7m 25s (- 0m 49s) (900 90%) 0.9772
8m 14s (- 0m 0s) (1000 100%) 1.0026

Below is the performance metrics of this model on test data:
Precision: 0.16523605150214593
Recall: 0.31950207468879666
F1 score: 0.21782178217821782


We use the previous encoded variables and create a model with a BiLSTM for question processing and a BiLSTM with cosine similarity attention for document processing. The model is trained and evaluated on the test data, and the performance metrics are recorded in performance_df. This variant explores the impact of using **cosine similarity attention**.

In [None]:
# We can use the previous encoded variables as we won't change word embeddings now

# Create the model
model_question6 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document6 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels), num_layers = 1, attention_method = "Cosine Similarity").to(device)

# Train the model on training data - using cosine for attention
trainIters(model_document6, model_question6, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document6.eval()
model_question6.eval()
precision, recall, f1 = evaluate_model(model_document6, model_question6)

performance_df.loc[5] = ['Attention Ablation Study', 'FastText only. Cosine Similarity Attention', precision, recall, f1 ]

0m 58s (- 8m 44s) (100 10%) 1.2460
1m 53s (- 7m 34s) (200 20%) 1.1351
2m 50s (- 6m 38s) (300 30%) 0.9918
3m 51s (- 5m 47s) (400 40%) 0.9808
4m 45s (- 4m 45s) (500 50%) 0.9921
5m 43s (- 3m 48s) (600 60%) 0.9940
6m 41s (- 2m 52s) (700 70%) 1.0283
7m 43s (- 1m 55s) (800 80%) 0.9616
8m 41s (- 0m 57s) (900 90%) 0.9773
9m 38s (- 0m 0s) (1000 100%) 1.0026

Below is the performance metrics of this model on test data:
Precision: 0.1670235546038544
Recall: 0.3236514522821577
F1 score: 0.22033898305084748


###3.3. Hyper Parameter Testing

we created a model with a BiLSTM for question processing and a BiLSTM for document processing. The model is trained with hyperparameters set to **500 epochs and a learning rate of 0.01**. It is then evaluated on the test data, and the performance metrics are recorded in **performance_df**. This variant explores the impact of different hyperparameters.

In [None]:
# We can use the previous encoded variables as we won't change word embeddings now

# Create the model
model_question7 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document7 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

no_of_epoch = 500
learning_rate_val = 0.01
# Train the model on training data
trainIters(model_document7, model_question7, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document7.eval()
model_question7.eval()
precision, recall, f1 = evaluate_model(model_document7, model_question7)

performance_df.loc[6] = ['Hyper Parameter Testing', 'FastText only. 500 Epochs, 0.01 Learning Rate', precision, recall, f1 ]

0m 50s (- 3m 23s) (100 20%) 1.0620
1m 40s (- 2m 30s) (200 40%) 1.0262
2m 28s (- 1m 39s) (300 60%) 0.9779
3m 21s (- 0m 50s) (400 80%) 0.9759
4m 7s (- 0m 0s) (500 100%) 0.9890

Below is the performance metrics of this model on test data:
Precision: 0.1979381443298969
Recall: 0.3983402489626556
F1 score: 0.2644628099173554


 Training with **1000 Epochs and 0.001 Learning Rate** (Fast Text word representation only, Dot Product Attention)

In [None]:
# Create the model
model_question8 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document8 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

no_of_epoch = 1000
learning_rate_val = 0.001
# Train the model on training data
trainIters(model_document8, model_question8, no_of_epoch, print_every=100, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document8.eval()
model_question8.eval()
precision, recall, f1 = evaluate_model(model_document8, model_question8)

performance_df.loc[7] = ['Hyper Parameter Testing', 'FastText only. 1000 Epochs, 0.001 Learning Rate', precision, recall, f1 ]

0m 48s (- 7m 19s) (100 10%) 1.2402
1m 35s (- 6m 20s) (200 20%) 1.0606
2m 22s (- 5m 33s) (300 30%) 0.9798
3m 13s (- 4m 50s) (400 40%) 0.9774
3m 59s (- 3m 59s) (500 50%) 0.9902
4m 47s (- 3m 11s) (600 60%) 0.9926
5m 36s (- 2m 24s) (700 70%) 1.0275
6m 28s (- 1m 37s) (800 80%) 0.9608
7m 16s (- 0m 48s) (900 90%) 0.9763
8m 3s (- 0m 0s) (1000 100%) 1.0016

Below is the performance metrics of this model on test data:
Precision: 0.18277310924369747
Recall: 0.36099585062240663
F1 score: 0.24267782426778245


Training with **1000 Epochs and 0.1 Learning Rate** (Fast Text word representation only, Dot Product Attention)

In [None]:
# Create the model
model_question9 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document9 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

no_of_epoch = 1000
learning_rate_val = 0.1
# Train the model on training data
trainIters(model_document9, model_question9, no_of_epoch, print_every=1000, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document9.eval()
model_question9.eval()
precision, recall, f1 = evaluate_model(model_document9, model_question9)

performance_df.loc[8] = ['Hyper Parameter Testing', 'FastText only. 1000 Epochs, 0.1 Learning Rate', precision, recall, f1 ]

8m 19s (- 0m 0s) (1000 100%) 0.9970

Below is the performance metrics of this model on test data:
Precision: 0.25621414913957935
Recall: 0.5560165975103735
F1 score: 0.35078534031413616


 Training with **5000 Epochs and 0.1 Learning Rate** (Fast Text word representation only, Dot Product Attention)

In [None]:
# Create the model
model_question10 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document10 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

no_of_epoch = 5000
learning_rate_val = 0.1
# Train the model on training data
trainIters(model_document10, model_question10, no_of_epoch, print_every=1000, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document10.eval()
model_question10.eval()
precision, recall, f1 = evaluate_model(model_document10, model_question10)

performance_df.loc[9] = ['Hyper Parameter Testing', 'FastText only. 5000 Epochs, 0.1 Learning Rate', precision, recall, f1 ]

7m 58s (- 31m 55s) (1000 20%) 0.9967
15m 58s (- 23m 57s) (2000 40%) 1.0025
23m 56s (- 15m 57s) (3000 60%) 1.0110
31m 46s (- 7m 56s) (4000 80%) 0.9879
39m 42s (- 0m 0s) (5000 100%) 0.9939

Below is the performance metrics of this model on test data:
Precision: 0.2646502835538752
Recall: 0.5809128630705395
F1 score: 0.3636363636363636


 Training with **10000 Epochs and 0.1 Learning Rate** (Fast Text word representation only, Dot Product Attention)

In [None]:
# Create the model
model_question11 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document11 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels)).to(device)

no_of_epoch = 10000
learning_rate_val = 0.1
# Train the model on training data
trainIters(model_document11, model_question11, no_of_epoch, print_every=1000, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document11.eval()
model_question11.eval()
precision, recall, f1 = evaluate_model(model_document11, model_question11)

performance_df.loc[10] = ['Hyper Parameter Testing', 'FastText only. 10000 Epochs, 0.1 Learning Rate', precision, recall, f1 ]

8m 7s (- 73m 5s) (1000 10%) 1.1147
16m 7s (- 64m 30s) (2000 20%) 1.2572
24m 9s (- 56m 22s) (3000 30%) 1.2620
32m 4s (- 48m 6s) (4000 40%) 1.2628
40m 9s (- 40m 9s) (5000 50%) 1.2548
48m 22s (- 32m 14s) (6000 60%) 1.2657
56m 35s (- 24m 15s) (7000 70%) 1.2731
64m 32s (- 16m 8s) (8000 80%) 1.2753
72m 40s (- 8m 4s) (9000 90%) 1.2606
80m 39s (- 0m 0s) (10000 100%) 1.2677

Below is the performance metrics of this model on test data:
Precision: 0.2660377358490566
Recall: 0.5850622406639004
F1 score: 0.36575875486381326


We also checked the performance of document model created using **5** BiLSTM layers but it was comparable to the model created using single BiLSTM layer and  trained with 10000 epochs and 0.1 as learning rate. So, we decided to continue using the model with single BiLSTM layer as it will save us on the extra computational resources required for the extra BiLSTM layers.

In [None]:
# Create the model with 5 BiLSTM layers
model_question12 = BiLSTMForQuestion(word_vector_size, hidden_size).to(device)
model_document12 = BiLSTMForDocument(document_word_vector_size, hidden_size, len(token_labels), num_layers = 5).to(device)

no_of_epoch = 5000
learning_rate_val = 0.1
# Train the model on training data
trainIters(model_document12, model_question12, no_of_epoch, print_every=1000, learning_rate=learning_rate_val)

# Evaluate the model on test data
model_document12.eval()
model_question12.eval()
precision, recall, f1 = evaluate_model(model_document12, model_question12)
performance_df.loc[11] = ['Number of Layers - 5', 'FastText only. 5000 Epochs, 0.1 Learning Rate', precision, recall, f1 ]

**Performance Evaluation Table of all 12 Model Variants**

We are now creating a table for all the 12 types of model with Precision Score , Recall Score and F1 Score.

In [None]:
from tabulate import tabulate

# Print the DataFrame using tabulate
print(tabulate(performance_df, headers='keys', tablefmt='psql'))

+----+--------------------------------+-------------------------------------------------+-------------------+----------------+------------+
|    | Model Type                     | Model Description                               |   Precision Score |   Recall Score |   F1 Score |
|----+--------------------------------+-------------------------------------------------+-------------------+----------------+------------|
|  0 | Input Embedding Ablation Study | FastText only. No TFIDF, POS and NER tags       |          0.175847 |       0.344398 |   0.232819 |
|  1 | Input Embedding Ablation Study | FastText with TFIDF and POS tags. No NER tag    |          0.167024 |       0.323651 |   0.220339 |
|  2 | Input Embedding Ablation Study | FastText with TFIDF, POS and NER tags           |          0.148796 |       0.282158 |   0.194842 |
|  3 | Attention Ablation Study       | FastText only. Dot Product Attention            |          0.17234  |       0.3361   |   0.227848 |
|  4 | Attention Abl

### Save the best model

We find the 11th variant of model (Fast Text word representation only, Dot Product Attention and training with 10000 epochs and 0.1 as learning rate) as the best performing model. So, we save that model.

In [None]:
# Save the best performing model
torch.save(model_document11,'document_model.pt')
torch.save(model_question11,'question_model.pt')

Loading the **Document** model.

In [None]:
model_document = torch.load('document_model.pt')
model_document.eval()


BiLSTMForDocument(
  (lstm): LSTM(50, 50, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=100, out_features=4, bias=True)
)

Loading the **Question** model.

In [None]:
question_model = torch.load('question_model.pt')
question_model.eval()

BiLSTMForQuestion(
  (lstm): LSTM(50, 50, batch_first=True, bidirectional=True)
)

Model Testing of our best model on the entire test data.

In [None]:
precision, recall, f1 = evaluate_model(model_document, question_model)


Below is the performance metrics of this model on test data:
Precision: 0.2660377358490566
Recall: 0.5850622406639004
F1 score: 0.36575875486381326
