# Textual Question Answering for the SQuAD dataset

The goal of this notebook is to build a BERT-based model which returns __an answer__, given a user question and a passage which includes the actual answer to the question. We are going to use the SQuAD 2.0 dataset, and begin our testings based on the bert-base-uncased model. We are going to fine-tune it, and evaluate its behaviour based on the dataset.

## Import Libraries

In [1]:
!pip3 install transformers

from transformers import BertTokenizerFast

from transformers import BertForQuestionAnswering
from transformers import BertForQuestionAnswering
from torch.utils.data import DataLoader
from transformers import AdamW
import tqdm.notebook as tq

import torch
# for json parsing
import json
# For data vizualization 
import matplotlib as mpl
import matplotlib.pyplot as plt
# For large and multi-dimensional arrays
import numpy as np
# For basic cleaning and data preprocessing 
import re
# For data manipulation and analysis
import pandas as pd
import nltk
nltk.download('punkt')
# fancy progress bar
from tqdm import tqdm 
# time measurements
import time
# some math
import math

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## The SQuAD 2.0 Dataset

The SQuAD dataset contains several passages, questions upon them, and their respective answers. It is stored in JSON files, thus we will have to parse them and store them in a pandas dataframe.

The usefull info that we want to mine from the dataset is the following:
 - Each __question__ posed
 - Each __answer__ to the question (there might be more than one)
 - The __label__ of each answer, aka the start end the end point in the context
 - The information if the answer is __impossible__
 - Each __context__ a.k.a. the paragraphs that answer the questions
 - The __full text__ in the way that BERT wants it

The SQuAD 2.0 dataset contains questions that can not be answered. If that is the case, we append an empty string as an answer, in order to encode this info that will be used for training.

The function used in order to parse the SQuAD dataset is the following:

In [2]:
# function to parse the squad dataset
def parse_squad(filename):
    # dict -> df
    data = {'id': [], 'question': [], 'answer': [], 'context': [], 'label': [], 'impossible_answer': []}
    # open the file
    with open(filename) as f:
        # and store its json values
        dataset = json.load(f)
        # for each article
        for article in dataset['data']:
            # for each paragraph
            for par in article['paragraphs']:
                # for each QA
                for qa in par['qas']:
                    # for each one of the gold answers
                    if (qa['is_impossible']):
                        data['id'].append(qa['id'])
                        data['question'].append(qa['question'])
                        data['answer'].append("")
                        data['context'].append(par['context'])
                        data['label'].append((0,0))
                        data['impossible_answer'].append(qa['is_impossible'])

                    else:
                        for ans in qa['answers']:
                            # keep id, question, answer and context
                            data['id'].append(qa['id'])
                            data['question'].append(qa['question'])
                            data['answer'].append(ans['text'])
                            data['context'].append(par['context'])
                            # we want to store the labels, aka the start and the end of the answer
                            qstart = ans['answer_start']
                            qend = qstart + len (ans['text'])
                            data['label'].append((qstart, qend))
                            # does the answer exist?
                            data['impossible_answer'].append(qa['is_impossible'])
    # create the df and return it
    return pd.DataFrame(data)


If we load both of the datasets in the same time, memory crashes. Thus, we are going to load only the train dataframe, in order to fine-tune the model.

In [3]:
# train_df = parse_squad("/content/drive/MyDrive/tn2/train-v2.0.json")
valid_df = parse_squad("/content/drive/MyDrive/tn2/dev-v2.0.json")

Let's take a look...

In [4]:
train_df.head()

## Using pre-trained BERT models

As mentioned in the instructions given, we are going to use the `bert-base-uncased` pre-trained model. We are going to feed it with our data, fine-tune it and then test it.

### Creating the tokens

The first order of business is to create the tokens to pass to the model. In order to do this, we must load the tokenizer for our model, and pass each row of our dataframe through it. The BERT models accept inputs that are in the form:

                                `[CLS] + Question + [SEP] + passage + [SEP]`

Load the tokenizer from the model we are going to use, the `bert-base-uncased`.

In [5]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Create the tokens for the train and validation datasets using the tokenizers that we have loaded.

In [6]:
# tokenize the datasets
train_encoding = tokenizer(list(train_df['question'].values), list(train_df['context'].values), truncation=True, padding=True)

Next up, in order to correctly predict the location of the answers, we must insert the labels to our encoded tokens. The labels are currently pointing to characters, thus we must change them in order to point to tokens. We are going to use the `char_to_token` function.

In [7]:
def add_labels(encoding, df):
    # list to append the labels
    token_labels = []
    
    for i in range(len(df)):
        # start position of the answer
        start = encoding.char_to_token(i, df.iloc[i]['label'][0], sequence_index=1)
        # if it is None, set it to maximum length
        if start is None:
            start = tokenizer.model_max_length
        # end position of the answer
        end = encoding.char_to_token(i, df.iloc[i]['label'][1], sequence_index=1)
        # if it is None, ajust it to match the start of the question
        if end is None:
            end = start + len(df.iloc[i]['answer'].split()) 
        # append to our list
        token_labels.append((start, end))
    # new column to the tokens
    encoding.update({'labels': token_labels})

Apply the function to our dataframe.

In [8]:
add_labels(train_encoding, train_df)

We are going to use a batch size of `8` for the training. When I tested for bigger batch size, the program ran out of memory, so we are going to run it whith a small one, specifically 8.

In [9]:
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encoding)

In [10]:
# create the dataloaders
train_data = DataLoader(train_dataset, batch_size=8, shuffle=True)

### Loading the model

As we mentioned, we are going to use the `bert-base-uncased` model. This is a pre-trained model, but is not optimized for our dataset. Hence, some weights in the model exist, but we aim to make them better by fine-tunning the model.

For this purpose, we are going to need an __optimizer__ and a __loss_function__, in order to further train the model. The learning rate hyperparameter that we are going to use is $10^{-5}$. I experimented with different hyperparameters, and came to the conclusion that this produces the best results.

In [11]:
# define our model
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
# use GPU if available
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# learning rate
lr = 0.00001
# optimizer
optimizer = AdamW(model.parameters(), lr=lr)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

### Further training the model

In order to achieve our goal, we are going to feed all of our dataset to the model, in order to train in on the specific questions. We are going to do so, the traditional way: by turing on the train mode, and train for the hyperparameters that we are going to define up next.

In [None]:
# transfer model to GPU for faster calculations
model.to(device)
# enable training mode
model.train()

# define the number of epochs
epochs = 2

loss_list = []
# for each epoch
for epoch in range(epochs):
    # for each batch in our data loader
    for i, batch in tq.tqdm(enumerate(train_data), total= epochs * len(train_data), position=0, leave=True):
        if (i == 1):
            break
        # reset the gradients
        optimizer.zero_grad()
        
        # get the info from the data loader
        ids = batch['input_ids'].to(device)
        mask = batch['attention_mask'].to(device)
        token_type_ids = batch['token_type_ids'].to(device)
        labels = batch['labels'].to(device)
        
        # forward to our model in order to train it.
        res = model.forward(ids, attention_mask=mask, start_positions=labels[:,0], end_positions=labels[:,1], token_type_ids=token_type_ids)
        
        # the first tensor is the loss
        loss = res[0]
        if (i == 0):
            sum = 0
        if (i % 100 == 0 and i > 0):
            loss_list.append(sum / 100)
            sum = 0

        sum += loss.item()
        # backprobagate
        loss.backward()

        optimizer.step()

#### Plotting the Loss Metric

Let's now plot the loss curve of our model, in order to observe how training proceeded.

In [None]:
plt.plot(loss_list)
plt.xlabel("Batch", fontsize=15)
plt.ylabel("Loss", rotation=0, fontsize=15)
plt.show()

As expected, the loss declines as the training proceeds.

We are going to save our model, because the training takes extremely long time, and we want to save each state of our model.

In [12]:
torch.save(model.state_dict(), "/content/drive/MyDrive/tn2/mondelo")

Load a presaved model. My final model can be found [here](https://drive.google.com/file/d/1hqxek4ZbZavYMbtXSvpjzeAZZ73Ukakd/view?usp=sharing)

In [13]:
model.load_state_dict(torch.load("/content/drive/MyDrive/tn2/mondelo"))

<All keys matched successfully>

Now, we can delete the training data in order to save some space in our memory.

In [None]:
del train_encoding
del train_data

### Evaluating the validation dataset

Finally, we are going to use the validation dataset in order to observe the difference made after fine-tunning.

In [15]:
# Function for producing the answer string (from the lectures, altered)
def answer(input_ids, tokenizer, ans_start, ans_end):
    # if starting and ending tokens are the same, that means that there is no answer to the question
    if (ans_start == ans_end):
        return ''
    # convert the tokens to strings
    token_ids = input_ids.tolist()
    tokens = tokenizer.convert_ids_to_tokens(token_ids)
    # avoid an overflow
    if (ans_start >= len(tokens)):
        return ''
    # create the answer by appending the tokens
    answer = tokens[ans_start]
    for token_index in range(ans_start + 1, ans_end + 1):
        if tokens[token_index][0:2] == '##':
            answer += tokens[token_index][2:]
        else:
            answer += ' ' + tokens[token_index]
    return answer

Now we are going define the evaluation function, that we are going to use.

In [16]:
def evaluate(model, valid_data):
    # evaluation mode
    model.eval()
    model.to(device)

    # dictionary for the answers
    answers = {}

    # do not update the gradients
    with torch.no_grad():
        for i, batch in tq.tqdm(enumerate(valid_data), total= len(valid_data), position=0, leave=True):

            # get the info from the data loader
            id = batch['input_ids'].to(device)
            mask = batch['attention_mask'].to(device)
            token_type = batch['token_type_ids'].to(device)

            # predict the label by forwarding the data to the model
            res = model(id, attention_mask=mask, token_type_ids=token_type)
            
            # argmax to get the starting and ending positions
            ans_start = torch.argmax(res.start_logits).item()
            ans_end = torch.argmax(res.end_logits).item()
            
            # append the answer to the dict
            answers[valid_df.iloc[i]['id']] = answer(id[0], tokenizer, ans_start, ans_end)

    return answers
 

In [17]:
answers = evaluate(model, valid_data)

### Altered evaluation script

We are going to use the evaluation script provided by SQuAD, but we are going to alter it a bit in order to suite us.

In [18]:
import string
import collections
def make_qid_to_has_ans(dataset):
  qid_to_has_ans = {}
  for article in dataset:
    for p in article['paragraphs']:
      for qa in p['qas']:
        qid_to_has_ans[qa['id']] = bool(qa['answers'])
  return qid_to_has_ans

def normalize_answer(s):
  """Lower text and remove punctuation, articles and extra whitespace."""
  def remove_articles(text):
    regex = re.compile(r'\b(a|an|the)\b', re.UNICODE)
    return re.sub(regex, ' ', text)
  def white_space_fix(text):
    return ' '.join(text.split())
  def remove_punc(text):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)
  def lower(text):
    return text.lower()
  return white_space_fix(remove_articles(remove_punc(lower(s))))

def get_tokens(s):
  if not s: return []
  return normalize_answer(s).split()

def compute_exact(a_gold, a_pred):
  return int(normalize_answer(a_gold) == normalize_answer(a_pred))

def compute_f1(a_gold, a_pred):
  gold_toks = get_tokens(a_gold)
  pred_toks = get_tokens(a_pred)
  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
  num_same = sum(common.values())
  if len(gold_toks) == 0 or len(pred_toks) == 0:
    # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
    return int(gold_toks == pred_toks)
  if num_same == 0:
    return 0
  precision = 1.0 * num_same / len(pred_toks)
  recall = 1.0 * num_same / len(gold_toks)
  f1 = (2 * precision * recall) / (precision + recall)
  return f1

def get_raw_scores(dataset, preds):
  exact_scores = {}
  f1_scores = {}
  for article in dataset:
    for p in article['paragraphs']:
      for qa in p['qas']:
        qid = qa['id']
        gold_answers = [a['text'] for a in qa['answers']
                        if normalize_answer(a['text'])]
        if not gold_answers:
          # For unanswerable questions, only correct answer is empty string
          gold_answers = ['']
        if qid not in preds:
          print('Missing prediction for %s' % qid)
          continue
        a_pred = preds[qid]
        # Take max over all gold answers
        exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
        f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
  return exact_scores, f1_scores

def apply_no_ans_threshold(scores, na_probs, qid_to_has_ans, na_prob_thresh):
  new_scores = {}
  for qid, s in scores.items():
    pred_na = na_probs[qid] > na_prob_thresh
    if pred_na:
      new_scores[qid] = float(not qid_to_has_ans[qid])
    else:
      new_scores[qid] = s
  return new_scores

def make_eval_dict(exact_scores, f1_scores, qid_list=None):
  if not qid_list:
    total = len(exact_scores)
    return collections.OrderedDict([
        ('exact', 100.0 * sum(exact_scores.values()) / total),
        ('f1', 100.0 * sum(f1_scores.values()) / total),
        ('total', total),
    ])
  else:
    total = len(qid_list)
    return collections.OrderedDict([
        ('exact', 100.0 * sum(exact_scores[k] for k in qid_list) / total),
        ('f1', 100.0 * sum(f1_scores[k] for k in qid_list) / total),
        ('total', total),
    ])

def merge_eval(main_eval, new_eval, prefix):
  for k in new_eval:
    main_eval['%s_%s' % (prefix, k)] = new_eval[k]

def plot_pr_curve(precisions, recalls, out_image, title):
  plt.step(recalls, precisions, color='b', alpha=0.2, where='post')
  plt.fill_between(recalls, precisions, step='post', alpha=0.2, color='b')
  plt.xlabel('Recall')
  plt.ylabel('Precision')
  plt.xlim([0.0, 1.05])
  plt.ylim([0.0, 1.05])
  plt.title(title)
  plt.savefig(out_image)
  plt.clf()

def make_precision_recall_eval(scores, na_probs, num_true_pos, qid_to_has_ans,
                               out_image=None, title=None):
  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
  true_pos = 0.0
  cur_p = 1.0
  cur_r = 0.0
  precisions = [1.0]
  recalls = [0.0]
  avg_prec = 0.0
  for i, qid in enumerate(qid_list):
    if qid_to_has_ans[qid]:
      true_pos += scores[qid]
    cur_p = true_pos / float(i+1)
    cur_r = true_pos / float(num_true_pos)
    if i == len(qid_list) - 1 or na_probs[qid] != na_probs[qid_list[i+1]]:
      # i.e., if we can put a threshold after this point
      avg_prec += cur_p * (cur_r - recalls[-1])
      precisions.append(cur_p)
      recalls.append(cur_r)
  if out_image:
    plot_pr_curve(precisions, recalls, out_image, title)
  return {'ap': 100.0 * avg_prec}

def run_precision_recall_analysis(main_eval, exact_raw, f1_raw, na_probs, 
                                  qid_to_has_ans, out_image_dir):
  if out_image_dir and not os.path.exists(out_image_dir):
    os.makedirs(out_image_dir)
  num_true_pos = sum(1 for v in qid_to_has_ans.values() if v)
  if num_true_pos == 0:
    return
  pr_exact = make_precision_recall_eval(
      exact_raw, na_probs, num_true_pos, qid_to_has_ans,
      out_image=os.path.join(out_image_dir, 'pr_exact.png'),
      title='Precision-Recall curve for Exact Match score')
  pr_f1 = make_precision_recall_eval(
      f1_raw, na_probs, num_true_pos, qid_to_has_ans,
      out_image=os.path.join(out_image_dir, 'pr_f1.png'),
      title='Precision-Recall curve for F1 score')
  oracle_scores = {k: float(v) for k, v in qid_to_has_ans.items()}
  pr_oracle = make_precision_recall_eval(
      oracle_scores, na_probs, num_true_pos, qid_to_has_ans,
      out_image=os.path.join(out_image_dir, 'pr_oracle.png'),
      title='Oracle Precision-Recall curve (binary task of HasAns vs. NoAns)')
  merge_eval(main_eval, pr_exact, 'pr_exact')
  merge_eval(main_eval, pr_f1, 'pr_f1')
  merge_eval(main_eval, pr_oracle, 'pr_oracle')

def histogram_na_prob(na_probs, qid_list, image_dir, name):
  if not qid_list:
    return
  x = [na_probs[k] for k in qid_list]
  weights = np.ones_like(x) / float(len(x))
  plt.hist(x, weights=weights, bins=20, range=(0.0, 1.0))
  plt.xlabel('Model probability of no-answer')
  plt.ylabel('Proportion of dataset')
  plt.title('Histogram of no-answer probability: %s' % name)
  plt.savefig(os.path.join(image_dir, 'na_prob_hist_%s.png' % name))
  plt.clf()

def find_best_thresh(preds, scores, na_probs, qid_to_has_ans):
  num_no_ans = sum(1 for k in qid_to_has_ans if not qid_to_has_ans[k])
  cur_score = num_no_ans
  best_score = cur_score
  best_thresh = 0.0
  qid_list = sorted(na_probs, key=lambda k: na_probs[k])
  for i, qid in enumerate(qid_list):
    if qid not in scores: continue
    if qid_to_has_ans[qid]:
      diff = scores[qid]
    else:
      if preds[qid]:
        diff = -1
      else:
        diff = 0
    cur_score += diff
    if cur_score > best_score:
      best_score = cur_score
      best_thresh = na_probs[qid]
  return 100.0 * best_score / len(scores), best_thresh

def find_all_best_thresh(main_eval, preds, exact_raw, f1_raw, na_probs, qid_to_has_ans):
  best_exact, exact_thresh = find_best_thresh(preds, exact_raw, na_probs, qid_to_has_ans)
  best_f1, f1_thresh = find_best_thresh(preds, f1_raw, na_probs, qid_to_has_ans)
  main_eval['best_exact'] = best_exact
  main_eval['best_exact_thresh'] = exact_thresh
  main_eval['best_f1'] = best_f1
  main_eval['best_f1_thresh'] = f1_thresh

def compute_metrics(dataset, preds):
  
  na_probs = {k: 0.0 for k in preds}
  qid_to_has_ans = make_qid_to_has_ans(dataset)  # maps qid to True/False
  has_ans_qids = [k for k, v in qid_to_has_ans.items() if v]
  no_ans_qids = [k for k, v in qid_to_has_ans.items() if not v]
  exact_raw, f1_raw = get_raw_scores(dataset, preds)
  exact_thresh = apply_no_ans_threshold(exact_raw, na_probs, qid_to_has_ans, 0)
                                        # OPTS.na_prob_thresh)
  f1_thresh = apply_no_ans_threshold(f1_raw, na_probs, qid_to_has_ans, 0)
                                    #  OPTS.na_prob_thresh)
  out_eval = make_eval_dict(exact_thresh, f1_thresh)
  if has_ans_qids:
    has_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=has_ans_qids)
    merge_eval(out_eval, has_ans_eval, 'HasAns')
  if no_ans_qids:
    no_ans_eval = make_eval_dict(exact_thresh, f1_thresh, qid_list=no_ans_qids)
    merge_eval(out_eval, no_ans_eval, 'NoAns')
    
    
  return out_eval


### Evaluation Metrics

Fiest up, we must process the validation dataset

In [19]:
# location of the validation dataset
valid_file = "/content/drive/MyDrive/tn2/dev-v2.0.json"

# open the dataset file
with open(valid_file) as f:
    # and store its json values
    dataset = json.load(f)

valid_df = parse_squad(valid_file)
# tokenization and addition of labels
valid_encoding = tokenizer(list(valid_df['question'].values), list(valid_df['context'].values), truncation=True, padding=True)
add_labels(valid_encoding, valid_df)

# creation of the data loader
valid_dataset = SquadDataset(valid_encoding)
valid_data = DataLoader(valid_dataset, batch_size=1, shuffle=False)

Compute the metrics based on the validation script.

In [20]:
result = final(dataset['data'], answers)

print("Exact matches: ", result['exact'])
print("F1 score: ", result['f1'])

Exact matches:  49.288301187568436
F1 score:  61.47234198490996


### Comparison without fine-tuned model

Finally, we are going to see what difference our fine tunning made to the predictions. We are going to re-load the untrained model, and predict the dataset through it

In [21]:
base_model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

answers = evaluate(base_model, valid_data)

result = final(dataset['data'], answers)

print("Exact matches: ", result['exact'])
print("F1 score: ", result['f1'])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

HBox(children=(FloatProgress(value=0.0, max=26247.0), HTML(value='')))


Exact matches:  0.6569527499368315
F1 score:  3.3973115115380033


As we can see, despite our scores not being exrtemely high, it definately beats the predictions without fine-tunning the model.