# qa-nlp
Question answering neural model based on the SQuAD dataset using XLNet.

Authors:
- Lorenzo Mario Amorosa
- Andrea Espis
- Mattia Orlandi
- Giacomo Pinardi

## 0. Environment setup
Import the required libraries, fix random seed and set GPU backend.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Numeric and data manipulation tools
import pandas as pd
import numpy as np
import random

# Deep learning framework
import torch
import torch.optim as optim

# Other tools
from tqdm.notebook import tqdm
from time import time
import json
from itertools import zip_longest

# Custom modules
from transformers import XLNetTokenizerFast
from model.xlnet_squad import XLNetForQuestionAnswering
from utils.squad_utils import squad_loss
from utils.xlnet_train_utils import training_loop, evaluate

# Type hint
from typing import Sequence, List, Tuple, Callable, Optional, Dict, Union

In [3]:
# Set seed for reproducibility
def fix_random(seed: int):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

fix_random(42)

In [4]:
# Use GPU acceleration if possible
if torch.cuda.is_available():
    DEVICE = torch.device('cuda:0')
    print("Using device:", DEVICE)
else:
    raise Exception('Switch to runtime GPU, otherwise the code won\'t work properly')

Using device: cuda:0


In [5]:
# Lambda for computing the mean of a list
mean: Callable[[Sequence[float]], float] = lambda l: sum(l) / len(l)

# Lambda for transforming a list of tuples into a tuple of lists
to_tuple_of_lists: Callable[[List[Tuple]], Tuple[List]] = lambda list_of_tuples: tuple(map(list, zip(*list_of_tuples)))

# Lambda for transforming a tuple of lists into a list of tuples
to_list_of_tuples: Callable[[Tuple[List]], List[Tuple]] = lambda tuple_of_lists: list(zip(*tuple_of_lists))

# Lambda for iterating with batches (if the length of the sequences does not match with the batch size, tuples of empty lists are appended)
batch_iteration: Callable[[List[Tuple]], zip] = lambda data, batch_size: zip_longest(*[iter(data)] * batch_size, fillvalue=([], [], []))

## 1. Dataset preparation

In [6]:
"""
json structure:

data []
|---title
|---paragraphs []
|   |---context
|   |---qas []
|   |   |---answers []
|   |   |   |---answer_start
|   |   |   |---text
|   |   |---question
|   |   |---id
version

"""

filename = 'training_set.json'

with open(filename, 'r') as f:
    raw_data = f.readlines()[0]

parsed_data = json.loads(raw_data)['data']

context_list = []
context_index = -1
paragraph_index = -1

dataset = {'paragraph_index': [], 'context_index': [], 'question': [], 'id': [], 'answer_start': [], 'answer_end': [], 'answer_text': []}

for i in range(len(parsed_data)):
    paragraph_index += 1
    for j in range(len(parsed_data[i]['paragraphs'])):
        context_list.append(parsed_data[i]['paragraphs'][j]['context'])
        context_index += 1

        for k in range(len(parsed_data[i]['paragraphs'][j]['qas'])):
            question = parsed_data[i]['paragraphs'][j]['qas'][k]['question']
            id = parsed_data[i]['paragraphs'][j]['qas'][k]['id']

            for l in range(len(parsed_data[i]['paragraphs'][j]['qas'][k]['answers'])): 
                answer_start = parsed_data[i]['paragraphs'][j]['qas'][k]['answers'][l]['answer_start']
                answer_text = parsed_data[i]['paragraphs'][j]['qas'][k]['answers'][l]['text']

                answer_end = answer_start + len(answer_text)

                dataset['paragraph_index'].append(paragraph_index)
                dataset['context_index'].append(context_index)
                dataset['question'].append(question)
                dataset['id'].append(id)
                dataset['answer_start'].append(answer_start)
                dataset['answer_end'].append(answer_end)
                dataset['answer_text'].append(answer_text)

df = pd.DataFrame.from_dict(dataset)

df.head()

Unnamed: 0,paragraph_index,context_index,question,id,answer_start,answer_end,answer_text
0,0,0,To whom did the Virgin Mary allegedly appear i...,5733be284776f41900661182,515,541,Saint Bernadette Soubirous
1,0,0,What is in front of the Notre Dame Main Building?,5733be284776f4190066117f,188,213,a copper statue of Christ
2,0,0,The Basilica of the Sacred heart at Notre Dame...,5733be284776f41900661180,279,296,the Main Building
3,0,0,What is the Grotto at Notre Dame?,5733be284776f41900661181,381,420,a Marian place of prayer and reflection
4,0,0,What sits on top of the Main Building at Notre...,5733be284776f4190066117e,92,126,a golden statue of the Virgin Mary


In [7]:
# Some examples of contexts and questions:
for i in range(0, 1000, 100):
    print('Context: ', context_list[df['context_index'][i]])
    print('Question:', df['question'][i], "\n")

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 

Context:  One of the main driving forces in the growth of the University was its football team, the Notre Dame Fighting Irish. Knute Rockne became head coach in 1918. Under Rockne, the Irish would post a record

In [8]:
# Define split ratios
test_ratio = 0.2
val_ratio = 0.2

# Build array of paragraphs indexes and shuffle them
paragraph_indexes = df['paragraph_index'].unique()
np.random.shuffle(paragraph_indexes)
n_samples = len(paragraph_indexes)

# Reserve indexes for test set
test_size = int(test_ratio * n_samples)
train_val_size = n_samples - test_size
test_indexes = paragraph_indexes[-test_size:]
# Reserve indexes for validation set
val_size = int(val_ratio * train_val_size)
train_size = train_val_size - val_size
val_indexes = paragraph_indexes[-(test_size + val_size):-test_size]
# Reserve indexes for training set
train_indexes = paragraph_indexes[:train_size]

assert train_size == len(train_indexes), 'Something went wrong with train set slicing'
assert val_size == len(val_indexes), 'Something went wrong with val set slicing'
assert test_size == len(test_indexes), 'Something went wrong with test set slicing'

print('Number of train paragraphs:', train_size)
print('Number of validation paragraphs:', val_size)
print('Number of test paragraphs:', test_size)

# Split dataframe
df_train = df[np.in1d(df['paragraph_index'], train_indexes)].reset_index(drop=True)
df_val = df[np.in1d(df['paragraph_index'], val_indexes)].reset_index(drop=True)
df_test = df[np.in1d(df['paragraph_index'], test_indexes)].reset_index(drop=True)

print('\nNumber of train samples:', len(df_train))
print('Number of validation samples:', len(df_val))
print('Number of test samples:', len(df_test))

# Obtaining list of ids for accurate performance evaluation
id_train = df_train['id'].to_list()
id_val = df_val['id'].to_list()
id_test = df_test['id'].to_list()

Number of train paragraphs: 284
Number of validation paragraphs: 70
Number of test paragraphs: 88

Number of train samples: 57451
Number of validation samples: 12921
Number of test samples: 17227


## 2. Tokenization

In [9]:
tokenizer = XLNetTokenizerFast.from_pretrained('xlnet-base-cased')

In [10]:
def align_labels(df: pd.DataFrame, context_list: List[str]):   
    # Retrieve contexts
    contexts = df['context_index'].apply(lambda x: context_list[x])
    # Get indexes to start_end characters
    y_char = [(start, end) for start, end in zip(df['answer_start'].tolist(), df['answer_end'].tolist())]
    # Tokenize context to obtain correspondence
    offset_mapping = [tokenizer.encode_plus(context, return_offsets_mapping=True, truncation=True)['offset_mapping'] for context in contexts]
    # Convert indexes s.t. they point to start/end tokens
    y_aligned = []
    for offset, (char_start, char_end) in zip(offset_mapping, y_char):
        token_start, token_end = 0, 0
        for i, span in enumerate(offset):
            if span[0] <= char_start <= span[1]:
                token_start = i
            if span[0] <= char_end <= span[1]:
                token_end = i
                break
        y_aligned.append((token_start, token_end))
    
    return contexts.tolist(), df['question'].tolist(), y_aligned

train_data = to_list_of_tuples(align_labels(df_train, context_list))
val_data = to_list_of_tuples(align_labels(df_val, context_list))
test_data = to_list_of_tuples(align_labels(df_test, context_list))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [11]:
shuffled_train_data = random.Random(42).sample(train_data, len(train_data))

for idx in range(10):
    # Get sample
    context, question, (token_start, token_end) = shuffled_train_data[idx]
    # Tokenize question and context
    input_ids = tokenizer(question, context, truncation='only_second')['input_ids']
    # Reconvert ids to tokens
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    # Find context offset
    offset = input_ids.index(tokenizer.sep_token_id)
    # Update start and end labels
    token_start += offset
    token_end += offset
    # Extract answer's tokens
    answer_tokens = tokens[token_start:token_end + 1]
    # Format answer
    answer = ''.join(answer_tokens).replace('▁', ' ').strip()

    print(question, ':', answer)

Where do traditional mandolin orchestras remain popular?  : in Japan and
When was the first overseas deployment of the Canadian Military? : the Second Boer
In which generation did iPod start providing compatibility with USB? : The third
What transformed business organization in 1890? : a managerial
Why are the Mollusca and Annelida considered to be close relatives? : the common presence of trochophore
What was Boardwalk Hall formerly known as? : "Historic Atlantic City Convention
Who is Yueju traditionally performed by? : by actresses
How many endemic species of fungi have been found? : including
What year was the 73rd AES Convention? : In
About how many hotels does St. Barts have? : about


## 3. Training


In [None]:
# Model creation
model_xlnet = XLNetForQuestionAnswering('xlnet-base-cased').to(DEVICE)

In [None]:
EP = 30
BS = 8

optimizer = optim.Adam(filter(lambda p: p.requires_grad, model_xlnet.parameters()))
criterion = squad_loss

history = training_loop(model=model_xlnet,
                        train_data=train_data,
                        optimizer=optimizer,
                        epochs=EP,
                        batch_size=BS,
                        criterion=criterion,
                        tokenizer=tokenizer,
                        val_data=val_data,
                        early_stopping=True,
                        patience=15,
                        checkpoint_path='xlnet.pt')
# Save the model
torch.save(model_xlnet, "xlnet.pt")
# Plot the history
plot_history(history)

## 4. Test
Assess the model's performance on the test set.

In [None]:
test_loss, _, _, test_exact_score, test_f1_score = evaluate(model_xlnet, test_data, BS, squad_loss, tokenizer, verbose=True)