# **0. Preliminary Settings**

At first, we need to clone the repository to get access to the code and use utility functions inside the notebook. The `src` folder is then added to the system path so that the modules can be used inside the notebook.

Then, we use the utility functions in the `src` folder to get the ***train*** and ***validation*** splits, while we discard the ***test*** split as it will be used to evaluate the best model only.

In [1]:
!git clone https://github.com/giuluck/Gangster-SQuAD

import sys
sys.path.append('Gangster-SQuAD/src')

from dataset import get_dataframes
train_df, val_df, _ = get_dataframes('Gangster-SQuAD/data/training_set.json')

Cloning into 'Gangster-SQuAD'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 30 (delta 8), reused 24 (delta 5), pack-reused 0[K
Unpacking objects: 100% (30/30), done.


In [2]:
## TODO: remove
train_df = train_df.iloc[:1000]
val_df = val_df.iloc[:100]

# **1. Preprocessing**

We use **SpaCy** to tokenize both the paragraphs and the questions and then extract the *contextual embedding* of each token. We also change a little bit the default tokenizer to be able to split consecutive punctuation characters and deal with strange dashes that are in the dataset.



In [3]:
import string
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load('en_core_web_sm')

split_chars = string.punctuation + '––'                            # weird dashes added
inf = list(nlp.Defaults.infixes) + [f'(?<=.)[{split_chars}](?=.)'] # always split by punctuation chars
infix_re = compile_infix_regex(tuple(inf))

nlp.tokenizer = Tokenizer(
  nlp.vocab,
  prefix_search=nlp.tokenizer.prefix_search,
  suffix_search=nlp.tokenizer.suffix_search,
  infix_finditer=infix_re.finditer,
  token_match=nlp.tokenizer.token_match,
  rules=nlp.Defaults.tokenizer_exceptions
)

for token in nlp("SpaCy is cooler than NLTK, ain't it?"):
  print(f'{token.string:7}--> {len(token.tensor)} features')

SpaCy  --> 96 features
is     --> 96 features
cooler --> 96 features
than   --> 96 features
NLTK   --> 96 features
,      --> 96 features
ai     --> 96 features
n't    --> 96 features
it     --> 96 features
?      --> 96 features


### ***1.1. Tokenization***

We process each sentences using *Spacy*.

- **Questions** are (almost) unique in the dataset, so we process them individually and we get the matrix of *contextual embeddings* for each of them. This matrix is then appended to the respective record both in the *train* and in the *validation* dataset.

- **Contexts**, instead, are repeated along records. Thus, as tokenizing a sentence is computationally heavy, in order to avoid to do that multiple times for the same sentences, we store the unique contexts in a set and then process them.
  * differently than for *questions*, for *contexts* we do not retrieve just the matrix of *contextual embeddings* but also the list of char *offsets* that will be used for retrieving the answer in the initial context given the two token bounaried predicted by the neural model.
  * as both of these data is heavy, we do not store them in the dataframe itself but rather in a list containing both the *embeddings* and the *offsets* of a certain *context*, so that in the dataframe we can just add a column for the respective index and avoid using memory to store the same information related to the same *contexts* multiple times.

> Please note that this procedure should take around **15 minutes**

In [4]:
def process_sentences(sentences, extract_features=lambda doc: doc.tensor, log_rate=1000):
  output = []
  for i, sentence in enumerate(sentences):
    if log_rate > 0 and i % log_rate == 0:
      format_chars = len(str(len(sentences) - 1))
      print(f'Sentence {i:{format_chars}}/{len(sentences)}')
    doc = nlp(sentence)
    features = extract_features(doc)
    output.append(features)
  return output

In [5]:
train_df['qst_embedding'] = process_sentences(train_df['question'])

Sentence   0/1000


In [6]:
val_df['qst_embedding'] = process_sentences(val_df['question'])

Sentence  0/100


In [7]:
import pandas as pd
from preprocessing import get_offsets

contexts = pd.concat((train_df['context'], val_df['context']))
contexts = { ctx: idx for idx, ctx in enumerate(set(contexts)) }

train_df['ctx_index'] = [contexts[ctx] for ctx in train_df['context']]
val_df['ctx_index'] = [contexts[ctx] for ctx in val_df['context']]

contexts = process_sentences(list(contexts.keys()), extract_features=lambda ctx: {
  'ctx_embedding': ctx.tensor,
  'offsets': get_offsets([token.string for token in ctx])
})

Sentence   0/140


### ***1.2. Boundaries Computation***

Once the contexts have been tokenized, we can compute the boundaries to identify the answer inside the context, which will be used for training and validation purposes as they will be the **outputs** of the neural model.

In [8]:
from preprocessing import compute_boundaries

def add_boundaries(df):
  start_indices = []
  end_indices = []
  for _, record in df.iterrows():
    ctx = contexts[record['ctx_index']]
    start_idx, end_idx = compute_boundaries(ctx['offsets'], record['start'], len(record['answer']))
    start_indices.append(start_idx)
    end_indices.append(end_idx)
  df['start_idx'] = start_indices
  df['end_idx'] = end_indices

add_boundaries(train_df)
add_boundaries(val_df)

train_df[['start_idx', 'end_idx']]

Unnamed: 0_level_0,start_idx,end_idx
id,Unnamed: 1_level_1,Unnamed: 2_level_1
5733be284776f41900661182,102,105
5733be284776f4190066117f,37,42
5733be284776f41900661180,57,60
5733be284776f41900661181,76,83
5733be284776f4190066117e,17,24
...,...,...
56d4f71e2ccc5a1400d833aa,11,12
56bed4553aeaaa14008c94e5,1,3
56bed4553aeaaa14008c94e7,31,34
56bed4553aeaaa14008c94e8,55,56


### ***1.3. Correctness Check***

Once the dataframe is ready, we check that whether our tokenization is good enough to let us retrieve the correct answers from the text or not.

Answers are retrived by:
1. getting the two `start` and `end` (token) boundaries that should be computed by the model
2. converting them into a `start_char` and an `end_char` pair of indices, which represent the boundaries in the original context, using the `indices` list
3. selecting the correct portion of the `context` using these two (char) boundaries and strip the obtained substring

Some of the answers are not correct, but this is due to the fact that the answers given in the dataset contain substrings or variations of the words which are present in the text.

In [9]:
from preprocessing import retrieve_answer, check_correctness

def retrieving_procedure(rec):
  ctx = contexts[rec['ctx_index']]
  return retrieve_answer(rec['start_idx'], rec['end_idx'], ctx['offsets'], rec['context'])

check_correctness(pd.concat((train_df, val_df)), retrieving_procedure)

Unnamed: 0_level_0,answer,normalized answer,retrieved,normalzed retrieved
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
56bf7e603aeaaa14008c9681,split with Luckett and Rober,split with luckett and rober,split with Luckett and Roberson,split with luckett and roberson
56be973d3aeaaa14008c9123,six,six,sixth,sixth


### ***1.4. Dataset Creation***

We can now prepare the dataset using **Torch** utils for data managing.

* The `Data` class extends *Torch's Dataset* and allows to get input and output data from the dataframe in a lazy way
* The `DataLoader`, then, is used to create mini-batches from the dataset and, via the custom function, to pad these mini-batches.

In [23]:
import torch
from torch.utils.data import Dataset, DataLoader

class Data(Dataset):
  def __init__(self, dataframe):
    self.dataframe = dataframe
        
  def __getitem__(self, index):
    rec = self.dataframe.iloc[index]
    ctx = contexts[rec['ctx_index']]
    input_ctx = torch.tensor(ctx['ctx_embedding'])
    input_qst = torch.tensor(rec['qst_embedding'])
    output_start = torch.tensor(rec['start_idx'])
    output_end = torch.tensor(rec['end_idx'])
    return (input_ctx, input_qst), (output_start, output_end)
  
  def __len__(self):
    return len(self.dataframe)

train_data = Data(train_df)
val_data = Data(val_df)

input, output = train_data[0]
print('Input:')
print('  > context:', input[0].shape)
print('  > question:', input[1].shape)
print('Output:')
print('  > start:', output[0].shape)
print('  > end:', output[1].shape)

Input:
  > context: torch.Size([142, 96])
  > question: torch.Size([14, 96])
Output:
  > start: torch.Size([])
  > end: torch.Size([])


In [24]:
from torch.nn.utils.rnn import pad_sequence

def get_loader(data, batch_size=32):
  def extract_batch(batch):
    input_ctxs = pad_sequence([ic for (ic, _), _ in batch], batch_first=True)
    input_qsts = pad_sequence([iq for (_, iq), _ in batch], batch_first=True)
    output_starts = torch.tensor([os for _, (os, _) in batch])
    output_ends = torch.tensor([oe for _, (_, oe) in batch])
    return (input_ctxs, input_qsts), (output_starts, output_ends)
  return DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=extract_batch)

for input, output in get_loader(train_data):
  print('Input:')
  print('  > context:', input[0].shape)
  print('  > question:', input[1].shape)
  print('Output:')
  print('  > start:', output[0].shape)
  print('  > end:', output[1].shape)
  break

Input:
  > context: torch.Size([32, 404, 96])
  > question: torch.Size([32, 19, 96])
Output:
  > start: torch.Size([32])
  > end: torch.Size([32])


# **2. Neural Model**