# **0. Preliminary Settings**

At first, we need to clone the repository to get access to the code and use utility functions inside the notebook. The `src` folder is then added to the system path so that the modules can be used inside the notebook.

Then, we use the utility functions in the `src` folder to get the ***train*** and ***validation*** splits, while we discard the ***test*** split as it will be used to evaluate the best model only.

In [None]:
%%capture
!pip install transformers wandb pytorch-lightning

In [None]:
!git clone https://github.com/giuluck/Gangster-SQuAD

import sys
sys.path.append('Gangster-SQuAD')
sys.path.append('Gangster-SQuAD/src')
sys.path.append('Gangster-SQuAD/src/models')

from dataframe import get_dataframes
train_df, val_df, _ = get_dataframes('Gangster-SQuAD/data/training_set.json')

In [None]:
## TODO: remove
# train_df = train_df.iloc[:1000]
# val_df = val_df.iloc[:100]

# **1. Preprocessing**

We use **HuggingFace** to start from a pretrained **DistilBERT** model with its own vocabulary and tokenizer.

**HuggingFace** provides a pretrained *DistilBertTokenizer*, which is quite slow, and a faster *BertWordPieceTokenizer*. In order to exploit both of them, we initially load the pretrained tokenizer, store its data, and use that data to build the faster tokenizer.

Once the tokenizer has been built, we use it to process every record in the dataframe in order to build the dataset used for training and testing purposes. This dataset will be composed by:
- a **list of tokens** structured in this way $$[\mbox{CLS}, \mbox{ctx_tok}_0, ..., \mbox{ctx_tok}_i, ..., \mbox{ctx_tok}_n, \mbox{SEP}, \mbox{qst_tok}_0, ..., \mbox{qst_tok}_j, ..., \mbox{qst_tok}_m, \mbox{SEP}]$$ which will be used as input for the *DistilBERT* model
- a **start** and an **end** integer value representing the indices of the boundary tokens that identify the answer in the text, which will be used as outputs for the *DistilBERT* model
- the **original context** and a **list of indices** representing the offsets, expressed in number of *chars* and not in number of *tokens*, which will be used to retrieve the original part of text in the context given the two outputs

In [None]:
from models import ModelInfo

model_info = ModelInfo('distilbert-base-uncased', embedding_dim=768, max_length=512, cls_token=101, sep_token=102)

In [None]:
from transformers import DistilBertTokenizer
from tokenizers import BertWordPieceTokenizer

DistilBertTokenizer.from_pretrained(model_info.pretrained_model).save_pretrained('slow_tokenizer/')
tokenizer = BertWordPieceTokenizer('slow_tokenizer/vocab.txt', lowercase=True)

In [None]:
import pandas as pd
from preprocessing import compute_boundaries
from dataframe import process_dataframe

train_df = process_dataframe(train_df, tokenizer)
val_df = process_dataframe(val_df, tokenizer)

train_df[['ctx_ids', 'qst_ids', 'start token', 'end token', 'offsets']]

> **DistilBERT** can manage sequences of 512 tokens at most, including one *\[CLS\]* token and two *\[SEP\]* tokens in our case. Thus, we exclude from the training dataset those sentences which exceed this limit. Still, validation sentences can be longer, therefore we will crop them later in the dataset creation.

In [None]:
excluded_train_ids = set([
  id for (id, ctx_ids), qst_ids in zip(train_df['ctx_ids'].items(), train_df['qst_ids'])
     if len(ctx_ids) + len(qst_ids) > model_info.max_length - 3
])

train_df = train_df[~train_df.index.isin(excluded_train_ids)]
train_df

> Once the dataframe is ready, we check that whether our tokenization is good enough to let us retrieve the correct answers from the text or not.
>
> Answers are retrived by:
> 1. getting the two `start` and `end` (token) boundaries that should be computed by the model
> 2. converting them into a `start_char` and an `end_char` pair of indices, which represent the boundaries in the original context, using the `indices` list
> 3. selecting the correct portion of the `context` using these two (char) boundaries and strip the obtained substring
>
> Some of the answers are not correct, but this is due to the fact that the answers given in the dataset contain substrings or variations of the words which are present in the text.

In [None]:
from preprocessing import retrieve_answer
from dataframe import check_correctness

def retrieving_procedure(rec):
  return retrieve_answer(rec['start token'], rec['end token'], rec['offsets'], rec['context'])

check_correctness(pd.concat((train_df, val_df)), retrieving_procedure)

# ***2. Dataset Creation***

We can now prepare the dataset using **Torch** utils for data managing.

* The `SquadDataset` class extends *Torch's Dataset* and allows to get input and output data from the dataframe in a lazy way
> Note that we add the *masks* tensor, which is currently a tensor of ones, that is used by *DistilBERT* to identify which token has to be considered and which one has to be discarded. Indeed, when we will pad the sequences, we will concatenate some *zeros* to this *masks* tensor to represent the padding tokens.

* The `DataLoader`, then, is used to create mini-batches from the dataset and, via the custom function, to pad these mini-batches.

In [None]:
from dataset import SquadDataset

train_data = SquadDataset(train_df, model_info)
val_data = SquadDataset(val_df, model_info)

input, output = train_data[0]
print('TRAIN DATA')
print('Input:', input.shape)
print('  > ids:', input[0].shape)
print('  > types:', input[1].shape)
print('Output:', output.shape)
print('  > start:', output[0].shape)
print('  > end:', output[1].shape)

print()

input, output = val_data[0]
print('VAL DATA')
print('Input:', input.shape)
print('  > ids:', input[0].shape)
print('  > types:', input[1].shape)
print('Output:', output.shape)
print('  > start:', output[0].shape)
print('  > end:', output[1].shape)

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_data, batch_size=16, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_data, batch_size=16, num_workers=4, pin_memory=True)

for input, output in train_loader:
  print('Input:', input.shape)
  print('  > ids:', input[:, 0].shape)
  print('  > types:', input[:, 1].shape)
  print('Output:', output.shape)
  print('  > start:', output[:, 0].shape)
  print('  > end:', output[:, 1].shape)
  break

# **3. Neural Model**

The base model is focused on the `last_hidden_state` which is passed through two linear layers and a softmax function to obtain `prob_start` and `prob_end`. The reason behind this choice lies in the most informative content of the last representation. However this approach is an approximation because we would be excluding a priori the additional information contained in the previous `hidden_states`. With this extended model "we aim to combine
hidden states from previous time steps by weighting them to generate an overall representation that may be able to capture additional semantic and positional patterns" according to Takeuchi et al. In essence the last `k` `hidden_states` are concatenated, hence the output of the model becomes $[\mbox{batch_size}, \mbox{sequence_length}, \mbox{embedding_dimension}, \mbox{k}]$ and it is passed through a linear layer which weight the `hidden_states` and output a tensor of size $[\mbox{batch_size}, \mbox{sequence_length}, \mbox{embedding_dimension}]$.
We can refer to this model as **DistilBERT with dot product attention over Hidden Layers and Highway**.

> The **Highway** network can be used before computing logits and it must improve the performances because it is able to filter the irrelevant information. The highway model computes:

*   $x\_proj = ReLU(W\_proj_x + b\_proj)$
*   $x\_gate = σ(W\_gate_x + b\_gate)$

> It returns: $x\_highway = x\_gate ⊙ x\_proj + (1 − x\_gate) ⊙ x$

In [None]:
from models import DistilBertWHL

model_name = 'distilbert_whl'
model = DistilBertWHL()

W&B keeps track of the metrics and the hyperparameters, login to the service and start training the model.

In [None]:
import wandb
from pytorch_lightning.loggers import WandbLogger

use_wandb = True

if use_wandb:
  # Access to W&B
  wandb.login()
  # Create the logger
  wandb_logger = WandbLogger(project="SQuAD", entity="lomb", name=model_name)
else:
  wandb_logger = None

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from metrics import MetricsCallback

epochs = 6

metrics_callback = MetricsCallback(train_df, train_data, val_df, val_data)

checkpoint_callback = ModelCheckpoint(
    monitor='val/f1_score',
    filename=model_name,
    save_top_k=1,
    mode='max'
)

trainer = pl.Trainer(
  logger=wandb_logger,
  max_epochs=epochs, precision=16,
  gpus=-1, distributed_backend='ddp',
  checkpoint_callback=True,
  callbacks=[metrics_callback, checkpoint_callback]
)

trainer.fit(model, train_loader)

In [None]:
if use_wandb:
  # Run this to stop the synchronization with the previous run
  wandb.finish()

In [None]:
model = DistilBertWithOutputKnowledge.load_from_checkpoint(path + model_name + '.ckpt')
torch.save(model.state_dict(), path + model_name + '.pt')