# **0. Preliminary Settings**

At first, we need to clone the repository to get access to the code and use utility functions inside the notebook. The `src` folder is then added to the system path so that the modules can be used inside the notebook.

Then, we use the utility functions in the `src` folder to get the ***train*** and ***validation*** splits, while we discard the ***test*** split as it will be used to evaluate the best model only.

In [None]:
!git clone https://github.com/giuluck/Gangster-SQuAD

import sys
sys.path.append('Gangster-SQuAD/src')

from dataset import get_dataframes
train_df, val_df, _ = get_dataframes('Gangster-SQuAD/data/training_set.json')

In [None]:
## TODO: remove
train_df = train_df.iloc[:]
val_df = val_df.iloc[:]

# **1. Preprocessing**

We use **HuggingFace** to start from a pretrained **DistilBERT** model with its own vocabulary and tokenizer.

**HuggingFace** provides a pretrained *DistilBertTokenizer*, which is quite slow, and a faster *BertWordPieceTokenizer*. In order to exploit both of them, we initially load the pretrained tokenizer, store its data, and use that data to build the faster tokenizer.

Once the tokenizer has been built, we use it to process every record in the dataframe in order to build the dataset used for training and testing purposes. This dataset will be composed by:
- a **list of tokens** structured in this way $$[\mbox{CLS}, \mbox{ctx_tok}_0, ..., \mbox{ctx_tok}_i, ..., \mbox{ctx_tok}_n, \mbox{SEP}, \mbox{qst_tok}_0, ..., \mbox{qst_tok}_j, ..., \mbox{qst_tok}_m, \mbox{SEP}]$$ which will be used as input for the *DistilBERT* model
- a **start** and an **end** integer value representing the indices of the boundary tokens that identify the answer in the text, which will be used as outputs for the *DistilBERT* model
- the **original context** and a **list of indices** representing the offsets, expressed in number of *chars* and not in number of *tokens*, which will be used to retrieve the original part of text in the context given the two outputs

In [None]:
%%capture
!pip install transformers

In [None]:
class ModelInfo():
  def __init__(self, pretrained_model, embedding_dim, max_length, cls_token, sep_token):
    self.pretrained_model = pretrained_model
    self.embedding_dim = embedding_dim
    self.max_length = max_length
    self.cls_token = cls_token
    self.sep_token = sep_token

model_info = ModelInfo('distilbert-base-uncased', embedding_dim=768, max_length=512, cls_token=101, sep_token=102)

In [None]:
from transformers import DistilBertTokenizer
from tokenizers import BertWordPieceTokenizer

DistilBertTokenizer.from_pretrained(model_info.pretrained_model).save_pretrained('slow_tokenizer/')
tokenizer = BertWordPieceTokenizer('slow_tokenizer/vocab.txt', lowercase=True)

In [None]:
import pandas as pd
from preprocessing import compute_boundaries

def process_dataframe(df):
  def process_record(record):
    # both context and question gets tokenized
    ctx_tokens = tokenizer.encode(record['context'])
    ctx_ids = ctx_tokens.ids[1:-1]                       # [CLS] and [SEP] tokens are discarded
    qst_tokens = tokenizer.encode(record['question'])
    qst_ids = qst_tokens.ids[1:-1]                       # [CLS] and [SEP] tokens are discarded
    # take all the context start chars then add a final index for the last character
    offsets = [s for s, _ in ctx_tokens.offsets[:-1]] + [len(record['context'])]
    # token boundaries to be used during training are computed
    start_token, end_token = compute_boundaries(offsets, record['start'], len(record['answer']))
    # input, output and utility data are returned to form the dataset
    return [ctx_ids, qst_ids, start_token, end_token, offsets]

  processed_df = pd.DataFrame(
    [[id] + process_record(record) for id, record in df.iterrows()],
    columns = ['id', 'ctx_ids', 'qst_ids', 'start token', 'end token', 'offsets']
  ).set_index(['id'])
  return processed_df.join(df)

train_df = process_dataframe(train_df)
val_df = process_dataframe(val_df)

train_df[['ctx_ids', 'qst_ids', 'start token', 'end token', 'offsets']]

> **DistilBERT** can manage sequences of 512 tokens at most, including one *\[CLS\]* token and two *\[SEP\]* tokens in our case. Thus, we exclude from the training dataset those sentences which exceed this limit. Still, validation sentences can be longer, therefore we will crop them later in the dataset creation.

In [None]:
excluded_train_ids = set([
  id for (id, ctx_ids), qst_ids in zip(train_df['ctx_ids'].items(), train_df['qst_ids'])
     if len(ctx_ids) + len(qst_ids) > model_info.max_length - 3
])

train_df = train_df[~train_df.index.isin(excluded_train_ids)]
train_df

> Once the dataframe is ready, we check that whether our tokenization is good enough to let us retrieve the correct answers from the text or not.
>
> Answers are retrived by:
> 1. getting the two `start` and `end` (token) boundaries that should be computed by the model
> 2. converting them into a `start_char` and an `end_char` pair of indices, which represent the boundaries in the original context, using the `indices` list
> 3. selecting the correct portion of the `context` using these two (char) boundaries and strip the obtained substring
>
> Some of the answers are not correct, but this is due to the fact that the answers given in the dataset contain substrings or variations of the words which are present in the text.

In [None]:
from preprocessing import retrieve_answer, check_correctness

def retrieving_procedure(rec):
  return retrieve_answer(rec['start token'], rec['end token'], rec['offsets'], rec['context'])

check_correctness(pd.concat((train_df, val_df)), retrieving_procedure)

# ***2. Dataset Creation***

We can now prepare the dataset using **Torch** utils for data managing.

* The `SquadDataset` class extends *Torch's Dataset* and allows to get input and output data from the dataframe in a lazy way
> Note that we add the *masks* tensor, which is currently a tensor of ones, that is used by *DistilBERT* to identify which token has to be considered and which one has to be discarded. Indeed, when we will pad the sequences, we will concatenate some *zeros* to this *masks* tensor to represent the padding tokens.

* The `DataLoader`, then, is used to create mini-batches from the dataset and, via the custom function, to pad these mini-batches.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class SquadDataset(Dataset):
  def __init__(self, dataframe, model_info):
    self.dataframe = dataframe
    self.max_len = model_info.max_length
    self.cls_tok = model_info.cls_token
    self.sep_tok = model_info.sep_token
        
  def __getitem__(self, index):
    rec = self.dataframe.iloc[index]
    # retrieving paragraph and question tokens, limiting them to the maximal length
    qst_ids = rec['qst_ids'][:self.max_len-3] + [self.sep_tok]
    ctx_ids = [self.cls_tok] + rec['ctx_ids'][:self.max_len-len(qst_ids)-2] + [self.sep_tok]
    len_ids = len(ctx_ids) + len(qst_ids)
    # contexts and questions are used to build the input tensor
    ctx_ids = torch.tensor(ctx_ids)
    qst_ids = torch.tensor(qst_ids)
    input_ids = torch.cat((ctx_ids, qst_ids))
    input_masks = torch.ones_like(input_ids)
    input_tensor = torch.stack((input_ids, input_masks), dim=0)
    # the input tensor is padded to length 512
    pad_tensor = torch.zeros((2, self.max_len - len_ids), dtype=torch.long)
    input_tensor = torch.cat((input_tensor, pad_tensor), dim=1)
    # an output tensor containing the two outputs is created as well
    output_tensor = torch.tensor([rec['start token'], rec['end token']])
    return input_tensor, output_tensor
  
  def __len__(self):
    return len(self.dataframe)

train_data = SquadDataset(train_df, model_info)
val_data = SquadDataset(val_df, model_info)

input, output = train_data[0]
print('TRAIN DATA')
print('Input:', input.shape)
print('  > ids:', input[0].shape)
print('  > types:', input[1].shape)
print('Output:', output.shape)
print('  > start:', output[0].shape)
print('  > end:', output[1].shape)

print()

input, output = val_data[0]
print('VAL DATA')
print('Input:', input.shape)
print('  > ids:', input[0].shape)
print('  > types:', input[1].shape)
print('Output:', output.shape)
print('  > start:', output[0].shape)
print('  > end:', output[1].shape)

In [None]:
train_loader = DataLoader(train_data, batch_size=16, shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_data, batch_size=16, num_workers=4, pin_memory=True)

for input, output in train_loader:
  print('Input:', input.shape)
  print('  > ids:', input[:, 0].shape)
  print('  > types:', input[:, 1].shape)
  print('Output:', output.shape)
  print('  > start:', output[:, 0].shape)
  print('  > end:', output[:, 1].shape)
  break

# **3. Neural Model**

**DistilBERT** returns the `last_hidden_state`, which has shape $[\mbox{batch_size}, \mbox{sequence_length}, \mbox{embedding_dimension}]$. In this model, we simply pass that values to *two fully-connected layers* which will return the *logits* related to each token. The model is trained via **Cateogirical Cross-Entropy**, assigning a weight to the `start_loss` that is **three times higher** than the `end_loss`, while the outputs of the model are the two values for the *start* and *end* token respectively.

In [None]:
%%capture
!pip install pytorch-lightning

> In the `forward` method, we return the indices for both the *start* and the *end* token. In particular, in this model we **ignore all the logits values for the end indices prior to the retrieved start indices**, in this way we force the model to predict an end index that is greater or equal to the starting one. Finally, we clip them to the maximal context lenght because, if the model gives a wrong token and returns a value which is greater than the length of the `offsets` list, an error will occurr.

In [None]:
from torch import nn
from transformers import DistilBertModel
import pytorch_lightning as pl

class DistilBertCategoricalWithOutputKnowledge(pl.LightningModule):
  def __init__(self, model_info=model_info, alpha=0.75):
    super().__init__()
    self.alpha = alpha
    self.encoder = DistilBertModel.from_pretrained(model_info.pretrained_model)
    self.start_fc = nn.Linear(model_info.embedding_dim, 1)
    self.end_fc = nn.Linear(model_info.embedding_dim, 1)
    self.criterion = nn.CrossEntropyLoss()

  def _logits(self, x):
    x = self.encoder(input_ids=x[:, 0], attention_mask=x[:, 1])
    x = x["last_hidden_state"]
    start = self.start_fc(x).squeeze(dim=2)
    end = self.end_fc(x).squeeze(dim=2)
    return start, end

  def forward(self, x):
    # retrieve logits
    start, end = self._logits(x)
    # mask values after context lengths to -inf
    tensor_length = start.shape[1]
    _, contexts_length = (x[:, 0] == model_info.sep_token).max(dim=1)
    for i, ctx_len in enumerate(contexts_length):
      mask = torch.arange(tensor_length, device=self.device) > ctx_len
      start[i, mask] = -float('inf')
      end[i, mask] = -float('inf')
    # retrieve start indices
    _, start_indices = start.max(dim=1)
    # mask end tensor to assign -inf value to logits before the start idx
    for i, start_idx in enumerate(start_indices):
      mask = torch.arange(tensor_length, device=self.device) < start_idx
      end[i, mask] = -float('inf')
    # retrieve end indices and clip to maximal length
    _, end_indices = end.max(dim=1)
    return start_indices, end_indices

  def training_step(self, batch, batch_idx):
    x, y = batch
    pred_start, pred_end = self._logits(x)
    loss_start = self.criterion(pred_start, y[:, 0])
    loss_end = self.criterion(pred_end, y[:, 1])
    self.log('loss_start', loss_start, prog_bar=True)
    self.log('loss_end', loss_end, prog_bar=True)
    loss = self.alpha * loss_start + (1 - self.alpha) * loss_end
    return loss

  def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-5)
    return optimizer

model = DistilBertCategoricalWithOutputKnowledge()

In [None]:
trainer = pl.Trainer(
  max_epochs=4, precision=16,
  gpus=-1, distributed_backend='ddp',
  logger=False, checkpoint_callback=False
)

trainer.fit(model, train_loader)

# **4. Results**

We now use the model to get the two boundaries for the answer. Then, we append these boundaries as two new columns in the *validation dataset* and use them to retrieve the textual answer inside the context paragraph. Finally, we get the dataset of wrong answers in order to insepct it, and from that we compute the **validation score** of the model as the ratio of good answers on the total number of validation samples.

In [None]:
starts, ends = [], []
num_batches = len(val_loader)

# pytorch lightning does not move the model to GPU during evaluation
# so we need to do that by hand for both the model and the inputs
model = model.cuda()

print('Starting Evaluation...')
for idx, (input, _) in enumerate(val_loader):
  if (idx + 1) % 100 == 0:
    print(f'Batch {idx+1:{len(str(num_batches))}}/{num_batches}')
  s, e = model(input.cuda())
  starts.append(s)
  ends.append(e)

val_df['pred_start'] = [s.item() for ss in starts for s in ss]
val_df['pred_end'] = [e.item() for ee in ends for e in ee]

In [None]:
def retrieving_procedure(rec):
  return retrieve_answer(rec['pred_start'], rec['pred_end'], rec['offsets'], rec['context'])

val_wrong = check_correctness(val_df, retrieving_procedure)
val_wrong

In [None]:
print('Validation Score:', 1 - len(val_wrong) / len(val_df))