> **BERT BASE UNCASED**
>
> ---
>
> In this notebook we develop a neural model starting from a pre-trained **BERT** model. We tokenize `paragraphs` and `questions` using the default tokenizer, then use the pretrained model as first layer for the neural network.

# **0. Preliminary Settings**

At first, we need to clone the repository to get access to the code and use utility functions inside the notebook. The `src` folder is then added to the system path so that the modules can be used inside the notebook.

Then, we use the utility functions in the `src` folder to get the ***train*** and ***validation*** splits, while we discard the ***test*** split as it will be used to evaluate the best model only.

In [None]:
!git clone https://github.com/giuluck/Gangster-SQuAD

import sys
sys.path.append('Gangster-SQuAD/src')

from dataset import get_dataframes
train_df, val_df, _ = get_dataframes('Gangster-SQuAD/data/training_set.json')

In [None]:
## TODO: remove
train_df = train_df.iloc[:1000]
val_df = val_df.iloc[:100]

# **1. Preprocessing**

We use **HuggingFace** to start from a pretrained **BERT** model with its own vocabulary and tokenizer.

In [None]:
%%capture
!pip install transformers

### ***1.1. Tokenization***

**HuggingFace** provides a pretrained *BertTokenizer*, which is quite slow, and a faster *BertWordPieceTokenizer*. In order to exploit both of them, we initially load the pretrained tokenizer, store its data, and use that data to build the faster tokenizer.

Once the tokenizer has been built, we use it to process every record in the dataframe in order to build the dataset used for training and testing purposes. This dataset will be composed by:
- a **list of tokens** structured in this way $$[\mbox{CLS}, \mbox{ctx_tok}_0, ..., \mbox{ctx_tok}_i, ..., \mbox{ctx_tok}_n, \mbox{SEP}, \mbox{qst_tok}_0, ..., \mbox{qst_tok}_j, ..., \mbox{qst_tok}_m, \mbox{SEP}]$$ which will be used as input for the *BERT* model, with the respective lists of **type ids** (*0* for the context, *1* for the answer)
- a **start** and an **end** integer value representing the indices of the boundary tokens that identify the answer in the text, which will be used as outputs for the *BERT* model
- the **original context** and a **list of indices** representing the offsets, expressed in number of *chars* and not in number of *tokens*, which will be used to retrieve the original part of text in the context given the two outputs

In [None]:
from transformers import BertTokenizer
from tokenizers import BertWordPieceTokenizer

pretrained_model = 'bert-base-uncased'

BertTokenizer.from_pretrained(pretrained_model).save_pretrained('slow_tokenizer/')
tokenizer = BertWordPieceTokenizer('slow_tokenizer/vocab.txt', lowercase=True)

In [None]:
special_tokens = { token: id for token, id in zip(['[CLS]', '[PAD]', '[SEP]'], tokenizer.encode('[PAD]').ids) }
special_tokens

In [None]:
import pandas as pd
from preprocessing import compute_boundaries

def process_dataframe(df):
  def process_record(record):
    # both context and question gets tokenized
    tokens = tokenizer.encode(record['context'], record['question'])
    # take all the context start chars then add a final index for the last character
    sep_index = tokens.ids.index(special_tokens['[SEP]'])
    offsets = [s for s, _ in tokens.offsets[:sep_index]] + [len(record['context'])]
    # token boundaries to be used during training are computed
    start_token, end_token = compute_boundaries(offsets, record['start'], len(record['answer']))
    # input, output and utility data are returned to form the dataset
    return [tokens.ids, tokens.type_ids, start_token, end_token, offsets]

  processed_df = pd.DataFrame(
    [[id] + process_record(record) for id, record in df.iterrows()],
    columns = ['id', 'ids', 'types', 'start token', 'end token', 'offsets']
  ).set_index(['id'])
  return processed_df.join(df)

train_df = process_dataframe(train_df)
val_df = process_dataframe(val_df)

train_df[['ids', 'types', 'start token', 'end token', 'offsets']]

### ***1.2. Correctness Check***

Once the dataframe is ready, we check that whether our tokenization is good enough to let us retrieve the correct answers from the text or not.

Answers are retrived by:
1. getting the two `start` and `end` (token) boundaries that should be computed by the model
2. converting them into a `start_char` and an `end_char` pair of indices, which represent the boundaries in the original context, using the `indices` list
3. selecting the correct portion of the `context` using these two (char) boundaries and strip the obtained substring

Some of the answers are not correct, but this is due to the fact that the answers given in the dataset contain substrings or variations of the words which are present in the text.

In [None]:
from preprocessing import retrieve_answer, check_correctness

def retrieving_procedure(rec):
  return retrieve_answer(rec['start token'], rec['end token'], rec['offsets'], rec['context'])

check_correctness(pd.concat((train_df, val_df)), retrieving_procedure)

### ***1.3. Dataset Creation***

We can now prepare the dataset using **Torch** utils for data managing.

* The `Data` class extends *Torch's Dataset* and allows to get input and output data from the dataframe in a lazy way
> Note that we add the *masks* tensor, which is currently a tensor of ones, that is used by *BERT* to identify which token has to be considered and which one has to be discarded. Indeed, when we will pad the sequences, we will concatenate some *zeros* to this *masks* tensor to represent the padding tokens.

* The `DataLoader`, then, is used to create mini-batches from the dataset and, via the custom function, to pad these mini-batches.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class SquadDataset(Dataset):
  def __init__(self, dataframe):
    self.dataframe = dataframe
        
  def __getitem__(self, index):
    rec = self.dataframe.iloc[index]
    input_ids = torch.tensor(rec['ids'])
    input_types = torch.tensor(rec['types'])
    input_masks = torch.ones_like(input_ids)
    output_start = torch.tensor(rec['start token'])
    output_end = torch.tensor(rec['end token'])
    return (input_ids, input_types, input_masks), (output_start, output_end)
  
  def __len__(self):
    return len(self.dataframe)

train_data = SquadDataset(train_df)
val_data = SquadDataset(val_df)

input, output = train_data[0]
print('Input:')
print('  > ids:', input[0].shape)
print('  > types:', input[1].shape)
print('  > masks:', input[2].shape)
print('Output:')
print('  > start:', output[0].shape)
print('  > end:', output[1].shape)

In [None]:
from torch.nn.utils.rnn import pad_sequence

def get_loader(data, batch_size=32):
  def extract_batch(batch):
    input_ids = pad_sequence([ii for (ii, _, _), _ in batch], batch_first=True)
    input_types = pad_sequence([it for (_, it, _), _ in batch], batch_first=True)
    input_masks = pad_sequence([im for (_, _, im), _ in batch], batch_first=True)
    output_starts = torch.tensor([os for _, (os, _) in batch])
    output_ends = torch.tensor([oe for _, (_, oe) in batch])
    return (input_ids, input_types, input_masks), (output_starts, output_ends)
  return DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=extract_batch)

for input, output in get_loader(train_data):
  print('Input:')
  print('  > ids:', input[0].shape)
  print('  > types:', input[1].shape)
  print('  > masks:', input[2].shape)
  print('Output:')
  print('  > start:', output[0].shape)
  print('  > end:', output[1].shape)
  break

# **2. Neural Model**

**BERT** is a language model and we will use it as an encoder to produce contextual embeddings for our tokens.

> The model actually returns a dictionary with *two outputs*. One is the `last_hidden_state`, which has shape $[\mbox{batch_size}, \mbox{sequence_length}, \mbox{embedding_dimension}]$, while the other is the `pooler_output`, which has shape $[\mbox{batch_size}, \mbox{embedding_dimension}]$. As in our task we want to inspect a sequence to compute the boundaries, we will rely on the first output only.

In [None]:
%%capture
!pip install pytorch-lightning

In [None]:
import pytorch_lightning as pl
from transformers import BertModel

import torch.nn as nn


class BertBaseline(pl.LightningModule):
    def __init__(self,
                 pretrained_model = 'bert-base-uncased',
                 alpha=0.5):
        super().__init__()
        self.alpha = alpha
        self.encoder = BertModel.from_pretrained(pretrained_model)
        self.start_fc = nn.Linear(768, 1)
        self.end_fc = nn.Linear(768, 1)
        self.softmax = nn.Softmax(dim=1)
        self.criterion = nn.CrossEntropyLoss()

    def _logits(self, x):
        x = self.encoder(input_ids=x[0], attention_mask=x[1], token_type_ids=x[2])
        x = x["last_hidden_state"]
        start = self.start_fc(x).squeeze(dim=2)
        end = self.end_fc(x).squeeze(dim=2)
        return start, end

    def forward(self, x):
        start, end = self._logits(x)
        prob_start = self.softmax(start)
        prob_end = self.softmax(end)
        return prob_start, prob_end

    def training_step(self, batch, batch_idx):
        x, (y_start, y_end) = batch
        pred_start, pred_end = self._logits(x)
        loss_start = self.criterion(pred_start, y_start)
        loss_end = self.criterion(pred_end, y_end)
        loss = self.alpha * loss_start + (1 - self.alpha) * loss_end
        return loss

    def validation_step(self, batch, batch_idx):
        raise NotImplementedError()

    def test_step(self, batch, batch_idx):
        raise NotImplementedError()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [None]:
trainer = pl.Trainer(max_epochs=10, gpus=1)
bert_baseline = BertBaseline()

trainer.fit(bert_baseline, get_loader(train_data, batch_size=16))