> **SPACY**
>
> ---
>
> In this notebook we develop a neural model leveraging **SpaCy**. We preprocess `paragraphs` and `questions` using the english language model in order to tokenize the sentences and retrieve their contextual embeddings. Finally, we use these embeddings as inputs for the model to be trained.

# **0. Preliminary Settings**

At first, we need to clone the repository to get access to the code and use utility functions inside the notebook. The `src` folder is then added to the system path so that the modules can be used inside the notebook.

Then, we use the utility functions in the `src` folder to get the ***train*** and ***validation*** splits, while we discard the ***test*** split as it will be used to evaluate the best model only.

In [None]:
!git clone https://github.com/giuluck/Gangster-SQuAD

import sys
sys.path.append('Gangster-SQuAD/src')

from dataset import get_dataframes
train_df, val_df, _ = get_dataframes('Gangster-SQuAD/data/training_set.json')

In [None]:
## TODO: remove
train_df = train_df.iloc[:1000]
val_df = val_df.iloc[:100]

# **1. Preprocessing**

We use **SpaCy** to tokenize both the paragraphs and the questions and then extract the *contextual embedding* of each token. We also change a little bit the default tokenizer to be able to split consecutive punctuation characters and deal with strange dashes that are in the dataset.



In [None]:
%%capture
!python -m spacy download en_core_web_md
!python -m spacy link en_core_web_md en_core_web_md

In [None]:
import string
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

nlp = spacy.load('en_core_web_md')

split_chars = string.punctuation + '––'                            # weird dashes added
inf = list(nlp.Defaults.infixes) + [f'(?<=.)[{split_chars}](?=.)'] # always split by punctuation chars
infix_re = compile_infix_regex(tuple(inf))

nlp.tokenizer = Tokenizer(
  nlp.vocab,
  prefix_search=nlp.tokenizer.prefix_search,
  suffix_search=nlp.tokenizer.suffix_search,
  infix_finditer=infix_re.finditer,
  token_match=nlp.tokenizer.token_match,
  rules=nlp.Defaults.tokenizer_exceptions
)

for token in nlp("SpaCy is cooler than NLTK, ain't it?"):
  print(f'{token.string:7}--> {len(token.tensor)} features')

### ***1.1. Tokenization***

We process each sentences using *Spacy*.

- **Questions** are (almost) unique in the dataset, so we process them individually and we get the matrix of *contextual embeddings* for each of them. This matrix is then appended to the respective record both in the *train* and in the *validation* dataset.

- **Contexts**, instead, are repeated along records. Thus, as tokenizing a sentence is computationally heavy, in order to avoid to do that multiple times for the same sentences, we store the unique contexts in a set and then process them.
  * differently than for *questions*, for *contexts* we do not retrieve just the matrix of *contextual embeddings* but also the list of char *offsets* that will be used for retrieving the answer in the initial context given the two token bounaried predicted by the neural model.
  * as both of these data is heavy, we do not store them in the dataframe itself but rather in a list containing both the *embeddings* and the *offsets* of a certain *context*, so that in the dataframe we can just add a column for the respective index and avoid using memory to store the same information related to the same *contexts* multiple times.

> Please note that this procedure should take around **20 minutes**

In [None]:
import numpy as np

def process_sentences(sentences, extract_features, log_rate=1000):
  output = []
  for i, sentence in enumerate(sentences):
    if log_rate > 0 and i % log_rate == 0:
      format_chars = len(str(len(sentences) - 1))
      print(f'Sentence {i:{format_chars}}/{len(sentences)}')
    doc = nlp(sentence)
    features = extract_features(doc)
    output.append(features)
  return output

def extract_embeddings(doc):
  return np.array([np.concatenate((tok.vector, tok.tensor)) for tok in doc])

extract_embeddings(nlp("SpaCy is cooler than NLTK, ain't it?")).shape

In [None]:
train_df['qst_embedding'] = process_sentences(train_df['question'], extract_features=extract_embeddings)

In [None]:
val_df['qst_embedding'] = process_sentences(val_df['question'], extract_features=extract_embeddings)

In [None]:
import pandas as pd
from preprocessing import get_offsets

contexts = pd.concat((train_df['context'], val_df['context']))
contexts = { ctx: idx for idx, ctx in enumerate(set(contexts)) }

train_df['ctx_index'] = [contexts[ctx] for ctx in train_df['context']]
val_df['ctx_index'] = [contexts[ctx] for ctx in val_df['context']]

contexts = process_sentences(list(contexts.keys()), extract_features=lambda ctx: {
  'ctx_embedding': extract_embeddings(ctx),
  'offsets': get_offsets([token.string for token in ctx])
})

### ***1.2. Boundaries Computation***

Once the contexts have been tokenized, we can compute the boundaries to identify the answer inside the context, which will be used for training and validation purposes as they will be the **outputs** of the neural model.

In [None]:
from preprocessing import compute_boundaries

def add_boundaries(df):
  start_indices = []
  end_indices = []
  for _, record in df.iterrows():
    ctx = contexts[record['ctx_index']]
    start_idx, end_idx = compute_boundaries(ctx['offsets'], record['start'], len(record['answer']))
    start_indices.append(start_idx)
    end_indices.append(end_idx)
  df['start_idx'] = start_indices
  df['end_idx'] = end_indices

add_boundaries(train_df)
add_boundaries(val_df)

train_df[['start_idx', 'end_idx']]

### ***1.3. Correctness Check***

Once the dataframe is ready, we check that whether our tokenization is good enough to let us retrieve the correct answers from the text or not.

Answers are retrived by:
1. getting the two `start` and `end` (token) boundaries that should be computed by the model
2. converting them into a `start_char` and an `end_char` pair of indices, which represent the boundaries in the original context, using the `indices` list
3. selecting the correct portion of the `context` using these two (char) boundaries and strip the obtained substring

Some of the answers are not correct, but this is due to the fact that the answers given in the dataset contain substrings or variations of the words which are present in the text.

In [None]:
from preprocessing import retrieve_answer, check_correctness

def retrieving_procedure(rec):
  ctx = contexts[rec['ctx_index']]
  return retrieve_answer(rec['start_idx'], rec['end_idx'], ctx['offsets'], rec['context'])

check_correctness(pd.concat((train_df, val_df)), retrieving_procedure)

### ***1.4. Dataset Creation***

We can now prepare the dataset using **Torch** utils for data managing.

* The `SquadDataset` class extends *Torch's Dataset* and allows to get input and output data from the dataframe in a lazy way
* The `DataLoader`, then, is used to create mini-batches from the dataset and, via the custom function, to pad these mini-batches.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class SquadDataset(Dataset):
  def __init__(self, dataframe):
    ctx_lengths = dataframe['ctx_index'].map(lambda idx: len(contexts[idx]['ctx_embedding']))
    qst_lengths = dataframe['qst_embedding'].map(lambda emb: len(emb))
    self.max_len = max([cl + ql for cl, ql in zip(ctx_lengths, qst_lengths)]) + 1 # +1 for the sep token
    self.dataframe = dataframe
        
  def __getitem__(self, index):
    rec = self.dataframe.iloc[index]
    ctx = contexts[rec['ctx_index']]
    input_ctx = torch.tensor(ctx['ctx_embedding'])
    input_qst = torch.tensor(rec['qst_embedding'])
    input_sep = torch.zeros((1, input_ctx.shape[1]))
    input_len = len(input_ctx) + len(input_qst) + 1
    input_pad = torch.zeros((self.max_len - input_len, input_ctx.shape[1]))
    input = torch.cat((input_ctx, input_sep, input_qst, input_pad))
    output = torch.tensor([rec['start_idx'], rec['end_idx']])
    return (input, input_len), output
  
  def __len__(self):
    return len(self.dataframe)

train_data = SquadDataset(train_df)
val_data = SquadDataset(val_df)

(input, _), output = train_data[0]
print('Input:', input.shape)
print('Output:', output.shape)

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence

def collate_fn(samples):
  inputs = torch.stack([input for (input, _), _ in samples])
  lengths = torch.tensor([length for (_, length), _ in samples])
  outputs = torch.stack([output for _, output in samples])
  return pack_padded_sequence(inputs, lengths, batch_first=True, enforce_sorted=False), outputs

train_loader = DataLoader(train_data, batch_size=64, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_data, batch_size=64, shuffle=False, collate_fn=collate_fn)

# **2. Neural Models**

In [None]:
%%capture
!pip install pytorch-lightning

In [None]:
from torch import nn
from torch.nn.utils.rnn import pad_packed_sequence
import pytorch_lightning as pl

class SpacyCategorical(pl.LightningModule):
    def __init__(self, embedding_dim=396, hidden_dim=128, alpha=0.5):
        super().__init__()
        self.alpha = alpha
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.start_fc = nn.Linear(2 * hidden_dim, 1)
        self.end_fc = nn.Linear(2 * hidden_dim, 1)
        self.softmax = nn.Softmax(dim=1)
        self.criterion = nn.CrossEntropyLoss()

    def _logits(self, x):
        x, _ = self.rnn(x)
        x, _ = pad_packed_sequence(x, batch_first=True)
        start = self.start_fc(x).squeeze(dim=2)
        end = self.end_fc(x).squeeze(dim=2)
        return start, end

    def forward(self, x):
        start, end = self._logits(x)
        prob_start = self.softmax(start)
        prob_end = self.softmax(end)
        return prob_start, prob_end
    
    def _step(self, batch, batch_idx):
        x, y = batch
        pred_start, pred_end = self._logits(x)
        loss_start = self.criterion(pred_start, y[:, 0])
        loss_end = self.criterion(pred_end, y[:, 1])
        return loss_start, loss_end

    def training_step(self, batch, batch_idx):
        loss_start, loss_end = self._step(batch, batch_idx)
        loss = self.alpha * loss_start + (1 - self.alpha) * loss_end
        self.log('loss_start', loss_start, prog_bar=True)
        self.log('loss_end', loss_end, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        loss_start, loss_end = self._step(batch, batch_idx)
        loss = self.alpha * loss_start + (1 - self.alpha) * loss_end
        self.log('val_loss_start', loss_start, prog_bar=True)
        self.log('val_loss_end', loss_end, prog_bar=True)
        self.log('val_loss', loss, prog_bar=True)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-5)
        return optimizer

spacy_categorical = SpacyCategorical()

In [None]:
from pytorch_lightning.callbacks import EarlyStopping

trainer = pl.Trainer(
  max_epochs=10, gpus=-1, progress_bar_refresh_rate=20, logger=False, checkpoint_callback=False,
  callbacks=[EarlyStopping(monitor='val_loss', mode='min', patience=5)]
)

trainer.fit(spacy_categorical, train_loader, val_loader)

In [None]:
starts, ends = [], []
for input, output in val_loader:
  s, e = spacy_categorical(input)
  _, s = s.max(dim=1)
  _, e = e.max(dim=1)
  starts.append(s.numpy())
  ends.append(e.numpy())

val_df['pred_start'] = [s for ss in starts for s in ss]
val_df['pred_end'] = [e for ee in ends for e in ee]
val_df

In [None]:
def retrieving_procedure(rec):
  ctx = contexts[rec['ctx_index']]
  max_len = len(ctx['offsets']) - 1
  pred_start = min(rec['pred_start'], max_len)
  pred_end = min(rec['pred_end'], max_len)
  return retrieve_answer(pred_start, pred_end, ctx['offsets'], rec['context'])

wrong_answers = check_correctness(val_df, retrieving_procedure)
wrong_answers