## Question Answering with BiLSTM and Attention

In this notebook, we will build a question answering (QA) deep learning architecture. To fully utilize the available context information, we will make use of BiLSTM units. Instead of one-hot encoding words, we will use GloVe embeddings to make lexical semantics available to the model. We follow the architecture shown in the figure below copied from our [textbook](https://web.stanford.edu/~jurafsky/slp3/25.pdf) (Fig 25.7). We differ in the representation of the context: we only use GloVe embeddings while the architecture in the textbook uses an enriched representation with POS, NER tags, etc.

![QA-Deep-Learning-Architecture](img/arch.png)

In [None]:
from pathlib import Path

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.callbacks import *
from tensorflow.keras.layers import LSTM, Bidirectional, Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

import arch_utils
import utils
from layers import *

### Sections of the Notebook
1. [Loading the Dataset](#load)
2. [Tokenizing and Encoding](#tokenize_encode)
3. [Embeddings](#embeddings)
4. [The Model and Training](#model)<br>
    4.1 [Custom Loss and Accuracy](#loss_acc)
5. [Exercises](#exercises)

<a id="load"></a>
### 1. Loading the Dataset

We will use the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/). This dataset is available in TensorFlow and we will use the TensorFlow dataset ([tfds](https://www.tensorflow.org/datasets/api_docs/python/tfds) loader). We will extract the context, question text, answer text and the starting position of the answer. We will do processing later in the notebook to shape the input into a form that our architecture can use.

In [None]:
CWD = Path.cwd().as_posix()

# Load data
squad_data, info = tfds.load("squad", data_dir=CWD, with_info=True)

# Get training and validation splits
squad_train = squad_data["train"]
squad_validation = squad_data["validation"]
print(info.features)

# Get context, question and answer text and starting point of answer in the context 
# (in number of words from the beginning of the context)
context_tr, question_tr, answer_text_tr, answer_start_tr = utils.split_info(squad_train)

### Example

We can look at an example from the training set, e.g. 5th example.

In [None]:
i = 5
start_char_idx = answer_start_tr[i]
print("Context: ".upper(), context_tr[i], "\n")
print("Context until answer: ".upper(), context_tr[i][0:start_char_idx-1], "\n")


print("Question: ".upper(), question_tr[i], "\n")
print("Answer: ".upper(), answer_text_tr[i])

<a id='tokenize_encode'></a>
### 2. Tokenizing and Encoding

We use [TensorFlow Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) and map to integers as in the previous homeworks.

We post-pad the resulting context and answer vectors to equal lengths for each example.

In [None]:
padding_type = "post"

# mark out of vocabulary words
oov_token = "<OOV>"

# initialize the tokenizer
tokenizer = Tokenizer(oov_token=oov_token)

# tokenize texts
tokenizer.fit_on_texts(context_tr)

# mapping between words and integers in the vocabulary
word_index = tokenizer.word_index

# size of the vocabulary
num_words = len(word_index.keys())

print("{:50s}: {}".format("Total number of words", num_words))

In [None]:
# encoding context vectors to integer vectors and post padding them 
sequences = tokenizer.texts_to_sequences(context_tr)
context_len = max(map(len, sequences))
print("{:50s}: {}".format("Max length of a context vector", context_len))
context_padded = pad_sequences(sequences, maxlen=context_len, padding=padding_type)

# encoding question vectors to integer vectors and post padding them 
sequences = tokenizer.texts_to_sequences(question_tr)
question_len = max(map(len, sequences))
print("{:50s}: {}".format("Max length of a question vector", question_len))
question_padded = pad_sequences(sequences, maxlen=question_len, padding=padding_type)

# encoding answer vectors into integer vectors
answer_token = tokenizer.texts_to_sequences(answer_text_tr)

We only want the examples whose contexts contain the exact answer. The example shown above will be excluded. Some of the words in the context are removed during tokenizing. We will need to search for the answer in the proximity of the given answer start position. We set the search window length to 10 before and after the given start position.

Here, we also form our labels for the dataset in the form of (answer_start, answer_end). We will use the length of the given answer text for that. The labels will be in one-hot encoding format where a 1 indicates either the start or the end of the answer.

In [None]:
y_train_tup = []  # (start,end)
selected = []  # we keep context that only contains exact answer
WINDOW = 10

for i in range(len(answer_start_tr)):
    start_char_idx = answer_start_tr[i]
    start = len(context_tr[i][0 : start_char_idx - 1].replace("-", " ").split()) + 1
    answer_len = len(answer_text_tr[i].replace("-", " ").split())
    end = start + answer_len

    for j in range(start - WINDOW, start + WINDOW):
        if np.array_equal(context_padded[i][j : j + answer_len], answer_token[i]):
            start = j
            end = j + answer_len - 1
            y_train_tup.append((start, end))
            selected.append(i)
            break

context_padded_clean = context_padded[selected]
question_padded_clean = question_padded[selected]
answer_text_clean = answer_text_tr[selected]

num_train_data = context_padded_clean.shape[0]
print("{:50s}: {}".format("Number of training samples after cleaning", num_train_data))

In [None]:
y_train = []
for i in range(len(context_padded_clean)):
    s = np.zeros(context_len, dtype="float32")
    e = np.zeros(context_len, dtype="float32")

    s[y_train_tup[i][0]] = 1
    e[y_train_tup[i][1]] = 1

    y_train.append(np.concatenate((s, e)))
y_train = np.array(y_train)

<a id='embeddings'></a>
### 3. Embeddings

We will use the 50 dimensional GloVe embeddings. Below are embedding vectors for some tokens e.g "the", "a", ".", etc.

In [None]:
!head glove.6B.50d.txt | cat -n

In [None]:
emb_dim = 50 # glove.6B.50d
embeddings_mat = arch_utils.load_embeddings("glove.6B.50d.txt", num_words, emb_dim, word_index)

<a id='model'></a>
### 4. The Model and Training

Our model will have two inputs: context and question vectors. We define an input layer for each. We will feed each type of input to Embedding layer and BiLSTM layer. Each group of Embedding+BiLSTM group will be parametrized separately. We will implement the attention mechanism in the BiLinear layer where we find the similarity between the context and the question.

We will use the functional Keras API this time.

In [None]:
units = 128
context_input = Input(shape=(context_len,))
context_emb = Embedding(num_words, embeddings_mat, emb_dim)(context_input)
context_lstm = BiLSTM(units)(context_emb)

question_input = Input(shape=(question_len,))
question_emb = Embedding(num_words, embeddings_mat, emb_dim)(question_input)
question_lstm = BiLSTM(units)(question_emb)

y_prob = BiLinear_Layer(2 * units, question_len)(context_lstm, question_lstm)

model = Model(inputs=[context_input, question_input], outputs=y_prob)
model.summary()

<a id='loss_acc'></a>
### 4.1 Custom Loss and Accuracy
We will use a custom loss and accuracy to evaluate our model. This is because we are predicting both the start and the end of the answer in the context. Specifically, for each position in the context, our model predicts the probability that it's the start of the answer and the probability that it's the end of the answer. To reflect this and summarize into a single metric, we sum the binary crossentropies to find the custom loss. While calculating the accuracy, we report the proportion of exact matches for the start and end positions.

In [None]:
optimizer = tf.keras.optimizers.Adam(lr=1e-3)
model.compile(
    optimizer=optimizer,
    loss=arch_utils.loss,
    metrics=[arch_utils.Custom_Accuracy(num_train_data)],
)

Training takes a long time, here we show only the first two epochs. In the batch submission, we will expect you to train for longer. Details are in the [Exercises](#exercises) section.

In [None]:
context_padded_jupyter = context_padded_clean[:1000]
question_padded_jupyter = question_padded_clean[:1000]
y_train_jupyter = y_train[:1000]

# model.load_weights("epochs_1000")
init_epoch = 0
num_epochs = 2
batch_size = 128

# early stopping will depend on the validation loss
# patience parameter determines how many epochs with no improvement
# in validation loss will be tolerated
# before training is terminated.
earlystopping = EarlyStopping(monitor="val_loss", patience=2)

filepath = "epochs_{epoch:03d}"
checkpoint = ModelCheckpoint(filepath, save_weights_only=True)

callbacks = [earlystopping, checkpoint]

history = model.fit(
    x=[context_padded_jupyter, question_padded_jupyter],
    y=y_train_jupyter,
    # keep 10% of the training data for validation
    validation_split=0.1,
    initial_epoch=init_epoch,
    epochs=num_epochs,
    callbacks=callbacks,
    verbose=2,  # Logs once per epoch.
    batch_size=batch_size,
    # Our neural network will be trained
    # with stochastic (mini-batch) gradient descent.
    # It is important that we shuffle our input.
    shuffle=True,  # set to True by default
)

# Print training history
history = history.history
print(
    "\nValidation accuracy: {acc}, loss: {loss}".format(
        acc=history["val_custom_accuracy"][-1], loss=history["val_loss"][-1]
    )
)

<a id='exercises'></a>
### 5. Exercises

###### 1. In the homework bundle, we provide the weights trained up to epoch 1500. Submit a batch job to train it for another 1000 epochs.
###### 2. With the model trained for 2500 epochs, report the accuracy. ~%20 of training data is set aside for test set.

<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

Please contact Haluk Dogan (<a href="mailto:hdogan@vivaldi.net">hdogan@vivaldi.net</a>) for further questions or inquries.