# NLP Project: Evaluation Optimization

## Part B: Question Answering

In this part, we will develop a question-answering system. 


Make sure to monitor your Azure budget carefully, should you choose to use it.

In [4]:
import re, os, json, pickle, ast, time, random, requests
import pandas as pd
import numpy as np
import scipy
import scipy.sparse as sp

import nltk
from nltk.corpus import stopwords
nltk.download("stopwords", quiet = True)
nltk.download("wordnet", quiet = True)
nltk.download("averaged_perceptron_tagger", quiet = True)
nltk.download("punkt", quiet = True)

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from tqdm import tqdm
tqdm.pandas()

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

GLOBAL_SEED = 1
 
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

GLOBAL_WORKER_ID = None
def _init_fn(worker_id):
    global GLOBAL_WORKER_ID
    GLOBAL_WORKER_ID = worker_id
    set_seed(GLOBAL_SEED + worker_id)

set_seed(GLOBAL_SEED)

In [6]:
def check_equal(actual, expected):
    assert actual == expected, actual

def check_approx(actual, expected):
    assert np.allclose(actual, expected), actual

## Subpart A: Project Overview
We will train our NLP models on the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/), a reading comprehension dataset with more than 100,000 questions. SQuAD was one of the first with a public leaderboard and thus was able to garner a large amount of research result and publicity towards itself. Questions in the dataset can be answered from the context that accompanies them, without requiring any domain-specific knowledge; thus they belong to the class of *single-hop* question answering problem.

Here's an example of what our model will do: given a *question*,
```
When did Beyonce start becoming popular?
```
and a block of text containing the answer, called the *context*,
```
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
```
our model will be able to extract the answer from this context, which is
```
in the late 1990s
```

We will start by evaluating some simple models that only identify the sentence which contains the answer (i.e., the *answer sentence*) within the context. 

First we load the dataset -- note that the following cell should have the tag `excluded_from_script`, since the autograder will use a different dataset.

In [7]:
df_squad = pd.read_csv(
    "cleaned_squad_data.csv",
    dtype = { 
        "question" : str,
        "context_paragraph" : str,
        "answer" : str,
        "answer_start" : int,
        "answer_end" : int,
        "answer_sent_index" : int,
        "tokenized_context" : str
    },
    converters = {"context_sentences" : ast.literal_eval}
)

df_squad.head(5)

Unnamed: 0,question,context_paragraph,answer,answer_start,answer_end,context_sentences,answer_sent_index
0,When did Beyonce start becoming popular?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,in the late 1990s,269,286,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1
1,What areas did Beyonce compete in when she was...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,singing and dancing,207,226,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1
2,When did Beyonce leave Destiny's Child and bec...,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,2003,526,530,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,3
3,In what city and state did Beyonce grow up?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,"Houston, Texas",166,180,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1
4,In which decade did Beyonce become famous?,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,late 1990s,276,286,[Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ ...,1


Since the cell contents are truncated, let's print out and examine one row in detail:

In [8]:
df_squad.iloc[3].to_dict()

{'question': 'In what city and state did Beyonce  grow up? ',
 'context_paragraph': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'answer': 'Houston, Texas',
 'answer_start': 166,
 'answer_end': 180,
 'context_sentences': ['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, re

Now we can get a better understanding of what each column means:
* `question` is the question text.
* `context_paragraph` is the paragraph of text that contains the answer, which our model extracts from.
* `answer` is the ground-truth answer to the given question.
* `answer_start` and `answer_end` are the indexes of the first and last character of `answer` within `context_paragraph`. In other words, `context_paragraph[answer_start:answer_end]` yields `answer`. Note that `answer_end` is not inclusive.
* `context_sentences` is the list of sentences in `context_paragraph`.
* `answer_sent_index` is the ground-truth index of the answer sentence within `context_sentences` (indexing starts from 0).

There are several techniques to build a question-answering model from this dataset, which we will introduce in the following sections.

## Part B: Unsupervised Models
In this section, we will implement three unsupervised learning models to identify the sentence that contains the answer to a given question. Here "unsupervised" means that we will not make use of the ground truth answer provided in the dataset (i.e., the columns `text`, `answer_start`, and `answer_end`). Instead, the sentence identification will be based only on some pre-defined heuristics.

### Question 7: Jaccard Overlap
The Jaccard overlap of two given sets $A$ and $B$ measures the similarity between them and is defined as

\begin{equation}
J(A, B) = \begin{cases}
    \frac{|A \cap B|}{|A \cup B|} & \text{ if $A \ne \emptyset$ or $B \ne \emptyset$ } \\
    1 & \text { otherwise }
\end{cases}
\end{equation}

Given a question and a list of context sentences, we can identify the answer sentence using Jaccard overlap as follows:
1. Construct the set of words that are in the input question; we will call this set $Q$.
1. Construct the sets of words that are in each context sentence; we will call these sets $S_1, S_2, \ldots, S_m$. Here $S_i$ is the set of words in the $i$-th context sentence.
1. Return the index of the context sentence whose Jaccard overlap with the input question is largest; this is our predicted answer sentence: 

$$\hat y = \underset{1 \le i \le m}{\operatorname{argmax}} J(Q, S_i).$$

Implement the function `get_jaccard_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores the predicted answer sentence index in a new column `"jaccard_prediction"`, and the corresponding largest Jaccard overlap value in a new column `"jaccard_value"`.

**Notes**:
* Our math notations use 1-based indexing, but in your implementation the indexes start from 0. In other words, if the first sentence in the context paragraph is the predicted answer sentence, you should return 0.
* If multiple context sentences have the same (largest) Jaccard overlap with the question, return the smallest sentence index.
* To build the set of words from a sentence, you should first tokenize the sentence with `nltk.word_tokenize`, and then turn the resulting list of tokens into a set.
* You do not need to perform any rounding on the distance values.
* Refer to the [Pandas primer](https://nbviewer.jupyter.org/url/clouddatascience.blob.core.windows.net/primers/pandas-primer/pandas_primer.ipynb) on how to vectorize a row-wise dataframe operation.

In [6]:
def get_jaccard_prediction(df_squad):
    """
    Identify the answer sentence as one that has the largest Jaccard overlap with the input question.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        
    returns:
        pd.DataFrame : the input dataframe with two additional columns, "jaccard_prediction" and "jaccard_value"
    """
    df_squad['question_tokens'] = df_squad['question'].apply(lambda x: nltk.tokenize.word_tokenize(x))
    df_squad['q_token_set'] = df_squad['question_tokens'].apply(lambda x: set(x))
    
    df_squad['context_sent_tokens'] = df_squad['context_sentences'].apply(lambda x: 
                                            [nltk.tokenize.word_tokenize(sent) for sent in x])
    df_squad['context_sent_sets'] = df_squad['context_sent_tokens'].apply(lambda x: [set(sent) for sent in x])
    
    df_squad['jaccard_values_list'] = df_squad.apply(lambda x: [len(x['q_token_set'] & c_set)/
                                                               len(x['q_token_set'] | c_set) 
                                                                for c_set in x['context_sent_sets']], axis = 1)
    df_squad['jaccard_value'] = df_squad['jaccard_values_list'].apply(max)
    df_squad['jaccard_prediction'] = df_squad['jaccard_values_list'].apply(lambda x: x.index(max(x)))

    df_squad = df_squad.drop(columns = ['jaccard_values_list', 'context_sent_sets', 'context_sent_tokens',
                                        'q_token_set', 'question_tokens'])
    return df_squad

In [17]:
def test_get_jaccard_prediction():
    """Test on the first 10 rows"""
    df_jaccard = get_jaccard_prediction(df_squad.head(10).copy())
    
    check_equal(df_jaccard.shape, (10, 9))
    
    check_approx(df_jaccard["jaccard_value"].tolist(),[
        0.0, 0.046511627906976744, 0.15, 0.03225806451612903, 0.02040816326530612,
        0.18421052631578946, 0.10869565217391304, 0.12, 0.05263157894736842, 0.10256410256410256
    ])
    
    check_equal(df_jaccard["jaccard_prediction"].tolist(), [0, 1, 1, 0, 3, 1, 3, 2, 1, 1])
    
    jaccard_accuracy = (df_jaccard["jaccard_prediction"] == df_jaccard["answer_sent_index"]).values.mean()
    check_equal(jaccard_accuracy, 0.6)
    
    
    """Test on the entire dataset"""
    df_jaccard = get_jaccard_prediction(df_squad.copy())
    
    check_equal(df_jaccard.shape, (86821, 9))
    
    check_approx(df_jaccard.tail(10)["jaccard_value"].tolist(), [
        0.11428571428571428, 0.17073170731707318, 0.08333333333333333, 0.10526315789473684, 0.125,
        0.10714285714285714, 0.04, 0.07407407407407407, 0.07142857142857142, 0.08333333333333333
    ])
    
    check_equal(df_jaccard.tail(10)["jaccard_prediction"].tolist(), [0, 0, 0, 4, 4, 1, 1, 1, 1, 1])
    
    jaccard_accuracy = (df_jaccard["jaccard_prediction"] == df_jaccard["answer_sent_index"]).values.mean()
    check_approx(jaccard_accuracy, 0.7001992605475634)
    print("All tests passed!")
    
test_get_jaccard_prediction()

All tests passed!


Essentially, the Jaccard technique is saying "the context sentence that is most similar to the question contains the answer." We see that even such a simple heuristic can achieve 70% accuracy, which is not bad at all. This performance can be improved a bit if we take the time to preprocess (e.g., remove stopwords, lemmatize tokens) in the questions and context sentences, before computing the Jaccard overlap.

### Question 8: TF-IDF Vectors
Instead of Jaccard overlap, we can employ other measures of similarity, such as the Euclidean distance:

$$d_{\text{euclidean}}(a, b) = \|a - b\|_2 = \sqrt{\sum_{i=1}^k (a_i - b_i)^2}.$$

To compute this distance, we first need to convert each question and context sentence into a vector. Recall from earlier projects that one way to do so is building a TF-IDF model, which transforms each string into a vector $v \in \mathbb{R}^k$ where:
* $k$ is the number of unique tokens in the entire dataset.
* The $i$-th element is the frequency of token $i$ in the string, divided by its IDF value.

Given a question, a list of context sentences, and a trained TF-IDF model, we can identify the answer sentence as follows:
1. Transform the question into a vector $u$ using the TF-IDF model.
1. Transform each context sentence $s_i$ into a vector $v_i$ using the TF-IDF model.
1. Return the index of the context sentence whose Euclidean distance to the input question is smallest in the TF-IDF space:

$$\hat y = \underset{1 \le i \le m}{\operatorname{argmin}} d_{\text{euclidean}}(u, v_i).$$

Implement the function `get_tfidf_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores the predicted answer sentence index in a new column `"tfidf_prediction"`, and the corresponding smallest Euclidean distance value in a new column `"distance_value"`.

**Notes**:
* Since the input `tfidf_vectorizer` is already trained, you don't need to fit it on anything. Instead, just call `.transform` on the appropriate question / context sentence.
* Because TF-IDF transformation outputs sparse matrices, you should only use `scipy.sparse` methods to operate on them. Using standard NumPy/Scipy methods may lead to dimension issues or implicit conversion of the sparse matrices to dense.
* If multiple context sentences have the same (smallest) distance value from the question, return the smallest sentence index.
* You may find that using `df.apply(<your_custom_function>, axis = 1)` to process every row is quite slow, due to the complexity of the operations involved. To work around this issue, try to think about how to completely vectorize this function, using only built-in Pandas and NumPy/Scipy or Sklearn operations.
* You may find the Pandas method `.explode()` helpful. Note that there are duplicate questions in the dataset (the same question may apply to different context paragraphs), so be careful when performing groupby on the exploded dataset.

In [7]:
def get_tfidf_prediction(df_squad, tfidf_vectorizer):
    """
    Identify the answer sentence as one whose TF-IDF representation has minimal distance to that of the question.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        tfidf_vectorizer (sklearn.feature_extraction.text.TfidfVectorizer) :
            the TF-IDF model to transform questions and sentences

    Refs:
    # https://stackoverflow.com/questions/59223095/pandas-explode-index
    # https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy 
        -- https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.norm.html
        
    returns:
        pd.DataFrame : the input dataframe with two additional columns, "tfidf_prediction" and "distance_value"
    """
    ## Use vectorizer to get TF-IDF for questions
    #q_vectors = tfidf_vectorizer.transform(df_squad['question'])  
    ### ^ Original approach, thought it would be more efficient this way.. couldn't assign to column and decided it's better to run on exploded DF 
    ###   even though it's doing some duplicative transformations

    ## One row per context sentence
    df_squad_exploded = df_squad.explode('context_sentences').reset_index().rename(columns={'index' : 'c_sent_index'})
    df_squad_exploded['c_sent_index'] = df_squad_exploded.groupby('c_sent_index').cumcount() ### Keep track of original index
    #df_squad_exploded['context_sent_tf_idf'] = tfidf_vectorizer.transform(df_squad_exploded['context_sentences']) # Get TF-IDF vectors per c. sentence
    
    ## Calculate euclidean distance and find minimum per question/context
    df_squad_exploded['distances'] = sp.linalg.norm(tfidf_vectorizer.transform(df_squad_exploded['question']) - 
                                                    tfidf_vectorizer.transform(df_squad_exploded['context_sentences']), axis = 1)
    min_distances_df = df_squad_exploded.groupby(['question', 'context_paragraph', 'answer']).agg(
        {'distances': min}).reset_index().rename(columns={'distances': 'distance_value'})  # get minimum distance
    
    # Merge back to exploded df and get minimum index of minimum distance
    df_squad_exploded = df_squad_exploded.merge(min_distances_df, on = ['question', 'context_paragraph', 'answer'])
    df_squad_exploded['is_min_dist'] = df_squad_exploded.apply(lambda x: True if x['distances'] == x['distance_value'] else False, axis = 1)
    min_distances_with_idx_df = df_squad_exploded[df_squad_exploded['is_min_dist']].groupby(['question', 'context_paragraph', 'answer']).agg(
        {'distance_value': min, 'c_sent_index': min}).reset_index().rename(columns={'c_sent_index': 'tfidf_prediction'})

    ## Merge back to original DF
    df_squad = df_squad.merge(min_distances_with_idx_df, on = ['question', 'context_paragraph', 'answer'], how = 'left')
    
    return df_squad

Below is an example trained TF-IDF vectorizer that we can use. Since fitting it on the entire dataset takes a while, we will initialiize and fit it in the global namespace.

In [8]:
tfidf_vectorizer = TfidfVectorizer(
    tokenizer = nltk.word_tokenize,
    stop_words = stopwords.words('english'),
    ngram_range = (1,2),
    max_df = 1.0,
    min_df = 10
)
tfidf_vectorizer.fit(df_squad["context_paragraph"].unique())



TfidfVectorizer(min_df=10, ngram_range=(1, 2),
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function word_tokenize at 0x0000015EA84E09D0>)

In [38]:
def test_get_tfidf_prediction():
    """Test on the first 10 rows"""
    df_tf_idf = get_tfidf_prediction(df_squad.head(10).copy(), tfidf_vectorizer)
    
    check_equal(df_tf_idf.shape, (10, 9))
    
    check_approx(df_tf_idf["distance_value"].tolist(), [
        1.4142135623730951, 1.4142135623730951, 1.118805210307003, 1.414213562373095,
        1.414213562373095, 1.0991965705062294, 1.2823059131390453, 1.1299491754496973,
        1.3217983153692023, 1.1364153055563095
    ])
    
    """Test on the whole dataset"""
    df_tf_idf = get_tfidf_prediction(df_squad.copy(), tfidf_vectorizer)
    
    check_equal(df_tf_idf.shape, (86821, 9))
    
    check_approx(df_tf_idf.tail(10)["distance_value"].tolist(), [
        1.2205482532513834, 1.1011579092263701, 1.23935316383152, 1.2848863907388077,
        1.1456258185112482, 1.2657059224645815, 1.4142135623730951, 1.2768968376533885,
        1.2666294346869091, 1.4142135623730951
    ])
    
    tfidf_accuracy = (df_tf_idf["tfidf_prediction"] == df_tf_idf["answer_sent_index"]).values.mean()
    assert 0.68 <= round(tfidf_accuracy, 2) <= 0.69, tfidf_accuracy
    print("All tests passed!")
    
%time test_get_tfidf_prediction()

All tests passed!
CPU times: total: 3min 30s
Wall time: 3min 33s


If your implementation is sufficiently optimized, you should expect to see the local test being finished in about 5 minutes or less, on a `Standard_DC2 V3` Compute. If your code runs for more than 7 minutes, you should try to improve its efficiency.

We still see a training accuracy of about 0.68, so TF-IDF is fairly similar in performance to the baseline Jaccard model.

Up until now we have used language models that rely only on word frequencies, without considering the meaning of the words themselves. Now we will address this shortcoming by considering the *word embedding* of each word in our corpus.  Roughly speaking, a word embedding is a vector representation of that word in some space $\mathbb{R}^k$. This representation differs from the TF-IDF transformation in two important ways:

1. If two words have similar meanings in some sense, their Euclidean distances should be close. For example, we may expect the word `"Pittsburgh"` to be closer, in Euclidean distance, to `"Chicago"` than to `"Pikachu"`, because the first two are city names while the third is a Pokemon.
1. The dimensionality of the vector $k$ is fixed and typically much smaller than the vocabulary size.

While these features sound promising, constructing word embeddings requires a very large amount of data (you need to see `"Pittsburgh"` and `"Chicago"` appear together in overlapping context enough times for the model to learn that they are similar). The algorithms to train word embedding models are unfortunately beyond the scope of this course, as they involve many machine learning theories we haven't covered.

That said, there are many powerful pre-trained models that we can use. These models have been trained on huge amounts of data and encode a lot of information about the meanings of the words. The first pre-trained model we will use is one from the [SentenceTransformer library](https://github.com/UKPLab/sentence-transformers#getting-started) called `'paraphrase-MiniLM-L3-v2'`. Once loaded, it can encode a collection of strings, yielding a matrix where each row is the vector embedding of one string.

Run the following cell to see the model in action. Note that the first time you do so, it may take some time to download the model. Also note that the embedding matrix is a dense NumPy matrix, rather than a sparse Scipy matrix like what TF-IDF outputs.

In [9]:
sent_transformer = SentenceTransformer('paraphrase-MiniLM-L3-v2')
embedding = sent_transformer.encode([
    'This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.'
])
print(embedding.shape)
print(embedding)

(3, 384)
[[-0.3278151  -0.275003    0.30813015 ...  0.10130047  0.196514
  -0.288206  ]
 [-0.09794946 -0.04687394  0.33406436 ... -0.06500659  0.07290423
   0.12609741]
 [ 0.16880377 -0.2113928   0.47756457 ...  0.13764471  0.81041163
   0.17771332]]


### Question 9: Sent2vec Encoders
Here we will follow roughly the same procedure as in the previous question: first convert the questions and context sentences into vectors using `sent_transformer`, then identify the context sentence whose vector representation is closest in Euclidean distance to that of the input question. However, one important caveat is that encoding the questions and context sentences with `sent_transformer` takes a lot longer (as much as 10 times longer) than with `tfidf_vectorizer`, because this encoding is actually the forward pass through a pre-trained neural network. To address this issue, we recommend the following outline for your implementation:

1. Construct a mapping from each unique question / context sentence to its vector encoding.
1. Use this mapping to compute the vector representation of every question / context sentence in the dataset.
1. Compute the Euclidean distances between the questions and context sentences to identify the answer sentence index for each question.

The key idea is that there are some duplicate questions and many duplicate context sentences (since several questions share the same context paragraph), so you want to store the encoding of each unique question / context sentence to avoid recomputing them several times. This same idea also applies to Question 2, although repeated computations are not as big of an issue there since TF-IDF transformations are fast.

Implement the function `get_sent2vec_prediction` that performs the above steps on the dataset `df_squad`. For every row, it stores the predicted answer sentence index in a new column `"sent2vec_prediction"`, and the corresponding smallest Euclidean distance value in a new column `"distance_value"`.

**Notes**:
* When working with only the values of a Series (e.g., a dataframe column) and do not care about its name or indices, calling the `.values` attribute to convert it to a NumPy array may provide some speed-up.
* If multiple context sentences have the same (smallest) distance value from the question, return the smallest sentence index.
* You should encode the entire set of unique questions / context sentences at once, rather than encoding each of them individually in a loop. Due to the encoder's internal implementation, you will get different embedding results if you encode each question / context sentence individually.
* Be careful when dealing with nested NumPy arrays. If you get a NumPy array that contains other NumPy arrays, you can use `np.stack` to turn it to a normal multi-dimensional array (otherwise it will be treated as a 1D array of pointers to other arrays).

In [16]:
def get_sent2vec_prediction(df_squad, encoder):
    """
    Identify the answer sentence as one whose Sent2vec representation has minimal distance to that of the question.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        encoder (SentenceTransformer) :
            the Sent2vec encoder used to transform questions and sentences into their word embeddings
        
    returns: Tuple(question_embeddings, context_embeddings, df_sent2vec)
        question_embeddings (Dict[str, np.ndarray]) :
            a mapping between each unique question and its Sent2vec embedding
        context_embeddings (Dict[str, np.ndarray]) :
            a mapping between each unique context sentence and its Sent2vec embedding
        df_sent2vec (pd.DataFrame) :
            the input dataframe with two additional columns, "sent2vec_prediction" and "distance_value"
    """
    ## Get embeddings for questions
    questions = list(df_squad['question'].unique())
    q_vectors = encoder.encode(questions)
    question_embeddings = {question: q_vectors[i,] for i, question in enumerate(questions)}
    
    ## Get embeddings for context sentences
    df_squad_exploded = df_squad.explode('context_sentences').reset_index().rename(columns={'index' : 'c_sent_index'})
    df_squad_exploded['c_sent_index'] = df_squad_exploded.groupby('c_sent_index').cumcount() ### Keep track of original index
    context_sentences = list(df_squad_exploded['context_sentences'].unique())
    c_sent_vectors = encoder.encode(context_sentences)
    context_embeddings = {sent: c_sent_vectors[i,] for i, sent in enumerate(context_sentences)}

    ## Merge DFs with the mappings into exploded DF, calculate distances
    question_mapping_df = pd.DataFrame(question_embeddings.items(), columns = ['question', 'question_embedding'])
    c_sent_mapping_df = pd.DataFrame(context_embeddings.items(), columns = ['context_sentences', 'c_sent_embedding'])
    df_squad_exploded = df_squad_exploded.merge(question_mapping_df, on = 'question')
    df_squad_exploded = df_squad_exploded.merge(c_sent_mapping_df, on = 'context_sentences')
    df_squad_exploded['distances'] = df_squad_exploded.apply(lambda x: 
                                       #np.round(
                                      np.linalg.norm(x['question_embedding'] - x['c_sent_embedding'])#, 2)
                                        , axis = 1)
    df_squad_exploded['distances'] = df_squad_exploded['distances'].astype(float)
    
    ## Group by Q/context/answer and merge answers back with original DF
    min_distances_df = df_squad_exploded.groupby(['question', 'context_paragraph', 'answer']).agg(
        {'distances': min}).reset_index().rename(columns={'distances': 'distance_value'})  # get minimum distance
    
    df_squad_exploded = df_squad_exploded.merge(min_distances_df, on = ['question', 'context_paragraph', 'answer'])
    df_squad_exploded['is_min_dist'] = df_squad_exploded.apply(lambda x: True if x['distances'] == x['distance_value'] else False, axis = 1)
    min_distances_with_idx_df = df_squad_exploded[df_squad_exploded['is_min_dist']].groupby(['question', 'context_paragraph', 'answer']).agg(
        {'distance_value': min, 'c_sent_index': min}).reset_index().rename(columns={'c_sent_index': 'sent2vec_prediction'})

    ## Merge back to original DF
    df_sent2vec = df_squad.merge(min_distances_with_idx_df, on = ['question', 'context_paragraph', 'answer'], how = 'left')
    
    return question_embeddings, context_embeddings, df_sent2vec

In [None]:
def test_get_sent2vec_prediction():
    """Test on the first 10 rows"""
    question_embeddings_map, context_embeddings_map, df_sent2vec = \
        get_sent2vec_prediction(df_squad.head(10).copy(), sent_transformer)
    
    question = 'When did Beyoncé rise to fame?'
    check_approx(
        question_embeddings_map[question][:10],
       [ 0.36320758,  0.18506967,  0.12518793, -0.19348083, -0.09297507,  0.06189617,
  0.08755877,  0.20996222, -0.21528678, -0.08113602]
    )
    
    question = 'In what city and state did Beyonce  grow up? '
    check_approx(
        question_embeddings_map[question][:10],
        [ 0.40505934,  0.02651624,  0.21662246, -0.14586538,  0.32407546, -0.3062249,
  0.10030355, 0.07935077, -0.4575257,  -0.23452175]
    )
    
    context = "Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child."
    check_approx(
        context_embeddings_map[context][:10],
        [ 0.1807038 , -0.01752569 , 0.15747224 , 0.24041094 ,-0.24826567,  0.10811564,
  0.0700346 , -0.0671778 , -0.02708031 , 0.0787482 ]
    )
    
    context = "Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time."
    check_approx(
        context_embeddings_map[context][:10],
        [ 0.26072213, -0.26842636,  0.05217046, -0.19852023, -0.02323298, -0.08129264,
  0.35738084 , 0.23436342, -0.07554979,  0.14573847]
    )
    
    check_approx(
        df_sent2vec["distance_value"].to_list(),
        [4.481517791748047, 4.241846561431885, 3.57340407371521, 4.166721820831299, 4.1860246658325195, 3.7791800498962402, 4.352610111236572, 4.356138229370117, 4.38789701461792, 4.027286529541016]
    )
    
    """Test on the full dataset"""
    question_embeddings_map, context_embeddings_map, df_sent2vec = \
        get_sent2vec_prediction(df_squad.copy(), sent_transformer)
    
    question = 'What is KMC an initialism of?'
    check_approx(
        question_embeddings_map[question][:10],
        [ 0.20711717,  0.34029195, -0.41746157,  0.05796434, -0.01994354, -0.07176114,
  0.04191927, -0.21609439,  0.11874234,  0.24417552]
    )
    
    question = 'In what year did Kathmandu create its initial international relationship?'
    check_approx(
        question_embeddings_map[question][:10],
        [-0.15354869,  0.0844518,  -0.02738676, -0.17458615, -0.259946,   -0.16857868,
 -0.16149712,  0.0111943,  -0.2832642,  -0.09731326]
    )
    
    context = "KMC's first international relationship was established in 1975 with the city of Eugene, Oregon, United States."
    check_approx(
        context_embeddings_map[context][:10],
        [-0.03115498, -0.12283143,  0.02119236, -0.05328825,  0.16859488,  0.00662427,
    -0.17930554, -0.12675692,  0.05834598,  0.2040695 ]
    )
    
    context = 'It was established in 1972 and started to impart medical education from 1978.'
    check_approx(
        context_embeddings_map[context][:10],
        [-0.01332329,  0.16401196,  0.18984318, -0.29964197, -0.06480616,  0.18221042,
 -0.15938419,  0.15202712, -0.06736959,  0.0288473 ]
    )
    check_approx(
        df_sent2vec["distance_value"].tail(10).to_list(),
        [3.17356538772583, 2.3664536476135254, 3.969325542449951, 3.6290855407714844, 3.3139748573303223, 3.7325403690338135, 4.797183990478516, 4.05021333694458, 3.853412389755249, 4.265787124633789]
    )
    
    sent2vec_accuracy = (df_sent2vec["sent2vec_prediction"] == df_sent2vec["answer_sent_index"]).mean()
    check_equal(round(sent2vec_accuracy, 2), 0.74)
    print("All tests passed!")
    
    print("Saving the embedding to pickle files for later use ...")
    with open("question_embeddings_map.pkl", "wb") as f1, open("context_embeddings_map.pkl", "wb") as f2:
        pickle.dump(question_embeddings_map, f1)
        pickle.dump(context_embeddings_map, f2)
    print("Done!")
        
    
test_get_sent2vec_prediction()

If your implementation is sufficiently optimized, you should expect to see the local test finished in about 15 minutes or less, on a `Standard_DS2 V3` CPU Compute. 

## Part C: Supervised Models
So far we have been exploring unsupervised methods for answer extraction which involves dividing the questions and contexts into tokens and projecting those tokens into a common representation space. You may notice that their performances weren't particularly great because we didn't perform any training on the dataset; instead, we only used pre-defined heuristics and pre-trained models. From this point, we will move to the supervised learning domain, where we make use of the ground-truth answers and build models that learn from these answers.

### Question 10: Preparing dataset for supervised learning
We will first consider a binary classification setting, where we are given a question and a context sentence, and need to predict whether this context sentence contains the answer. In this setting, we can get several training data points from each row in the original dataset `df_squad`. In particular, if a row in `df_squad` looks like the following:

|question|context_sentences|answer_sent_index|
|---|---|---|
|`q`|`[s0, s1, s2, s3]`|2|

then it contributes four data points:

|question|context_sentence|is_answer_sent|
|---|---|---|
|`q`|`s0`|0|
|`q`|`s1`|0|
|`q`|`s2`|1|
|`q`|`s3`|0|

More generally, a row in `df_squad` where the `context_sentences` list has `n` sentences will be transformed into `n` rows, one for each context sentence. Among these new rows, only the row at index `answer_sent_index` gets assigned the label 1, while the others get the label 0.

Implement the function `build_data_for_classification` that turns the original dataset `df_squad` into a dataframe with 3 columns -- `question`, `context_sentence` and `is_answer_sent` -- using the procedure specified above.

**Notes**:
* Keep in mind that the new dataframe has a column named `context_sentence`, **not** `context_sentences`.
* You should preserve the original row ordering in the input dataframe.

In [11]:
def build_data_for_classification(df_squad):
    """
    Convert the SQuAD dataset into a format where every row contains one question, one context answer,
    and a flag that indicates whether the context sentence is the answer.
    
    args:
        df_squad (pd.DataFrame) : a copy of the SQuAD dataset
        
    returns:
        pd.DataFrame : a new dataframe with 3 columns: question, context_sentence, is_answer_sent
    """
    ## Get embeddings for context sentences
    df_squad_exploded = df_squad.explode('context_sentences').reset_index().rename(columns={'index' : 'c_sent_index',
                                                                                            'context_sentences': 'context_sentence'})
    df_squad_exploded['c_sent_index'] = df_squad_exploded.groupby('c_sent_index').cumcount() ### Keep track of original index
    df_squad_exploded['is_answer_sent'] = df_squad_exploded.apply(lambda x: 1 if x['c_sent_index'] == x['answer_sent_index']
                                                                  else 0, axis = 1)
    df_squad_exploded = df_squad_exploded[['question', 'context_sentence', 'is_answer_sent']]
    return df_squad_exploded

In [12]:
def test_build_data_for_classification():
    """Test on 10 random rows"""
    df_sample = df_squad.sample(n = 10, random_state = 0).copy()
    df_formatted = build_data_for_classification(df_sample)
    
    assert df_formatted.shape == (48, 3), df_formatted.shape
    
    assert sorted(df_formatted.columns) == ['context_sentence', 'is_answer_sent', 'question']
    
    assert df_formatted['is_answer_sent'].tolist() == [
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 1, 0, 1, 0
    ], df_formatted['is_answer_sent'].tolist()
    
    # check that question orderings are preserved
    assert (df_formatted["question"].unique() == df_sample["question"].unique()).all()
    
    """Test on the full dataset"""
    df_formatted = build_data_for_classification(df_squad.copy())
    
    assert df_formatted.shape == (443259, 3), df_formatted.shape
    
    assert df_formatted['is_answer_sent'].sample(n = 40, random_state = 200).tolist() == [
        1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 1
    ], df_formatted['is_answer_sent'].tail(40).tolist()
    
    assert (df_formatted["question"].unique() == df_squad["question"].unique()).all()
    print("All tests passed!")
    
test_build_data_for_classification()

All tests passed!


Before building our supervised learning models, we will load the embeddings you created in Question 3 and set up the train set and test set.

In [9]:
question_embeddings_map = pd.read_pickle("question_embeddings_map.pkl")
context_embeddings_map = pd.read_pickle("context_embeddings_map.pkl")

In [13]:
df_squad_train, df_squad_test = train_test_split(df_squad, train_size = 0.8, random_state = 0)
df_train_formatted = build_data_for_classification(df_squad_train.reset_index())
df_test_formatted = build_data_for_classification(df_squad_test.reset_index())

### Question 11: Logistic Regression
Having set up the dataset for binary classification, we can now train a logistic regression model. While the binary labels are already set up, we still need to construct the feature vectors as follows. For every question `q` and context sentence `s`:
* Convert the question to its word embedding $x_q \in \mathbb{R}^k$, using the question embedding map from Question 9.
* Convert the context sentence to its word embedding $x_s \in \mathbb{R}^l$, using the context embedding map from Question 9.
* Concatenate these two vectors to get the input vector to the logistic regression model:

$$x_{q,s} = (x_{q1} \quad x_{q2} \quad \ldots \quad x_{qk} \quad x_{s1} \quad x_{s2} \quad \ldots \quad x_{sl})^\top \in \mathbb{R}^{k+l}.$$

Implement the function `get_lr_prediction` that performs the following steps:

1. Construct the feature vector for each row of the train set `df_train_formatted`, using the above formula.
1. Use an Sklearn `StandardScaler` (with default parameters) to fit and transform the train set `df_train_formatted`.
1. Train an Sklearn `LogisticRegression` model on the train set `df_train_formatted`.
1. Use this model to perform prediction on the test set `df_test_formatted`.
1. Return the trained LR model and its accuracy on the test set (i.e., the number of correct predictions divided by the test set size).

**Notes**:
* When creating a `LogisticRegression` model you should set `random_state` to the input `seed` and `max_iters` to 1000. You do not need to specify any other parameter.
* Make sure you also standardize the feature matrix built from the test set before inputting it to the LR model for prediction.
* Similar to Question 9, if you get a NumPy array that contains other NumPy arrays, you can use `np.stack` to turn it to a normal multi-dimensional array (otherwise it will be treated as a 1D array of pointers to other arrays).

In [20]:
def get_lr_prediction(df_train_formatted, df_test_formatted, question_embeddings_map, context_embeddings_map, seed = 0):
    """
    Train and evaluate the performance of a binary logisitic regression model to predict
    whether a context sentence contains the answer to a given question.
    
    args:
        df_train_formatted (pd.DataFrame) : the train set dataframe with 3 columns:
            question, context_sentence, is_answer_sent
        df_test_formatted (pd.DataFrame) : the test set dataframe with 3 columns:
            question, context_sentence, is_answer_sent
        question_embeddings_map (dict[str, np.ndarray]) : a mapping from question to word embedding
        context_embeddings_map (dict[str, np.ndarray]) : a mapping from context sentence to word embedding
        seed (int) : the random generator used in LogisticRegression
        
    return: Tuple(trained_model, accuracy)
        trained_model (sklearn.linear_model.LogisticRegression) : the LR model trained on the train set
        accuracy (float) : the accuracy score of the trained model on the test set
    """
    def build_feature_matrix(df, question_embeddings_map, context_embeddings_map):
        question_features = np.stack(df['question'].progress_map(lambda x: question_embeddings_map[x]))
        context_features = np.stack(df['context_sentence'].progress_map(lambda x: context_embeddings_map[x]))
        return np.hstack((question_features, context_features))
    
    #X_train = np.array([list(question_embeddings_map[row['question']]) + list(context_embeddings_map[row['context_sentence']])
    #                    for i, row in df_train_formatted.iterrows()])
    X_train = build_feature_matrix(df_train_formatted, question_embeddings_map, context_embeddings_map)
    scaler = StandardScaler().fit(X_train)
    model = LogisticRegression(random_state = seed, max_iter = 1000).fit(scaler.transform(X_train), df_train_formatted['is_answer_sent'].values)

    X_test = build_feature_matrix(df_test_formatted, question_embeddings_map, context_embeddings_map)
    #X_test = np.array([list(question_embeddings_map[row['question']]) + list(context_embeddings_map[row['context_sentence'].values])
    #                    for i, row in df_test_formatted.iterrows()])
    #X_test = StandardScaler().fit_transform(X_test)
    accuracy = model.score(scaler.transform(X_test), df_test_formatted['is_answer_sent'].values)
    
    return model, accuracy

In [None]:
def test_get_lr_prediction():
    "Test on the first 1000 rows of the dataset"
    df_squad_train_mini, df_squad_test_mini = train_test_split(df_squad.head(1000), train_size = 0.8, random_state = 0)
    df_train_formatted_mini = build_data_for_classification(df_squad_train_mini.reset_index())
    df_test_formatted_mini = build_data_for_classification(df_squad_test_mini.reset_index())
    lr_mini, acc_mini = get_lr_prediction(
        df_train_formatted_mini, df_test_formatted_mini,
        question_embeddings_map, context_embeddings_map
    )
    assert lr_mini.coef_.flatten()[:10].round(2).tolist() == [
        0.07, -0.01, -0.03, -0.03, -0.01, 0.01, -0.03, 0.0, 0.04, -0.04
    ]
    assert lr_mini.intercept_.round(2)[0] == -2.73
    assert round(acc_mini, 2) ==  0.81
    
    """Test on the entire dataset"""
    lr, acc = get_lr_prediction(df_train_formatted, df_test_formatted, question_embeddings_map, context_embeddings_map)
    assert lr.coef_.flatten()[:10].round(2).tolist() == [
        -0.04, 0.0, 0.01, 0.0, -0.0, 0.0, -0.01, -0.0, -0.03, 0.01
    ]
    assert lr.intercept_.round(2)[0] == -1.54
    assert round(acc, 2) == 0.80
    
    print("All tests passed!")
    
test_get_lr_prediction()

We obtain about 80% accuracy in this binary classification task, which is not too bad! However, it's important to note that this accuracy cannot be compared with those from previous questions, because it is evaluated in a different setting (`df_test_formatted`). If we were only interested in whether the ground truth answer sentences are correctly detected, we would evaluate the accuracy on the original test set `df_squad_test`, instead of the formatted one.