# L4: Word embeddings

In this lab you will explore word embeddings. A **word embedding** is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. You will use word embedding to find similar words, and evaluate their usefulness in an inference task.

You will use the word vectors that come with [spaCy](http://spacy.io). Note that you will need the &lsquo;large&rsquo; English language model; the &lsquo;small&rsquo; model that you used in previous labs does not include proper word vectors.

In [2]:
import spacy

nlp = spacy.load("en_core_web_lg")

Every word in the model&rsquo;s vocabulary comes with a 300-dimensional vector, represented as a NumPy array. The following code cell shows how to access the vector for the word *cheese*:

In [3]:
nlp.vocab["cheese"].vector

array([-5.5252e-01,  1.8894e-01,  6.8737e-01, -1.9789e-01,  7.0575e-02,
        1.0075e+00,  5.1789e-02, -1.5603e-01,  3.1941e-01,  1.1702e+00,
       -4.7248e-01,  4.2867e-01, -4.2025e-01,  2.4803e-01,  6.8194e-01,
       -6.7488e-01,  9.2401e-02,  1.3089e+00, -3.6278e-02,  2.0098e-01,
        7.6005e-01, -6.6718e-02, -7.7794e-02,  2.3844e-01, -2.4351e-01,
       -5.4164e-01, -3.3540e-01,  2.9805e-01,  3.5269e-01, -8.0594e-01,
       -4.3611e-01,  6.1535e-01,  3.4212e-01, -3.3603e-01,  3.3282e-01,
        3.8065e-01,  5.7427e-02,  9.9918e-02,  1.2525e-01,  1.1039e+00,
        3.6678e-02,  3.0490e-01, -1.4942e-01,  3.2912e-01,  2.3300e-01,
        4.3395e-01,  1.5666e-01,  2.2778e-01, -2.5830e-02,  2.4334e-01,
       -5.8136e-02, -1.3486e-01,  2.4521e-01, -3.3459e-01,  4.2839e-01,
       -4.8181e-01,  1.3403e-01,  2.6049e-01,  8.9933e-02, -9.3770e-02,
        3.7672e-01, -2.9558e-02,  4.3841e-01,  6.1212e-01, -2.5720e-01,
       -7.8506e-01,  2.3880e-01,  1.3399e-01, -7.9315e-02,  7.05

## Problem 1: Finding similar words

Your first task is to use the word embeddings to find similar words. More specifically, we ask you to write a function `most_similar` that takes a vector $x$ and returns a list with the 10 most similar entries in spaCy&rsquo;s vocabulary, with similarity being defined by cosine.

**Tip:** spaCy already has a [`most_similar`](https://spacy.io/api/vectors#most_similar) method that you can wrap.

In [4]:
# TODO: Enter your implementation of `most_similar` here
from scipy.spatial.distance import cosine

# Not display warnings - Spacy's most_similar() prints a warning everytime
# This is caused by the zero vector corresponding to the <OOV> token
# Note: This issue is fixed in Spacy 2.2.2 (https://github.com/explosion/spaCy/issues/3412)
import warnings  
warnings.filterwarnings('ignore')

def most_similar(query_vector, n=10):
    # n=n+1 because the <OOV> token with zero vector and similarity=nan is ignored
    # So, most_similar() returns n-1 results instead of n
    similar_tokens_hash = nlp.vocab.vectors.most_similar(query_vector.reshape([1,-1]), n=n+1)[0][0]
    similar_words = [nlp.vocab[hash] for hash in similar_tokens_hash]
    return similar_words

Test your implementation by running the following code cell, which will print the 10 most similar words for the word *cheese*:

In [5]:
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import pandas as pd
import seaborn as sns
from bokeh.palettes import inferno, plasma, viridis

In [6]:
print(" ".join(w.text for w in most_similar(nlp.vocab["cheese"].vector)))

cheese CHEESE Cheese Cheddar CHEDDAR cheddar Bacon BACON bacon Cheeses


You should get the following output:

Once you have a working implementation of `most_similar`, use it to think about in what sense the returned words really are &lsquo;similar&rsquo; to the cue word. Try to find examples where the cue word and at least one of the words returned by `most_similar` are in the following semantic relations:

1. synonymy (exchangeable meanings)
2. antonymy (opposite meanings)
3. hyperonymy/hyponymy (more specific/less specific meanings)

Document your examples in the code cell below.

After quite some playing around, we found the following examples for the above mentioned semantic relations.

#### Synonymy - Words with exchangeable meaning

In [7]:
def print_similar_words(query_words):
    for word in query_words:
        print("{} : {}".format(word, " | ".join(w.text for w in most_similar(nlp.vocab[word].vector))))
        print("------------------------------------------------------------------")

In [8]:
# Synonymy
print_similar_words(["nice", "sadness", "beautiful"])

nice : Nice | NIce | nice | nICE | NICE | Good | gOOD | GOOD | good | GOod
------------------------------------------------------------------
sadness : SADNESS | sadness | Sadness | Sorrow | SORROW | sorrow | DESPAIR | despair | Despair | GRIEF
------------------------------------------------------------------
beautiful : beautiful | BEAUTIFUL | Beautiful | GORGEOUS | Gorgeous | gorgeous | lovely | Lovely | LOVELY | Stunning
------------------------------------------------------------------


#### Antonymy - Words with opposite meaning

In [9]:
# Antonymy
print_similar_words(["positive", "advantages", "pros"])

positive : positive | Positive | POSITIVE | negative | NEGATIVE | Negative | POSTIVE | postive | Postive | positivity
------------------------------------------------------------------
advantages : ADVANTAGES | advantages | Advantages | Disadvantages | DISADVANTAGES | disadvantages | drawbacks | DRAWBACKS | Drawbacks | Advantage
------------------------------------------------------------------
pros : PROs | Pros | pros | PROS | CONs | CONS | Cons | cons | PRO | prO
------------------------------------------------------------------


#### Hyperonymy - Words which are a specific type of query word

In [10]:
# Hyperonymy
print_similar_words(["wool", "metal", "literature", "footwear"])

wool : wool | WOOL | Wool | cashmere | Cashmere | CASHMERE | merino | MERINO | Merino | woolen
------------------------------------------------------------------
metal : Metal | METAL | metal | Steel | steel | STEEL | Aluminum | ALUMINUM | aluminum | IRON
------------------------------------------------------------------
literature : LITERATURE | Literature | literature | LITERARY | Literary | literary | poetry | Poetry | POETRY | scholarly
------------------------------------------------------------------
footwear : Footwear | footwear | FOOTWEAR | Shoes | shoes | SHOES | shoe | Shoe | SHOE | Sneakers
------------------------------------------------------------------


#### Hyponymy - Words which are a general class to which query word belongs

In [11]:
# Hyponymy
print_similar_words(["shoe", "arachnophobia", "elk"])

shoe : shoe | SHOE | Shoe | shoes | SHOES | Shoes | Footwear | footwear | FOOTWEAR | SNEAKER
------------------------------------------------------------------
arachnophobia : Arachnophobia | arachnophobia | acrophobia | ACROPHOBIA | Acrophobia | agoraphobia | Agoraphobia | Phobia | phobia | PHOBIA
------------------------------------------------------------------
elk : Elk | elk | ELK | deer | Deer | DEER | Moose | MOOSE | moose | Bison
------------------------------------------------------------------


## Problem 2: Plotting similar words

Your next task is to visualize the word embedding space by a plot. To do so, you will have to reduce the dimensionality of the space from 300 to 2&nbsp;dimensions. One suitable algorithm for this is [T-distributed Stochastic Neighbor Embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) (TSNE), which is implemented in scikit-learn&rsquo;s [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class.

Write a function `plot_most_similar` that takes a list of words (lexemes) and does the following:

1. For each word in the list, find the most similar words (lexemes) in the spaCy vocabulary.
2. Compute the TSNE transformation of the corresponding vectors to 2&nbsp;dimensions.
3. Produce a scatter plot of the transformed vectors, with the vectors as points and the corresponding word forms as labels.

In [12]:
# TODO: Write code here to plot the most similar words
def plot_most_similar(target):
    
    unique_ref = [ref.text for ref in target]
    similar_lexemes = [most_similar(nlp.vocab[x].vector) for x in unique_ref]
    array_lex = np.array([w.vector for list_lex in similar_lexemes for w in list_lex])
    ref_words = np.concatenate([np.repeat(ref, len(similar_lexemes[0])) for ref in unique_ref])
    df_plot = pd.DataFrame(TSNE(n_components=2).fit_transform(array_lex))
    df_plot[2] = [x.text for list_lex in similar_lexemes for x in list_lex]
    df_plot.columns = ["Dim1", "Dim2", "Word"]
    df_plot[3] = np.concatenate([np.repeat(x, len(similar_lexemes[0])) for x in viridis(len(similar_lexemes))])
    df_plot[4] = ["Coordinates: (" + str(round(df_plot["Dim1"][i], 4)) + "; " + str(round(df_plot["Dim2"][i], 4))
                  + ")<br>Word: " + df_plot["Word"][i] + "<br>Reference: " + ref_words[i] 
                  for i in range(df_plot.shape[0])]
    df_plot.columns = ["Dim1", "Dim2", "Word", "Color", "Hover"]
    
    
    # Plot with plotly
    fig = go.Figure()

    for i in range(len(unique_ref)):
        df_temp = df_plot[(i*len(similar_lexemes[0])):(i*len(similar_lexemes[0])+len(similar_lexemes[0]))]
        
        fig.add_trace(go.Scatter(
            x=df_temp["Dim1"],
            y=df_temp["Dim2"],
            mode="markers+text",
            text=df_temp["Word"],
            textposition="bottom center",
            textfont=dict(
                family="sans serif",
                size=14,
                color="steelblue"
            ),
            hovertext = df_temp["Hover"],
            hoverinfo="text",
            marker=dict(
                size=16,
                cmax=39,
                cmin=0,
                color=df_temp["Color"],
                colorscale="Viridis"
            ),
            name=unique_ref[i],
            showlegend=True
        ))

    fig.update_xaxes(zeroline=True, zerolinewidth=1, zerolinecolor='black')
    fig.update_yaxes(zeroline=True, zerolinewidth=1, zerolinecolor='black')
    fig.update_layout(
        title = "2D TSNE of the sample words and their top 10 most similar",
        xaxis_title="Dimension 1",
        yaxis_title="Dimension 2",
        font=dict(
            family="Courier New, monospace",
            size=18,
            color="#7f7f7f"
        )
    )

    fig.show()

Test your code by running the following cell:

In [13]:
plot_most_similar(nlp.vocab[w] for w in ["cheese", "goat", "sweden", "university", "computer"])

## Problem 3: Analogies

In a **word analogy task** you are given three words $x$, $y$, $z$ and have to predict a word $w$ that has the same semantic relation to $z$ as $y$ has to $x$. One example is *man*, *woman*, *brother*, the expected answer being *sister* (the semantic relation is *male*/*female*).

[Mikolov et al. (2013)](http://www.aclweb.org/anthology/N13-1090) have shown that word analogy tasks can be solved by adding and substracting word vectors in a word embedding: the vector for *sister* is the closest vector (in terms of cosine distance) to the vector *brother* $-$ *man* $+$ *woman*. Your next task is to write a function `fourth` that takes in three words (say *brother*, *man*, *woman*) and predicts the word that completes the analogy (in this case, *sister*).

In [14]:
# TODO: Enter code here to solve the analogy problem
def analogy(x, y, z):
    
    # First we find the ideal value of the vector for the unknown word w
    # THIS IS NOT WORKING?????? From the paper this is supposed to be the right computation...
    # But we got words similar to z!
    # w_ideal = y.vector - x.vector + z.vector
    w_ideal = x.vector - y.vector + z.vector
    
    # Since this exact vector does not exist, we need to find the one with highest cosine similarity
    # among all the terms present in our vocabulary.
    # Having a dictionary available, we find the most similar to the ideal one 
    w = most_similar(w_ideal, n = 1)[0]
    
    return(w)

def print_analogy(x, y, z, expected = False):
    if bool(expected):
        print("Analogy for: ", x, " - ", y, " + ", z, " (expected ", expected, ")", sep = "")
    else:
        print("Analogy for:", x, "-", y, "+", z)
    print(analogy(nlp.vocab[x], nlp.vocab[y], nlp.vocab[z]).text)
    print("--------------------------------------------------------------------------------")

Test your code by running the following code. You should get *sister*.

In [15]:
print_analogy("brother", "man", "woman", expected = "sister")

Analogy for: brother - man + woman (expected sister)
sister
--------------------------------------------------------------------------------


You should also be able to get the following:

* *Stockholm* $-$ *Sweden* $+$ *Germany* $=$ *Berlin*
* *Swedish* $-$ *Sweden* $+$ *France* $=$ *French*
* *better* $-$ *good* $+$ *bad* $=$ *worse*
* *walked* $-$ *walk* $+$ *take* $=$ *took*

Experiment with other examples to see whether you get the expected output. Provide three examples of analogies for which the model produces the &lsquo;correct&rsquo; answer, and three examples on which the model &lsquo;failed&rsquo;. Based on your theoretical understanding of word embeddings, do you have a hypothesis as to why the model succeeds/fails in completing the analogy? Discuss this question in a short text.

#### Given analogies

In [16]:
print_analogy("Stockholm", "Sweden", "Germany")
print_analogy("Swedish", "Sweden", "France")
print_analogy("better", "good", "bad")
print_analogy("walked", "walk", "take")

Analogy for: Stockholm - Sweden + Germany
Berlin
--------------------------------------------------------------------------------
Analogy for: Swedish - Sweden + France
french
--------------------------------------------------------------------------------
Analogy for: better - good + bad
Worse
--------------------------------------------------------------------------------
Analogy for: walked - walk + take
took
--------------------------------------------------------------------------------


#### Experimental analogies - Successful examples

In [17]:
print_analogy("Sweden", "Swedish", "Spanish", expected = "Spain")
print_analogy("Swedish", "Sweden", "Italy", expected = "Italian")
print_analogy("feel", "felt", "made", expected = "make")

Analogy for: Sweden - Swedish + Spanish (expected Spain)
Spain
--------------------------------------------------------------------------------
Analogy for: Swedish - Sweden + Italy (expected Italian)
italian
--------------------------------------------------------------------------------
Analogy for: feel - felt + made (expected make)
MAKE
--------------------------------------------------------------------------------


#### Experimental analogies - Failed examples

In [18]:
print_analogy("Sweden", "Swedish", "Norway", expected = "Norvegian")
print_analogy("mocassin", "shoe", "trousers", expected = "jeans or similar")
print_analogy("hand", "arm", "leg", expected = "feet/foot")

Analogy for: Sweden - Swedish + Norway (expected Norvegian)
Norway
--------------------------------------------------------------------------------
Analogy for: mocassin - shoe + trousers (expected jeans or similar)
trousers
--------------------------------------------------------------------------------
Analogy for: hand - arm + leg (expected feet/foot)
leg
--------------------------------------------------------------------------------


**ANSWER:**  
As we can see, the effectivness of the word embedding is limited: only a few examples are actually working, and that does not even depend on the semantic field (we have both successful and failing examples 
in countries-nationalities for instance).  
The failing examples we reported show another important limit of the algorithm: it failes to recognize non-unique analogies (like with the trousers example) and analogies that present commonly used variations of the same words (like feet/foot). The succesful examples that we were able to find were only among the semantic fields already present in this lab.  
We hypothesize that one of the reasons behind this lack of precision is the possibility of the vectors *x* and *y* to cancel each other out (due to the often high similarity of the two). In fact we often get back exactly the same word of *z*. Another thing to take into account is that this algorithm does not exclude from the pool of analogies the same words of *x*, *y* and *z*.

## Natural language inference dataset

In the second part of this lab, you will be evaluating the usefulness of word embeddings in the context of a natural language inference task. The data for this part is the [SNLI corpus](https://nlp.stanford.edu/projects/snli/), a collection of 570k human-written English image caption pairs manually labeled with the labels *Entailment*, *Contradiction*, and *Neutral*. Consider the following sentence pair as an example:

* Sentence 1: A soccer game with multiple males playing.
* Sentence 2: Some men are playing a sport.

This pair is labeled with *Entailment*, because sentence&nbsp;2 is logically entailed (implied) by sentence&nbsp;1 – if sentence&nbsp;1 is true, then sentence&nbsp;2 is true, too. The following sentence pair, on the other hand, is labeled with *Contradiction*, because both sentences cannot be true at the same time.

* Sentence 1: A black race car starts up in front of a crowd of people.
* Sentence 2: A man is driving down a lonely road.

For detailed information about the corpus, refer to [Bowman et al. (2015)](https://www.aclweb.org/anthology/D15-1075/). For this lab, we load the training portion and the development portion of the dataset.

**Note:** Because the SNLI corpus is rather big, we initially only load a small portion (25,000 samples) of the training data. Once you have working code for Problems&nbsp;4–6, you should set the flag `final` to `True` and re-run all cells with the full dataset.

In [25]:
import bz2
import pandas as pd

final_evaluation = True    # TODO: Set to True for the final evaluation!

with bz2.open("../input/db-text-mining-lab4/train.jsonl.bz2", 'rt') as source:
    if final_evaluation:
        df_train = pd.read_json(source, lines=True)
    else:
        df_train = pd.read_json(source, lines=True)[:25000]
    print("Number of sentence pairs in the training data:", len(df_train))

with bz2.open("../input/db-text-mining-lab4/dev.jsonl.bz2", 'rt') as source:
    df_dev = pd.read_json(source, lines=True)
    print("Number of sentence pairs in the development data:", len(df_dev))

Number of sentence pairs in the training data: 549367
Number of sentence pairs in the development data: 9842


When you inspect the data frames, you will see that we have preprocessed the sentences and separated tokens by spaces. In the columns `tagged1` and `tagged2`, we have added the part-of-speech tags for every token (as predicted by spaCy), also separated by spaces.

In [26]:
df_train.head()

Unnamed: 0,gold_label,sentence1,tags1,sentence2,tags2
0,neutral,A person on a horse jumps over a broken down a...,DET NOUN ADP DET NOUN VERB ADP DET ADJ ADP NOU...,A person is training his horse for a competiti...,DET NOUN AUX VERB PRON NOUN ADP DET NOUN PUNCT
1,contradiction,A person on a horse jumps over a broken down a...,DET NOUN ADP DET NOUN VERB ADP DET ADJ ADP NOU...,"A person is at a diner , ordering an omelette .",DET NOUN AUX ADP DET NOUN PUNCT VERB DET NOUN ...
2,entailment,A person on a horse jumps over a broken down a...,DET NOUN ADP DET NOUN VERB ADP DET ADJ ADP NOU...,"A person is outdoors , on a horse .",DET NOUN AUX ADV PUNCT ADP DET NOUN PUNCT
3,neutral,Children smiling and waving at camera,NOUN VERB CCONJ VERB ADP NOUN,They are smiling at their parents,PRON AUX VERB ADP PRON NOUN
4,entailment,Children smiling and waving at camera,NOUN VERB CCONJ VERB ADP NOUN,There are children present,PRON AUX NOUN ADJ


## Problem 4: Two simple baselines

Your first task is to establish two simple baselines for the natural language inference task.

### Random baseline

One drawback with the Most Frequent Class (MFC) baseline is that it does not yield well-defined precision and recall values for all classes. Here we therefore ask you to implement a classifier that generates *random* predictions, where the probability of a class is determined by its relative frequency in the training data. This functionality is provided by scikit-learn&rsquo;s [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). Write code to evaluate the performance of this classifier on the development data.

In [27]:
# Imports
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import clone
from sklearn.base import BaseEstimator, TransformerMixin
import itertools
from sklearn.neural_network import MLPClassifier
from sklearn.externals import joblib





In [28]:
# Function to get predictions from given model for train and test data
def get_predictions(model, datasets):
    return (model.predict(dataset) for dataset in datasets)

In [29]:
# TODO: Enter code here to implement the random baseline. Print the classification report.
np.random.seed(10)
mfc = DummyClassifier().fit(df_train.loc[:, df_train.columns!="gold_label"], df_train["gold_label"])
train_preds, dev_preds = get_predictions(mfc, [df_train.loc[:, df_train.columns!="gold_label"], 
                                               df_dev.loc[:, df_train.columns!="gold_label"]])

In [30]:
pd.DataFrame(classification_report(train_preds, df_train["gold_label"], output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.334418,0.333101,0.332494,0.333338,0.333338,0.33334
recall,0.333029,0.333951,0.333036,0.333338,0.333338,0.333338
f1-score,0.333722,0.333525,0.332765,0.333338,0.333337,0.333339
support,183951.0,182949.0,182467.0,0.333338,549367.0,549367.0


In [31]:
pd.DataFrame(classification_report(dev_preds, df_dev["gold_label"], output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.350824,0.343346,0.340649,0.34495,0.34494,0.344969
recall,0.347642,0.347839,0.339286,0.34495,0.344922,0.34495
f1-score,0.349226,0.345578,0.339966,0.34495,0.344923,0.344952
support,3308.0,3286.0,3248.0,0.34495,9842.0,9842.0


For the DummyClassifier, we got a training accuracy and test accuracy of 33.4%
This makes sense as we have 3 classes and the data is almost equally distributed across all the classes.

### One-sided baseline

A second obvious baseline for the inference task is to predict the class label of a sentence pair based on the text of only one of the two sentences, just as in a standard document classification task. Put together a simple [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) + [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline that implements this idea, train it, and evaluate it on the development data. Is it better to base predictions on sentence&nbsp;1 or sentence&nbsp;2?

In [32]:
# TODO: Enter code here to implement the one-sentence baselines. Print the classification reports.
cnt_vectorizer = CountVectorizer(ngram_range=(1,2), min_df=4, max_df=0.9)
logreg = LogisticRegression(solver="saga", multi_class="ovr", max_iter=50)

logcnt_pipeline = Pipeline([("cnt_vectorizer", cnt_vectorizer), ("logreg", logreg)])

np.random.seed(10)
logcnt_sent1_model = clone(logcnt_pipeline).fit(df_train["sentence1"], df_train["gold_label"])

np.random.seed(10)
logcnt_sent2_model = clone(logcnt_pipeline).fit(df_train["sentence2"], df_train["gold_label"])

In [33]:
# Get predictions for train and test data from model with sentence1
train_preds_1, test_preds_1 = get_predictions(logcnt_sent1_model, [df_train["sentence1"], df_dev["sentence1"]])

# Get predictions for train and test data from model with sentence2
train_preds_2, test_preds_2 = get_predictions(logcnt_sent2_model, [df_train["sentence2"], df_dev["sentence2"]])

#### CountVectorizer + LogisticRegression on *sentence1*

In [34]:
pd.DataFrame(classification_report(df_train["gold_label"], train_preds_1, output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.337047,0.338963,0.338638,0.338346,0.338216,0.338216
recall,0.259287,0.362209,0.393639,0.338346,0.338378,0.338346
f1-score,0.293097,0.350201,0.364073,0.338346,0.33579,0.335774
support,183187.0,183416.0,182764.0,0.338346,549367.0,549367.0


In [35]:
pd.DataFrame(classification_report(df_dev["gold_label"], test_preds_1, output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.33437,0.341063,0.32887,0.334688,0.334768,0.334826
recall,0.262355,0.362271,0.379598,0.334688,0.334741,0.334688
f1-score,0.294017,0.351347,0.352418,0.334688,0.332594,0.332605
support,3278.0,3329.0,3235.0,0.334688,9842.0,9842.0


#### CountVectorizer + LogisticRegression on *sentence2*

In [36]:
pd.DataFrame(classification_report(df_train["gold_label"], train_preds_2, output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.729953,0.701692,0.744365,0.724097,0.725337,0.725312
recall,0.711863,0.767398,0.692905,0.724097,0.724055,0.724097
f1-score,0.720795,0.733075,0.717713,0.724097,0.723861,0.72387
support,183187.0,183416.0,182764.0,0.724097,549367.0,549367.0


In [37]:
pd.DataFrame(classification_report(df_dev["gold_label"], test_preds_2, output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.664024,0.673609,0.68106,0.672729,0.672898,0.672866
recall,0.66443,0.709222,0.643586,0.672729,0.672412,0.672729
f1-score,0.664227,0.690957,0.661793,0.672729,0.672326,0.672468
support,3278.0,3329.0,3235.0,0.672729,9842.0,9842.0


By comparing the test accuracies obtained by using *sentence1* (0.6) and *sentence2* (0.33) for the small dataset, we see that basing the predictions on *sentence2* seems to result in better results. So, we would choose the model using *sentence2*.

## Problem 5: A classifier based on manually engineered features

[Bowman et al., 2015](https://www.aclweb.org/anthology/D15-1075/) evaluate a classifier that uses (among others) **cross-unigram features**. This term is used to refer to pairs of unigrams $(w_1, w_2)$ such that $w_1$ occurs in sentence&nbsp;1, $w_2$ occurs in sentence&nbsp;2, and both have been assigned the same part-of-speech tag.

Your next task is to implement the cross-unigram classifier. To this end, the next cell contains skeleton code for a transformer that you can use as the first component in a classification pipeline. This transformer converts each row of the SNLI data frame into a space-separated string consisting of

* the standard unigrams (of sentence&nbsp;1 or sentence&nbsp;2 – this depends on your results in Problem&nbsp;4)
* the cross-unigrams, as defined above.

The space-separated string forms a new &lsquo;document&rsquo; that can be passed to a vectorizer in exactly the same way as a standard sentence in Problem&nbsp;4.

We have selected to use the standard unigrams of *sentence2* based on the results for the previous question as it leads to better test accuracy.

In [38]:
class CrossUnigramsTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    # Vectorize a sentence-tag-sentence-tag quadruple.
    def _transform(self, sentence1, tags1, sentence2, tags2):
        # Tokenize sentence and tag (also removing stop words and non-alpha words and their tags)
        with nlp.disable_pipes("tagger", "parser", "ner"):
            words_tags1 = [(w.text, t) for w, t in zip(nlp(sentence1), tags1.split()) if w.is_alpha and not w.is_stop]
            words_tags2 = [(w.text, t) for w, t in zip(nlp(sentence2), tags2.split()) if w.is_alpha and not w.is_stop]
        
        # Filter out tags that do not match
        cross_unigrams = [wt1[0] + "_" + wt2[0] for wt1,wt2 in 
                          itertools.product(words_tags1, words_tags2) if wt1[1]==wt2[1]]
        
        # Combine standard unigrams and cross unigrams
        return " ".join([w for w, t in words_tags2] + cross_unigrams)

    def transform(self, X):
        return [self._transform(row[0], row[1], row[2], row[3]) for i,row in X.iterrows()]

Once you have an implementation of the transformer, extend the pipeline that you built for Problem&nbsp;4, train it, and evaluate it on the development data.

In [39]:
# TODO: Enter code here to implement the cross-unigrams classifier. Print the classification report.
cnt_cu_vectorizer = CountVectorizer(ngram_range=(1,1))
logreg_cu = LogisticRegression(solver="saga", multi_class="ovr", max_iter=50)
cross_unigrams = CrossUnigramsTransformer()

np.random.seed(10)
logcnt_cu_pipeline = Pipeline([("cross_unigrams", cross_unigrams), ("cnt_vectorizer", cnt_cu_vectorizer), 
                               ("logreg", logreg_cu)])
logcnt_cu_model = clone(logcnt_cu_pipeline).fit(df_train.loc[:, df_train.columns != "gold_label"], df_train["gold_label"])

In [40]:
train_cu_preds, dev_cu_preds = get_predictions(logcnt_cu_model, [df_train.loc[:, df_train.columns != "gold_label"],
                                                                 df_dev.loc[:, df_dev.columns != "gold_label"]])

In [41]:
pd.DataFrame(classification_report(df_train["gold_label"], train_cu_preds, output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.841323,0.758726,0.82631,0.805116,0.808786,0.808752
recall,0.806602,0.874051,0.734444,0.805116,0.805033,0.805116
f1-score,0.823597,0.812316,0.777674,0.805116,0.804529,0.804553
support,183187.0,183416.0,182764.0,0.805116,549367.0,549367.0


In [42]:
pd.DataFrame(classification_report(df_dev["gold_label"], dev_cu_preds, output_dict=True))

Unnamed: 0,contradiction,entailment,neutral,accuracy,macro avg,weighted avg
precision,0.756434,0.700966,0.728688,0.726885,0.728696,0.728552
recall,0.735204,0.806248,0.636785,0.726885,0.726079,0.726885
f1-score,0.745668,0.74993,0.679644,0.726885,0.725081,0.725408
support,3278.0,3329.0,3235.0,0.726885,9842.0,9842.0


By using a CountVectorizer + LogisticRegression model and features obtained from sentence2 and cross-unigrams for the small dataset, we have obtained a test accuracy of 62.5% which is an improvement of approximately 2.2% compared to the previous model using only sentence2 features.

## Problem 6: A classifier based on word embeddings

Your last task in this lab is to build a classifier for the natural language inference task that uses word embeddings. More specifically, we ask you to implement a vectorizer that represents each sentence as the sum of its word vectors – a representation known as the **continuous bag-of-words**. Thus, given that spaCy&rsquo;s word vectors have 300 dimensions, each sentence will be transformed into a 300-dimensional vector. To represent a sentence pair, the vectorizer should concatenate the vectors for the individual sentences; this yields a 600-dimensional vector. This vector can then be passed to a classifier.

The next code cell contains skeleton code for the vectorizer. You will have to implement two methods: one that maps a single sentence to a vector (of length 300), and one that maps a sentence pair to a vector (of length 600).

In [43]:
class PairedSentenceVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    # Vectorize a single sentence
    def _transform1(self, sentence):
        with nlp.disable_pipes("tagger", "parser", "ner"):
            return np.sum(np.array([w.vector for w in nlp(sentence)]), axis=0, keepdims=True)

    # Vectorize a pair of sentences
    def _transform2(self, sentence1, sentence2):
        return np.hstack([self._transform1(sentence1), self._transform1(sentence2)])

    def transform(self, X):
        return np.concatenate([self._transform2(row[0], row[1]).reshape(1, -1) for i,row in X.iterrows()])

Once you have a working implementation, build a pipeline consisting of the new vectorizer and a [multi-layer perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). This more powerful (compared to logistic regression) classifier is called for here because we do not specify features by hand (as we did in Problem&nbsp;5), but want to let the model learn a good representation of the data by itself. Use 3&nbsp;hidden layers, each with size 300. It suffices to train the classifier for 8&nbsp;iterations (epochs).

In [44]:
# TODO: Enter code here to implement the word embeddings classifier. Print the classification report.
np.random.seed(123)
psent_vectorizer = PairedSentenceVectorizer()
mlp_classifier = MLPClassifier(hidden_layer_sizes=(200, 100), activation="logistic", early_stopping=True, validation_fraction=0.1,
                               learning_rate="invscaling", batch_size=32, verbose=3)

mlp_pipeline = Pipeline([("psent_vectorizer", psent_vectorizer), ("mlp_classifier", mlp_classifier)])
# mlp_model = clone(mlp_pipeline).fit(df_train.loc[:, ["sentence1", "sentence2"]], df_train["gold_label"])
mlp_model = joblib.load('../input/db-text-mining-lab4/mlp_model_full_3x300.pkl')

FileNotFoundError: [Errno 2] No such file or directory: '../input/mlp_model_full_3x300.pkl'

In [None]:
train_mlp_preds, dev_mlp_preds = get_predictions(mlp_model, [df_train.loc[:, ["sentence1", "sentence2"]],
                                                             df_dev.loc[:, ["sentence1", "sentence2"]] ])

In [None]:
pd.DataFrame(classification_report(df_train["gold_label"], train_mlp_preds, output_dict=True))

In [None]:
pd.DataFrame(classification_report(df_dev["gold_label"], dev_mlp_preds, output_dict=True))

## Final evaluation

Once you have working code for all problems, re-run the code for Problems&nbsp;4–6 with the full training data. What are your results? How do they differ from the results that you obtained for the smaller training data? How do you interpret this? Summarize your findings in a short text.

**Problem 4:**

Dummy Classifier:    
Small dataset: Training accuracy = 0.33376, Test accuracy = 0.333774    
Full dataset: Training accuracy = 0.333338, Test accuracy = 0.34495


The Dummy Classifier performs very similarly after training with both small and full dataset. The accuracy of approximately 0.33 makes sense as it matches the probability of guessing the class randomly when there are 3 classes.


One-sided baseline with sentence1 (Logistic Regression Classifier):    
Small dataset: Training accuracy = 0.33864, Test accuracy = 0.331335    
Full dataset: Training accuracy = 0.338346, Test accuracy = 0.334688


One-sided baseline - sentence2 (Logistic Regression Classifier):    
Small dataset: Training accuracy = 0.745, Test accuracy = 0.602825    
Full dataset: Training accuracy = 0.724097, Test accuracy = 0.672729


It looks like the one sided baseline using sentence 1 performs at the same level as a Dummy Classifier for both small and full dataset. This probably means that sentence 1 alone does not contain any features that can help in this classification task. However, we get a much better accuracy when we use sentence 2. The accuracy is further improved when we train it on the full dataset as compared to the small dataset. This probably means that sentence 2 on its own contains features which can help in this classification task. Also, more data leads to increase in accuracy. This is probably because the model would encounter more varied words and contexts and generalize better to the task. This is evidenced by how the training accuracy is lower but the test accuracy is much higher and the difference between training and test accuracies is also lower while using the full dataset for training.


**Problem 5:**

Cross-unigrams + sentence2 Logistic Regression Classifier:    
Small dataset: Training accuracy = 0.87944, Test accuracy = 0.624975    
Full dataset: Training accuracy = 0.805116, Test accuracy = 0.726885


By adding cross-unigrams, we observe that we are able to improve the accuracy further as compared to using only sentence 2. Also, using the full dataset leads to almost an increase of 0.1 in accuracy compared to the small dataset. This probably means that the cross unigrams are able to add more features which help in this classification task. It makes sense that considering words from both sentences leads to better accuracy because that would be able to represent the relationship between the sentences better. This model seems to be generalizing better when we use the full dataset. This is evidenced by how the training accuracy is lower but the test accuracy is much higher and the difference between training and test accuracies is also lower while using the full dataset for training.


**Problem 6:**

MLPClassifier with hidden units = (200, 100):    
Small dataset: Training accuracy = 0.73712, Test accuracy = 0.643873    
Full dataset: Training accuracy = 0.769373, Test accuracy = 0.742532


We get a further 2% increase in test accuracy by using both small and full datasets. The model seems to be generalizing better than before for the small dataset also. The full dataset does lead to even better accuracy and generalization. This makes sense because we are able to encode the meaning of the sentence to a higher degree by using the word vectors. By concatenating the word vectors of the 2 sentences as a feature vector, we are able to compare the meanings of the 2 sentences at a more semantic level. Overall, the MLPClassifier model provides the best test accuracy of around 74%. 


<div class="alert alert-info">
    Please read the section ‘General information’ on the ‘Labs’ page of the course website before submitting this notebook!
</div>