# Homework and bake-off: word-level entailment with neural networks

In [None]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Data](#Data)
  1. [Edge disjoint](#Edge-disjoint)
  1. [Word disjoint](#Word-disjoint)
1. [Baseline](#Baseline)
  1. [Representing words: vector_func](#Representing-words:-vector_func)
  1. [Combining words into inputs: vector_combo_func](#Combining-words-into-inputs:-vector_combo_func)
  1. [Classifier model](#Classifier-model)
  1. [Baseline results](#Baseline-results)
1. [Homework questions](#Homework-questions)
  1. [Hypothesis-only baseline [2 points]](#Hypothesis-only-baseline-[2-points])
  1. [Alternatives to concatenation [2 points]](#Alternatives-to-concatenation-[2-points])
  1. [A deeper network [2 points]](#A-deeper-network-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

The general problem is word-level natural language inference.

Training examples are pairs of words $(w_{L}, w_{R}), y$ with $y = 1$ if $w_{L}$ entails $w_{R}$, otherwise $0$.

The homework questions below ask you to define baseline models for this and develop your own system for entry in the bake-off, which will take place on a held-out test-set distributed at the start of the bake-off. (Thus, all the data you have available for development is available for training your final system before the bake-off begins.)

<img src="fig/wordentail-diagram.png" width=600 alt="wordentail-diagram.png" />

## Set-up

See [the first notebook in this unit](nli_01_task_and_data.ipynb) for set-up instructions.

In [2]:
from collections import defaultdict
import json
import numpy as np
import os
import pandas as pd
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import nli
import utils

In [3]:
DATA_HOME = 'data'

NLIDATA_HOME = os.path.join(DATA_HOME, 'nlidata')

wordentail_filename = os.path.join(
    NLIDATA_HOME, 'nli_wordentail_bakeoff_data.json')

GLOVE_HOME = os.path.join(DATA_HOME, 'glove.6B')

## Data

I've processed the data into two different train/test splits, in an effort to put some pressure on our models to actually learn these semantic relations, as opposed to exploiting regularities in the sample.

* `edge_disjoint`: The `train` and `dev` __edge__ sets are disjoint, but many __words__ appear in both `train` and `dev`.
* `word_disjoint`: The `train` and `dev` __vocabularies are disjoint__, and thus the edges are disjoint as well.

These are very different problems. For `word_disjoint`, there is real pressure on the model to learn abstract relationships, as opposed to memorizing properties of individual words.

In [4]:
with open(wordentail_filename) as f:
    wordentail_data = json.load(f)

The outer keys are the  splits plus a list giving the vocabulary for the entire dataset:

In [5]:
wordentail_data.keys()

dict_keys(['edge_disjoint', 'vocab', 'word_disjoint'])

### Edge disjoint

In [5]:
wordentail_data['edge_disjoint'].keys()

dict_keys(['dev', 'train'])

This is what the split looks like; all three have this same format:

In [6]:
wordentail_data['edge_disjoint']['dev'][: 5]

[[['sweater', 'stroke'], 0],
 [['constipation', 'hypovolemia'], 0],
 [['disease', 'inflammation'], 0],
 [['herring', 'animal'], 1],
 [['cauliflower', 'outlook'], 0]]

Let's test to make sure no edges are shared between `train` and `dev`:

In [7]:
nli.get_edge_overlap_size(wordentail_data, 'edge_disjoint')

0

As we expect, a *lot* of vocabulary items are shared between `train` and `dev`:

In [8]:
nli.get_vocab_overlap_size(wordentail_data, 'edge_disjoint')

2916

This is a large percentage of the entire vocab:

In [9]:
len(wordentail_data['vocab'])

8470

Here's the distribution of labels in the `train` set. It's highly imbalanced, which will pose a challenge for learning. (I'll go ahead and reveal that the `dev` set is similarly distributed.)

In [10]:
def label_distribution(split):
    return pd.DataFrame(wordentail_data[split]['train'])[1].value_counts()

In [11]:
label_distribution('edge_disjoint')

0    14650
1     2745
Name: 1, dtype: int64

### Word disjoint

In [12]:
wordentail_data['word_disjoint'].keys()

dict_keys(['dev', 'train'])

In the `word_disjoint` split, no __words__ are shared between `train` and `dev`:

In [13]:
nli.get_vocab_overlap_size(wordentail_data, 'word_disjoint')

0

Because no words are shared between `train` and `dev`, no edges are either:

In [14]:
nli.get_edge_overlap_size(wordentail_data, 'word_disjoint')

0

The label distribution is similar to that of `edge_disjoint`, though the overall number of examples is a bit smaller:

In [15]:
label_distribution('word_disjoint')

0    7199
1    1349
Name: 1, dtype: int64

## Baseline

Even in deep learning, __feature representation is vital and requires care!__ For our task, feature representation has two parts: representing the individual words and combining those representations into a single network input.

### Representing words: vector_func

Let's consider two baseline word representations methods:

1. Random vectors (as returned by `utils.randvec`).
1. 50-dimensional GloVe representations.

In [9]:
def randvec(w, n=50, lower=-1.0, upper=1.0):
    """Returns a random vector of length `n`. `w` is ignored."""
    return utils.randvec(n=n, lower=lower, upper=upper)

In [17]:
# Any of the files in glove.6B will work here:

glove_dim = 50

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

### Combining words into inputs: vector_combo_func

Here we decide how to combine the two word vectors into a single representation. In more detail, where `u` is a vector representation of the left word and `v` is a vector representation of the right word, we need a function `vector_combo_func` such that `vector_combo_func(u, v)` returns a new input vector `z` of dimension `m`. A simple example is concatenation:

In [18]:
def vec_concatenate(u, v):
    """Concatenate np.array instances `u` and `v` into a new np.array"""
    return np.concatenate((u, v))

`vector_combo_func` could instead be vector average, vector difference, etc. (even combinations of those) â€“ there's lots of space for experimentation here; [homework question 2](#Alternatives-to-concatenation-[1-point]) below pushes you to do some exploration.

### Classifier model

For a baseline model, I chose `TorchShallowNeuralClassifier`:

In [19]:
net = TorchShallowNeuralClassifier(hidden_dim=50, max_iter=100)

### Baseline results

The following puts the above pieces together, using `vector_func=glove_vec`, since `vector_func=randvec` seems so hopelessly misguided for `word_disjoint`!

In [20]:
word_disjoint_experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=net, 
    vector_func=glove_vec,
    vector_combo_func=vec_concatenate)

Finished epoch 100 of 100; error is 0.022493046708405018

              precision    recall  f1-score   support

           0      0.923     0.929     0.926      1910
           1      0.400     0.377     0.388       239

   micro avg      0.868     0.868     0.868      2149
   macro avg      0.661     0.653     0.657      2149
weighted avg      0.864     0.868     0.866      2149



In [24]:
word_disjoint_experiment.keys()

dict_keys(['model', 'train_data', 'assess_data', 'macro-F1', 'vector_func', 'vector_combo_func'])

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Hypothesis-only baseline [2 points]

During our discussion of SNLI and MultiNLI, we noted that a number of research teams have shown that hypothesis-only baselines for NLI tasks can be remarkably robust. This question asks you to explore briefly how this baseline effects the 'edge_disjoint' and 'word_disjoint' versions of our task.

For this problem, submit two functions:

1. A `vector_combo_func` function called `hypothesis_only` that simply throws away the premise, using the unmodified hypothesis (second) vector as its representation of the example.

1. A function called `run_hypothesis_only_evaluation` that does the following:
    1. Loops over the two conditions 'word_disjoint' and 'edge_disjoint' and the two `vector_combo_func` values `vec_concatenate` and `hypothesis_only`, calling `nli.wordentail_experiment` to train on the conditions 'train' portion and assess on its 'dev' portion, with `glove_vec` as the `vector_func`. So that the results are consistent, use an `sklearn.linear_model.LogisticRegression` with default parameters as the model.
    1. Returns a `dict` mapping `(condition_name, function_name)` pairs to the 'macro-F1' score for that pair, as returned by the call to `nli.wordentail_experiment`. (Tip: you can get the `str` name of your function `hypothesis_only` with `hypothesis_only.__name__`.)
    
The test functions `test_hypothesis_only` and `test_run_hypothesis_only_evaluation` will help ensure that your functions have the desired logic.

In [55]:
from sklearn.linear_model import LogisticRegression

##### YOUR CODE HERE
def hypothesis_only(premise, hypothesis):
    return hypothesis

    ##### YOUR CODE HERE

def run_hypothesis_only_evaluation():
    ##### YOUR CODE HERE
    datasets = ['word_disjoint', 'edge_disjoint']
    vector_functions = [vec_concatenate, hypothesis_only]
    evaluations = {}
    for dset in datasets:
        for func in vector_functions:
            print((dset, func.__name__))
            experiment = nli.wordentail_experiment(
                train_data=wordentail_data[dset]['train'],
                assess_data=wordentail_data[dset]['dev'], 
                model=LogisticRegression(), 
                vector_func=glove_vec,
                vector_combo_func=func)
            evaluations[(dset, func.__name__)] = experiment['macro-F1']
                
    return evaluations

In [56]:
evals = run_hypothesis_only_evaluation()

('word_disjoint', 'vec_concatenate')




              precision    recall  f1-score   support

           0      0.902     0.981     0.940      1910
           1      0.486     0.146     0.225       239

   micro avg      0.888     0.888     0.888      2149
   macro avg      0.694     0.564     0.582      2149
weighted avg      0.856     0.888     0.860      2149

('word_disjoint', 'hypothesis_only')




              precision    recall  f1-score   support

           0      0.892     0.989     0.938      1910
           1      0.323     0.042     0.074       239

   micro avg      0.884     0.884     0.884      2149
   macro avg      0.607     0.515     0.506      2149
weighted avg      0.829     0.884     0.842      2149

('edge_disjoint', 'vec_concatenate')




              precision    recall  f1-score   support

           0      0.875     0.971     0.920      7376
           1      0.579     0.226     0.325      1321

   micro avg      0.857     0.857     0.857      8697
   macro avg      0.727     0.598     0.622      8697
weighted avg      0.830     0.857     0.830      8697

('edge_disjoint', 'hypothesis_only')




              precision    recall  f1-score   support

           0      0.871     0.975     0.920      7376
           1      0.584     0.197     0.294      1321

   micro avg      0.857     0.857     0.857      8697
   macro avg      0.728     0.586     0.607      8697
weighted avg      0.828     0.857     0.825      8697



In [28]:
def test_hypothesis_only(hypothesis_only):
    v = hypothesis_only(1, 2)
    assert v == 2   

In [29]:
test_hypothesis_only(hypothesis_only)

In [33]:
def test_run_hypothesis_only_evaluation(run_hypothesis_only_evaluation):
    results = run_hypothesis_only_evaluation()
    assert ('word_disjoint', 'vec_concatenate') in results, \
        "The return value of `run_hypothesis_only_evaluation` does not have the intended kind of keys"
    assert isinstance(results[('word_disjoint', 'vec_concatenate')], float), \
        "The values of the `run_hypothesis_only_evaluation` result should be floats"

In [52]:
test_run_hypothesis_only_evaluation(run_hypothesis_only_evaluation)

('word_disjoint', 'vec_concatenate')


Finished epoch 100 of 100; error is 0.022795399883762002

              precision    recall  f1-score   support

           0      0.924     0.935     0.929      1910
           1      0.427     0.389     0.407       239

   micro avg      0.874     0.874     0.874      2149
   macro avg      0.675     0.662     0.668      2149
weighted avg      0.869     0.874     0.871      2149

('word_disjoint', 'hypothesis_only')


Finished epoch 100 of 100; error is 1.5376498103141785

              precision    recall  f1-score   support

           0      0.908     0.947     0.927      1910
           1      0.357     0.234     0.283       239

   micro avg      0.868     0.868     0.868      2149
   macro avg      0.632     0.591     0.605      2149
weighted avg      0.847     0.868     0.856      2149

('edge_disjoint', 'vec_concatenate')


Finished epoch 100 of 100; error is 0.06305615417659283

              precision    recall  f1-score   support

           0      0.921     0.931     0.926      7376
           1      0.591     0.555     0.572      1321

   micro avg      0.874     0.874     0.874      8697
   macro avg      0.756     0.743     0.749      8697
weighted avg      0.871     0.874     0.872      8697

('edge_disjoint', 'hypothesis_only')


Finished epoch 100 of 100; error is 3.2835095673799515

              precision    recall  f1-score   support

           0      0.909     0.951     0.930      7376
           1      0.632     0.472     0.540      1321

   micro avg      0.878     0.878     0.878      8697
   macro avg      0.771     0.711     0.735      8697
weighted avg      0.867     0.878     0.871      8697



### Alternatives to concatenation [2 points]

We've so far just used vector concatenation to represent the premise and hypothesis words. This question asks you to explore two simple alternative:

1. Write a function `vec_diff` that, for a given pair of vector inputs `u` and `v`, returns the element-wise difference between `u` and `v`.

1. Write a function `vec_max` that, for a given pair of vector inputs `u` and `v`, returns the element-wise max values between `u` and `v`.

You needn't include your uses of `nli.wordentail_experiment` with these functions, but we assume you'll be curious to see how they do!

In [57]:
def vec_diff(u, v):
    ##### YOUR CODE HERE
    return np.subtract(u, v)


    
def vec_max(u, v):
    ##### YOUR CODE HERE
    return np.maximum(u, v)



In [58]:
def test_vec_diff(vec_diff):
    u = np.array([10.2, 8.1])
    v = np.array([1.2, -7.1])
    result = vec_diff(u, v)
    expected = np.array([9.0, 15.2])
    assert np.array_equal(result, expected), \
        "Expected {}; got {}".format(expected, result)

In [59]:
test_vec_diff(vec_diff)

In [60]:
def test_vec_max(vec_max):
    u = np.array([1.2,  8.1])
    v = np.array([10.2, -7.1])
    result = vec_max(u, v)
    expected = np.array([10.2, 8.1])
    assert np.array_equal(result, expected), \
        "Expected {}; got {}".format(expected, result)

In [61]:
test_vec_max(vec_max)

In [62]:
def run_hypothesis_only_evaluation_multiple_functions():
    ##### YOUR CODE HERE
    datasets = ['word_disjoint', 'edge_disjoint']
    vector_functions = [vec_concatenate, hypothesis_only, vec_diff, vec_max]
    evaluations = {}
    for dset in datasets:
        for func in vector_functions:
            print((dset, func.__name__))
            experiment = nli.wordentail_experiment(
                train_data=wordentail_data[dset]['train'],
                assess_data=wordentail_data[dset]['dev'], 
                model=LogisticRegression(), 
                vector_func=glove_vec,
                vector_combo_func=func)
            evaluations[(dset, func.__name__)] = experiment['macro-F1']
                
    return evaluations

run_hypothesis_only_evaluation_multiple_functions()

('word_disjoint', 'vec_concatenate')




              precision    recall  f1-score   support

           0      0.902     0.979     0.938      1910
           1      0.461     0.146     0.222       239

   micro avg      0.886     0.886     0.886      2149
   macro avg      0.681     0.562     0.580      2149
weighted avg      0.853     0.886     0.859      2149

('word_disjoint', 'hypothesis_only')




              precision    recall  f1-score   support

           0      0.893     0.989     0.939      1910
           1      0.382     0.054     0.095       239

   micro avg      0.885     0.885     0.885      2149
   macro avg      0.638     0.522     0.517      2149
weighted avg      0.836     0.885     0.845      2149

('word_disjoint', 'vec_diff')




              precision    recall  f1-score   support

           0      0.893     0.987     0.938      1910
           1      0.368     0.059     0.101       239

   micro avg      0.884     0.884     0.884      2149
   macro avg      0.631     0.523     0.520      2149
weighted avg      0.835     0.884     0.845      2149

('word_disjoint', 'vec_max')




              precision    recall  f1-score   support

           0      0.898     0.986     0.940      1910
           1      0.480     0.100     0.166       239

   micro avg      0.888     0.888     0.888      2149
   macro avg      0.689     0.543     0.553      2149
weighted avg      0.851     0.888     0.854      2149

('edge_disjoint', 'vec_concatenate')




              precision    recall  f1-score   support

           0      0.875     0.970     0.920      7376
           1      0.575     0.229     0.327      1321

   micro avg      0.857     0.857     0.857      8697
   macro avg      0.725     0.599     0.624      8697
weighted avg      0.830     0.857     0.830      8697

('edge_disjoint', 'hypothesis_only')




              precision    recall  f1-score   support

           0      0.872     0.975     0.920      7376
           1      0.584     0.200     0.298      1321

   micro avg      0.857     0.857     0.857      8697
   macro avg      0.728     0.587     0.609      8697
weighted avg      0.828     0.857     0.826      8697

('edge_disjoint', 'vec_diff')




              precision    recall  f1-score   support

           0      0.859     0.989     0.919      7376
           1      0.602     0.092     0.159      1321

   micro avg      0.853     0.853     0.853      8697
   macro avg      0.730     0.540     0.539      8697
weighted avg      0.820     0.853     0.804      8697

('edge_disjoint', 'vec_max')




              precision    recall  f1-score   support

           0      0.860     0.981     0.917      7376
           1      0.512     0.110     0.181      1321

   micro avg      0.849     0.849     0.849      8697
   macro avg      0.686     0.546     0.549      8697
weighted avg      0.807     0.849     0.805      8697



{('word_disjoint', 'vec_concatenate'): 0.5803553993360672,
 ('word_disjoint', 'hypothesis_only'): 0.5169358178053831,
 ('word_disjoint', 'vec_diff'): 0.5195790690930377,
 ('word_disjoint', 'vec_max'): 0.5529876117835217,
 ('edge_disjoint', 'vec_concatenate'): 0.623656138011501,
 ('edge_disjoint', 'hypothesis_only'): 0.6090499675531149,
 ('edge_disjoint', 'vec_diff'): 0.5391780763850621,
 ('edge_disjoint', 'vec_max'): 0.5487903894475338}

### A deeper network [2 points]

It is very easy to subclass `TorchShallowNeuralClassifier` if all you want to do is change the network graph: all you have to do is write a new `define_graph`. If your graph has new arguments that the user might want to set, then you should also redefine `__init__` so that these values are accepted and set as attributes.

For this question, please subclass `TorchShallowNeuralClassifier` so that it defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
r_{1} &= \textbf{Bernoulli}(1 - \textbf{dropout\_prob}, n) \\
d_{1} &= r_1 * h_{1} \\
h_{2} &= f(d_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

Here, $r_{1}$ and $d_{1}$ define a dropout layer: $r_{1}$ is a random binary vector of dimension $n$, where the probability of a value being $1$ is given by $1 - \textbf{dropout_prob}$. $r_{1}$ is multiplied element-wise by our first hidden representation, thereby zeroing out some of the values. The result is fed to the user's activation function $f$, and the result of that is fed through another linear layer to produce $h_{3}$. (Inside `TorchShallowNeuralClassifier`, $h_{3}$ is the basis for a softmax classifier, so no activation function is applied to it.)

For your implementation, please use `nn.Sequential`, `nn.Linear`, and `nn.Dropout` to define the required layers.

For comparison, using this notation, `TorchShallowNeuralClassifier` defines the following graph:

$$\begin{align}
h_{1} &= xW_{1} + b_{1} \\
h_{2} &= f(h_{1}) \\
h_{3} &= h_{2}W_{2} + b_{2}
\end{align}$$

The following code starts this sub-class for you, so that you can concentrate on `define_graph`. Be sure to make use of `self.dropout_prob`

For this problem, submit just your completed  `TorchDeepNeuralClassifier`. You needn't evaluate it, though we assume you will be keen to do that!

You can use `test_TorchDeepNeuralClassifier` to ensure that your network has the intended structure.

In [6]:
import torch.nn as nn

class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self, dropout_prob=0.7, **kwargs):
        self.dropout_prob = dropout_prob
        super(TorchShallowNeuralClassifier,self).__init__(**kwargs)
    
    def define_graph(self):
        """Complete this method!
        
        Returns
        -------
        an `nn.Module` instance, which can be a free-standing class you 
        write yourself, as in `torch_rnn_classifier`, or the outpiut of 
        `nn.Sequential`, as in `torch_shallow_neural_classifier`.
        
        """
        ##### YOUR CODE HERE
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            nn.Dropout(self.dropout_prob),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes_))

    

##### YOUR CODE HERE    




In [79]:
def test_TorchDeepNeuralClassifier(TorchDeepNeuralClassifier):
    dropout_prob = 0.55
    assert hasattr(TorchDeepNeuralClassifier(), "dropout_prob"), \
        "TorchDeepNeuralClassifier must have an attribute `dropout_prob`."
    try:
        inst = TorchDeepNeuralClassifier(dropout_prob=dropout_prob)
    except TypeError:
        raise TypeError("TorchDeepNeuralClassifier must allow the user "
                        "to set `dropout_prob` on initialization")
    inst.input_dim = 10
    inst.n_classes_ = 5
    graph = inst.define_graph()
    assert len(graph) == 4, \
        "The graph should have 4 layers; yours has {}".format(len(graph))    
    expected = {
        0: 'Linear',
        1: 'Dropout',
        2: 'Tanh',
        3: 'Linear'}
    for i, label in expected.items():
        name = graph[i].__class__.__name__
        assert label in name, \
            "The {} layer of the graph should be a {} layer; yours is {}".format(i, label, name)
    assert graph[1].p == dropout_prob, \
        "The user's value for `dropout_prob` should be the value of `p` for the Dropout layer."

In [80]:
test_TorchDeepNeuralClassifier(TorchDeepNeuralClassifier)

In [113]:
def run_hypothesis_only_evaluation_multiple_functions():
    ##### YOUR CODE HERE
    datasets = ['word_disjoint', 'edge_disjoint']
    vector_functions = [vec_concatenate, hypothesis_only, vec_diff, vec_max]
    evaluations = {}
    for dset in datasets:
        for func in vector_functions:
            print((dset, func.__name__))
            experiment = nli.wordentail_experiment(
                train_data=wordentail_data[dset]['train'],
                assess_data=wordentail_data[dset]['dev'], 
                model=TorchDeepNeuralClassifier(), 
                vector_func=glove_vec,
                vector_combo_func=func)
            evaluations[(dset, func.__name__)] = experiment['macro-F1']
                
    return evaluations

run_hypothesis_only_evaluation_multiple_functions()

('word_disjoint', 'vec_concatenate')


Finished epoch 100 of 100; error is 1.0344147235155106

              precision    recall  f1-score   support

           0      0.916     0.976     0.945      1910
           1      0.602     0.285     0.386       239

   micro avg      0.899     0.899     0.899      2149
   macro avg      0.759     0.630     0.666      2149
weighted avg      0.881     0.899     0.883      2149

('word_disjoint', 'hypothesis_only')


Finished epoch 100 of 100; error is 2.2804210633039474

              precision    recall  f1-score   support

           0      0.899     0.988     0.941      1910
           1      0.542     0.109     0.181       239

   micro avg      0.891     0.891     0.891      2149
   macro avg      0.720     0.549     0.561      2149
weighted avg      0.859     0.891     0.857      2149

('word_disjoint', 'vec_diff')


Finished epoch 100 of 100; error is 1.9885770380496979

              precision    recall  f1-score   support

           0      0.915     0.943     0.929      1910
           1      0.397     0.297     0.340       239

   micro avg      0.872     0.872     0.872      2149
   macro avg      0.656     0.620     0.634      2149
weighted avg      0.857     0.872     0.863      2149

('word_disjoint', 'vec_max')


Finished epoch 100 of 100; error is 2.2771204859018326

              precision    recall  f1-score   support

           0      0.903     0.977     0.939      1910
           1      0.469     0.159     0.238       239

   micro avg      0.886     0.886     0.886      2149
   macro avg      0.686     0.568     0.588      2149
weighted avg      0.855     0.886     0.861      2149

('edge_disjoint', 'vec_concatenate')


Finished epoch 100 of 100; error is 2.8542336970567703

              precision    recall  f1-score   support

           0      0.911     0.967     0.939      7376
           1      0.723     0.474     0.572      1321

   micro avg      0.892     0.892     0.892      8697
   macro avg      0.817     0.721     0.755      8697
weighted avg      0.883     0.892     0.883      8697

('edge_disjoint', 'hypothesis_only')


Finished epoch 100 of 100; error is 4.529859960079193

              precision    recall  f1-score   support

           0      0.901     0.973     0.936      7376
           1      0.728     0.403     0.519      1321

   micro avg      0.887     0.887     0.887      8697
   macro avg      0.815     0.688     0.727      8697
weighted avg      0.875     0.887     0.872      8697

('edge_disjoint', 'vec_diff')


Finished epoch 100 of 100; error is 4.731611818075185

              precision    recall  f1-score   support

           0      0.891     0.957     0.923      7376
           1      0.593     0.346     0.437      1321

   micro avg      0.865     0.865     0.865      8697
   macro avg      0.742     0.652     0.680      8697
weighted avg      0.846     0.865     0.849      8697

('edge_disjoint', 'vec_max')


Finished epoch 100 of 100; error is 4.723342582583427

              precision    recall  f1-score   support

           0      0.883     0.968     0.924      7376
           1      0.615     0.284     0.388      1321

   micro avg      0.864     0.864     0.864      8697
   macro avg      0.749     0.626     0.656      8697
weighted avg      0.842     0.864     0.842      8697



{('word_disjoint', 'vec_concatenate'): 0.665812330092614,
 ('word_disjoint', 'hypothesis_only'): 0.5612978942055689,
 ('word_disjoint', 'vec_diff'): 0.6342894490208651,
 ('word_disjoint', 'vec_max'): 0.5880813222724988,
 ('edge_disjoint', 'vec_concatenate'): 0.7554944328847185,
 ('edge_disjoint', 'hypothesis_only'): 0.7274513699333226,
 ('edge_disjoint', 'vec_diff'): 0.6799595423985118,
 ('edge_disjoint', 'vec_max'): 0.6560119639427279}

### Your original system [3 points]

This is a simple dataset, but our focus on the 'word_disjoint' condition ensures that it's a challenging one, and there are lots of modeling strategies one might adopt. 

You are free to do whatever you like. We require only that your system differ in some way from those defined in the preceding questions. They don't have to be completely different, though. For example, you might want to stick with the model but represent examples differently, or the reverse.

Keep in mind that, for the bake-off evaluation, the 'edge_disjoint' portions of the data are off limits. You can, though, train on the combination of the 'word_disjoint' 'train' and 'dev' portions. You are free to use different pretrained word vectors and the like. Please do not introduce additional entailment datasets into your training data, though.

Please embed your code in this notebook so that we can rerun it.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [126]:
glove_dim = 300

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

def glove_diff(premise, hypothesis):
    distance = np.subtract(premise, hypothesis)
    premise_distance = np.concatenate((premise, distance))
    premise_distance_hypothesis = np.concatenate((premise_distance, hypothesis))
    return premise_distance_hypothesis
    
def run_hypothesis_only_evaluation_multiple_functions():
    ##### YOUR CODE HERE
    datasets = ['word_disjoint', 'edge_disjoint']
    vector_functions = [glove_diff, vec_concatenate, hypothesis_only, vec_diff, vec_max]
    evaluations = {}
    for dset in datasets:
        for func in vector_functions:
            print((dset, func.__name__))
            experiment = nli.wordentail_experiment(
                train_data=wordentail_data[dset]['train'],
                assess_data=wordentail_data[dset]['dev'], 
                model=TorchDeepNeuralClassifier(dropout_prob=0.15), 
                vector_func=glove_vec,
                vector_combo_func=func)
            evaluations[(dset, func.__name__)] = experiment['macro-F1']
                
    return evaluations

run_hypothesis_only_evaluation_multiple_functions()

Finished epoch 1 of 100; error is 3.69259774684906

('word_disjoint', 'glove_diff')


Finished epoch 100 of 100; error is 0.11436278140172362

              precision    recall  f1-score   support

           0      0.921     0.946     0.933      1910
           1      0.447     0.351     0.393       239

   micro avg      0.879     0.879     0.879      2149
   macro avg      0.684     0.649     0.663      2149
weighted avg      0.868     0.879     0.873      2149

('word_disjoint', 'vec_concatenate')


Finished epoch 100 of 100; error is 0.10753945540636778

              precision    recall  f1-score   support

           0      0.920     0.923     0.921      1910
           1      0.366     0.356     0.361       239

   micro avg      0.860     0.860     0.860      2149
   macro avg      0.643     0.639     0.641      2149
weighted avg      0.858     0.860     0.859      2149

('word_disjoint', 'hypothesis_only')


Finished epoch 100 of 100; error is 1.6349783837795258

              precision    recall  f1-score   support

           0      0.901     0.966     0.933      1910
           1      0.363     0.155     0.217       239

   micro avg      0.876     0.876     0.876      2149
   macro avg      0.632     0.560     0.575      2149
weighted avg      0.841     0.876     0.853      2149

('word_disjoint', 'vec_diff')


Finished epoch 100 of 100; error is 0.15161572117358446

              precision    recall  f1-score   support

           0      0.913     0.889     0.901      1910
           1      0.266     0.322     0.292       239

   micro avg      0.826     0.826     0.826      2149
   macro avg      0.590     0.606     0.596      2149
weighted avg      0.841     0.826     0.833      2149

('word_disjoint', 'vec_max')


Finished epoch 100 of 100; error is 1.0280232951045036

              precision    recall  f1-score   support

           0      0.904     0.966     0.934      1910
           1      0.407     0.184     0.254       239

   micro avg      0.879     0.879     0.879      2149
   macro avg      0.656     0.575     0.594      2149
weighted avg      0.849     0.879     0.859      2149

('edge_disjoint', 'glove_diff')


Finished epoch 100 of 100; error is 0.36626478377729654

              precision    recall  f1-score   support

           0      0.924     0.943     0.934      7376
           1      0.642     0.568     0.603      1321

   micro avg      0.886     0.886     0.886      8697
   macro avg      0.783     0.756     0.768      8697
weighted avg      0.881     0.886     0.883      8697

('edge_disjoint', 'vec_concatenate')


Finished epoch 100 of 100; error is 0.3725319569930434

              precision    recall  f1-score   support

           0      0.918     0.953     0.935      7376
           1      0.669     0.525     0.588      1321

   micro avg      0.888     0.888     0.888      8697
   macro avg      0.793     0.739     0.762      8697
weighted avg      0.880     0.888     0.883      8697

('edge_disjoint', 'hypothesis_only')


Finished epoch 100 of 100; error is 3.3843596279621124

              precision    recall  f1-score   support

           0      0.910     0.957     0.933      7376
           1      0.662     0.469     0.549      1321

   micro avg      0.883     0.883     0.883      8697
   macro avg      0.786     0.713     0.741      8697
weighted avg      0.872     0.883     0.874      8697

('edge_disjoint', 'vec_diff')


Finished epoch 100 of 100; error is 0.7115450352430344

              precision    recall  f1-score   support

           0      0.904     0.948     0.926      7376
           1      0.602     0.438     0.507      1321

   micro avg      0.871     0.871     0.871      8697
   macro avg      0.753     0.693     0.716      8697
weighted avg      0.858     0.871     0.862      8697

('edge_disjoint', 'vec_max')


Finished epoch 100 of 100; error is 1.8269821777939796

              precision    recall  f1-score   support

           0      0.907     0.939     0.923      7376
           1      0.577     0.463     0.513      1321

   micro avg      0.867     0.867     0.867      8697
   macro avg      0.742     0.701     0.718      8697
weighted avg      0.857     0.867     0.861      8697



{('word_disjoint', 'glove_diff'): 0.6632674235911422,
 ('word_disjoint', 'vec_concatenate'): 0.6411412485984875,
 ('word_disjoint', 'hypothesis_only'): 0.5747667187663273,
 ('word_disjoint', 'vec_diff'): 0.5962312113174182,
 ('word_disjoint', 'vec_max'): 0.5940246404623789,
 ('edge_disjoint', 'glove_diff'): 0.7681490473548221,
 ('edge_disjoint', 'vec_concatenate'): 0.7617307939242727,
 ('edge_disjoint', 'hypothesis_only'): 0.7409508381755636,
 ('edge_disjoint', 'vec_diff'): 0.7161780203629226,
 ('edge_disjoint', 'vec_max'): 0.718158682148762}

In [10]:
# Enter your system description in this cell.
# Please do not remove this comment.
'''
My system makes use of larger glove representations, the concatenation of premise 
with distance (between premise and hypothesis) and hypothesis, and a simple neural model
with tuned dropout rate. The rationale behind this is that larger representations could
better represent the relation existing between hypothesis and premise. This is aided with
the substraction of both the hypothesis and premise vectors (to indicate the distance)
and a relatively shallow network with hand-tuned dropout rate,

Bakeoff code from cell above; just re-stating to ensure everything runs smoothly:
'''
glove_dim = 300

glove_src = os.path.join(GLOVE_HOME, 'glove.6B.{}d.txt'.format(glove_dim))

# Creates a dict mapping strings (words) to GloVe vectors:
GLOVE = utils.glove2dict(glove_src)

def glove_vec(w):    
    """Return `w`'s GloVe representation if available, else return 
    a random vector."""
    return GLOVE.get(w, randvec(w, n=glove_dim))

def glove_diff(premise, hypothesis):
    distance = np.subtract(premise, hypothesis)
    premise_distance = np.concatenate((premise, distance))
    premise_distance_hypothesis = np.concatenate((premise_distance, hypothesis))
    return premise_distance_hypothesis

experiment = nli.wordentail_experiment(
    train_data=wordentail_data['word_disjoint']['train'],
    assess_data=wordentail_data['word_disjoint']['dev'], 
    model=TorchDeepNeuralClassifier(dropout_prob=0.15), 
    vector_func=glove_vec,
    vector_combo_func=glove_diff)


Finished epoch 100 of 100; error is 0.1038165153004229

              precision    recall  f1-score   support

           0      0.920     0.930     0.925      1910
           1      0.390     0.356     0.372       239

   micro avg      0.866     0.866     0.866      2149
   macro avg      0.655     0.643     0.649      2149
weighted avg      0.861     0.866     0.864      2149



## Bake-off [1 point]

The goal of the bake-off is to achieve the highest macro-average F1 score on __word_disjoint__, on a test set that we will make available at the start of the bake-off. The announcement will go out on the discussion forum. To enter, you'll be asked to run `nli.bake_off_evaluation` on the output of your chosen `nli.wordentail_experiment` run. 

The cells below this one constitute your bake-off entry.

The rules described in the [Your original system](#Your-original-system-[3-points]) homework question are also in effect for the bake-off.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [12]:
# Enter your bake-off assessment code into this cell. 
# Please do not remove this comment.
##### YOUR CODE HERE
test_data_filename = os.path.join(
    NLIDATA_HOME,
    "bakeoff-wordentail-data",
    "nli_wordentail_bakeoff_data-test.json")

nli.bake_off_evaluation(
    experiment,
    test_data_filename)




              precision    recall  f1-score   support

           0      0.864     0.839     0.852      1767
           1      0.429     0.478     0.452       446

   micro avg      0.766     0.766     0.766      2213
   macro avg      0.646     0.658     0.652      2213
weighted avg      0.776     0.766     0.771      2213



In [None]:
# On an otherwise blank line in this cell, please enter
# your macro-avg f1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.

##### YOUR CODE HERE
0.652

