# Performing Natural Language Inference with the SNLI Dataset

The Stanford Natural Language Inference (SNLI) dataset is a recently released corpus of sentence pairs labelled with inferential relationships. The first sentence in each pair can either entail, contradict, or be neutral with respect to the second sentence, and our goal is build models that learn to predict this relationship for novel sentence pairs. 
The dataset consists of a training set of ~500,000 labelled sentence pairs, along with development and test sets that each contain ~10,000 labelled sentence pairs. More information about the dataset can be found here.

In the rest of this notebook, we'll compare a number of methods for learning to label the sentence pairs in this dataset. As simple baseline, we'll first build distributed bag-of-words representations for each sentence, and then use these sentence representations as input to a multilayer perceptron that predicts a class label. 

## 1. Bag-of-Words with Random Embeddings

First, we'll use sentence representations that are a sum of intially random word embeddings that are learned over the course of training. To start, we'll load the dataset and build a vocabulary. Then, we'll extract the labelled sentence pairs for the training and development sets:

In [27]:
import numpy as np

from pysem.corpora import SNLI
from pysem.utils.ml import MultiLayerPerceptron
from pysem.utils.vsa import normalize

snli = SNLI(path='/Users/peterblouw/corpora/snli_1.0/')
snli.build_vocab()

snli.extractor = snli.get_xy_pairs
train_data = [pair for pair in snli.train_data if pair.label != '-']
dev_data = [pair for pair in snli.dev_data if pair.label != '-']

Next, we'll use scikit-learn to make a count vectorizer that converts sentences into binary vectors. The vectors will be used to extract the correct word embeddings from an initially random embedding matrix. 300 dimensional embeddings will be used for consistency with the pretrained embeddings to be used in the next example. 

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

dim = 300
vectorizer = CountVectorizer(binary=True)
vectorizer.fit(snli.vocab)

embedding_matrix = np.random.normal(loc=0, scale=1/np.sqrt(dim), size=(dim, len(vectorizer.get_feature_names())))

With an MLP for label prediction it's now possible to learn a set of embeddings and classifier parameter on the training data and then see how well things generalize to the development data.

In [35]:
classifier = MultiLayerPerceptron(di=dim*2, dh=dim, do=3)

minibatch_size = 100
iters = 1000
rate = 0.1

# Train using randomly selected minibatches to get roughly 5 epochs
for _ in range(iters):
    minibatch = random.sample(train_data, minibatch_size)

    s1s = [sample.sentence1 for sample in minibatch]
    s2s = [sample.sentence2 for sample in minibatch]
    
    s1_indicators = vectorizer.transform(s1s).toarray().T
    s2_indicators = vectorizer.transform(s2s).toarray().T
    
    s1_embeddings = np.dot(embedding_matrix, s1_indicators)
    s2_embeddings = np.dot(embedding_matrix, s2_indicators)

    # concatenate the sentence representations for input to classifier
    xs = np.vstack((s1_embeddings, s2_embeddings))
    ys = snli.binarize([sample.label for sample in minibatch])

    # update the classifier parameters
    classifier.train(xs, ys, rate=rate)
    
    # update the embedding matrix
    s1s_grad = np.dot(classifier.yi_grad[:dim], s1_indicators.T) 
    s2s_grad = np.dot(classifier.yi_grad[dim:], s2_indicators.T)
    embedding_grad = (s1s_grad + s2s_grad) / minibatch_size
    embedding_matrix -= rate * embedding_grad

After training, the accuracy on the training set and the development set can be computed as follows:

In [43]:
from itertools import islice

def compute_accuracy(data):
    n_correct = 0
    n_total = 0
    batchsize = 5000
        
    while True:
        batch = list(islice(data, batchsize))
        n_total += len(batch)
        if len(batch) == 0:
            break

        s1s = [sample.sentence1 for sample in batch]
        s2s = [sample.sentence2 for sample in batch]
    
        s1_indicators = vectorizer.transform(s1s).toarray().T
        s2_indicators = vectorizer.transform(s2s).toarray().T

        s1_embeddings = np.dot(embedding_matrix, s1_indicators)
        s2_embeddings = np.dot(embedding_matrix, s2_indicators)
    
        xs = np.vstack((s1_embeddings, s2_embeddings))
        ys = snli.binarize([sample.label for sample in batch])
        
        predictions = classifier.predict(xs)
        n_correct += sum(np.equal(predictions, np.argmax(ys, axis=0)))
        
    print('Accuracy: ', n_correct / n_total)

compute_accuracy((i for i in dev_data))

5000
4842
0
Accuracy:  0.358057305426
