# Natural Language Inference using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

In this lab we'll work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

Name: **MAX BOHOLM**

# 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns [here](https://gubox.box.com/s/idd9b9cfbks4dnhznps0gjgbnrzsvfs4).

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [1]:
import torch
import torch.nn as nn
import torchtext
import torch.nn.functional as F

device = torch.device('cuda:0')
#device = torch.device('cpu')

In [2]:
batch_size = 16
epochs = 3
my_dimensions = 100
learning_rate = 0.001

**Note:** From time to time (or rather most of the time), there is a memory problem on MLTGPU, resulting in the following error:

    RuntimeError: CUDA error: out of memory

I have had runs of the code all the way through with (if I remember correctly): the complete trainingset, `my_dimensions = 100`, `batch_size = 16`, and `epochs = 3`, resulting in an accuracy of about 48%. 

**We can ignore this function**, but `mini_me()` is useful when developing the model with a smaller data set. 

    import random

    def mini_me(cutoff, directory="snli-data", original="train.csv", out="mini_me.csv"):
        with open(f"{directory}/{original}", mode="r") as f:
            data=[x.split("\t") for x in f.read().split("\n")]

        random.shuffle(data)

        with open(f"{directory}/{out}", mode="w") as f:
            f.write("\n".join(["\t".join(x) for x in data][:cutoff]))

    mini_me(cutoff=1000)

**Note:** In order to avoid 

    AttributeError: 'Example' object has no attribute 'context'

when iterating over the train and test data from the `dataloader` I have removed empty lines in the `train.csv` and `test.csv` files. I could have implemented some code here to do that from the Jupyter Notebook file, but I have not doe that.

In [4]:
#from torchtext.legacy.data import Field, BucketIterator, Iterator, TabularDataset # Needed for running this on my laptop
from torchtext.data import Field, BucketIterator, Iterator, TabularDataset

def dataloader(directory="snli-data",
               train_file="train.csv",
               #train_file="mini_me.csv",
               test_file="test.csv",
               batch=batch_size):
    
    whitespacer = lambda x: x.split(' ') #from: https://canvas.gu.se/files/4597768/download?download_frd=1
  
    SENTENCE = Field(tokenize   = whitespacer,
                    lower       = True,
                    batch_first = True,
                    init_token  = "<start>", 
                    eos_token   = "<end>"
                   ) 
    
    LABEL = Field(batch_first = True)    
    
    my_fields = [("premise", SENTENCE),
                 ("hypothesis", SENTENCE),
                 ("label", LABEL)]
    
    train, test = TabularDataset.splits(path   = directory,
                                        train  = train_file,
                                        test   = test_file,
                                        format = 'csv',
                                        fields = my_fields,
                                        csv_reader_params = {'delimiter':'\t',
                                                             'quotechar':'¤'}) 
                                        #"¤" not in data
    SENTENCE.build_vocab(train) 
    LABEL.build_vocab(train)

    train_iter, test_iter = BucketIterator.splits((train, test),
                                                  batch_size        = batch,
                                                  sort_within_batch = True,
                                                  sort_key          = lambda x: len(x.premise),
                                                  shuffle           = True,
                                                  device            = device)

    return train_iter, test_iter, SENTENCE.vocab, LABEL.vocab

# 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [5]:
def pooling(input_tensor):
    output_tensor=torch.max(input_tensor, dim=1).values #`keepdim=False` by default, which gives us squeezed output
    return output_tensor

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[2 marks]**

In [6]:
def combine_premise_and_hypothesis(premise, hypothesis):
    
    abs_difference=torch.abs(premise - hypothesis)
    multiplied=premise*hypothesis
    
    output=torch.cat((premise, hypothesis, abs_difference, multiplied), dim=1)
    
    return output

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [7]:
class SNLIModel(nn.Module):
    def __init__(self, voc_size, n_dimensions, n_labels):
        super(SNLIModel, self).__init__()
        self.embeddings = nn.Embedding(voc_size, n_dimensions)
        self.rnn = nn.LSTM(n_dimensions, n_dimensions, bidirectional=True, batch_first=True)
        self.classifier = nn.Linear(n_dimensions*2*4, n_labels)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, premise, hypothesis):
        p = self.embeddings(premise)
        h = self.embeddings(hypothesis)
        
        seq_p, *_ = self.rnn(p)
        seq_h, *_ = self.rnn(h)
        
        p_pooled = pooling(seq_p)
        h_pooled = pooling(seq_h)
        
        ph_representation = combine_premise_and_hypothesis(p_pooled, h_pooled)
        
        drop = self.dropout(ph_representation)
        
        predictions = self.classifier(drop)
        
        return predictions

# 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [9]:
#Training...

import torch.optim as optim

train_iter, test_iter, vocab, labels = dataloader()

model = SNLIModel(voc_size=len(vocab),
                  n_dimensions=my_dimensions,
                  n_labels=len(labels))

model.to(device)

loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

total_loss = 0
for e in range(epochs):
    for i, batch in enumerate(train_iter):
        #print(len(batch))
        p = batch.premise
        h = batch.hypothesis
        label = batch.label
        
        output = model(p, h)
        #print(output)
        #print(label)
        
        loss = loss_function(output, label.squeeze())
        
        # For a batch sized 1, there needs to do some transformations 
        # so that the loss function does not complain
        # Since, the solution below is suboptimal, it is
        # commented out.
        # if len(batch) == 1:  
        #     loss = loss_function(output, label.squeeze(1))
        # else:
        #     loss = loss_function(output, label.squeeze()) 
        
        #Note: code below adopted from previous assignment
        total_loss += loss.item()
        print(total_loss/(i+1), end='\r') 
        loss.backward() # compute gradients
        optimizer.step() # update parameters
        optimizer.zero_grad # reset gradients
        
        #break
    print()

17.848468930197434
51.987904647312946
92.962872322101166


In [10]:
#Testing ...
correct_set = []
correct_per_relation = {label:[] for label in [labels.itos[x] for x in range(len(labels))]}
model.eval() #evaluation mode

for i, batch in enumerate(test_iter):
    print(f"{round((i/len(test_iter))*100, 3)} %", end="\r")
    p = batch.premise
    h = batch.hypothesis
    label = batch.label
    
    output = model(p, h)
    
    my_probs = F.softmax(output, dim=1)
    index_of_top_prob = torch.max(my_probs, dim=1).indices
    predicted_label = [labels.itos[x] for x in index_of_top_prob]
    if len(batch) == 1:
        true_label = [labels.itos[x] for x in label.squeeze(1)]
    else:
        true_label = [labels.itos[x] for x in label.squeeze()]
    
    for prediction, truth in zip(predicted_label, true_label):
        if prediction == truth:
            correct_set.append(1)
            correct_per_relation[truth].append(1)
        else:
            correct_set.append(0)
            correct_per_relation[truth].append(0)

accuracy = sum(correct_set) / len(correct_set)

accuracy_per_relation = {label:0 for label in correct_per_relation.keys()}

for label in correct_per_relation.keys():
    if len(correct_per_relation[label]) == 0:
        accuracy_per_relation[label] = "NA"
    else:
        mean = sum(correct_per_relation[label]) / len(correct_per_relation[label])
        accuracy_per_relation[label] = mean
    
print(f"Accuracy: {accuracy}")
print()
print("Relation\tAccuracy")

for relation in ["contradiction", "neutral", "entailment"]:
    print("{}\t{}".format(relation, accuracy_per_relation[relation])) 


Accuracy: 0.3986

Relation	Accuracy
contradiction	0.1742354031510658
neutral	0.1186703945324635
entailment	0.9026128266033254


Suggest a _baseline_ that we can compare our model against **[2 marks]**

**Your answer should go here**

*Naive Model* (Baseline): for every instance predict the most common relation (label) of the data (*Lmax*). 

Naive Model would get an accuacy of *count*(*Lmax*) / N. With three labels in balanced test and training samples where the labels are equally common, as the present ones, accuracy of Naive Model would be about 1/3. 

In [11]:
#A baseline
with open("snli-data/train.csv", mode="r") as f:
    data = [x.split("\t") for x in f.read().split("\n")]

counter = {}
for x in data:
    relation = x[-1]
    if relation in counter:
        counter[relation]+=1
    else:
        counter[relation]=1

#print(counter)
baseline = max(counter.values()) / len(data)
print(baseline)

0.3333914990766188


Suggest some ways (other than using a baseline) in which we can analyse the models performance **[4 marks]**.

**Your answer should go here**

A number of variations for evaluation of NLI models other than the simple train-and-test procedure implemented here have been discussed in the litterature (Talman et al, 2019 has been the main inspiration for this list).

1. The model can be trained and tested on various datasets othe than SNLI, e.g. *MultiNLI* and *SciTail*.
2. The model could be trained on one dataset, but evaluated on another. Annotation artifacts is a known problem of present datasets. Cross-dataset evaluation is a procedure to address this problem.
3. Detailed error analysis of what inference labels are handled best/worst in various datasets.
4. Linguisitc features analysis, including breaking tests or "stress tests" (Naik et al. 2018), using datasets (e.g. Breaking NLI) which has been designed to incorporate linguitic features known to be challenging for NLI, such as word overlap, negation, and antonymy. 
5. Transfer experiments. Given that the NLI task is designed so that sentence representations are treated separately, the sentence embeddings can be used in a in transfer learning in a downstream task. Thus, our model can be evaluated based on the preformance of the sentence representations it produces in tasks over and above NLI, e.g. in SentEval dataset. 
    


Suggest some ways to improve the model **[3 marks]**.

**Your answer should go here**

1. Conneau et al. (also Talman et al.) achieves high performances by implementing hierarchical structure of with several layers of LSTMs.
2. Talman et al. implements an iterative refinement architecture, where the input is "reconsidered" at deeper layers of LSTMs. They found this hierarchical structure to out outperform other hirearchical layouts (incl. stacked LSTMs layers).
3. Using pre-trained word embeddings is a possible venue for improvement.


#### Additional references ...

Naik A. et al 2018. "Stress test evaluation for natural language inference", *Proceedings of the 27th International Conference on Computational Linguistics*, pp. 2340-2353. 

Talman A. et al 2019. "Sentence embeddings in NLI with iterative refinement encoders", *Natural Language Engineering*, 25: 467-482

### Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.