# Demonstration and error analysis
In this notebook I demonstrate how the models I trained can be used to make predictions on novel hypothesis - premise pairs. Furthes, I present my evaluation results and I carry out an error analysis, both for the NLI task and for the sentence embeddings.

## Dependencies
* [CheckList](https://github.com/marcotcr/checklist)
* Torch
* Pandas
* Numpy
* Json
* Copy

Further, the model checkpoint should be downloaded from [this link](https://drive.google.com/drive/folders/18EWKTYv4CsF8mxgE7K4Ym6zHtqR6w6fF?usp=sharing) and placed in the `model_checkpoints` folder.

In [9]:
import pandas as pd
import json
import models
import torch 
import config
import mutils

import checklist
from checklist.editor import Editor
import numpy as np
import copy

device = "cuda" if torch.cuda.is_available() else "cpu"
seed = config.seed

## Performance scores
Here I will present the performance scores of the models

In [2]:
with open('../eval_results/final_model_results.json') as json_file:
        nli_dict = json.load(json_file)

In [3]:
results_table = pd.DataFrame.from_dict(nli_dict, orient='index')
results_table = results_table[['dev_acc', 'test_acc', 'micro', 'macro']]

Performance scores similar to Table 3 in the paper:

In [4]:
results_table

Unnamed: 0,dev_acc,test_acc,micro,macro
base,65.57,65.25,79.16,77.75
lstm,79.45,78.81,76.98,76.25
bilstm,78.51,78.55,79.63,78.88
bilstmpool,82.98,82.79,81.37,80.5


The performance scores for the LSTM and the BiLSTM with max pooling correspond to the results of Conneau et al. (2017) within a 3% margin for the NLI task and a 6% margin for SentEval. Also corresponding to Conneau et al. the BiLSTM max pooling model performs the best, both for NLI and for SentEval. Interestingly however, the base model outperforms the LSTM for SentEval, and performs equal to BiLSTM on these tasks, while its performance is lower for the NLI task.

Performance scores for the individual SentEval tasks:

In [5]:
with open('eval_results/final_task_results.json') as json_file:
        sent_eval_scores = json.load(json_file)
        
sent_eval_results = pd.DataFrame.from_dict(sent_eval_scores, orient='index')
sent_eval_results

Unnamed: 0,MR,CR,SUBJ,MPQA,SST2,TREC,SICKEntailment,MRPC
base,74.88,78.46,90.11,84.85,78.21,67.31,80.8,71.2
lstm,72.13,77.73,85.96,84.84,76.26,61.45,82.2,73.11
bilstm,72.4,79.26,89.52,85.06,78.67,73.06,83.4,72.55
bilstmpool,75.76,81.78,91.57,85.64,79.93,75.26,84.4,74.31


The base model performs particularly well on SUBJ (subjectivity status) and MPQA (opinion polarity). All models perform relatively low on TREC (question-type classification), particulary the LSTM achieves a low performance on this task. All models achieve satisfying results on the SUBJ task.

BiLSTMPool performs the best on all tasks.

## Demonstration
Here I will demonstrate how to make predictions for a new hypothesis-premise pair with one of the models.
### Load models
load all trained models + the GloVe embeddings of the tokens in the training data

In [10]:
base_nli, _ = mutils.load_model('base', '../model_checkpoints/base_model_final', None)
lstm_nli, lstm_lstm = mutils.load_model('lstm', '../model_checkpoints/lstm_nli_model_final', '../model_checkpoints/lstm_lstm_model_final')
bilstm_nli, bilstm_lstm = mutils.load_model('bilstm', '../model_checkpoints/bilstm_nli_model_final', '../model_checkpoints/bilstm_lstm_model_final')
bilstmpool_nli, bilstmpool_lstm = mutils.load_model('bilstmpool', '../model_checkpoints/bilstmpool_nli_model_final', '../model_checkpoints/bilstmpool_lstm_model_final')

load bilstm encoder
load bilstm maxpool encoder


In [11]:
with open('train_word_dict.json') as json_file:
    word_dict = json.load(json_file)

embedding_model = mutils.load_embeddings(word_dict)

In [16]:
prediction_translate = {0 : 'entailment', 1: 'neutral', 2: 'contradiction'}

In [None]:
# select model name
model = 'bilstm'
# select corresponding nli model
nli = bilstm_nli
#define premise (s1) and hypothesis (s2)
s1 = 'the dog was hungry'
s2 = 'the cat was hungry'

In [19]:
#make prediciton
prediction = mutils.predict(model, embedding_model, nli, s1, s2)
print(f"Prediction for {model} model for \nPremise: {s1} \nHypothesis: {s2}\n") 
print(f"Prediction = {prediction} : {prediction_translate[prediction]}")

Prediction for bilstm model for 
Premise: the dog was hungry 
Hypothesis: the cat was hungry

Prediction = 2 : contradiction


Here, the bilstm model makes a false prediction, as I would consider the premise and the hypothesis to be neutral

## Error analysis for NLI

In this Section, I will try out various difficult NLI testcases. In order to investigate whether the models make structural mistakes, I create 100 example sentences per testcase using [CheckList](https://github.com/marcotcr/checklist), and report on the performance of each model. 

In [30]:
label_dict = {0: 0, 1:0, 2:0}

#create a dict that keeps track of predictions per model
results_dict = {'base': copy.deepcopy(label_dict),
               'lstm' : copy.deepcopy(label_dict),
               'bilstm': copy.deepcopy(label_dict),
               'bilstmpool': copy.deepcopy(label_dict)}

model_tuples = [('base', base_nli),('lstm', lstm_nli), ('bilstm', bilstm_nli), ('bilstmpool', bilstmpool_nli)]

def predict_all_models(result_dict, s1, s2):
    """
    Makes predictions on a premise-hypothesis pair for all four models, stores predictions in results dict
    """
    for model, nli in model_tuples:
        prediction = mutils.predict(model, embedding_model, nli, s1, s2)
        result_dict[model][prediction] += 1
    return result_dict

def print_performances(result_dict, nsamples, correct_label):
    """
    Prints out a table with the performance scores on the testcase per model
    """
    performances = {}
    for model, _ in model_tuples:
        model_scores = copy.deepcopy(result_dict[model])
        model_scores['correct'] = str(round(model_scores[correct_label] / nsamples * 100,1)) + '%'
        performances[model] = model_scores
    performance_table = pd.DataFrame.from_dict(performances, orient='index')
    print(performance_table)

def eval_test(eval_sents, nsamples, correct_label):
    """
    Evaluates all models on a testcase of {nsamples} evaluation sentences;
    prints out a table with per-model performance scores
    """
    result_dict = copy.deepcopy(results_dict)
    for n, sent in enumerate(eval_sents):
        try:
            s1, s2 = sent.split(';')
            
        except:
            nsamples = nsamples - 1
            continue
        if n == 0:
            print(f"{nsamples} samples of the following structure where the correct label is " +
                  f"{correct_label} ({prediction_translate[correct_label]}):")
            print(f"Premise: '{s1}' \nHypothesis '{s2}'")
        result_dict = predict_all_models(result_dict, s1, s2)
    print_performances(result_dict, nsamples, correct_label)

### Negation
In this testcase, I will investigate whether the models can recognize a simple negation to be a contradiction

In [23]:
nsamples = 100
editor = Editor()

In [36]:
negation = editor.template('{first_name} is from {country}; {first_name} is not from {country} ')

np.random.seed(seed) 
negation_sents = np.random.choice(negation.data, nsamples)
correct_label = 2 #contradiction

In [37]:
eval_test(negation_sents, nsamples, correct_label)

100 samples of the following structure where the correct label is 2 (contradiction):
Premise: 'Scott is from Poland' 
Hypothesis ' Scott is not from Poland '
             0   1   2 correct
base        43  56   1    1.0%
lstm         1   0  99   99.0%
bilstm       1   0  99   99.0%
bilstmpool   3   0  97   97.0%


The base model is unable to handle simple negation: it predicts either entailment or neutral - the mean of the token embeddings does not appear to capture contradictions. The LSTM-based models do handle these cases correctly

## Negetion in one part of the sentence, entailment in the relevant part
In the following sentences, these is a negation in the first clause of the premise but *not* in the second clause, and the hypothesis entails the second clause. A model that can recognizes negation could struggle here, if it overestimates the scope of the negation. 

In [39]:
negation_entailment = editor.template('{first_name} does not live in {city} but they are from {country}; {first_name} is from {country} ')
np.random.seed(seed) 
negation_entailment_sents = np.random.choice(negation_entailment.data, nsamples)
correct_label = 0 #entailment

eval_test(negation_entailment_sents, nsamples, correct_label)

100 samples of the following structure where the correct label is 0 (entailment):
Premise: 'Alan does not live in Anaheim but they are from Bulgaria' 
Hypothesis ' Alan is from Bulgaria '
             0   1   2 correct
base         1  56  43    1.0%
lstm        85   0  15   85.0%
bilstm      20   7  73   20.0%
bilstmpool  97   0   3   97.0%


As in the previous task, the base model cannot do this task correctly: interestingly however, here the model predicts contradiction/neutral, so it does appear to recognize the negation here (or, it does not recognize the similarity between the premise and the hypothesis). 

Even more interestingly, the BiLSTM model performs very poorly on this task -consistently predicting contradiction-, while the other LSTM-based models do perform well.



## Agent - patient disambiguation in active/passive sentences
In the following two tests, I investigate whether the models can correctly distinguish between agents and patients between active and passive constructions. In the first test the patient is the object in the active premise, while it is the subject in the passive hypothesis: here the models should predict entailment. 

In the second test, the agent is the subject in the active premise - but this same person is also the subject in the passive hypothesis, making it the patient. Here, the models should thus predict neutral or contradition, but *not* entailment. 


In [41]:
# a list of verbs to use in the test cases
passive_verbs = ['kissed', 'killed', 'hurt', 'touched', 'ignored', 'silenced', 'hit', 'greeted']
english_firstname = editor.lexicons.female_from.United_Kingdom + editor.lexicons.male_from.United_Kingdom

In [42]:
active_passive = editor.template('{first_name} {verb} {first}; {first} was {verb} by {first_name}', first=english_firstname, verb=passive_verbs)

np.random.seed(seed) 
active_passive_sents = np.random.choice(active_passive.data, nsamples)

correct_label = 0 #entailment

eval_test(active_passive_sents, nsamples, correct_label)

100 samples of the following structure where the correct label is 0 (entailment):
Premise: 'Christopher hit Hannah' 
Hypothesis ' Hannah was hit by Christopher'
              0   1  2 correct
base        100   0  0  100.0%
lstm        100   0  0  100.0%
bilstm      100   0  0  100.0%
bilstmpool   71  21  8   71.0%


In [43]:
active_passive = editor.template('{first_name} {verb} {first}; {first_name} was {verb} by {first}', first=english_firstname, verb=passive_verbs)

np.random.seed(seed) 
active_passive_sents = np.random.choice(active_passive.data, nsamples)

correct_label = 1 #neutral

eval_test(active_passive_sents, nsamples, correct_label)

100 samples of the following structure where the correct label is 1 (neutral):
Premise: 'Christopher hit Hannah' 
Hypothesis ' Christopher was hit by Hannah'
              0  1  2 correct
base        100  0  0    0.0%
lstm        100  0  0    0.0%
bilstm      100  0  0    0.0%
bilstmpool  100  0  0    0.0%


This test-pair gives some interesting results. Firstly, while all models perform well on the first test, they all obtain a zero-score on the second test - even though the names of the subject and the object are swapped between the two hypotheses, the models all still make the same prediction. The models thus all fail consistently at agent-patient disambiguation.

Secondly, the BiLSTMPool model obtains the lowest score at the first testcase, which is interesting since this is the best model overall.

## Long premise - short hypothesis

In this test I evaluate whether the models can extract the relevant information from a longer premise.

In [44]:
verbs = ['hit', 'kicked', 'stopped', 'touched', 'missed', 'smashed']
# lists of sentence fillers to increase the distance between the agent and the predicate
precedents = ['nearly falling down', 'missing the past three games', 'celebrating a perfect streak', 'suffering from a knee injury', 'appearing so fit']

In [45]:
long_distance = editor.template("{first_name}, after {filler}, finnaly {verb} the ball; {first_name} {verb} the ball", verb=verbs, filler=precedents)

np.random.seed(seed) 
long_d_sents = np.random.choice(long_distance.data, nsamples)

correct_label = 0 #entailment

eval_test(long_d_sents, nsamples, correct_label)

100 samples of the following structure where the correct label is 0 (entailment):
Premise: 'Edward, after suffering from a knee injury, finnaly stopped the ball' 
Hypothesis ' Edward stopped the ball'
              0   1   2 correct
base         22  49  29   22.0%
lstm         98   0   2   98.0%
bilstm       96   4   0   96.0%
bilstmpool  100   0   0  100.0%


These results show that all results except the base model can perform this test satisfyingly.

## Synonyms / hypernym recognition

In this final test, I evaluate whether the models recognize synonyms and some hypernyns correctly, and thus predict entailment for the test cases.

In [46]:
synonym_words = [('author', 'writer'), ('surgeon', 'doctor'), ('server', 'waiter'), ('chef','cook'), 
                 ('educator','teacher'), ('professor','academic'), ('person','human'), ('actor','performer'),
                 ('musician', 'artist'), ('hairdresser', 'hairstylist')]

In [49]:
synonyms = editor.template("{first_name} is {a:occupation[0]}; {first_name} {a:occupation[1]}", occupation=synonym_words, filler=precedents)

np.random.seed(seed) 
synonym_sents = np.random.choice(synonyms.data, nsamples)

correct_label = 0 #entailment

eval_test(synonym_sents, nsamples, correct_label)


100 samples of the following structure where the correct label is 0 (entailment):
Premise: 'Tim is a person' 
Hypothesis ' Tim a human'
             0   1   2 correct
base        18  73   9   18.0%
lstm        55  36   9   55.0%
bilstm      72  12  16   72.0%
bilstmpool  69  18  13   69.0%


This task appears to be rather difficult, none of the models achieve over 72%. The best performance is obtained by the BiLSTM model and the worst performance by the base model. Most mistakes are classifying a neutral relation between the sentence pairs - this could indicate that the relations between the synonyms/hyper(/o)nyms are not satisfyingly represented in the sentence embeddings, causing the model to consider the sentences to be unrelated.

# Sent embeddings evaluation

Finally, I will investigate a few things:
* cosine similarity of sents containing syno/hyper nyms
* visualizations of some sentences - token importantce , following Conneau et al
* visualize pos - neg sentiment sentence embeddings

In [12]:
df = pd.read_csv('http://bit.ly/dataset-sst2', 
                 nrows=100, sep='\t', names=['text', 'label'])

In [13]:
df['label'] = df['label'].replace({0: 'negative', 1: 'positive'})

In [14]:
model = 'bilstm'
e = mutils.encode(model, embedding_model, df['text'], device, bistm_lstm)

AttributeError: module 'mutils' has no attribute 'encode'

In [15]:
dir(mutils)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'config',
 'create_batches',
 'device',
 'get_UVY',
 'get_batch',
 'get_sent_embedding',
 'get_word_embedding',
 'load_embeddings',
 'load_model',
 'load_word_dict',
 'models',
 'nn',
 'np',
 'output_dict',
 'pad',
 'pad_singlebatch',
 'predict',
 'preprocess_sentence_data',
 'torch',
 'word_tokenize']