In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## SICK entailment task data

The [SemEval](http://alt.qcri.org/semeval2016/) data are obtained from the `datasets-sts` repo: https://github.com/brmson/dataset-sts

`pysts` (included as git submodule in this repo) can be used to load SICK task data.

In [2]:
import sys
sys.path.append('./dataset-sts/')

In [3]:
import pysts
from pysts.loader import load_sts

In [4]:
sickA, sickB, score = pysts.loader.load_sick2014('dataset-sts/data/sts/sick2014/SICK_train.txt',
                                                 mode='entailment')

In [5]:
sickA[0]

['A',
 'group',
 'of',
 'kids',
 'is',
 'playing',
 'in',
 'a',
 'yard',
 'and',
 'an',
 'old',
 'man',
 'is',
 'standing',
 'in',
 'the',
 'background']

In [6]:
sickB[0]

['A',
 'group',
 'of',
 'boys',
 'in',
 'a',
 'yard',
 'is',
 'playing',
 'and',
 'a',
 'man',
 'is',
 'standing',
 'in',
 'the',
 'background']

In [7]:
score[0]

array([0, 1, 0])

In [8]:
score

array([[0, 1, 0],
       [0, 1, 0],
       [0, 0, 1],
       ..., 
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0]])

## GloVe pre-trained word vectors 

GloVe - Global Vectors for Word Representation (https://nlp.stanford.edu/projects/glove/). Pre-trained word vectors have been downloaded (we use the 300-dimensional vectors trained on the 840 billion token Common Crawl corpus: http://nlp.stanford.edu/data/glove.840B.300d.zip), and converted to a dictionary for further usage:
    
    import pandas as pd
    import zipfile
    
    z = zipfile.ZipFile("./glove.840B.300d.zip")
    glove = pd.read_csv(z.open('glove.840B.300d.txt'), sep=" ", quoting=3, header=None, index_col=0)
    glove2 = {key: val.values for key, val in glove.T.items()}
    
    import pickle
    with open('glove.840B.300d.pkl', 'wb') as output:
        pickle.dump(glove2, output)
        
See the [GloVe_pretrained_vectors.ipynb](GloVe_pretrained_vectors.ipynb) notebook for the actual code.

In [9]:
import pickle

In [10]:
with open('glove.840B.300d.pkl', 'rb') as pkl:
    glove = pickle.load(pkl)

## Sentence embedding

See the [sts_tasks.ipynb](sts_tasks.ipynb) notebook for an exploration of the different ways to obtain the sentence embedding. The resulting code is put in a sklearn-like transformer in [wordembeddings.py](files/wordembeddings.py), which we use here:

In [11]:
from wordembeddings import EmbeddingVectorizer

In [12]:
emb = EmbeddingVectorizer(word_vectors=glove, weighted=True, R=True)

In [13]:
Vs0 = emb.fit_transform(sickA)
Vs1 = emb.fit_transform(sickB)

## Using the word/sentence embeddings in learned models: Predicting the entailment

### Keras model (from paper)

Here, we implement using Keras a model similar to the one described in Wieting *et al.*, 2016 (https://arxiv.org/pdf/1511.08198.pdf, code in https://github.com/jwieting/iclr2016)

In [14]:
from keras.models import Sequential, Model
from keras.layers import Dense, Input, concatenate

Using TensorFlow backend.


In [15]:
g1_dot_g2 = Vs0 * Vs1
g1_abs_g2 = np.abs(Vs0 - Vs1)

In [16]:
wv_dim = Vs0.shape[1]

In [17]:
lin_dot = Input(shape=(wv_dim,), name='lin_dot')
lin_abs = Input(shape=(wv_dim,), name='lin_abs')

l_sum = concatenate([lin_dot, lin_abs])
l_sigmoid = Dense(50, activation='sigmoid')(l_sum)
l_softmax = Dense(3, activation='softmax')(l_sigmoid)

model = Model(inputs=[lin_dot, lin_abs], outputs=l_softmax)

In [18]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [19]:
y = score

In [20]:
model.fit([g1_dot_g2, g1_abs_g2], y, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f1a6c012be0>

In [21]:
predicted = model.predict([g1_dot_g2, g1_abs_g2])

In [22]:
predicted

array([[ 0.22848211,  0.44269061,  0.32882735],
       [ 0.16841087,  0.52811384,  0.30347526],
       [ 0.06455892,  0.40623131,  0.52920973],
       ..., 
       [ 0.00150757,  0.96586132,  0.03263115],
       [ 0.00262515,  0.98294824,  0.0144267 ],
       [ 0.00213707,  0.99459004,  0.0032729 ]], dtype=float32)

In [23]:
from sklearn import metrics

In [24]:
metrics.accuracy_score(score.argmax(axis=1), predicted.argmax(axis=1))

0.77044444444444449

In [25]:
metrics.confusion_matrix(score.argmax(axis=1), predicted.argmax(axis=1))

array([[ 481,  122,   62],
       [  40, 2282,  214],
       [ 137,  458,  704]])

#### Validate on the test data

In [26]:
sick_testA, sick_testB, score_test = \
    pysts.loader.load_sick2014('dataset-sts/data/sts/sick2014/SICK_test_annotated.txt',
                               mode='entailment')

In [27]:
Vs0_test = emb.fit_transform(sick_testA)
Vs1_test = emb.fit_transform(sick_testB)

Supervised:

In [28]:
g1_dot_g2_test = Vs0_test * Vs1_test
g1_abs_g2_test = np.abs(Vs0_test - Vs1_test)

In [29]:
predicted = model.predict([g1_dot_g2_test, g1_abs_g2_test])

In [30]:
metrics.accuracy_score(score_test.argmax(axis=1), predicted.argmax(axis=1))

0.73878627968337729

In [31]:
metrics.confusion_matrix(score_test.argmax(axis=1), predicted.argmax(axis=1))

array([[ 460,  147,  113],
       [  40, 2455,  298],
       [ 157,  532,  725]])

### Neural Network (MLPClassifier) with sklearn

In [32]:
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

In [33]:
X = np.concatenate([Vs0 * Vs1, np.abs(Vs0, Vs1)], axis=1)

In [34]:
X.shape

(4500, 600)

Converting the binary sore matrix to 1D array of labels:

In [35]:
y = score.argmax(axis=1)

Pipeline consists of scaling the word embeddings, and then a simple neural network:

In [36]:
scaler = StandardScaler()  

In [37]:
X = scaler.fit_transform(X)  

In [38]:
mlp = MLPClassifier(activation='logistic', solver='adam', hidden_layer_sizes=20)

In [39]:
mlp.fit(X, y)



MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=20, learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [40]:
predicted = mlp.predict(X)

In [41]:
predicted

array([1, 1, 2, ..., 1, 1, 1])

In [42]:
metrics.accuracy_score(y, predicted)

0.97844444444444445

In [43]:
metrics.confusion_matrix(y, predicted)

array([[ 620,   26,   19],
       [  17, 2504,   15],
       [   9,   11, 1279]])

#### Validate on the test data:

In [44]:
X_test = np.concatenate([Vs0_test * Vs1_test, np.abs(Vs0_test, Vs1_test)], axis=1)

In [45]:
X_test = scaler.transform(X_test)

In [46]:
predicted_test = mlp.predict(X_test)

In [47]:
metrics.accuracy_score(score_test.argmax(axis=1), predicted_test)

0.59853866450172521

So the MLP overfits, and the performance on the validation data is much worse compared to the Keras model.

## Comparison to count-based model (no word embeddings) with sklearn

In [48]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
from sklearn.neural_network import MLPClassifier

Converting the already tokenized documents back to full text, so the sklearn transformers can handle it:

In [49]:
X_raw1 = [' '.join(s) for s in sickA]
X_raw2 = [' '.join(s) for s in sickB]

In [50]:
dim = len(X_raw1)
dim

4500

Combining them to do a combined count vectorizer:

In [51]:
X_raw = X_raw1 + X_raw2

In [52]:
tfidf = TfidfVectorizer()

In [53]:
X_tfidf = tfidf.fit_transform(X_raw)

In [54]:
X_tfidf

<9000x2165 sparse matrix of type '<class 'numpy.float64'>'
	with 71126 stored elements in Compressed Sparse Row format>

In [55]:
X_tfidf1 = X_tfidf[:dim, :]
X_tfidf2 = X_tfidf[dim:, :]

Reducing the dimensionality to the same 300D as with the word embeddings:

In [56]:
lsa1 = make_pipeline(TruncatedSVD(n_components=300), Normalizer(copy=False))
lsa2 = make_pipeline(TruncatedSVD(n_components=300), Normalizer(copy=False))

In [57]:
X1 = lsa1.fit_transform(X_tfidf1)
X2 = lsa2.fit_transform(X_tfidf2)

In [58]:
X1

array([[  4.51585033e-01,  -4.04499744e-02,  -3.67662025e-02, ...,
          6.20010535e-02,   3.57434222e-02,   5.94623341e-02],
       [  5.58522088e-01,  -2.12649517e-03,  -1.55138327e-02, ...,
          9.55973489e-04,  -9.65932921e-03,   1.30330303e-02],
       [  4.12860784e-01,   2.28397376e-02,  -1.73158900e-01, ...,
         -2.46892130e-02,  -1.41916809e-02,  -2.04767027e-02],
       ..., 
       [  5.87403690e-01,   3.78857109e-01,  -2.92038629e-01, ...,
          1.29298148e-02,   5.16820624e-03,  -9.78083652e-03],
       [  2.83587589e-01,  -1.28032454e-01,  -3.28391739e-04, ...,
          7.87505370e-03,  -2.47142500e-02,   4.16246030e-03],
       [  1.06934681e-01,  -1.31687775e-01,  -1.15858566e-01, ...,
         -6.39035428e-02,  -1.27108413e-03,   9.00039303e-03]])

Concatenate both into a single feature matrix (this time I do not use multiplication and absolute difference of both):

In [59]:
X = np.concatenate([X1, X2], axis=1)

Using a similar neural network:

In [60]:
mlp = MLPClassifier(activation='logistic', solver='adam', hidden_layer_sizes=20)

In [61]:
mlp.fit(X, y)



MLPClassifier(activation='logistic', alpha=0.0001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=20, learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [62]:
predicted = mlp.predict(X)

In [63]:
predicted

array([1, 1, 2, ..., 1, 1, 1])

In [64]:
metrics.accuracy_score(y, predicted)

0.72022222222222221

In [65]:
metrics.confusion_matrix(y, predicted)

array([[ 494,  149,   22],
       [ 146, 2056,  334],
       [   9,  599,  691]])

#### Validate on the test data:

In [66]:
X_raw1_test = [' '.join(s) for s in sick_testA]
X_raw2_test = [' '.join(s) for s in sick_testB]

In [67]:
dim_test = len(X_raw1_test)
dim_test

4927

In [68]:
X_raw_test = X_raw1_test + X_raw2_test

In [69]:
X_tfidf_test = tfidf.transform(X_raw_test)

In [70]:
X_tfidf_test

<9854x2165 sparse matrix of type '<class 'numpy.float64'>'
	with 77558 stored elements in Compressed Sparse Row format>

In [71]:
X_tfidf1_test = X_tfidf_test[:dim_test, :]
X_tfidf2_test = X_tfidf_test[dim_test:, :]

In [72]:
X1_test = lsa1.transform(X_tfidf1_test)
X2_test = lsa2.transform(X_tfidf2_test)

In [73]:
X_test = np.concatenate([X1_test, X2_test], axis=1)

In [74]:
predicted_test = mlp.predict(X_test)

In [75]:
metrics.accuracy_score(score_test.argmax(axis=1), predicted_test)

0.57783641160949872

So the TFIFF - SVD approach actually gives similar results compared to the WordEmbedding.