# Using Sent2Vec via Gensim

This tutorial is about using sent2vec model in Gensim. Here, we'll learn to work with the sent2vec library for training sentence-embedding models, saving & loading them and performing similarity operations. This notebook also contains a comparison of the gensim implementation with the [original c++ implementation](https://github.com/epfml/sent2vec), Gensim's Doc2Vec and Gensim's FastText. All the evaluation scripts used in the notebook can be found [here](https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b).

# What is Sent2Vec?

Sent2Vec delivers numerical representations (features) for short texts or sentences, which can be used as input to any machine learning task later on. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences. The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the [paper](https://arxiv.org/abs/1703.02507) for more details.

The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams present in each sentence, and averaging the n-gram embeddings along with the words

# Training models

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model. All models are trained with the same hyperparameters for evaluation purposes.

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec as s2v
from gensim.models.fasttext import FastText as ft
from gensim.utils import tokenize
import scipy
import re
from numpy import dot
from gensim import matutils
import time
import numpy as np
import tensorflow as tf
import random
import eval_sick
import eval_classification
import eval_trec
import smart_open

Using TensorFlow backend.


In [2]:
# Prepare training data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'
lee_data = []
with open(lee_train_file) as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = tokenize(sentence)
                    lee_data.append(list(sentence))
                    f2.write(' '.join(lee_data[-1]) + '\n')

In [3]:
# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

# Using gensim implementation of sent2vec

In [4]:
# Train new sent2vec model
sent2vec_model = s2v(lee_data, vector_size=100, epochs=20, seed=42)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 0.06 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 1307, tokens read: 60302
INFO:gensim.models.sent2vec:Training...
INFO:gensim.models.sent2vec:Begin epoch 0 :
INFO:gensim.models.sent2vec:Progress: 3.96, lr: 0.1921, loss: 3.7907
INFO:gensim.models.sent2vec:Begin epoch 1 :
INFO:gensim.models.sent2vec:Progress: 7.93, lr: 0.1841, loss: 3.6546
INFO:gensim.models.sent2vec:Begin epoch 2 :
INFO:gensim.models.sent2vec:Progress: 11.90, lr: 0.1762, loss: 3.5554
INFO:gensim.models.sent2vec:Begin epoch 3 :
INFO:gensim.models.sent2vec:Progress: 15.86, lr: 0.1683, loss: 3.4607
INFO:gensim.models.sent2vec:Begin epoch 4 :
INFO:gensim.models.sent2vec:Progress: 19.83, lr: 0.1603, loss: 3.3682
INFO:gensim.models.sent2vec:Begin epoch 5 :
INFO:gensim.models.sent2vec:Progress: 23.80, lr: 0.1524, loss: 3.2797
INFO:gensim.models.sent2vec:Begin epoch 6 :
INFO:gensim.models.sent2vec:Progress: 27.76, lr

# Training hyperparameters

Sent2Vec supports the folllowing parameters:

 - vector_size: Size of embeddings to be learnt (Default 100)
 - alpha: Initial learning rate (Default 0.2)
 - min_count: Ignore words with number of occurrences below this (Default 5)
 - loss: Training objective. Allowed values: `ns` (Default `ns`)
 - neg: Number of negative words to sample, for `ns` (Default 10)
 - epochs: Number of epochs (Default 5)
 - bucket: Number of hash buckets for vocabulary (Default 2000000)
 - lr_update_rate: Change the rate of updates for the learning rate (Default 100)
 - t: Sampling threshold (Default 0.0001)
 - dropoutk: Number of ngrams dropped when training a sent2vec model (Default 2)
 - word_ngrams: Max length of word ngram (Default 2)
 - minn: min length of char ngrams (Default 3)
 - maxn: max length of char ngrams (Default 6)
 - seed: random seed for reproducibility reasons (Default 42)

In [5]:
# Print sentence vector
sent2vec_model.sentence_vectors(['This', 'is', 'an', 'awesome', 'gift'])

array([ -2.47282963e-01,   1.87189350e-01,  -3.21987474e-02,
         7.91178968e-02,   3.39226979e-01,  -4.05825705e-01,
         8.00900208e-01,   1.35698618e-01,  -4.38281501e-02,
         7.56798528e-01,  -3.80137642e-01,  -2.22912740e-01,
        -8.74431924e-02,   5.80761992e-02,   4.83582117e-01,
         1.01573390e-01,   7.24145461e-01,   2.94534907e-01,
         1.56936339e-01,  -1.77839273e-01,  -1.38541675e-01,
         2.78566131e-01,  -7.17920556e-01,  -2.89763518e-01,
         1.50396665e-01,  -1.23929553e-01,   6.27710213e-01,
         2.74299500e-01,   3.58840180e-01,   4.12131903e-02,
        -1.07678619e-01,  -1.87951293e-01,  -5.40631978e-01,
        -1.87393133e-01,   6.79594841e-01,   8.36914707e-01,
         1.57756821e-02,  -6.81864588e-01,   3.33469583e-01,
         9.79331323e-01,  -1.11638314e-03,   8.25131676e-01,
         8.70815820e-01,  -3.52632190e-01,   2.81376163e-01,
         3.82966643e-01,   3.37886963e-01,  -4.67265898e-01,
         7.07851315e-01,

In [6]:
# Print cosine similarity between two sentences
print sent2vec_model.similarity(['The', 'sky', 'is', 'blue'], ['I', 'am', 'going', 'to', 'a', 'party'])
print sent2vec_model.similarity(['This', 'is', 'an', 'awesome', 'gift'], ['This', 'present', 'is', 'great'])

0.116041437211
0.783033600606


# Saving and loading models

Models can be saved and loaded via the load and save methods.

In [7]:
# Save trained sent2vec model
sent2vec_model.save('s2v1')

INFO:gensim.utils:saving Sent2Vec object under s2v1, separately None
INFO:gensim.utils:storing np array 'wi' to s2v1.wi.npy
INFO:gensim.utils:saved s2v1


In [2]:
# Load pretrained sent2vec model
loaded_model = s2v.load('s2v1')

INFO:gensim.utils:loading Sent2Vec object from s2v1
INFO:gensim.utils:loading wi from s2v1.wi.npy with mmap=None
INFO:gensim.utils:loaded s2v1


# Unsupervised similarity evaluation

Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the [SICK 2014](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools) datasets. These similarity scores are compared to the gold-standard human judgements using [Pearson’s correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) scores. The SICK dataset consists of about 10,000 sentence pairs along with relatedness scores of the pairs. We use the code provided by [Kiros et al., 2015](https://github.com/ryankiros/skip-thoughts).

In [9]:
eval_sick.evaluate(loaded_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
1s - loss: 1.4723 - val_loss: 1.3952
Epoch 2/10
0s - loss: 1.4011 - val_loss: 1.3571
Epoch 3/10
0s - loss: 1.3717 - val_loss: 1.3356
Epoch 4/10
0s - loss: 1.3534 - val_loss: 1.3214
Epoch 5/10
0s - loss: 1.3402 - val_loss: 1.3111
Epoch 6/10
0s - loss: 1.3302 - val_loss: 1.3033
Epoch 7/10
0s - loss: 1.3220 - val_loss: 1.2974
Epoch 8/10
0s - loss: 1.3153 - val_loss: 1.2927
Epoch 9/10
0s - loss: 1.3096 - val_loss: 1.2889
Epoch 10/10
0s - loss: 1.3047 - val_loss: 1.2859
0.470984774503
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.3003 - val_loss: 1.2834
Epoch 2/10
0s - loss: 1.2964 - val_loss: 1.2814
Epoch 3/10
0s - loss: 1.2929 - val_loss: 1.2796
Epoch 4/10
0s - loss: 1.2898 - val_loss: 1.2782
Epoch 5/10
0s - loss: 1.2869 - val_loss: 1.2770
Epoch 6/10
0s - loss: 1.2842 - val_loss: 1.2759
Epoch 7/10
0s - loss: 1.2817 - val_loss: 1.2750
Epoch 8/10
0s - loss: 1.2794 - val_loss: 1.2742
Epoch 9/10
0s 

array([ 3.11393216,  3.21367784,  3.32681751, ...,  3.31225647,
        2.57795017,  3.68078184])

# Downstream Supervised Evaluation

Sentence embeddings are evaluated for various supervised classification tasks. We evaluate classification of movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ)(Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). To classify, we use the code provided by [(Kiros et al., 2015)](https://github.com/ryankiros/skip-thoughts). Sent2Vec embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the [MR and SUBJ](https://www.cs.cornell.edu/people/pabo/movie-review-data/) datasets. For those datasets nested cross-validation is used to tune the L2 penalty. For the [TREC dataset](http://cogcomp.cs.illinois.edu/Data/QA/QC/), the accuracy is computed on the test set.

In [10]:
eval_classification.eval_nested_kfold(model=loaded_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
(1, 0.78222222222222226)
(1, 0.76333333333333331)
(1, 0.77333333333333332)
(1, 0.76444444444444448)
(1, 0.73111111111111116)
(1, 0.77666666666666662)
(1, 0.75888888888888884)
(1, 0.77555555555555555)
(1, 0.75222222222222224)
(1, 0.79888888888888887)
(2, 0.78000000000000003)
(2, 0.76333333333333331)
(2, 0.77111111111111108)
(2, 0.76333333333333331)
(2, 0.73333333333333328)
(2, 0.78111111111111109)
(2, 0.76000000000000001)
(2, 0.77333333333333332)
(2, 0.75555555555555554)
(2, 0.79666666666666663)
(4, 0.77777777777777779)
(4, 0.76111111111111107)
(4, 0.77111111111111108)
(4, 0.76222222222222225)
(4, 0.73555555555555552)
(4, 0.78000000000000003)
(4, 0.75555555555555554)
(4, 0.77222222222222225)
(4, 0.75666666666666671)
(4, 0.80000000000000004)
(8, 0.77666666666666662)
(8, 0.76000000000000001)
(8, 0.77222222222222225)
(8, 0.76444444444444448)
(8, 0.73777777777777775)
(8, 0.78111111111111109)
(8, 0.75555555555555554)
(8, 0.77111111111111108)
(8, 0.75666666666666

[0.77100000000000002,
 0.77200000000000002,
 0.76800000000000002,
 0.77900000000000003,
 0.751,
 0.753,
 0.77200000000000002,
 0.77300000000000002,
 0.753,
 0.79300000000000004]

In [11]:
eval_classification.eval_nested_kfold(model=loaded_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
(1, 0.5708333333333333)
(1, 0.59166666666666667)
(1, 0.5541666666666667)
(1, 0.58437499999999998)
(1, 0.59062499999999996)
(1, 0.57872784150156409)
(1, 0.6193952033368092)
(1, 0.59332638164754958)
(1, 0.58185610010427524)
(1, 0.5714285714285714)
(2, 0.57187500000000002)
(2, 0.59166666666666667)
(2, 0.55833333333333335)
(2, 0.58854166666666663)
(2, 0.58854166666666663)
(2, 0.57455683003128255)
(2, 0.61522419186652766)
(2, 0.59541188738269035)
(2, 0.57664233576642332)
(2, 0.56934306569343063)
(4, 0.57187500000000002)
(4, 0.59062499999999996)
(4, 0.55833333333333335)
(4, 0.58854166666666663)
(4, 0.58854166666666663)
(4, 0.57455683003128255)
(4, 0.61522419186652766)
(4, 0.59645464025026074)
(4, 0.57872784150156409)
(4, 0.56412930135557871)
(8, 0.57187500000000002)
(8, 0.58958333333333335)
(8, 0.5552083333333333)
(8, 0.58958333333333335)
(8, 0.5864583333333333)
(8, 0.57559958289885294)
(8, 0.61522419186652766)
(8, 0.59749739311783112)
(8, 0.57664233576642332)
(

[0.58856607310215558,
 0.56982193064667297,
 0.59099437148217637,
 0.56472795497185746,
 0.59099437148217637,
 0.59568480300187621,
 0.58818011257035652,
 0.59287054409005624,
 0.60506566604127576,
 0.57035647279549717]

In [3]:
eval_trec.evaluate(model=loaded_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.634


# Evaluation of original c++ implementation of sent2vec

In order to build and train c++ implementation of sent2vec, use the following commands. This will produce object files for all the classes as well as the main binary sent2vec.

In [15]:
! git clone https://github.com/epfml/sent2vec.git
% cd sent2vec
! make

/Users/prerna135/Documents/GitHub/gensim/sent2vec


In [16]:
# Train model using original c++ implementation of sent2vec
start_time = time.time()
! ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.0001 -dropoutK 2 -bucket 2000000
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

Read 0M words
Number of words:  1837
Number of labels: 0
Progress: 100.0%  words/sec/thread: 27315  lr: 0.000000  loss: 3.016871  eta: 0h0m m 0m 0h0m   loss: 3.108133  eta: 0h0m 3.092520  eta: 0h0m h0m 


Total training time: 25.7511711121 seconds


In [3]:
eval_sick.evaluate(seed=42, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
1s - loss: 1.4797 - val_loss: 1.4236
Epoch 2/10
0s - loss: 1.4212 - val_loss: 1.3924
Epoch 3/10
0s - loss: 1.3991 - val_loss: 1.3769
Epoch 4/10
0s - loss: 1.3861 - val_loss: 1.3675
Epoch 5/10
0s - loss: 1.3772 - val_loss: 1.3610
Epoch 6/10
0s - loss: 1.3702 - val_loss: 1.3563
Epoch 7/10
0s - loss: 1.3645 - val_loss: 1.3527
Epoch 8/10
0s - loss: 1.3597 - val_loss: 1.3500
Epoch 9/10
0s - loss: 1.3554 - val_loss: 1.3479
Epoch 10/10
0s - loss: 1.3517 - val_loss: 1.3462
0.411652301513
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.3484 - val_loss: 1.3449
Epoch 2/10
0s - loss: 1.3454 - val_loss: 1.3438
Epoch 3/10
0s - loss: 1.3428 - val_loss: 1.3429
Epoch 4/10
0s - loss: 1.3403 - val_loss: 1.3421
Epoch 5/10
0s - loss: 1.3381 - val_loss: 1.3414
Epoch 6/10
0s - loss: 1.3360 - val_loss: 1.3408
Epoch 7/10
0s - loss: 1.3341 - val_loss: 1.3403
Epoch 8/10
0s - loss: 1.3323 - val_loss: 1.3398
Epoch 9/10
0s 

array([ 2.95893544,  3.07742165,  3.3711923 , ...,  3.1989737 ,
        2.3646093 ,  2.89993937])

In [3]:
eval_classification.eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
(1, 0.74888888888888894)
(1, 0.74222222222222223)
(1, 0.73888888888888893)
(1, 0.74222222222222223)
(1, 0.71777777777777774)
(1, 0.73222222222222222)
(1, 0.73555555555555552)
(1, 0.73444444444444446)
(1, 0.71999999999999997)
(1, 0.76555555555555554)
(2, 0.76000000000000001)
(2, 0.75666666666666671)
(2, 0.74777777777777776)
(2, 0.75111111111111106)
(2, 0.72555555555555551)
(2, 0.74444444444444446)
(2, 0.74111111111111116)
(2, 0.73999999999999999)
(2, 0.73111111111111116)
(2, 0.77777777777777779)
(4, 0.78111111111111109)
(4, 0.76777777777777778)
(4, 0.75444444444444447)
(4, 0.76444444444444448)
(4, 0.74555555555555553)
(4, 0.74555555555555553)
(4, 0.74888888888888894)
(4, 0.75555555555555554)
(4, 0.73444444444444446)
(4, 0.78111111111111109)
(8, 0.7844444444444445)
(8, 0.77333333333333332)
(8, 0.75777777777777777)
(8, 0.76444444444444448)
(8, 0.74222222222222223)
(8, 0.75777777777777777)
(8, 0.75222222222222224)
(8, 0.76444444444444448)
(8, 0.738888888888888

[0.78600000000000003,
 0.79900000000000004,
 0.78500000000000003,
 0.77800000000000002,
 0.78100000000000003,
 0.76200000000000001,
 0.78100000000000003,
 0.78900000000000003,
 0.76900000000000002,
 0.80100000000000005]

In [4]:
eval_classification.eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
(1, 0.55000000000000004)
(1, 0.54374999999999996)
(1, 0.54166666666666663)
(1, 0.58333333333333337)
(1, 0.55833333333333335)
(1, 0.56725755995828986)
(1, 0.59436913451511997)
(1, 0.57977059436913447)
(1, 0.57872784150156409)
(1, 0.59019812304483832)
(2, 0.5552083333333333)
(2, 0.54479166666666667)
(2, 0.53749999999999998)
(2, 0.58958333333333335)
(2, 0.55729166666666663)
(2, 0.56100104275286755)
(2, 0.59332638164754958)
(2, 0.57455683003128255)
(2, 0.58706986444212717)
(2, 0.60166840458811266)
(4, 0.55625000000000002)
(4, 0.546875)
(4, 0.54479166666666667)
(4, 0.59375)
(4, 0.55833333333333335)
(4, 0.55474452554744524)
(4, 0.59436913451511997)
(4, 0.58185610010427524)
(4, 0.59019812304483832)
(4, 0.58811261730969755)
(8, 0.56562500000000004)
(8, 0.5541666666666667)
(8, 0.54270833333333335)
(8, 0.58229166666666665)
(8, 0.56354166666666672)
(8, 0.56204379562043794)
(8, 0.60271115745568304)
(8, 0.59019812304483832)
(8, 0.59019812304483832)
(8, 0.59228362877997

[0.56888472352389874,
 0.59793814432989689,
 0.56472795497185746,
 0.57598499061913699,
 0.57973733583489684,
 0.59849906191369606,
 0.60506566604127576,
 0.61913696060037526,
 0.6097560975609756,
 0.59662288930581608]

In [4]:
eval_trec.evaluate(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.594


# Evaluation of Doc2Vec

In [10]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [13]:
train_corpus = list(read_corpus(lee_train_file))

In [14]:
# Doc2Vec model1 with PV-DM and sum of context word vectors
doc2vec_model1 = gensim.models.doc2vec.Doc2Vec(size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model1.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [15]:
# Doc2Vec model2 with PV-DBOW and sum of context word vectors
doc2vec_model2 = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model2.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [16]:
# Doc2Vec model3 with PV-DM and mean of context word vectors
doc2vec_model3 = gensim.models.doc2vec.Doc2Vec(dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model3.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [17]:
# Doc2Vec model4 with PV-DBOW and mean of context word vectors
doc2vec_model4 = gensim.models.doc2vec.Doc2Vec(dm=0, dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model4.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [18]:
%time doc2vec_model1.train(train_corpus, total_examples=doc2vec_model1.corpus_count, epochs=doc2vec_model1.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 42.53% examples, 307160 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724854 effective words) took 1.8s, 403292 effective words/s
CPU times: user 3.34 s, sys: 235 ms, total: 3.57 s
Wall time: 1.81 s


724854

In [19]:
%time doc2vec_model2.train(train_corpus, total_examples=doc2vec_model2.corpus_count, epochs=doc2vec_model2.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 73.33% examples, 523637 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724863 effective words) took 1.4s, 535931 effective words/s
CPU times: user 2.92 s, sys: 138 ms, total: 3.06 s
Wall time: 1.36 s


724863

In [20]:
%time doc2vec_model3.train(train_corpus, total_examples=doc2vec_model3.corpus_count, epochs=doc2vec_model3.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 55.98% examples, 403436 words/s, in_qsize 6, out_qsize 1
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724763 effective words) took 1.6s, 442325 effective words/s
CPU times: user 3.57 s, sys: 234 ms, total: 3.8 s
Wall time: 1.65 s


724763

In [21]:
%time doc2vec_model4.train(train_corpus, total_examples=doc2vec_model4.corpus_count, epochs=doc2vec_model4.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 88.33% examples, 635615 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724928 effective words) took 1.1s, 639466 effective words/s
CPU times: user 2.8 s, sys: 126 ms, total: 2.92 s
Wall time: 1.14 s


724928

In [23]:
eval_sick.evaluate(doc2vec_model1, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4719 - val_loss: 1.4461
Epoch 2/10
0s - loss: 1.4426 - val_loss: 1.4399
Epoch 3/10
0s - loss: 1.4361 - val_loss: 1.4349
Epoch 4/10
0s - loss: 1.4303 - val_loss: 1.4306
Epoch 5/10
0s - loss: 1.4250 - val_loss: 1.4268
Epoch 6/10
0s - loss: 1.4202 - val_loss: 1.4235
Epoch 7/10
0s - loss: 1.4158 - val_loss: 1.4206
Epoch 8/10
0s - loss: 1.4117 - val_loss: 1.4180
Epoch 9/10
0s - loss: 1.4079 - val_loss: 1.4156
Epoch 10/10
0s - loss: 1.4044 - val_loss: 1.4135
0.178486854685
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4011 - val_loss: 1.4116
Epoch 2/10
0s - loss: 1.3980 - val_loss: 1.4098
Epoch 3/10
0s - loss: 1.3951 - val_loss: 1.4082
Epoch 4/10
0s - loss: 1.3924 - val_loss: 1.4067
Epoch 5/10
0s - loss: 1.3898 - val_loss: 1.4054
Epoch 6/10
0s - loss: 1.3873 - val_loss: 1.4041
Epoch 7/10
0s - loss: 1.3850 - val_loss: 1.4030
Epoch 8/10
0s - loss: 1.3828 - val_loss: 1.4019
Epoch 9/10
0s 

array([ 3.2508122 ,  3.89464389,  3.20437309, ...,  3.67293952,
        3.36935709,  3.19288288])

In [26]:
eval_classification.eval_nested_kfold(model=doc2vec_model1, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.65000000000000002)
(1, 0.64777777777777779)
(1, 0.65555555555555556)
(1, 0.66888888888888887)
(1, 0.64333333333333331)
(1, 0.69444444444444442)
(1, 0.66111111111111109)
(1, 0.66222222222222227)
(1, 0.6744444444444444)
(1, 0.68666666666666665)
(2, 0.65000000000000002)
(2, 0.64777777777777779)
(2, 0.65777777777777779)
(2, 0.6677777777777778)
(2, 0.64333333333333331)
(2, 0.69222222222222218)
(2, 0.66111111111111109)
(2, 0.66111111111111109)
(2, 0.6744444444444444)
(2, 0.68666666666666665)
(4, 0.65000000000000002)
(4, 0.64666666666666661)
(4, 0.65777777777777779)
(4, 0.66666666666666663)
(4, 0.64222222222222225)
(4, 0.69222222222222218)
(4, 0.66111111111111109)
(4, 0.65888888888888886)
(4, 0.6744444444444444)
(4, 0.68444444444444441)
(8, 0.65000000000000002)
(8, 0.64555555555555555)
(8, 0.65777777777777779)
(8, 0.66666666666666663)
(8, 0.64111111111111108)
(8, 0.69222222222222218)
(8, 0.66111111111111109)
(8, 0.65888888888888886)
(8, 0.6744444444444444)


[0.66600000000000004,
 0.64600000000000002,
 0.66300000000000003,
 0.65500000000000003,
 0.67300000000000004,
 0.65000000000000002,
 0.67400000000000004,
 0.66500000000000004,
 0.69199999999999995,
 0.68300000000000005]

In [27]:
eval_classification.eval_nested_kfold(model=doc2vec_model1, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.52604166666666663)
(1, 0.56562500000000004)
(1, 0.53645833333333337)
(1, 0.57291666666666663)
(1, 0.546875)
(1, 0.53910323253388948)
(1, 0.56934306569343063)
(1, 0.5578727841501564)
(1, 0.55161626694473409)
(1, 0.56830031282586024)
(2, 0.5239583333333333)
(2, 0.56874999999999998)
(2, 0.53645833333333337)
(2, 0.57499999999999996)
(2, 0.54583333333333328)
(2, 0.53910323253388948)
(2, 0.57038581856100101)
(2, 0.55683003128258601)
(2, 0.54953076120959332)
(2, 0.56830031282586024)
(4, 0.5239583333333333)
(4, 0.56874999999999998)
(4, 0.53645833333333337)
(4, 0.57499999999999996)
(4, 0.546875)
(4, 0.53806047966631909)
(4, 0.56934306569343063)
(4, 0.55683003128258601)
(4, 0.54953076120959332)
(4, 0.56725755995828986)
(8, 0.5239583333333333)
(8, 0.56874999999999998)
(8, 0.53749999999999998)
(8, 0.57499999999999996)
(8, 0.546875)
(8, 0.53701772679874871)
(8, 0.56830031282586024)
(8, 0.55683003128258601)
(8, 0.54848800834202294)
(8, 0.56725755995828986)
(16, 0.

[0.5267104029990628,
 0.52764761012183692,
 0.55534709193245779,
 0.56660412757973733,
 0.55534709193245779,
 0.55253283302063794,
 0.57692307692307687,
 0.55159474671669795,
 0.56285178236397748,
 0.56566604127579734]

In [28]:
eval_sick.evaluate(doc2vec_model2, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4817 - val_loss: 1.4438
Epoch 2/10
0s - loss: 1.4426 - val_loss: 1.4364
Epoch 3/10
0s - loss: 1.4371 - val_loss: 1.4318
Epoch 4/10
0s - loss: 1.4319 - val_loss: 1.4276
Epoch 5/10
0s - loss: 1.4269 - val_loss: 1.4236
Epoch 6/10
0s - loss: 1.4222 - val_loss: 1.4199
Epoch 7/10
0s - loss: 1.4177 - val_loss: 1.4164
Epoch 8/10
0s - loss: 1.4134 - val_loss: 1.4132
Epoch 9/10
0s - loss: 1.4094 - val_loss: 1.4101
Epoch 10/10
0s - loss: 1.4056 - val_loss: 1.4072
0.282045039903
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4020 - val_loss: 1.4045
Epoch 2/10
0s - loss: 1.3986 - val_loss: 1.4020
Epoch 3/10
0s - loss: 1.3954 - val_loss: 1.3996
Epoch 4/10
0s - loss: 1.3923 - val_loss: 1.3974
Epoch 5/10
0s - loss: 1.3894 - val_

array([ 3.31670414,  3.39893406,  3.41949492, ...,  3.20750578,
        3.07233875,  2.83874244])

In [25]:
eval_classification.eval_nested_kfold(model=doc2vec_model2, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.74222222222222223)
(1, 0.72111111111111115)
(1, 0.72222222222222221)
(1, 0.73444444444444446)
(1, 0.71777777777777774)
(1, 0.75)
(1, 0.72444444444444445)
(1, 0.72333333333333338)
(1, 0.73777777777777775)
(1, 0.74444444444444446)
(2, 0.74222222222222223)
(2, 0.72444444444444445)
(2, 0.72111111111111115)
(2, 0.73222222222222222)
(2, 0.71777777777777774)
(2, 0.74888888888888894)
(2, 0.72333333333333338)
(2, 0.71888888888888891)
(2, 0.73888888888888893)
(2, 0.74555555555555553)
(4, 0.74222222222222223)
(4, 0.71999999999999997)
(4, 0.72222222222222221)
(4, 0.72999999999999998)
(4, 0.71888888888888891)
(4, 0.7466666666666667)
(4, 0.72333333333333338)
(4, 0.72111111111111115)
(4, 0.73666666666666669)
(4, 0.74333333333333329)
(8, 0.73999999999999999)
(8, 0.71999999999999997)
(8, 0.72111111111111115)
(8, 0.72888888888888892)
(8, 0.72111111111111115)
(8, 0.74444444444444446)
(8, 0.72555555555555551)
(8, 0.72222222222222221)
(8, 0.73555555555555552)
(8, 0.74444

[0.748,
 0.72899999999999998,
 0.73099999999999998,
 0.74099999999999999,
 0.72599999999999998,
 0.73199999999999998,
 0.747,
 0.72899999999999998,
 0.73099999999999998,
 0.745]

In [29]:
eval_classification.eval_nested_kfold(model=doc2vec_model2, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.55833333333333335)
(1, 0.57916666666666672)
(1, 0.57187500000000002)
(1, 0.56874999999999998)
(1, 0.59895833333333337)
(1, 0.5849843587069864)
(1, 0.56830031282586024)
(1, 0.56621480709071947)
(1, 0.5714285714285714)
(1, 0.61835245046923881)
(2, 0.5625)
(2, 0.57708333333333328)
(2, 0.5697916666666667)
(2, 0.56770833333333337)
(2, 0.59791666666666665)
(2, 0.58602711157455678)
(2, 0.56830031282586024)
(2, 0.56830031282586024)
(2, 0.57247132429614178)
(2, 0.61730969760166843)
(4, 0.56041666666666667)
(4, 0.57708333333333328)
(4, 0.56770833333333337)
(4, 0.5697916666666667)
(4, 0.59687500000000004)
(4, 0.5849843587069864)
(4, 0.56517205422314909)
(4, 0.57038581856100101)
(4, 0.57351407716371217)
(4, 0.61626694473409804)
(8, 0.56145833333333328)
(8, 0.57708333333333328)
(8, 0.56770833333333337)
(8, 0.5697916666666667)
(8, 0.59687500000000004)
(8, 0.5849843587069864)
(8, 0.56517205422314909)
(8, 0.57038581856100101)
(8, 0.57351407716371217)
(8, 0.615224191

[0.5538894095595126,
 0.55857544517338331,
 0.57598499061913699,
 0.56003752345215763,
 0.56566604127579734,
 0.57598499061913699,
 0.60037523452157604,
 0.57129455909943716,
 0.5684803001876173,
 0.6163227016885553]

In [30]:
eval_sick.evaluate(doc2vec_model3, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4752 - val_loss: 1.4428
Epoch 2/10
0s - loss: 1.4431 - val_loss: 1.4382
Epoch 3/10
0s - loss: 1.4372 - val_loss: 1.4346
Epoch 4/10
0s - loss: 1.4316 - val_loss: 1.4315
Epoch 5/10
0s - loss: 1.4264 - val_loss: 1.4287
Epoch 6/10
0s - loss: 1.4216 - val_loss: 1.4262
Epoch 7/10
0s - loss: 1.4172 - val_loss: 1.4239
Epoch 8/10
0s - loss: 1.4131 - val_loss: 1.4219
Epoch 9/10
0s - loss: 1.4092 - val_loss: 1.4201
Epoch 10/10
0s - loss: 1.4057 - val_loss: 1.4185
0.125690524435
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4023 - val_loss: 1.4170
Epoch 2/10
0s - loss: 1.3992 - val_loss: 1.4156
Epoch 3/10
0s - loss: 1.3962 - val_loss: 1.4144
Epoch 4/10
0s - loss: 1.3935 - val_loss: 1.4133
Epoch 5/10
0s - loss: 1.3909 - val_

array([ 3.11481246,  3.40361393,  3.05005667, ...,  3.47234805,
        3.43841542,  3.51948238])

In [31]:
eval_classification.eval_nested_kfold(model=doc2vec_model3, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.65555555555555556)
(1, 0.66444444444444439)
(1, 0.66666666666666663)
(1, 0.66333333333333333)
(1, 0.66666666666666663)
(1, 0.67666666666666664)
(1, 0.68555555555555558)
(1, 0.6744444444444444)
(1, 0.68000000000000005)
(1, 0.69111111111111112)
(2, 0.6544444444444445)
(2, 0.6677777777777778)
(2, 0.66666666666666663)
(2, 0.66333333333333333)
(2, 0.66666666666666663)
(2, 0.6744444444444444)
(2, 0.68444444444444441)
(2, 0.67777777777777781)
(2, 0.68111111111111111)
(2, 0.69333333333333336)
(4, 0.65222222222222226)
(4, 0.6677777777777778)
(4, 0.66666666666666663)
(4, 0.66444444444444439)
(4, 0.66666666666666663)
(4, 0.6744444444444444)
(4, 0.68444444444444441)
(4, 0.67777777777777781)
(4, 0.68111111111111111)
(4, 0.68999999999999995)
(8, 0.65222222222222226)
(8, 0.66555555555555557)
(8, 0.66555555555555557)
(8, 0.66444444444444439)
(8, 0.66555555555555557)
(8, 0.6744444444444444)
(8, 0.68444444444444441)
(8, 0.67777777777777781)
(8, 0.68111111111111111)
(8

[0.68000000000000005,
 0.66400000000000003,
 0.67200000000000004,
 0.66800000000000004,
 0.67000000000000004,
 0.67000000000000004,
 0.68200000000000005,
 0.67300000000000004,
 0.68200000000000005,
 0.68500000000000005]

In [32]:
eval_classification.eval_nested_kfold(model=doc2vec_model3, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.5854166666666667)
(1, 0.54270833333333335)
(1, 0.54374999999999996)
(1, 0.55208333333333337)
(1, 0.5708333333333333)
(1, 0.56308654848800832)
(1, 0.54640250260688217)
(1, 0.55578727841501563)
(1, 0.57351407716371217)
(1, 0.59124087591240881)
(2, 0.58437499999999998)
(2, 0.54374999999999996)
(2, 0.54062500000000002)
(2, 0.54895833333333333)
(2, 0.5697916666666667)
(2, 0.56204379562043794)
(2, 0.54640250260688217)
(2, 0.55474452554744524)
(2, 0.57247132429614178)
(2, 0.59124087591240881)
(4, 0.58437499999999998)
(4, 0.54270833333333335)
(4, 0.54062500000000002)
(4, 0.55000000000000004)
(4, 0.5708333333333333)
(4, 0.56308654848800832)
(4, 0.54640250260688217)
(4, 0.55474452554744524)
(4, 0.57247132429614178)
(4, 0.59124087591240881)
(8, 0.58437499999999998)
(8, 0.54270833333333335)
(8, 0.54062500000000002)
(8, 0.54895833333333333)
(8, 0.5708333333333333)
(8, 0.56412930135557871)
(8, 0.54640250260688217)
(8, 0.55474452554744524)
(8, 0.5714285714285714)
(

[0.54732895970009376,
 0.57263355201499533,
 0.52814258911819889,
 0.56941838649155718,
 0.53470919324577859,
 0.56660412757973733,
 0.57129455909943716,
 0.5337711069418386,
 0.56378986866791747,
 0.60225140712945591]

In [33]:
eval_sick.evaluate(doc2vec_model4, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4934 - val_loss: 1.4400
Epoch 2/10
0s - loss: 1.4419 - val_loss: 1.4308
Epoch 3/10
0s - loss: 1.4361 - val_loss: 1.4254
Epoch 4/10
0s - loss: 1.4307 - val_loss: 1.4206
Epoch 5/10
0s - loss: 1.4254 - val_loss: 1.4161
Epoch 6/10
0s - loss: 1.4205 - val_loss: 1.4118
Epoch 7/10
0s - loss: 1.4158 - val_loss: 1.4078
Epoch 8/10
0s - loss: 1.4113 - val_loss: 1.4041
Epoch 9/10
0s - loss: 1.4071 - val_loss: 1.4007
Epoch 10/10
0s - loss: 1.4031 - val_loss: 1.3974
0.341862730526
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.3993 - val_loss: 1.3944
Epoch 2/10
0s - loss: 1.3957 - val_loss: 1.3915
Epoch 3/10
0s - loss: 1.3923 - val_loss: 1.3889
Epoch 4/10
0s - loss: 1.3891 - val_loss: 1.3864
Epoch 5/10
0s - loss: 1.3860 - val_

array([ 3.19522402,  3.93637581,  3.09101247, ...,  3.04093465,
        3.18775981,  3.0320669 ])

In [34]:
eval_classification.eval_nested_kfold(model=doc2vec_model4, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.72666666666666668)
(1, 0.70666666666666667)
(1, 0.72777777777777775)
(1, 0.73888888888888893)
(1, 0.68888888888888888)
(1, 0.72888888888888892)
(1, 0.7122222222222222)
(1, 0.71777777777777774)
(1, 0.73111111111111116)
(1, 0.73111111111111116)
(2, 0.72666666666666668)
(2, 0.70666666666666667)
(2, 0.72666666666666668)
(2, 0.73777777777777775)
(2, 0.68999999999999995)
(2, 0.72777777777777775)
(2, 0.71111111111111114)
(2, 0.71666666666666667)
(2, 0.72999999999999998)
(2, 0.72888888888888892)
(4, 0.72666666666666668)
(4, 0.7088888888888889)
(4, 0.72777777777777775)
(4, 0.73888888888888893)
(4, 0.69111111111111112)
(4, 0.72666666666666668)
(4, 0.70777777777777773)
(4, 0.7155555555555555)
(4, 0.72888888888888892)
(4, 0.73111111111111116)
(8, 0.72666666666666668)
(8, 0.70999999999999996)
(8, 0.72666666666666668)
(8, 0.73999999999999999)
(8, 0.69111111111111112)
(8, 0.72777777777777775)
(8, 0.70666666666666667)
(8, 0.71444444444444444)
(8, 0.72888888888888892

[0.748,
 0.72599999999999998,
 0.71699999999999997,
 0.73099999999999998,
 0.71299999999999997,
 0.70799999999999996,
 0.71599999999999997,
 0.73699999999999999,
 0.71399999999999997,
 0.73099999999999998]

In [35]:
eval_classification.eval_nested_kfold(model=doc2vec_model4, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.56666666666666665)
(1, 0.60416666666666663)
(1, 0.546875)
(1, 0.56354166666666672)
(1, 0.5708333333333333)
(1, 0.59436913451511997)
(1, 0.58915537017726793)
(1, 0.57455683003128255)
(1, 0.58915537017726793)
(1, 0.61001042752867574)
(2, 0.56562500000000004)
(2, 0.60624999999999996)
(2, 0.546875)
(2, 0.56562500000000004)
(2, 0.57395833333333335)
(2, 0.59541188738269035)
(2, 0.59124087591240881)
(2, 0.57351407716371217)
(2, 0.58706986444212717)
(2, 0.61313868613138689)
(4, 0.56458333333333333)
(4, 0.60312500000000002)
(4, 0.546875)
(4, 0.56666666666666665)
(4, 0.57291666666666663)
(4, 0.59749739311783112)
(4, 0.59124087591240881)
(4, 0.57351407716371217)
(4, 0.58602711157455678)
(4, 0.60896767466110535)
(8, 0.56458333333333333)
(8, 0.60416666666666663)
(8, 0.54791666666666672)
(8, 0.5697916666666667)
(8, 0.57291666666666663)
(8, 0.59645464025026074)
(8, 0.58811261730969755)
(8, 0.5714285714285714)
(8, 0.58602711157455678)
(8, 0.60896767466110535)
(16, 0

[0.57544517338331769,
 0.56794751640112462,
 0.60037523452157604,
 0.55816135084427765,
 0.5544090056285178,
 0.58536585365853655,
 0.59756097560975607,
 0.58442776735459667,
 0.57223264540337715,
 0.60131332082551592]

In [36]:
eval_trec.evaluate(doc2vec_model1, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.37


In [37]:
eval_trec.evaluate(doc2vec_model2, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.426


In [38]:
eval_trec.evaluate(doc2vec_model3, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.382


In [39]:
eval_trec.evaluate(doc2vec_model4, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.414


# Evaluation of sentence vectors obtained from averaging FastText word vectors

In [54]:
lee_data = LineSentence(lee_train_file)
fasttext_model = ft(size=100, alpha=0.2, negative=10, max_vocab_size=30000000, seed=42, iter=20)
fasttext_model.build_vocab(lee_data)
start_time = time.time()
fasttext_model.train(lee_data, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 10781 word types from a corpus of 59890 raw words and 300 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
INFO:gensim.models.word2vec:min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 10781 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 45 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
INFO:gensim.models.word2vec:estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.fasttext:Total number of ngrams is 17006
INFO:gens

In [57]:
fasttext_model.save('ft1')

INFO:gensim.utils:saving FastText object under ft1, separately None
INFO:gensim.utils:not storing attribute syn0norm
INFO:gensim.utils:not storing attribute syn0_ngrams_norm
INFO:gensim.utils:not storing attribute syn0_vocab_norm
INFO:gensim.utils:saved ft1


In [40]:
ft_loaded_model = ft.load('ft1')

INFO:gensim.utils:loading FastText object from ft1
INFO:gensim.utils:loading wv recursively from ft1.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:setting ignored attribute syn0_ngrams_norm to None
INFO:gensim.utils:setting ignored attribute syn0_vocab_norm to None
INFO:gensim.utils:loaded ft1


In [41]:
eval_sick.evaluate(ft_loaded_model, seed=42, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.7577 - val_loss: 1.4547
Epoch 2/10
0s - loss: 1.4562 - val_loss: 1.3914
Epoch 3/10
0s - loss: 1.4006 - val_loss: 1.3606
Epoch 4/10
0s - loss: 1.3675 - val_loss: 1.3430
Epoch 5/10
0s - loss: 1.3449 - val_loss: 1.3319
Epoch 6/10
0s - loss: 1.3282 - val_loss: 1.3245
Epoch 7/10
0s - loss: 1.3152 - val_loss: 1.3192
Epoch 8/10
0s - loss: 1.3048 - val_loss: 1.3152
Epoch 9/10
0s - loss: 1.2962 - val_loss: 1.3122
Epoch 10/10
0s - loss: 1.2889 - val_loss: 1.3099
0.49614948157
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.2826 - val_loss: 1.3081
Epoch 2/10
0s - loss: 1.2772 - val_loss: 1.3066
Epoch 3/10
0s - loss: 1.2724 - val_loss: 1.3055
Epoch 4/10
0s - loss: 1.2682 - val_loss: 1.3046
Epoch 5/10
0s - loss: 1.2644 - val_l

array([ 2.83366221,  3.42188593,  3.40384959, ...,  3.39955676,
        3.13025376,  2.85702541])

In [42]:
eval_classification.eval_nested_kfold(model=ft_loaded_model, name='SUBJ', use_nb=False, model_name='fasttext')

Computing sentence vectors...
(1, 0.80777777777777782)
(1, 0.80222222222222217)
(1, 0.7877777777777778)
(1, 0.81888888888888889)
(1, 0.80111111111111111)
(1, 0.7911111111111111)
(1, 0.81888888888888889)
(1, 0.79666666666666663)
(1, 0.78888888888888886)
(1, 0.82222222222222219)
(2, 0.80777777777777782)
(2, 0.80222222222222217)
(2, 0.7877777777777778)
(2, 0.81666666666666665)
(2, 0.80333333333333334)
(2, 0.79000000000000004)
(2, 0.81888888888888889)
(2, 0.79666666666666663)
(2, 0.78888888888888886)
(2, 0.82111111111111112)
(4, 0.80666666666666664)
(4, 0.80222222222222217)
(4, 0.79000000000000004)
(4, 0.81666666666666665)
(4, 0.80333333333333334)
(4, 0.79000000000000004)
(4, 0.81888888888888889)
(4, 0.79777777777777781)
(4, 0.78888888888888886)
(4, 0.81999999999999995)
(8, 0.80666666666666664)
(8, 0.80222222222222217)
(8, 0.79000000000000004)
(8, 0.81555555555555559)
(8, 0.80333333333333334)
(8, 0.79000000000000004)
(8, 0.81888888888888889)
(8, 0.79777777777777781)
(8, 0.78888888888888886

[0.82299999999999995,
 0.80500000000000005,
 0.79700000000000004,
 0.80600000000000005,
 0.80800000000000005,
 0.78600000000000003,
 0.80500000000000005,
 0.80800000000000005,
 0.78800000000000003,
 0.81899999999999995]

In [44]:
eval_classification.eval_nested_kfold(model=ft_loaded_model, name='MR', use_nb=False, model_name='fasttext')

Computing sentence vectors...
(1, 0.609375)
(1, 0.57395833333333335)
(1, 0.60520833333333335)
(1, 0.62083333333333335)
(1, 0.56666666666666665)
(1, 0.64025026068821689)
(1, 0.63503649635036497)
(1, 0.61626694473409804)
(1, 0.59124087591240881)
(1, 0.61209593326381651)
(2, 0.61041666666666672)
(2, 0.57291666666666663)
(2, 0.60520833333333335)
(2, 0.62187499999999996)
(2, 0.56458333333333333)
(2, 0.64337851929092804)
(2, 0.6329509906152242)
(2, 0.61626694473409804)
(2, 0.59124087591240881)
(2, 0.61105318039624612)
(4, 0.61041666666666672)
(4, 0.57291666666666663)
(4, 0.60416666666666663)
(4, 0.62187499999999996)
(4, 0.56562500000000004)
(4, 0.64337851929092804)
(4, 0.6329509906152242)
(4, 0.61626694473409804)
(4, 0.59124087591240881)
(4, 0.61105318039624612)
(8, 0.61041666666666672)
(8, 0.57291666666666663)
(8, 0.60312500000000002)
(8, 0.62187499999999996)
(8, 0.56458333333333333)
(8, 0.64337851929092804)
(8, 0.6329509906152242)
(8, 0.61626694473409804)
(8, 0.59124087591240881)
(8, 0.611

[0.61480787253983127,
 0.60449859418931584,
 0.59193245778611636,
 0.61726078799249529,
 0.59005628517823638,
 0.60694183864915574,
 0.64446529080675419,
 0.61257035647279545,
 0.60412757973733588,
 0.60694183864915574]

In [45]:
eval_trec.evaluate(ft_loaded_model, evalcv=False, evaltest=True, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.62


# Evaluation Results

| S.No. | Model Name                                | Training Time (in seconds) | Pearson/Spearman/MSE on SICK | Mean SUBJ | Mean MR | TREC |
|-------|-------------------------------------------|----------------------------|------------------------------|-----------|---------|------|
| 1.    | Gensim Sent2Vec                           | 319.09                     | 0.48/0.49/0.78               | 0.76      | 0.58    | 0.63 |
| 2.    | Original Sent2Vec                         | 25.75                      | 0.42/0.43/0.82               | 0.78      | 0.59    | 0.59 |
| 3.    | PV-DM with sum of context word vectors    | 3.57                       | 0.27/0.27/0.94               | 0.66      | 0.55    | 0.37 |
| 4.    | PV-DM with mean of context word vectors   | 3.8                        | 0.28/0.28/0.93               | 0.67      | 0.55    | 0.38 |
| 5.    | PV-DBOW with sum of context word vector   | 3.06                       | 0.36/0.35/0.88               | 0.73      | 0.57    | 0.42 |
| 6.    | PV-DBOW with mean of context word vectors | 2.92                       | 0.34/0.34/0.89               | 0.72      | 0.57    | 0.41 |
| 7.    | Mean of gensim fasttext word vectors      | 1540.17                    | 0.49/0.49/0.76               | 0.80      | 0.60    | 0.62 |