# Using Sent2Vec via Gensim

This tutorial is about using sent2vec model in Gensim. Here, we'll learn to work with the sent2vec library for training sentence-embedding models, saving & loading them and performing similarity operations. This notebook also contains a comparison of the gensim implementation with the [original c++ implementation](https://github.com/epfml/sent2vec), Gensim's Doc2Vec and Gensim's FastText. All the evaluation scripts used in the notebook can be found [here](https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b).

In [None]:
# Download evaluation scripts
! wget https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b/raw/7a970c7480474516ee41ba7a60d37092819822ea/eval_classification.py
! wget https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b/raw/7a970c7480474516ee41ba7a60d37092819822ea/eval_sick.py
! wget https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b/raw/7a970c7480474516ee41ba7a60d37092819822ea/eval_trec.py
! wget https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b/raw/7a970c7480474516ee41ba7a60d37092819822ea/dataset_handler.py

# What is Sent2Vec?

Sent2Vec delivers numerical representations (features) for short texts or sentences, which can be used as input to any machine learning task later on. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences. The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the [paper](https://arxiv.org/abs/1703.02507) for more details.

The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams present in each sentence, and averaging the n-gram embeddings along with the words

# Training models

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model. All models are trained with the same hyperparameters for evaluation purposes.

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec
from gensim.models.fasttext import FastText
from gensim.utils import tokenize
import re
import time
import numpy as np
import eval_sick
import eval_classification
import eval_trec
import smart_open

Using TensorFlow backend.


# Prepare training data

In [2]:
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'
lee_data = []
with open(lee_train_file) as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = tokenize(sentence)
                    lee_data.append(list(sentence))
                    f2.write(' '.join(lee_data[-1]) + '\n')

In [3]:
# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

# Using gensim implementation of sent2vec

In [4]:
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

In [5]:
# Train new sent2vec model
% time sent2vec_model = Sent2Vec(lee_data, size=100, epochs=20, seed=42, workers=4)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 0.06 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 1531, tokens read: 60302
INFO:gensim.models.sent2vec:training model with 4 workers on 1531 vocabulary and 100 features
INFO:gensim.models.sent2vec:PROGRESS: at 8.28% words, 79298 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 17.39% words, 81742 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 27.32% words, 86469 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 38.92% words, 92702 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 48.86% words, 92947 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 59.62% words, 94653 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 71.22% words, 95012 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 82.82% words, 97058 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 91.92% words, 95278 words/s
INFO:gensim.models.sent2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.m

# Training hyperparameters

Sent2Vec supports the following parameters:

- sentences : For larger corpora (like the Toronto corpus), consider an iterable that streams the sentences directly from disk/network.
- size : Dimensionality of the feature vectors. Default 100
- lr : Initial learning rate. Default 0.2
- seed : For the random number generator for reproducible reasons. Default 42
- min_count : Ignore all words with total frequency lower than this. Default 5
- max_vocab_size : Limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Default is 30000000.
- t : Threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
- loss_type : Default is 'ns', negative sampling will be used.
- neg : Specifies how many "noise words" should be drawn (usually between 5-20). Default is 10.
- epochs : Number of iterations (epochs) over the corpus. Default is 5.
- lr_update_rate : Change the rate of updates for the learning rate. Default is 100.
- word_ngrams : Max length of word ngram. Default is 2.
- bucket : Number of hash buckets for vocabulary. Default is 2000000.
- minn : Min length of char ngrams. Default is 3.
- maxn : Max length of char ngrams. Default is 6.
- dropoutk : Number of ngrams dropped when training a sent2vec model. Default is 2.
- batch_words : Target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
- workers : Use this many worker threads to train the model (=faster training with multicore machines). Default is 3.

In [6]:
# Print sentence vector
sent2vec_model.sentence_vectors(['This', 'is', 'an', 'awesome', 'gift'])

array([ 0.59427662,  0.45323249,  0.50265895,  0.4995847 ,  0.34286592,
        0.50089866,  0.61660693,  0.66634651,  0.49990433,  0.64750528,
        0.52456549,  0.57985228,  0.37021146,  0.60212792,  0.43293218,
        0.41538005,  0.69427354,  0.65900928,  0.70971879,  0.42145939,
        0.56558073,  0.61602969,  0.28584392,  0.5740314 ,  0.58066796,
        0.44665115,  0.5402053 ,  0.61129872,  0.52463576,  0.48204397,
        0.51067452,  0.52571809,  0.461759  ,  0.45117879,  0.4421997 ,
        0.54885694,  0.63369278,  0.46641079,  0.19771745,  0.60596168,
        0.76673974,  0.45278565,  0.65774093,  0.21516863,  0.29132966,
        0.69450265,  0.41453836,  0.58605764,  0.37342349,  0.33855072,
        0.42545663,  0.4449666 ,  0.37623366,  0.41264582,  0.80266036,
        0.45889603,  0.4402174 ,  0.65778087,  0.42700231,  0.5813403 ,
        0.31794493,  0.37576896,  0.48747489,  0.45650574,  0.39796694,
        0.45184011,  0.52442752,  0.34692425,  0.41412414,  0.46

In [7]:
#print cosine similarity between two sentences
sent2vec_model.similarity(['This', 'is', 'an', 'awesome', 'gift'], ['This', 'present', 'is', 'great'])

0.96255669226718132

# Saving and loading models

Models can be saved and loaded via the load and save methods.

In [8]:
# Save trained sent2vec model
sent2vec_model.save('s2v1')

INFO:gensim.utils:saving Sent2Vec object under s2v1, separately None
INFO:gensim.utils:storing np array 'wi' to s2v1.wi.npy
INFO:gensim.utils:saved s2v1


In [9]:
# Load pretrained sent2vec model
loaded_model = s2v.load('s2v1')

INFO:gensim.utils:loading Sent2Vec object from s2v1
INFO:gensim.utils:loading wi from s2v1.wi.npy with mmap=None
INFO:gensim.utils:loaded s2v1


# Unsupervised similarity evaluation

Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the [SICK 2014](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools) datasets. These similarity scores are compared to the gold-standard human judgements using [Pearson’s correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) scores. The SICK dataset consists of about 10,000 sentence pairs along with relatedness scores of the pairs. We use the code provided by [Kiros et al., 2015](https://github.com/ryankiros/skip-thoughts).

In [10]:
eval_sick.evaluate(loaded_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Dev Pearson: 0.422729545743
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.431216972239
Test Spearman: 0.430858902788
Test MSE: 0.830003650035


array([ 3.34222598,  3.36105921,  3.34240926, ...,  3.06700658,
        2.60424432,  3.44811346])

# Downstream Supervised Evaluation

Sentence embeddings are evaluated for various supervised classification tasks. We evaluate classification of movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ)(Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). To classify, we use the code provided by [(Kiros et al., 2015)](https://github.com/ryankiros/skip-thoughts). Sent2Vec embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the [MR and SUBJ](https://www.cs.cornell.edu/people/pabo/movie-review-data/) datasets. For those datasets nested cross-validation is used to tune the L2 penalty. For the [TREC dataset](http://cogcomp.cs.illinois.edu/Data/QA/QC/), the accuracy is computed on the test set.

In [11]:
eval_classification.eval_nested_kfold(model=loaded_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.70099999999999996]
[0.70099999999999996, 0.71799999999999997]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997, 0.72799999999999998]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997, 0.72799999999999998, 0.72299999999999998]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997, 0.72799999999999998, 0.72299999999999998, 0.72399999999999998]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997, 0.72799999999999998, 0.72299999999999998, 0.72399999999999998, 0.70199999999999996]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997, 0.72799999999999998, 0.72299999999999998, 0.72399999999999998, 0.70199999999999996, 0.72399999999999998]
[0.70099999999999996, 0.71799999999999997, 0.71199999999999997, 0.72799999999999998, 0.72299999999999998, 0.72399999999999998, 0.70199999999999996, 0.72399999999999998, 0.70099999999999996]
[0.7009999999999

[0.70099999999999996,
 0.71799999999999997,
 0.71199999999999997,
 0.72799999999999998,
 0.72299999999999998,
 0.72399999999999998,
 0.70199999999999996,
 0.72399999999999998,
 0.70099999999999996,
 0.72999999999999998]

In [12]:
eval_classification.eval_nested_kfold(model=loaded_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.57357075913776945]
[0.57357075913776945, 0.55576382380506095]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684, 0.57035647279549717]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684, 0.57035647279549717, 0.53658536585365857]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684, 0.57035647279549717, 0.53658536585365857, 0.54596622889305813]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684, 0.57035647279549717, 0.53658536585365857, 0.54596622889305813, 0.60787992495309573]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684, 0.57035647279549717, 0.53658536585365857, 0.54596622889305813, 0.60787992495309573, 0.6163227016885553]
[0.57357075913776945, 0.55576382380506095, 0.57973733583489684, 0.57035647279549717, 0.53658536585365857, 0.54596622889305813, 0.60787992495309573, 0.6163227016885553, 0.60037523452157604]
[0.573570759137769

[0.57357075913776945,
 0.55576382380506095,
 0.57973733583489684,
 0.57035647279549717,
 0.53658536585365857,
 0.54596622889305813,
 0.60787992495309573,
 0.6163227016885553,
 0.60037523452157604,
 0.58348968105065668]

In [13]:
eval_trec.evaluate(model=loaded_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.586


# Evaluation of original c++ implementation of sent2vec

In order to build and train c++ implementation of sent2vec, use the following commands. This will produce object files for all the classes as well as the main binary sent2vec.

In [15]:
! git clone https://github.com/epfml/sent2vec.git
% cd sent2vec
! make

/Users/prerna135/Documents/GitHub/gensim/sent2vec


In [5]:
# Train model using original c++ implementation of sent2vec
! time ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 4 -t 0.0001 -dropoutK 2 -bucket 2000000

Read 0M words
Number of words:  1837
Number of labels: 0
Progress: 100.0%  words/sec/thread: 92875  lr: 0.000000  loss: 3.107460  eta: 0h0m oss: 6.382276  eta: 0h3m 0h0m 28.8%  words/sec/thread: 41455  lr: 0.142491  loss: 3.573717  eta: 0h0m 0.111096  loss: 3.478478  eta: 0h0m m 69885  lr: 0.079403  loss: 3.358491  eta: 0h0m 73.3%  words/sec/thread: 78577  lr: 0.053372  loss: 3.262215  eta: 0h0m 0.043828  loss: 3.228405  eta: 0h0m 84.7%  words/sec/thread: 85207  lr: 0.030507  loss: 3.187402  eta: 0h0m   lr: 0.008810  loss: 3.128377  eta: 0h0m 

real	0m13.490s
user	0m16.215s
sys	0m2.013s


In [7]:
% cd ..
eval_sick.evaluate(seed=42, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Dev Pearson: 0.419326924957
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.415730692175
Test Spearman: 0.422953082413
Test MSE: 0.842070843724


array([ 3.00608191,  3.19365489,  3.4313475 , ...,  3.29148397,
        2.62857256,  3.05053893])

In [8]:
eval_classification.eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.78700000000000003]
[0.78700000000000003, 0.79900000000000004]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004, 0.78200000000000003]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004, 0.78200000000000003, 0.78300000000000003]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004, 0.78200000000000003, 0.78300000000000003, 0.76200000000000001]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004, 0.78200000000000003, 0.78300000000000003, 0.76200000000000001, 0.78600000000000003]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004, 0.78200000000000003, 0.78300000000000003, 0.76200000000000001, 0.78600000000000003, 0.79300000000000004]
[0.78700000000000003, 0.79900000000000004, 0.79000000000000004, 0.78200000000000003, 0.78300000000000003, 0.76200000000000001, 0.78600000000000003, 0.79300000000000004, 0.76100000000000001]
[0.7870000000000

[0.78700000000000003,
 0.79900000000000004,
 0.79000000000000004,
 0.78200000000000003,
 0.78300000000000003,
 0.76200000000000001,
 0.78600000000000003,
 0.79300000000000004,
 0.76100000000000001,
 0.79200000000000004]

In [9]:
eval_classification.eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.57825679475164016]
[0.57825679475164016, 0.584817244611059]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173, 0.56378986866791747]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173, 0.56378986866791747, 0.57786116322701686]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173, 0.56378986866791747, 0.57786116322701686, 0.59099437148217637]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173, 0.56378986866791747, 0.57786116322701686, 0.59099437148217637, 0.60694183864915574]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173, 0.56378986866791747, 0.57786116322701686, 0.59099437148217637, 0.60694183864915574, 0.61538461538461542]
[0.57825679475164016, 0.584817244611059, 0.5684803001876173, 0.56378986866791747, 0.57786116322701686, 0.59099437148217637, 0.60694183864915574, 0.61538461538461542, 0.61163227016885557]
[0.57825679475164016, 0.584817244611059

[0.57825679475164016,
 0.584817244611059,
 0.5684803001876173,
 0.56378986866791747,
 0.57786116322701686,
 0.59099437148217637,
 0.60694183864915574,
 0.61538461538461542,
 0.61163227016885557,
 0.59474671669793622]

In [10]:
eval_trec.evaluate(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.6


# Evaluation of Doc2Vec

In [13]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [14]:
train_corpus = list(read_corpus(lee_train_file))

In [15]:
# Doc2Vec model1 with PV-DM and sum of context word vectors
doc2vec_model1 = gensim.models.doc2vec.Doc2Vec(size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model1.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [16]:
# Doc2Vec model2 with PV-DBOW and sum of context word vectors
doc2vec_model2 = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model2.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [17]:
# Doc2Vec model3 with PV-DM and mean of context word vectors
doc2vec_model3 = gensim.models.doc2vec.Doc2Vec(dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model3.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [18]:
# Doc2Vec model4 with PV-DBOW and mean of context word vectors
doc2vec_model4 = gensim.models.doc2vec.Doc2Vec(dm=0, dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model4.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [19]:
%time doc2vec_model1.train(train_corpus, total_examples=doc2vec_model1.corpus_count, epochs=doc2vec_model1.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 62.53% examples, 451598 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724888 effective words) took 1.6s, 459770 effective words/s
CPU times: user 3.73 s, sys: 230 ms, total: 3.96 s
Wall time: 1.59 s


724888

In [20]:
%time doc2vec_model2.train(train_corpus, total_examples=doc2vec_model2.corpus_count, epochs=doc2vec_model2.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 80.98% examples, 579813 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724799 effective words) took 1.2s, 586018 effective words/s
CPU times: user 3.07 s, sys: 126 ms, total: 3.2 s
Wall time: 1.24 s


724799

In [21]:
%time doc2vec_model3.train(train_corpus, total_examples=doc2vec_model3.corpus_count, epochs=doc2vec_model3.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 36.78% examples, 262306 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 86.73% examples, 311389 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724835 effective words) took 2.2s, 326296 effective words/s
CPU times: user 3.69 s, sys: 163 ms, total: 3.86 s
Wall time: 2.23 s


724835

In [22]:
%time doc2vec_model4.train(train_corpus, total_examples=doc2vec_model4.corpus_count, epochs=doc2vec_model4.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 69.20% examples, 500867 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724340 effective words) took 1.4s, 527263 effective words/s
CPU times: user 3.2 s, sys: 132 ms, total: 3.33 s
Wall time: 1.38 s


724340

In [23]:
eval_sick.evaluate(doc2vec_model1, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.286438142989
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.269150921565
Test Spearman: 0.262674319422
Test MSE: 0.94450382038


array([ 3.33818458,  3.32924404,  3.22118222, ...,  3.62762797,
        3.44393652,  3.15340341])

In [24]:
eval_classification.eval_nested_kfold(model=doc2vec_model1, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.67400000000000004]
[0.67400000000000004, 0.67000000000000004]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005, 0.67900000000000005]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005, 0.67900000000000005, 0.66800000000000004]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005, 0.67900000000000005, 0.66800000000000004, 0.67100000000000004]
[0.6740000000000

[0.67400000000000004,
 0.67000000000000004,
 0.65600000000000003,
 0.64000000000000001,
 0.66800000000000004,
 0.68000000000000005,
 0.67900000000000005,
 0.66800000000000004,
 0.67100000000000004,
 0.69199999999999995]

In [25]:
eval_classification.eval_nested_kfold(model=doc2vec_model1, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.5135895032802249]
[0.5135895032802249, 0.53045923149015928]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173, 0.57879924953095685]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173, 0.57879924953095685, 0.5412757973733584]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173, 0.57879924953095685, 0.5412757973733584, 0.5684803001876173]
[0.5135895032802249, 0.530459231

[0.5135895032802249,
 0.53045923149015928,
 0.54878048780487809,
 0.54409005628517826,
 0.55816135084427765,
 0.5684803001876173,
 0.57879924953095685,
 0.5412757973733584,
 0.5684803001876173,
 0.5478424015009381]

In [26]:
eval_sick.evaluate(doc2vec_model2, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.39693084071
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.332600557652
Test Spearman: 0.331574006828
Test MSE: 0.905109955103


array([ 3.57159891,  3.35031843,  3.5568133 , ...,  3.06880136,
        3.49490334,  3.19048271])

In [27]:
eval_classification.eval_nested_kfold(model=doc2vec_model2, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.75]
[0.75, 0.71699999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997, 0.71099999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997, 0.71099999999999997, 0.71199999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997, 0.7109999999999999

[0.75,
 0.71699999999999997,
 0.70499999999999996,
 0.73999999999999999,
 0.72299999999999998,
 0.72099999999999997,
 0.71599999999999997,
 0.71099999999999997,
 0.71199999999999997,
 0.73499999999999999]

In [28]:
eval_classification.eval_nested_kfold(model=doc2vec_model2, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.57638238050609181]
[0.57638238050609181, 0.57357075913776945]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778, 0.59849906191369606]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778, 0.59849906191369606, 0.56754221388367732]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778, 0.59849906191369606, 0.56754221388367732, 0.56566604127579734]
[0.5763823805060

[0.57638238050609181,
 0.57357075913776945,
 0.59474671669793622,
 0.55253283302063794,
 0.57317073170731703,
 0.55628517823639778,
 0.59849906191369606,
 0.56754221388367732,
 0.56566604127579734,
 0.59568480300187621]

In [29]:
eval_sick.evaluate(doc2vec_model3, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.218506672244
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.268180823329
Test Spearman: 0.267794536402
Test MSE: 0.946454984438


array([ 3.55381104,  3.65387629,  3.32172886, ...,  3.50422149,
        3.73455072,  3.27572114])

In [30]:
eval_classification.eval_nested_kfold(model=doc2vec_model3, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.64700000000000002]
[0.64700000000000002, 0.65500000000000003]
[0.64700000000000002, 0.65500000000000003, 0.623]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0.69399999999999995]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0.69399999999999995, 0.66500000000000004]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0.69399999999999995, 0.66500000000000004, 0.66000000000000003]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0

[0.64700000000000002,
 0.65500000000000003,
 0.623,
 0.66400000000000003,
 0.65100000000000002,
 0.67300000000000004,
 0.69399999999999995,
 0.66500000000000004,
 0.66000000000000003,
 0.67800000000000005]

In [31]:
eval_classification.eval_nested_kfold(model=doc2vec_model3, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.50140581068416124]
[0.50140581068416124, 0.53701968134957823]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747, 0.56754221388367732]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747, 0.56754221388367732, 0.55816135084427765]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747, 0.56754221388367732, 0.55816135084427765, 0.53752345215759845]
[0.50140581068416124, 

[0.50140581068416124,
 0.53701968134957823,
 0.54690431519699811,
 0.5478424015009381,
 0.57035647279549717,
 0.56378986866791747,
 0.56754221388367732,
 0.55816135084427765,
 0.53752345215759845,
 0.58724202626641653]

In [32]:
eval_sick.evaluate(doc2vec_model4, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.409921757155
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.341916171513
Test Spearman: 0.340134184451
Test MSE: 0.898672297686


array([ 3.4285679 ,  3.71794175,  2.89571583, ...,  3.20624222,
        3.11745406,  2.71925664])

In [33]:
eval_classification.eval_nested_kfold(model=doc2vec_model4, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.751]
[0.751, 0.70699999999999996]
[0.751, 0.70699999999999996, 0.71999999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997, 0.71399999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997, 0.71399999999999997, 0.72099999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997, 0.713999

[0.751,
 0.70699999999999996,
 0.71999999999999997,
 0.73299999999999998,
 0.73899999999999999,
 0.69999999999999996,
 0.71599999999999997,
 0.71399999999999997,
 0.72099999999999997,
 0.73099999999999998]

In [34]:
eval_classification.eval_nested_kfold(model=doc2vec_model4, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.57919400187441428]
[0.57919400187441428, 0.56419868791002814]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716, 0.59474671669793622]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716, 0.59474671669793622, 0.5684803001876173]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716, 0.59474671669793622, 0.5684803001876173, 0.575046904315197]
[0.57919400187441428, 0.564

[0.57919400187441428,
 0.56419868791002814,
 0.6031894934333959,
 0.57410881801125702,
 0.56285178236397748,
 0.57129455909943716,
 0.59474671669793622,
 0.5684803001876173,
 0.575046904315197,
 0.61444652908067543]

In [35]:
eval_trec.evaluate(doc2vec_model1, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.398


In [36]:
eval_trec.evaluate(doc2vec_model2, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.422


In [37]:
eval_trec.evaluate(doc2vec_model3, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.378


In [38]:
eval_trec.evaluate(doc2vec_model4, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.404


# Evaluation of sentence vectors obtained from averaging FastText word vectors

In [5]:
lee_data = LineSentence(lee_train_file)
fasttext_model = FastText(size=100, alpha=0.2, negative=10, max_vocab_size=30000000, seed=42, iter=20)
fasttext_model.build_vocab(lee_data)
% time fasttext_model.train(lee_data, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 10781 word types from a corpus of 59890 raw words and 300 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
INFO:gensim.models.word2vec:min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 10781 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 45 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
INFO:gensim.models.word2vec:estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.fasttext:Total number of ngrams is 17006
INFO:gens

INFO:gensim.models.word2vec:PROGRESS: at 66.82% examples, 377 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 67.55% examples, 380 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 68.37% examples, 380 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 69.23% examples, 377 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 70.10% examples, 380 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 71.00% examples, 381 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 71.78% examples, 378 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 72.53% examples, 380 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 73.33% examples, 382 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 74.18% examples, 379 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 75.02% ex

In [57]:
fasttext_model.save('ft1')

INFO:gensim.utils:saving FastText object under ft1, separately None
INFO:gensim.utils:not storing attribute syn0norm
INFO:gensim.utils:not storing attribute syn0_ngrams_norm
INFO:gensim.utils:not storing attribute syn0_vocab_norm
INFO:gensim.utils:saved ft1


In [39]:
ft_loaded_model = ft.load('ft1')

INFO:gensim.utils:loading FastText object from ft1
INFO:gensim.utils:loading wv recursively from ft1.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:setting ignored attribute syn0_ngrams_norm to None
INFO:gensim.utils:setting ignored attribute syn0_vocab_norm to None
INFO:gensim.utils:loaded ft1


In [40]:
eval_sick.evaluate(ft_loaded_model, seed=42, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.508652314637
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.502745225838
Test Spearman: 0.498571233172
Test MSE: 0.762192068723


array([ 2.79715686,  3.30016953,  3.30776085, ...,  3.46108352,
        3.05130163,  2.48509623])

In [41]:
eval_classification.eval_nested_kfold(model=ft_loaded_model, name='SUBJ', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.82299999999999995]
[0.82299999999999995, 0.80500000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003, 0.80500000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003, 0.80500000000000005, 0.80800000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003, 0.80500000000000005, 0.80800000000000005, 0.78800000000000003]
[0.8229999999999

[0.82299999999999995,
 0.80500000000000005,
 0.79700000000000004,
 0.80600000000000005,
 0.80800000000000005,
 0.78600000000000003,
 0.80500000000000005,
 0.80800000000000005,
 0.78800000000000003,
 0.81899999999999995]

In [42]:
eval_classification.eval_nested_kfold(model=ft_loaded_model, name='MR', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.61480787253983127]
[0.61480787253983127, 0.60449859418931584]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574, 0.64446529080675419]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574, 0.64446529080675419, 0.61257035647279545]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574, 0.64446529080675419, 0.61257035647279545, 0.60412757973733588]
[0.6148078725398

[0.61480787253983127,
 0.60449859418931584,
 0.59193245778611636,
 0.61819887429643527,
 0.59005628517823638,
 0.60694183864915574,
 0.64446529080675419,
 0.61257035647279545,
 0.60412757973733588,
 0.60694183864915574]

In [43]:
eval_trec.evaluate(ft_loaded_model, evalcv=False, evaltest=True, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.62


# Evaluation Results

| S.No. | Model Name                                | Total Execution Time (in seconds) | Pearson/Spearman/MSE on SICK | Mean SUBJ | Mean MR | TREC |
|-------|-------------------------------------------|----------------------------|------------------------------|-----------|---------|------|
| 1.    | Gensim Sent2Vec                           | 27.5                       | 0.43/0.43/0.83               | 0.71      | 0.57    | 0.58 |
| 2.    | Original Sent2Vec                         | 13.4                       | 0.41/0.42/0.84               | 0.78      | 0.58    | 0.60 |
| 3.    | PV-DM with sum of context word vectors    | 1.59                       | 0.27/0.27/0.94               | 0.66      | 0.55    | 0.37 |
| 4.    | PV-DM with mean of context word vectors   | 2.23                       | 0.28/0.28/0.93               | 0.67      | 0.55    | 0.38 |
| 5.    | PV-DBOW with sum of context word vector   | 1.24                       | 0.36/0.35/0.88               | 0.73      | 0.57    | 0.42 |
| 6.    | PV-DBOW with mean of context word vectors | 1.38                       | 0.34/0.34/0.89               | 0.72      | 0.57    | 0.41 |
| 7.    | Mean of gensim fasttext word vectors      | 1686                       | 0.49/0.49/0.76               | 0.80      | 0.60    | 0.62 |

# Evaluation on sample of Toronto Book Corpus

# Prepare training data

In [3]:
toronto_data = []
lines = 0
with open('./books_in_sentences/books_large_p1.txt') as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if np.random.random() > 0.5:
            if lines >= 100000:
                break
            lines += 1
            if line not in ['\n', '\r\n']:
                line = re.split('\.|\?|\n', line.strip())
                for sentence in line:
                    if len(sentence) > 1:
                        sentence = tokenize(sentence)
                        toronto_data.append(list(sentence))
                        f2.write(' '.join(toronto_data[-1]) + '\n')

In [4]:
# Print sample training data
for sentence in toronto_data[:5]:
    print sentence,'\n'

[u'its', u'a', u'place', u'where', u'your', u'parents', u'wouldnt', u'even', u'care', u'if', u'you', u'stayed', u'out', u'late', u'biking', u'with', u'your', u'friends'] 

[u'only', u'because', u'everyone', u'felt', u'so', u'safe', u'so', u'comfy'] 

[u'they', u'dont', u'know', u'the', u'half', u'of', u'it'] 

[u'but', u'i', u'do'] 

[u'i', u'know', u'it', u'all', u'and', u'starlings', u'is', u'not', u'the', u'place', u'where', u'you', u'want', u'to', u'be', u'after', u'dark'] 



In [5]:
# Train new sent2vec model on part of the Toronto Book Corpus (100,000 sentences)
% time sent2vec_toronto_model = Sent2Vec(toronto_data, size=100, epochs=5, seed=42, workers=4)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 1.00 M words
INFO:gensim.models.sent2vec:Read 1.35 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 11565, tokens read: 1346199
INFO:gensim.models.sent2vec:training model with 4 workers on 11565 vocabulary and 100 features
INFO:gensim.models.sent2vec:PROGRESS: at 1.19% words, 71718 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 3.41% words, 103980 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 5.20% words, 107437 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 7.12% words, 108416 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 9.05% words, 110556 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 11.28% words, 113706 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 13.50% words, 115893 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 15.58% words, 118216 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 17.66% words, 117794 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 20.0

In [6]:
sent2vec_toronto_model.save('s2v2')

INFO:gensim.utils:saving Sent2Vec object under s2v2, separately None
INFO:gensim.utils:storing np array 'wi' to s2v2.wi.npy
INFO:gensim.utils:saved s2v2


In [7]:
eval_sick.evaluate(sent2vec_toronto_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Dev Pearson: 0.427122900262
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.480262181097
Test Spearman: 0.487383241191
Test MSE: 0.783650957419


array([ 3.19838024,  3.28760001,  3.33666651, ...,  3.66994808,
        3.31433722,  3.50956535])

In [8]:
eval_trec.evaluate(sent2vec_toronto_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.51


In [9]:
eval_classification.eval_nested_kfold(model=sent2vec_toronto_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.59231490159325206]
[0.59231490159325206, 0.57638238050609181]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654, 0.58630393996247654]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654, 0.58630393996247654, 0.575046904315197]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654, 0.58630393996247654, 0.575046904315197, 0.59193245778611636]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654, 0.58630393996247654, 0.575046904315197, 0.59193245778611636, 0.59005628517823638]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654, 0.58630393996247654, 0.575046904315197, 0.59193245778611636, 0.59005628517823638, 0.58161350844277671]
[0.59231490159325206, 0.57638238050609181, 0.58630393996247654, 0.58630393996247654, 0.575046904315197, 0.59193245778611636, 0.59005628517823638, 0.58161350844277671, 0.60131332082551592]
[0.59231490159325206, 0.57

[0.59231490159325206,
 0.57638238050609181,
 0.58630393996247654,
 0.58630393996247654,
 0.575046904315197,
 0.59193245778611636,
 0.59005628517823638,
 0.58161350844277671,
 0.60131332082551592,
 0.55628517823639778]

In [10]:
eval_classification.eval_nested_kfold(model=sent2vec_toronto_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.70699999999999996]
[0.70699999999999996, 0.69899999999999995]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995, 0.69499999999999995]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995, 0.69499999999999995, 0.67500000000000004]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995, 0.69499999999999995, 0.67500000000000004, 0.69099999999999995]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995, 0.69499999999999995, 0.67500000000000004, 0.69099999999999995, 0.67500000000000004]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995, 0.69499999999999995, 0.67500000000000004, 0.69099999999999995, 0.67500000000000004, 0.68799999999999994]
[0.70699999999999996, 0.69899999999999995, 0.69299999999999995, 0.69499999999999995, 0.67500000000000004, 0.69099999999999995, 0.67500000000000004, 0.68799999999999994, 0.67400000000000004]
[0.7069999999999

[0.70699999999999996,
 0.69899999999999995,
 0.69299999999999995,
 0.69499999999999995,
 0.67500000000000004,
 0.69099999999999995,
 0.67500000000000004,
 0.68799999999999994,
 0.67400000000000004,
 0.70499999999999996]

In [11]:
# Train model using original c++ implementation of sent2vec
% cd sent2vec
! time ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 5 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 4 -t 0.0001 -dropoutK 2 -bucket 2000000

/Users/prerna135/Documents/GitHub/gensim/sent2vec
Read 1M words
Number of words:  12938
Number of labels: 0
Progress: 100.0%  words/sec/thread: 138014  lr: 0.000000  loss: 2.647149  eta: 0h0m m   lr: 0.199982  loss: 7.583591  eta: 5h36m 4.887782  eta: 0h2m   eta: 0h1m h0m 0.160521  loss: 3.154479  eta: 0h0m 3.058576  eta: 0h0m %  words/sec/thread: 109013  lr: 0.123338  loss: 2.994709  eta: 0h0m 42.0%  words/sec/thread: 112304  lr: 0.116090  loss: 2.945761  eta: 0h0m   loss: 2.941653  eta: 0h0m 0h0m %  words/sec/thread: 119588  lr: 0.094700  loss: 2.858512  eta: 0h0m   words/sec/thread: 122925  lr: 0.083742  loss: 2.813961  eta: 0h0m 60.0%  words/sec/thread: 123896  lr: 0.079910  loss: 2.790588  eta: 0h0m 128617  lr: 0.060927  loss: 2.755890  eta: 0h0m 0h0m 0m 0m 131905  lr: 0.040259  loss: 2.701730  eta: 0h0m 81.4%  words/sec/thread: 132186  lr: 0.037159  loss: 2.698457  eta: 0h0m 92.0%  words/sec/thread: 136248  lr: 0.015957  loss: 2.660486  eta: 0h0m 

real	0m32.302s
user	0m54.028s
s

In [12]:
% cd ..
eval_sick.evaluate(seed=42, model_name='original_sent2vec')

/Users/prerna135/Documents/GitHub/gensim
Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.467496919577
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.519848031764
Test Spearman: 0.509933660616
Test MSE: 0.743931587436


array([ 2.88400938,  3.2521763 ,  3.12047375, ...,  2.98194377,
        2.9174632 ,  3.46705309])

In [13]:
eval_classification.eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.82199999999999995]
[0.82199999999999995, 0.82099999999999995]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005, 0.79500000000000004]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005, 0.79500000000000004, 0.81499999999999995]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005, 0.79500000000000004, 0.81499999999999995, 0.81000000000000005]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005, 0.79500000000000004, 0.81499999999999995, 0.81000000000000005, 0.78200000000000003]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005, 0.79500000000000004, 0.81499999999999995, 0.81000000000000005, 0.78200000000000003, 0.82999999999999996]
[0.82199999999999995, 0.82099999999999995, 0.81100000000000005, 0.79500000000000004, 0.81499999999999995, 0.81000000000000005, 0.78200000000000003, 0.82999999999999996, 0.81699999999999995]
[0.8219999999999

[0.82199999999999995,
 0.82099999999999995,
 0.81100000000000005,
 0.79500000000000004,
 0.81499999999999995,
 0.81000000000000005,
 0.78200000000000003,
 0.82999999999999996,
 0.81699999999999995,
 0.82499999999999996]

In [14]:
eval_classification.eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.60356138706654172]
[0.60356138706654172, 0.61761949390815374]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464, 0.60694183864915574]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464, 0.60694183864915574, 0.62851782363977482]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464, 0.60694183864915574, 0.62851782363977482, 0.61538461538461542]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464, 0.60694183864915574, 0.62851782363977482, 0.61538461538461542, 0.6163227016885553]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464, 0.60694183864915574, 0.62851782363977482, 0.61538461538461542, 0.6163227016885553, 0.62945590994371481]
[0.60356138706654172, 0.61761949390815374, 0.63414634146341464, 0.60694183864915574, 0.62851782363977482, 0.61538461538461542, 0.6163227016885553, 0.62945590994371481, 0.62288930581613511]
[0.6035613870665417

[0.60356138706654172,
 0.61761949390815374,
 0.63414634146341464,
 0.60694183864915574,
 0.62851782363977482,
 0.61538461538461542,
 0.6163227016885553,
 0.62945590994371481,
 0.62288930581613511,
 0.6369606003752345]

In [15]:
eval_trec.evaluate(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.556


In [2]:
def read_toronto_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if i >= 100000:
                break
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [4]:
train_corpus = list(read_toronto_corpus('./books_in_sentences/books_large_p1.txt'))

In [5]:
# Doc2Vec model with PV-DBOW and sum of context word vectors
doc2vec_model = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #10000, processed 113217 words (786978/s), 7839 word types, 10000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #20000, processed 256086 words (1046949/s), 14477 word types, 20000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #30000, processed 377626 words (913805/s), 17118 word types, 30000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #40000, processed 481481 words (966669/s), 19727 word types, 40000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #50000, processed 594311 words (931086/s), 21187 word types, 50000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #60000, processed 755296 words (1243627/s), 24414 word types, 60000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #70000, processed 987300 words (1505903/s), 27989 word types, 70000 

In [6]:
%time doc2vec_model.train(train_corpus, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 13214 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 0.64% examples, 129682 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 1.38% examples, 138924 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 2.16% examples, 139614 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 2.98% examples, 149130 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 3.72% examples, 167098 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 4.53% examples, 166746 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 5.40% examples, 166633 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 6.22% examples, 167030 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 7.06% examples, 164868 words/s, in_qs

INFO:gensim.models.word2vec:PROGRESS: at 63.16% examples, 164573 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 63.86% examples, 165334 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 64.62% examples, 165141 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 65.58% examples, 165449 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 66.42% examples, 165536 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 67.24% examples, 165277 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 67.96% examples, 165198 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 68.68% examples, 166000 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 69.45% examples, 165950 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 70.15% examples, 165664 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.w

22759350

In [7]:
eval_sick.evaluate(doc2vec_model, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Dev Pearson: 0.129711484713
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.153188257519
Test Spearman: 0.133704859493
Test MSE: 0.99598130755


array([ 3.44692591,  3.56224901,  3.71145853, ...,  3.49521685,
        3.63893017,  3.43376413])

In [8]:
eval_classification.eval_nested_kfold(model=doc2vec_model, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.67600000000000005]
[0.67600000000000005, 0.68300000000000005]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995, 0.66300000000000003]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995, 0.66300000000000003, 0.65400000000000003]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995, 0.66300000000000003, 0.65400000000000003, 0.65700000000000003]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995, 0.66300000000000003, 0.65400000000000003, 0.65700000000000003, 0.66600000000000004]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995, 0.66300000000000003, 0.65400000000000003, 0.65700000000000003, 0.66600000000000004, 0.66800000000000004]
[0.67600000000000005, 0.68300000000000005, 0.69799999999999995, 0.66300000000000003, 0.65400000000000003, 0.65700000000000003, 0.66600000000000004, 0.66800000000000004, 0.68600000000000005]
[0.6760000000000

[0.67600000000000005,
 0.68300000000000005,
 0.69799999999999995,
 0.66300000000000003,
 0.65400000000000003,
 0.65700000000000003,
 0.66600000000000004,
 0.66800000000000004,
 0.68600000000000005,
 0.68100000000000005]

In [9]:
eval_classification.eval_nested_kfold(model=doc2vec_model, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.55482661668228683]
[0.55482661668228683, 0.549203373945642]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734, 0.55534709193245779]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734, 0.55534709193245779, 0.54596622889305813]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734, 0.55534709193245779, 0.54596622889305813, 0.57973733583489684]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734, 0.55534709193245779, 0.54596622889305813, 0.57973733583489684, 0.55159474671669795]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734, 0.55534709193245779, 0.54596622889305813, 0.57973733583489684, 0.55159474671669795, 0.57317073170731703]
[0.55482661668228683, 0.549203373945642, 0.56566604127579734, 0.55534709193245779, 0.54596622889305813, 0.57973733583489684, 0.55159474671669795, 0.57317073170731703, 0.57129455909943716]
[0.55482661668228683, 0.54920337

[0.55482661668228683,
 0.549203373945642,
 0.56566604127579734,
 0.55534709193245779,
 0.54596622889305813,
 0.57973733583489684,
 0.55159474671669795,
 0.57317073170731703,
 0.57129455909943716,
 0.56566604127579734]

In [10]:
eval_trec.evaluate(doc2vec_model, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.29


**NOTE: It is evident that more data = better results for sent2vec as the above model (trained for 5 epochs) achieves similar results to the model trained on the much smaller Lee corpus (trained for 20 epochs)

| S.No. | Model             | Total Execution Time (in seconds) | Pearson/Spearman/MSE on SICK | MR   | SUBJ | TREC |
|-------|-------------------|-----------------------------------|------------------------------|------|------|------|
| 1.    | Gensim Sent2Vec   | 75                                | 0.48/0.48/0.78               | 0.58 | 0.69 | 0.51 |
| 2.    | Original Sent2Vec | 32.3                              | 0.51/0.50/0.74               | 0.62 | 0.81 | 0.55 |
| 3.    | Doc2Vec DBOW (sum)| 136                               | 0.15/0.13/0.99               | 0.56 | 0.67 | 0.29 |