# Using Sent2Vec via Gensim

This tutorial is about using sent2vec model in Gensim. Here, we'll learn to work with the sent2vec library for training sentence-embedding models, saving & loading them and performing similarity operations. This notebook also contains a comparison of the gensim implementation with the [original c++ implementation](https://github.com/epfml/sent2vec), Gensim's Doc2Vec and Gensim's FastText. All the evaluation scripts used in the notebook can be found [here](https://gist.github.com/prerna135/9b5eb55054d29c1495460b75fc061c6b).

# What is Sent2Vec?

Sent2Vec delivers numerical representations (features) for short texts or sentences, which can be used as input to any machine learning task later on. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences. The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the [paper](https://arxiv.org/abs/1703.02507) for more details.

The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams present in each sentence, and averaging the n-gram embeddings along with the words

# Training models

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model. All models are trained with the same hyperparameters for evaluation purposes.

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec as s2v
from gensim.models.fasttext import FastText as ft
from gensim.utils import tokenize
import scipy
import re
from numpy import dot
from gensim import matutils
import time
import numpy as np
import tensorflow as tf
import random
import eval_sick
import eval_classification
import eval_trec
import smart_open

Using TensorFlow backend.


In [2]:
# Prepare training data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'
lee_data = []
with open(lee_train_file) as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = tokenize(sentence)
                    lee_data.append(list(sentence))
                    f2.write(' '.join(lee_data[-1]) + '\n')

In [3]:
# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

# Using gensim implementation of sent2vec

In [4]:
# Train new sent2vec model
sent2vec_model = s2v(lee_data, vector_size=100, epochs=20, seed=42)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 0.06 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 1307, tokens read: 60302
INFO:gensim.models.sent2vec:Training...
INFO:gensim.models.sent2vec:Begin epoch 0 :
INFO:gensim.models.sent2vec:Progress: 3.96, lr: 0.1921, loss: 3.7907
INFO:gensim.models.sent2vec:Begin epoch 1 :
INFO:gensim.models.sent2vec:Progress: 7.93, lr: 0.1841, loss: 3.6546
INFO:gensim.models.sent2vec:Begin epoch 2 :
INFO:gensim.models.sent2vec:Progress: 11.90, lr: 0.1762, loss: 3.5554
INFO:gensim.models.sent2vec:Begin epoch 3 :
INFO:gensim.models.sent2vec:Progress: 15.86, lr: 0.1683, loss: 3.4607
INFO:gensim.models.sent2vec:Begin epoch 4 :
INFO:gensim.models.sent2vec:Progress: 19.83, lr: 0.1603, loss: 3.3682
INFO:gensim.models.sent2vec:Begin epoch 5 :
INFO:gensim.models.sent2vec:Progress: 23.80, lr: 0.1524, loss: 3.2797
INFO:gensim.models.sent2vec:Begin epoch 6 :
INFO:gensim.models.sent2vec:Progress: 27.76, lr

# Training hyperparameters

Sent2Vec supports the folllowing parameters:

 - vector_size: Size of embeddings to be learnt (Default 100)
 - alpha: Initial learning rate (Default 0.2)
 - min_count: Ignore words with number of occurrences below this (Default 5)
 - loss: Training objective. Allowed values: `ns` (Default `ns`)
 - neg: Number of negative words to sample, for `ns` (Default 10)
 - epochs: Number of epochs (Default 5)
 - bucket: Number of hash buckets for vocabulary (Default 2000000)
 - lr_update_rate: Change the rate of updates for the learning rate (Default 100)
 - t: Sampling threshold (Default 0.0001)
 - dropoutk: Number of ngrams dropped when training a sent2vec model (Default 2)
 - word_ngrams: Max length of word ngram (Default 2)
 - minn: min length of char ngrams (Default 3)
 - maxn: max length of char ngrams (Default 6)
 - seed: random seed for reproducibility reasons (Default 42)

In [5]:
# Print sentence vector
sent2vec_model.sentence_vectors(['This', 'is', 'an', 'awesome', 'gift'])

array([ -2.47282963e-01,   1.87189350e-01,  -3.21987474e-02,
         7.91178968e-02,   3.39226979e-01,  -4.05825705e-01,
         8.00900208e-01,   1.35698618e-01,  -4.38281501e-02,
         7.56798528e-01,  -3.80137642e-01,  -2.22912740e-01,
        -8.74431924e-02,   5.80761992e-02,   4.83582117e-01,
         1.01573390e-01,   7.24145461e-01,   2.94534907e-01,
         1.56936339e-01,  -1.77839273e-01,  -1.38541675e-01,
         2.78566131e-01,  -7.17920556e-01,  -2.89763518e-01,
         1.50396665e-01,  -1.23929553e-01,   6.27710213e-01,
         2.74299500e-01,   3.58840180e-01,   4.12131903e-02,
        -1.07678619e-01,  -1.87951293e-01,  -5.40631978e-01,
        -1.87393133e-01,   6.79594841e-01,   8.36914707e-01,
         1.57756821e-02,  -6.81864588e-01,   3.33469583e-01,
         9.79331323e-01,  -1.11638314e-03,   8.25131676e-01,
         8.70815820e-01,  -3.52632190e-01,   2.81376163e-01,
         3.82966643e-01,   3.37886963e-01,  -4.67265898e-01,
         7.07851315e-01,

In [6]:
# Print cosine similarity between two sentences
print sent2vec_model.similarity(['The', 'sky', 'is', 'blue'], ['I', 'am', 'going', 'to', 'a', 'party'])
print sent2vec_model.similarity(['This', 'is', 'an', 'awesome', 'gift'], ['This', 'present', 'is', 'great'])

0.116041437211
0.783033600606


# Saving and loading models

Models can be saved and loaded via the load and save methods.

In [7]:
# Save trained sent2vec model
sent2vec_model.save('s2v1')

INFO:gensim.utils:saving Sent2Vec object under s2v1, separately None
INFO:gensim.utils:storing np array 'wi' to s2v1.wi.npy
INFO:gensim.utils:saved s2v1


In [4]:
# Load pretrained sent2vec model
loaded_model = s2v.load('s2v1')

INFO:gensim.utils:loading Sent2Vec object from s2v1
INFO:gensim.utils:loading wi from s2v1.wi.npy with mmap=None
INFO:gensim.utils:loaded s2v1


# Unsupervised similarity evaluation

Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the [SICK 2014](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools) datasets. These similarity scores are compared to the gold-standard human judgements using [Pearson’s correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) scores. The SICK dataset consists of about 10,000 sentence pairs along with relatedness scores of the pairs. We use the code provided by [Kiros et al., 2015](https://github.com/ryankiros/skip-thoughts).

In [5]:
eval_sick.evaluate(loaded_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Dev Pearson: 0.488022039971
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.480271787239
Test Spearman: 0.489470899414
Test MSE: 0.786314446477


array([ 3.09487504,  3.26262388,  3.3487608 , ...,  3.32659516,
        2.54281916,  3.60840255])

# Downstream Supervised Evaluation

Sentence embeddings are evaluated for various supervised classification tasks. We evaluate classification of movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ)(Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). To classify, we use the code provided by [(Kiros et al., 2015)](https://github.com/ryankiros/skip-thoughts). Sent2Vec embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the [MR and SUBJ](https://www.cs.cornell.edu/people/pabo/movie-review-data/) datasets. For those datasets nested cross-validation is used to tune the L2 penalty. For the [TREC dataset](http://cogcomp.cs.illinois.edu/Data/QA/QC/), the accuracy is computed on the test set.

In [6]:
eval_classification.eval_nested_kfold(model=loaded_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.77100000000000002]
[0.77100000000000002, 0.77200000000000002]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003, 0.751]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003, 0.751, 0.753]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003, 0.751, 0.753, 0.77200000000000002]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003, 0.751, 0.753, 0.77200000000000002, 0.77300000000000002]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003, 0.751, 0.753, 0.77200000000000002, 0.77300000000000002, 0.753]
[0.77100000000000002, 0.77200000000000002, 0.76800000000000002, 0.77900000000000003, 0.751, 0.753, 0.77200000000000002, 0.77300000000000002, 0.753, 0.793000

[0.77100000000000002,
 0.77200000000000002,
 0.76800000000000002,
 0.77900000000000003,
 0.751,
 0.753,
 0.77200000000000002,
 0.77300000000000002,
 0.753,
 0.79300000000000004]

In [7]:
eval_classification.eval_nested_kfold(model=loaded_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.58856607310215558]
[0.58856607310215558, 0.56982193064667297]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637, 0.56378986866791747]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637, 0.56378986866791747, 0.59099437148217637]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637, 0.56378986866791747, 0.59099437148217637, 0.59568480300187621]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637, 0.56378986866791747, 0.59099437148217637, 0.59568480300187621, 0.58818011257035652]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637, 0.56378986866791747, 0.59099437148217637, 0.59568480300187621, 0.58818011257035652, 0.59287054409005624]
[0.58856607310215558, 0.56982193064667297, 0.59099437148217637, 0.56378986866791747, 0.59099437148217637, 0.59568480300187621, 0.58818011257035652, 0.59287054409005624, 0.60506566604127576]
[0.5885660731021

[0.58856607310215558,
 0.56982193064667297,
 0.59099437148217637,
 0.56378986866791747,
 0.59099437148217637,
 0.59568480300187621,
 0.58818011257035652,
 0.59287054409005624,
 0.60506566604127576,
 0.57035647279549717]

In [8]:
eval_trec.evaluate(model=loaded_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.634


# Evaluation of original c++ implementation of sent2vec

In order to build and train c++ implementation of sent2vec, use the following commands. This will produce object files for all the classes as well as the main binary sent2vec.

In [15]:
! git clone https://github.com/epfml/sent2vec.git
% cd sent2vec
! make

/Users/prerna135/Documents/GitHub/gensim/sent2vec


In [16]:
# Train model using original c++ implementation of sent2vec
start_time = time.time()
! ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.0001 -dropoutK 2 -bucket 2000000
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

Read 0M words
Number of words:  1837
Number of labels: 0
Progress: 100.0%  words/sec/thread: 27315  lr: 0.000000  loss: 3.016871  eta: 0h0m m 0m 0h0m   loss: 3.108133  eta: 0h0m 3.092520  eta: 0h0m h0m 


Total training time: 25.7511711121 seconds


In [9]:
eval_sick.evaluate(seed=42, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.417315994108
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.420326770285
Test Spearman: 0.425649516516
Test MSE: 0.838276461109


array([ 2.98017314,  3.09968709,  3.31417498, ...,  3.26684362,
        2.37790687,  2.85882165])

In [10]:
eval_classification.eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.78600000000000003]
[0.78600000000000003, 0.79900000000000004]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77700000000000002]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77700000000000002, 0.78100000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77700000000000002, 0.78100000000000003, 0.76200000000000001]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77700000000000002, 0.78100000000000003, 0.76200000000000001, 0.78200000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77700000000000002, 0.78100000000000003, 0.76200000000000001, 0.78200000000000003, 0.79000000000000004]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77700000000000002, 0.78100000000000003, 0.76200000000000001, 0.78200000000000003, 0.79000000000000004, 0.76900000000000002]
[0.7860000000000

[0.78600000000000003,
 0.79900000000000004,
 0.78500000000000003,
 0.77700000000000002,
 0.78100000000000003,
 0.76200000000000001,
 0.78200000000000003,
 0.79000000000000004,
 0.76900000000000002,
 0.80100000000000005]

In [11]:
eval_classification.eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.56794751640112462]
[0.56794751640112462, 0.59700093720712277]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746, 0.57786116322701686]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746, 0.57786116322701686, 0.58067542213883683]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606, 0.60506566604127576]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606, 0.60506566604127576, 0.61819887429643527]
[0.56794751640112462, 0.59700093720712277, 0.56472795497185746, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606, 0.60506566604127576, 0.61819887429643527, 0.61069418386491559]
[0.5679475164011

[0.56794751640112462,
 0.59700093720712277,
 0.56472795497185746,
 0.57786116322701686,
 0.58067542213883683,
 0.59849906191369606,
 0.60506566604127576,
 0.61819887429643527,
 0.61069418386491559,
 0.59662288930581608]

In [12]:
eval_trec.evaluate(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.594


# Evaluation of Doc2Vec

In [13]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [14]:
train_corpus = list(read_corpus(lee_train_file))

In [15]:
# Doc2Vec model1 with PV-DM and sum of context word vectors
doc2vec_model1 = gensim.models.doc2vec.Doc2Vec(size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model1.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [16]:
# Doc2Vec model2 with PV-DBOW and sum of context word vectors
doc2vec_model2 = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model2.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [17]:
# Doc2Vec model3 with PV-DM and mean of context word vectors
doc2vec_model3 = gensim.models.doc2vec.Doc2Vec(dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model3.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [18]:
# Doc2Vec model4 with PV-DBOW and mean of context word vectors
doc2vec_model4 = gensim.models.doc2vec.Doc2Vec(dm=0, dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model4.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [19]:
%time doc2vec_model1.train(train_corpus, total_examples=doc2vec_model1.corpus_count, epochs=doc2vec_model1.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 62.53% examples, 451598 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724888 effective words) took 1.6s, 459770 effective words/s
CPU times: user 3.73 s, sys: 230 ms, total: 3.96 s
Wall time: 1.59 s


724888

In [20]:
%time doc2vec_model2.train(train_corpus, total_examples=doc2vec_model2.corpus_count, epochs=doc2vec_model2.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 80.98% examples, 579813 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724799 effective words) took 1.2s, 586018 effective words/s
CPU times: user 3.07 s, sys: 126 ms, total: 3.2 s
Wall time: 1.24 s


724799

In [21]:
%time doc2vec_model3.train(train_corpus, total_examples=doc2vec_model3.corpus_count, epochs=doc2vec_model3.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 36.78% examples, 262306 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:PROGRESS: at 86.73% examples, 311389 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724835 effective words) took 2.2s, 326296 effective words/s
CPU times: user 3.69 s, sys: 163 ms, total: 3.86 s
Wall time: 2.23 s


724835

In [22]:
%time doc2vec_model4.train(train_corpus, total_examples=doc2vec_model4.corpus_count, epochs=doc2vec_model4.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 69.20% examples, 500867 words/s, in_qsize 5, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724340 effective words) took 1.4s, 527263 effective words/s
CPU times: user 3.2 s, sys: 132 ms, total: 3.33 s
Wall time: 1.38 s


724340

In [23]:
eval_sick.evaluate(doc2vec_model1, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.286438142989
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.269150921565
Test Spearman: 0.262674319422
Test MSE: 0.94450382038


array([ 3.33818458,  3.32924404,  3.22118222, ...,  3.62762797,
        3.44393652,  3.15340341])

In [24]:
eval_classification.eval_nested_kfold(model=doc2vec_model1, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.67400000000000004]
[0.67400000000000004, 0.67000000000000004]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005, 0.67900000000000005]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005, 0.67900000000000005, 0.66800000000000004]
[0.67400000000000004, 0.67000000000000004, 0.65600000000000003, 0.64000000000000001, 0.66800000000000004, 0.68000000000000005, 0.67900000000000005, 0.66800000000000004, 0.67100000000000004]
[0.6740000000000

[0.67400000000000004,
 0.67000000000000004,
 0.65600000000000003,
 0.64000000000000001,
 0.66800000000000004,
 0.68000000000000005,
 0.67900000000000005,
 0.66800000000000004,
 0.67100000000000004,
 0.69199999999999995]

In [25]:
eval_classification.eval_nested_kfold(model=doc2vec_model1, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.5135895032802249]
[0.5135895032802249, 0.53045923149015928]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173, 0.57879924953095685]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173, 0.57879924953095685, 0.5412757973733584]
[0.5135895032802249, 0.53045923149015928, 0.54878048780487809, 0.54409005628517826, 0.55816135084427765, 0.5684803001876173, 0.57879924953095685, 0.5412757973733584, 0.5684803001876173]
[0.5135895032802249, 0.530459231

[0.5135895032802249,
 0.53045923149015928,
 0.54878048780487809,
 0.54409005628517826,
 0.55816135084427765,
 0.5684803001876173,
 0.57879924953095685,
 0.5412757973733584,
 0.5684803001876173,
 0.5478424015009381]

In [26]:
eval_sick.evaluate(doc2vec_model2, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.39693084071
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.332600557652
Test Spearman: 0.331574006828
Test MSE: 0.905109955103


array([ 3.57159891,  3.35031843,  3.5568133 , ...,  3.06880136,
        3.49490334,  3.19048271])

In [27]:
eval_classification.eval_nested_kfold(model=doc2vec_model2, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.75]
[0.75, 0.71699999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997, 0.71099999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997, 0.71099999999999997, 0.71199999999999997]
[0.75, 0.71699999999999997, 0.70499999999999996, 0.73999999999999999, 0.72299999999999998, 0.72099999999999997, 0.71599999999999997, 0.7109999999999999

[0.75,
 0.71699999999999997,
 0.70499999999999996,
 0.73999999999999999,
 0.72299999999999998,
 0.72099999999999997,
 0.71599999999999997,
 0.71099999999999997,
 0.71199999999999997,
 0.73499999999999999]

In [28]:
eval_classification.eval_nested_kfold(model=doc2vec_model2, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.57638238050609181]
[0.57638238050609181, 0.57357075913776945]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778, 0.59849906191369606]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778, 0.59849906191369606, 0.56754221388367732]
[0.57638238050609181, 0.57357075913776945, 0.59474671669793622, 0.55253283302063794, 0.57317073170731703, 0.55628517823639778, 0.59849906191369606, 0.56754221388367732, 0.56566604127579734]
[0.5763823805060

[0.57638238050609181,
 0.57357075913776945,
 0.59474671669793622,
 0.55253283302063794,
 0.57317073170731703,
 0.55628517823639778,
 0.59849906191369606,
 0.56754221388367732,
 0.56566604127579734,
 0.59568480300187621]

In [29]:
eval_sick.evaluate(doc2vec_model3, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.218506672244
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.268180823329
Test Spearman: 0.267794536402
Test MSE: 0.946454984438


array([ 3.55381104,  3.65387629,  3.32172886, ...,  3.50422149,
        3.73455072,  3.27572114])

In [30]:
eval_classification.eval_nested_kfold(model=doc2vec_model3, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.64700000000000002]
[0.64700000000000002, 0.65500000000000003]
[0.64700000000000002, 0.65500000000000003, 0.623]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0.69399999999999995]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0.69399999999999995, 0.66500000000000004]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0.69399999999999995, 0.66500000000000004, 0.66000000000000003]
[0.64700000000000002, 0.65500000000000003, 0.623, 0.66400000000000003, 0.65100000000000002, 0.67300000000000004, 0

[0.64700000000000002,
 0.65500000000000003,
 0.623,
 0.66400000000000003,
 0.65100000000000002,
 0.67300000000000004,
 0.69399999999999995,
 0.66500000000000004,
 0.66000000000000003,
 0.67800000000000005]

In [31]:
eval_classification.eval_nested_kfold(model=doc2vec_model3, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.50140581068416124]
[0.50140581068416124, 0.53701968134957823]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747, 0.56754221388367732]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747, 0.56754221388367732, 0.55816135084427765]
[0.50140581068416124, 0.53701968134957823, 0.54690431519699811, 0.5478424015009381, 0.57035647279549717, 0.56378986866791747, 0.56754221388367732, 0.55816135084427765, 0.53752345215759845]
[0.50140581068416124, 

[0.50140581068416124,
 0.53701968134957823,
 0.54690431519699811,
 0.5478424015009381,
 0.57035647279549717,
 0.56378986866791747,
 0.56754221388367732,
 0.55816135084427765,
 0.53752345215759845,
 0.58724202626641653]

In [32]:
eval_sick.evaluate(doc2vec_model4, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.409921757155
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.341916171513
Test Spearman: 0.340134184451
Test MSE: 0.898672297686


array([ 3.4285679 ,  3.71794175,  2.89571583, ...,  3.20624222,
        3.11745406,  2.71925664])

In [33]:
eval_classification.eval_nested_kfold(model=doc2vec_model4, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.751]
[0.751, 0.70699999999999996]
[0.751, 0.70699999999999996, 0.71999999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997, 0.71399999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997, 0.71399999999999997, 0.72099999999999997]
[0.751, 0.70699999999999996, 0.71999999999999997, 0.73299999999999998, 0.73899999999999999, 0.69999999999999996, 0.71599999999999997, 0.713999

[0.751,
 0.70699999999999996,
 0.71999999999999997,
 0.73299999999999998,
 0.73899999999999999,
 0.69999999999999996,
 0.71599999999999997,
 0.71399999999999997,
 0.72099999999999997,
 0.73099999999999998]

In [34]:
eval_classification.eval_nested_kfold(model=doc2vec_model4, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.57919400187441428]
[0.57919400187441428, 0.56419868791002814]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716, 0.59474671669793622]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716, 0.59474671669793622, 0.5684803001876173]
[0.57919400187441428, 0.56419868791002814, 0.6031894934333959, 0.57410881801125702, 0.56285178236397748, 0.57129455909943716, 0.59474671669793622, 0.5684803001876173, 0.575046904315197]
[0.57919400187441428, 0.564

[0.57919400187441428,
 0.56419868791002814,
 0.6031894934333959,
 0.57410881801125702,
 0.56285178236397748,
 0.57129455909943716,
 0.59474671669793622,
 0.5684803001876173,
 0.575046904315197,
 0.61444652908067543]

In [35]:
eval_trec.evaluate(doc2vec_model1, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.398


In [36]:
eval_trec.evaluate(doc2vec_model2, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.422


In [37]:
eval_trec.evaluate(doc2vec_model3, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.378


In [38]:
eval_trec.evaluate(doc2vec_model4, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.404


# Evaluation of sentence vectors obtained from averaging FastText word vectors

In [54]:
lee_data = LineSentence(lee_train_file)
fasttext_model = ft(size=100, alpha=0.2, negative=10, max_vocab_size=30000000, seed=42, iter=20)
fasttext_model.build_vocab(lee_data)
start_time = time.time()
fasttext_model.train(lee_data, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 10781 word types from a corpus of 59890 raw words and 300 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
INFO:gensim.models.word2vec:min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 10781 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 45 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
INFO:gensim.models.word2vec:estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.fasttext:Total number of ngrams is 17006
INFO:gens

In [57]:
fasttext_model.save('ft1')

INFO:gensim.utils:saving FastText object under ft1, separately None
INFO:gensim.utils:not storing attribute syn0norm
INFO:gensim.utils:not storing attribute syn0_ngrams_norm
INFO:gensim.utils:not storing attribute syn0_vocab_norm
INFO:gensim.utils:saved ft1


In [39]:
ft_loaded_model = ft.load('ft1')

INFO:gensim.utils:loading FastText object from ft1
INFO:gensim.utils:loading wv recursively from ft1.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:setting ignored attribute syn0_ngrams_norm to None
INFO:gensim.utils:setting ignored attribute syn0_vocab_norm to None
INFO:gensim.utils:loaded ft1


In [40]:
eval_sick.evaluate(ft_loaded_model, seed=42, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.508652314637
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.502745225838
Test Spearman: 0.498571233172
Test MSE: 0.762192068723


array([ 2.79715686,  3.30016953,  3.30776085, ...,  3.46108352,
        3.05130163,  2.48509623])

In [41]:
eval_classification.eval_nested_kfold(model=ft_loaded_model, name='SUBJ', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.82299999999999995]
[0.82299999999999995, 0.80500000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003, 0.80500000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003, 0.80500000000000005, 0.80800000000000005]
[0.82299999999999995, 0.80500000000000005, 0.79700000000000004, 0.80600000000000005, 0.80800000000000005, 0.78600000000000003, 0.80500000000000005, 0.80800000000000005, 0.78800000000000003]
[0.8229999999999

[0.82299999999999995,
 0.80500000000000005,
 0.79700000000000004,
 0.80600000000000005,
 0.80800000000000005,
 0.78600000000000003,
 0.80500000000000005,
 0.80800000000000005,
 0.78800000000000003,
 0.81899999999999995]

In [42]:
eval_classification.eval_nested_kfold(model=ft_loaded_model, name='MR', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.61480787253983127]
[0.61480787253983127, 0.60449859418931584]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574, 0.64446529080675419]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574, 0.64446529080675419, 0.61257035647279545]
[0.61480787253983127, 0.60449859418931584, 0.59193245778611636, 0.61819887429643527, 0.59005628517823638, 0.60694183864915574, 0.64446529080675419, 0.61257035647279545, 0.60412757973733588]
[0.6148078725398

[0.61480787253983127,
 0.60449859418931584,
 0.59193245778611636,
 0.61819887429643527,
 0.59005628517823638,
 0.60694183864915574,
 0.64446529080675419,
 0.61257035647279545,
 0.60412757973733588,
 0.60694183864915574]

In [43]:
eval_trec.evaluate(ft_loaded_model, evalcv=False, evaltest=True, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.62


# Evaluation Results

| S.No. | Model Name                                | Training Time (in seconds) | Pearson/Spearman/MSE on SICK | Mean SUBJ | Mean MR | TREC |
|-------|-------------------------------------------|----------------------------|------------------------------|-----------|---------|------|
| 1.    | Gensim Sent2Vec                           | 319.09                     | 0.48/0.49/0.78               | 0.76      | 0.58    | 0.63 |
| 2.    | Original Sent2Vec                         | 25.75                      | 0.42/0.43/0.82               | 0.78      | 0.59    | 0.59 |
| 3.    | PV-DM with sum of context word vectors    | 3.57                       | 0.27/0.27/0.94               | 0.66      | 0.55    | 0.37 |
| 4.    | PV-DM with mean of context word vectors   | 3.8                        | 0.28/0.28/0.93               | 0.67      | 0.55    | 0.38 |
| 5.    | PV-DBOW with sum of context word vector   | 3.06                       | 0.36/0.35/0.88               | 0.73      | 0.57    | 0.42 |
| 6.    | PV-DBOW with mean of context word vectors | 2.92                       | 0.34/0.34/0.89               | 0.72      | 0.57    | 0.41 |
| 7.    | Mean of gensim fasttext word vectors      | 1540.17                    | 0.49/0.49/0.76               | 0.80      | 0.60    | 0.62 |

# Evaluation on sample of Toronto Book Corpus

In [2]:
# Prepare training data
toronto_data = []
lines = 0
with open('./books_in_sentences/books_large_p1.txt') as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if np.random.random() > 0.5:
            if lines >= 100000:
                break
            lines += 1
            if line not in ['\n', '\r\n']:
                line = re.split('\.|\?|\n', line.strip())
                for sentence in line:
                    if len(sentence) > 1:
                        sentence = tokenize(sentence)
                        toronto_data.append(list(sentence))
                        f2.write(' '.join(toronto_data[-1]) + '\n')

In [3]:
# Print sample training data
for sentence in toronto_data[:5]:
    print sentence,'\n'

[u'isbn', u'isbn', u'for', u'my', u'family', u'who', u'encouraged', u'me', u'to', u'never', u'stop', u'fighting', u'for', u'my', u'dreams', u'chapter', u'summer', u'vacations', u'supposed', u'to', u'be', u'fun', u'right'] 

[u'starlings', u'new', u'york', u'is', u'not', u'the', u'place', u'youd', u'expect', u'much', u'to', u'happen'] 

[u'its', u'a', u'small', u'quiet', u'town', u'the', u'kind', u'where', u'everyone', u'knows', u'your', u'name'] 

[u'only', u'because', u'everyone', u'felt', u'so', u'safe', u'so', u'comfy'] 

[u'they', u'dont', u'know', u'the', u'half', u'of', u'it'] 



In [4]:
# Train new sent2vec model on part of the Toronto Book Corpus (100,000 sentences)
sent2vec_toronto_model = s2v(toronto_data, vector_size=100, epochs=5, seed=42)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 1.00 M words
INFO:gensim.models.sent2vec:Read 1.35 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 11552, tokens read: 1348104
INFO:gensim.models.sent2vec:Training...
INFO:gensim.models.sent2vec:Begin epoch 0 :
INFO:gensim.models.sent2vec:Progress: 19.33, lr: 0.1613, loss: 2.9068
INFO:gensim.models.sent2vec:Begin epoch 1 :
INFO:gensim.models.sent2vec:Progress: 38.66, lr: 0.1227, loss: 2.7171
INFO:gensim.models.sent2vec:Begin epoch 2 :
INFO:gensim.models.sent2vec:Progress: 57.98, lr: 0.0840, loss: 2.5908
INFO:gensim.models.sent2vec:Begin epoch 3 :
INFO:gensim.models.sent2vec:Progress: 77.31, lr: 0.0454, loss: 2.4977
INFO:gensim.models.sent2vec:Begin epoch 4 :
INFO:gensim.models.sent2vec:Progress: 96.64, lr: 0.0067, loss: 2.4345
INFO:gensim.models.sent2vec:Total training time: 1414.88167095 seconds


In [5]:
sent2vec_toronto_model.save('s2v2')

INFO:gensim.utils:saving Sent2Vec object under s2v2, separately None
INFO:gensim.utils:storing np array 'wi' to s2v2.wi.npy
INFO:gensim.utils:saved s2v2


In [6]:
eval_sick.evaluate(sent2vec_toronto_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Dev Pearson: 0.478225382288
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.527323452228
Test Spearman: 0.520549235358
Test MSE: 0.739274124711


array([ 2.92524702,  2.96355812,  3.1027343 , ...,  3.37268152,
        2.13288476,  3.2024786 ])

In [14]:
eval_trec.evaluate(sent2vec_toronto_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.572


In [8]:
eval_classification.eval_nested_kfold(model=sent2vec_toronto_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.60637300843486408]
[0.60637300843486408, 0.60918462980318655]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526, 0.60694183864915574]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526, 0.60694183864915574, 0.61069418386491559]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526, 0.60694183864915574, 0.61069418386491559, 0.61538461538461542]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526, 0.60694183864915574, 0.61069418386491559, 0.61538461538461542, 0.62288930581613511]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526, 0.60694183864915574, 0.61069418386491559, 0.61538461538461542, 0.62288930581613511, 0.61913696060037526]
[0.60637300843486408, 0.60918462980318655, 0.61913696060037526, 0.60694183864915574, 0.61069418386491559, 0.61538461538461542, 0.62288930581613511, 0.61913696060037526, 0.61257035647279545]
[0.6063730084348

[0.60637300843486408,
 0.60918462980318655,
 0.61913696060037526,
 0.60694183864915574,
 0.61069418386491559,
 0.61538461538461542,
 0.62288930581613511,
 0.61913696060037526,
 0.61257035647279545,
 0.60694183864915574]

In [9]:
eval_classification.eval_nested_kfold(model=sent2vec_toronto_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.79500000000000004]
[0.79500000000000004, 0.80500000000000005]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003, 0.78700000000000003]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003, 0.78700000000000003, 0.77200000000000002]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003, 0.78700000000000003, 0.77200000000000002, 0.78100000000000003]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003, 0.78700000000000003, 0.77200000000000002, 0.78100000000000003, 0.77300000000000002]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003, 0.78700000000000003, 0.77200000000000002, 0.78100000000000003, 0.77300000000000002, 0.80500000000000005]
[0.79500000000000004, 0.80500000000000005, 0.77900000000000003, 0.78700000000000003, 0.77200000000000002, 0.78100000000000003, 0.77300000000000002, 0.80500000000000005, 0.79900000000000004]
[0.7950000000000

[0.79500000000000004,
 0.80500000000000005,
 0.77900000000000003,
 0.78700000000000003,
 0.77200000000000002,
 0.78100000000000003,
 0.77300000000000002,
 0.80500000000000005,
 0.79900000000000004,
 0.80900000000000005]

In [10]:
# Train model using original c++ implementation of sent2vec
% cd sent2vec
start_time = time.time()
! ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 5 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.0001 -dropoutK 2 -bucket 2000000
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

/Users/prerna135/Documents/GitHub/gensim/sent2vec
Read 1M words
Number of words:  12990
Number of labels: 0
Progress: 100.0%  words/sec/thread: 85591  lr: 0.000000  loss: 2.670269  eta: 0h0m 99612  loss: 7.089671  eta: 0h17m 3.376575  eta: 0h0m m %  words/sec/thread: 29819  lr: 0.158044  loss: 3.205851  eta: 0h0m m %  words/sec/thread: 51649  lr: 0.113969  loss: 2.956678  eta: 0h0m %  words/sec/thread: 53235  lr: 0.110118  loss: 2.931963  eta: 0h0m 55.1%  words/sec/thread: 60925  lr: 0.089842  loss: 2.857768  eta: 0h0m 57.3%  words/sec/thread: 62504  lr: 0.085355  loss: 2.848190  eta: 0h0m 0m   lr: 0.076588  loss: 2.827639  eta: 0h0m 63.6%  words/sec/thread: 66675  lr: 0.072802  loss: 2.817499  eta: 0h0m   eta: 0h0m 0h0m   lr: 0.031771  loss: 2.690434  eta: 0h0m 0m 


Total training time: 32.5075762272 seconds


In [13]:
eval_sick.evaluate(seed=42, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Dev Pearson: 0.417587833758
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.419756504215
Test Spearman: 0.425938719051
Test MSE: 0.838645111858


array([ 2.97449634,  3.08862349,  3.33243425, ...,  3.27892888,
        2.38570633,  2.86452046])

In [15]:
eval_classification.eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.78600000000000003]
[0.78600000000000003, 0.79900000000000004]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77800000000000002]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77800000000000002, 0.78100000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77800000000000002, 0.78100000000000003, 0.76200000000000001]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77800000000000002, 0.78100000000000003, 0.76200000000000001, 0.78200000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77800000000000002, 0.78100000000000003, 0.76200000000000001, 0.78200000000000003, 0.78900000000000003]
[0.78600000000000003, 0.79900000000000004, 0.78500000000000003, 0.77800000000000002, 0.78100000000000003, 0.76200000000000001, 0.78200000000000003, 0.78900000000000003, 0.76900000000000002]
[0.7860000000000

[0.78600000000000003,
 0.79900000000000004,
 0.78500000000000003,
 0.77800000000000002,
 0.78100000000000003,
 0.76200000000000001,
 0.78200000000000003,
 0.78900000000000003,
 0.76900000000000002,
 0.80100000000000005]

In [16]:
eval_classification.eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.5670103092783505]
[0.5670103092783505, 0.59700093720712277]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747, 0.57786116322701686]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747, 0.57786116322701686, 0.58067542213883683]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606, 0.60506566604127576]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606, 0.60506566604127576, 0.62007504690431525]
[0.5670103092783505, 0.59700093720712277, 0.56378986866791747, 0.57786116322701686, 0.58067542213883683, 0.59849906191369606, 0.60506566604127576, 0.62007504690431525, 0.60881801125703561]
[0.5670103092783505, 0.59

[0.5670103092783505,
 0.59700093720712277,
 0.56378986866791747,
 0.57786116322701686,
 0.58067542213883683,
 0.59849906191369606,
 0.60506566604127576,
 0.62007504690431525,
 0.60881801125703561,
 0.59756097560975607]

In [17]:
eval_trec.evaluate(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.594


**NOTE: It is evident that more data = better results as the above model (trained for 5 epochs) achieves similar results to the model trained on the much smaller Lee corpus (trained for 20 epochs)

| S.No. | Model             | Training Time (in seconds) | Pearson/Spearman/MSE on SICK | MR   | SUBJ | TREC |
|-------|-------------------|----------------------------|------------------------------|------|------|------|
| 1.    | Gensim Sent2Vec   | 1414.88                    | 0.52/0.52/0.73               | 0.61 | 0.79 | 0.57 |
| 2.    | Original Sent2Vec | 32.5                       | 0.41/0.42/0.83               | 0.59 | 0.78 | 0.59 |