# Using Sent2Vec via Gensim

This tutorial is about using sent2vec model in Gensim. Here, we'll learn to work with the sent2vec library for training sentence-embedding models, saving & loading them and performing similarity operations. This notebook also contains a comparison of the gensim implementation with the [original c++ implementation](https://github.com/epfml/sent2vec). 

# What is Sent2Vec?

Sent2Vec delivers numerical representations (features) for short texts or sentences, which can be used as input to any machine learning task later on. Think of it as an unsupervised version of FastText, and an extension of word2vec (CBOW) to sentences. The method uses a simple but efficient unsupervised objective to train distributed representations of sentences. The algorithm outperforms the state-of-the-art unsupervised models on most benchmark tasks, and on many tasks even beats supervised models, highlighting the robustness of the produced sentence embeddings, see the [paper](https://arxiv.org/abs/1703.02507) for more details.

The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams present in each sentence, and averaging the n-gram embeddings along with the words

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec as s2v
from gensim.models.fasttext import FastText as ft
from gensim.utils import tokenize
import scipy
import re
from numpy import dot
from gensim import matutils
import time
import numpy as np
import tensorflow as tf
import random
import eval_sick
import eval_classification
import eval_trec
import smart_open

Using TensorFlow backend.


In [2]:
# Setting seeds for reproducibility reasons 
random.seed(42)
np.random.seed(42)
tf.set_random_seed(42)

# Training models

For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model. All models are trained with the same hyperparameters for evaluation purposes.

In [3]:
# Prepare training data
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'
lee_data = []
with open(lee_train_file) as f1, open("./input.txt",'w') as f2:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = tokenize(sentence)
                    lee_data.append(list(sentence))
                    f2.write(' '.join(lee_data[-1]) + '\n')

In [4]:
# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

# Using gensim implementation of sent2vec

In [5]:
# Train new sent2vec model
sent2vec_model = s2v(vector_size=100, epochs=20, seed=42)
sent2vec_model.train(lee_data)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 0.06 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 1307, tokens read: 60302
INFO:gensim.models.sent2vec:Training...
INFO:gensim.models.sent2vec:Begin epoch 0 :
INFO:gensim.models.sent2vec:Progress: 3.96, lr: 0.1921, loss: 3.7888
INFO:gensim.models.sent2vec:Begin epoch 1 :
INFO:gensim.models.sent2vec:Progress: 7.93, lr: 0.1841, loss: 3.6516
INFO:gensim.models.sent2vec:Begin epoch 2 :
INFO:gensim.models.sent2vec:Progress: 11.90, lr: 0.1762, loss: 3.5507
INFO:gensim.models.sent2vec:Begin epoch 3 :
INFO:gensim.models.sent2vec:Progress: 15.86, lr: 0.1683, loss: 3.4571
INFO:gensim.models.sent2vec:Begin epoch 4 :
INFO:gensim.models.sent2vec:Progress: 19.83, lr: 0.1603, loss: 3.3651
INFO:gensim.models.sent2vec:Begin epoch 5 :
INFO:gensim.models.sent2vec:Progress: 23.80, lr: 0.1524, loss: 3.2775
INFO:gensim.models.sent2vec:Begin epoch 6 :
INFO:gensim.models.sent2vec:Progress: 27.76, lr

# Training hyperparameters

Sent2Vec supports the folllowing parameters:

 - vector_size: Size of embeddings to be learnt (Default 100)
 - alpha: Initial learning rate (Default 0.2)
 - min_count: Ignore words with number of occurrences below this (Default 5)
 - loss: Training objective. Allowed values: `ns` (Default `ns`)
 - neg: Number of negative words to sample, for `ns` (Default 10)
 - epochs: Number of epochs (Default 5)
 - bucket: Number of hash buckets for vocabulary (Default 2000000)
 - lr_update_rate: Change the rate of updates for the learning rate (Default 100)
 - t: Sampling threshold (Default 0.0001)
 - dropoutk: Number of ngrams dropped when training a sent2vec model (Default 2)
 - word_ngrams: Max length of word ngram (Default 2)
 - minn: min length of char ngrams (Default 3)
 - maxn: max length of char ngrams (Default 6)
 - seed: random seed for reproducibility reasons (Default 42)

In [6]:
# Print sentence vector
sent2vec_model.sentence_vectors(['This', 'is', 'an', 'awesome', 'gift'])

array([ 0.26883082,  0.23762724,  0.52956072, -0.30249289, -0.06541158,
       -0.19586215, -0.21639706, -0.2870642 , -0.16640631, -0.16300409,
       -0.72889602, -0.00188518,  0.29943442,  0.2204167 ,  0.518335  ,
        0.15313007,  0.44969054,  0.34413915, -0.24833839,  0.57970606,
        0.23409248,  0.56129456, -0.68225917,  0.80357712, -0.12340111,
        0.52688657,  0.26677058,  0.20029608,  0.30819539,  0.52107787,
       -0.09311091,  0.79241742,  0.36930686,  0.55054623, -0.6026444 ,
        0.62004812, -0.11500538,  0.42416913,  0.34123594, -0.35301269,
        0.97941591, -0.08514785, -0.59108816,  0.32378288,  0.24572779,
        0.32549348, -0.52341544,  0.39753549,  0.65115151,  0.36858409,
        1.0796831 ,  0.65557207, -0.67688278,  0.62000768,  0.16206132,
        1.06445599,  0.14487612,  0.76555147,  0.06491144,  0.23573901,
       -0.7069076 ,  0.15132169,  0.25780877, -0.42521453,  0.19709011,
       -0.36970163, -0.1419971 , -0.08766063, -0.03583936,  0.99

In [7]:
# Print cosine similarity between two sentences
print sent2vec_model.similarity(['The', 'sky', 'is', 'blue'], ['I', 'am', 'going', 'to', 'a', 'party'])
print sent2vec_model.similarity(['This', 'is', 'an', 'awesome', 'gift'], ['This', 'present', 'is', 'great'])

0.0974793450342
0.760685172277


# Saving and loading models

Models can be saved and loaded via the load and save methods.

In [8]:
# Save trained sent2vec model
sent2vec_model.save('s2v1')

INFO:gensim.utils:saving Sent2Vec object under s2v1, separately None
INFO:gensim.utils:storing np array 'wi' to s2v1.wi.npy
INFO:gensim.utils:saved s2v1


In [9]:
# Load pretrained sent2vec model
loaded_model = s2v.load('s2v1')

INFO:gensim.utils:loading Sent2Vec object from s2v1
INFO:gensim.utils:loading wi from s2v1.wi.npy with mmap=None
INFO:gensim.utils:loaded s2v1


# Unsupervised similarity evaluation

Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the [SICK 2014](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools) datasets. These similarity scores are compared to the gold-standard human judgements using [Pearson’s correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) scores. The SICK dataset consists of about 10,000 sentence pairs along with relatedness scores of the pairs. We use the code provided by [Kiros et al., 2015](https://github.com/ryankiros/skip-thoughts).

In [10]:
eval_sick.evaluate(loaded_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4667 - val_loss: 1.4017
Epoch 2/10
0s - loss: 1.3982 - val_loss: 1.3841
Epoch 3/10
0s - loss: 1.3834 - val_loss: 1.3784
Epoch 4/10
0s - loss: 1.3772 - val_loss: 1.3759
Epoch 5/10
0s - loss: 1.3736 - val_loss: 1.3743
Epoch 6/10
0s - loss: 1.3710 - val_loss: 1.3730
Epoch 7/10
0s - loss: 1.3689 - val_loss: 1.3720
Epoch 8/10
0s - loss: 1.3670 - val_loss: 1.3711
Epoch 9/10
0s - loss: 1.3654 - val_loss: 1.3703
Epoch 10/10
0s - loss: 1.3639 - val_loss: 1.3695
0.328580048381
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.3625 - val_loss: 1.3689
Epoch 2/10
0s - loss: 1.3612 - val_loss: 1.3683
Epoch 3/10
0s - loss: 1.3600 - val_loss: 1.3677
Epoch 4/10
0s - loss: 1.3589 - val_loss: 1.3672
Epoch 5/10
0s - loss: 1.3578 - val_loss: 1.3668
Epoch 6/10
0s - loss: 1.3568 - val_loss: 1.3663
Epoch 7/10
0s - loss: 1.3558 - val_loss: 1.3660
Epoch 8/10
0s - loss: 1.3549 - val_loss: 1.3656
Epoch 9/10
0s 

array([ 3.28358151,  3.49121839,  3.64299456, ...,  2.98211957,
        3.57306631,  3.50961446])

# Comparison with original c++ implementation of sent2vec

Comparison is done on the SICK 2014 dataset. Models are evaluated WITHOUT training them on a logistic regression classifier as opposed to above example.

In [11]:
# Prepare evaluation data
train_sick = []
test_sick = []
trial_sick = []
with open("./SICK/SICK.txt") as f, open('./train_sick.txt', 'w') as f1, open('./test_sick.txt', 'w') as f2, open('./trial_sick.txt', 'w') as f3:
    for line in f:
        tokens = line.strip().split('\t')
        if tokens[0].isdigit():
            if tokens[11] == 'TRAIN':
                train_sick.append((tokens[1], tokens[2], tokens[4]))
                f1.write(tokens[1] + '\n' + tokens[2] + '\n')
            elif tokens[11] == 'TEST':
                test_sick.append((tokens[1], tokens[2], tokens[4]))
                f2.write(tokens[1] + '\n' + tokens[2] + '\n')
            else:
                trial_sick.append((tokens[1], tokens[2], tokens[4]))
                f3.write(tokens[1] + '\n' + tokens[2] + '\n')

In [12]:
# Print sample evaluation data
# Evaluation data is of the form: (sentence1, sentence2, similarity_score)
for example in train_sick[:5]:
    print example

('A group of kids is playing in a yard and an old man is standing in the background', 'A group of boys in a yard is playing and a man is standing in the background', '4.5')
('A group of children is playing in the house and there is no man standing in the background', 'A group of kids is playing in a yard and an old man is standing in the background', '3.2')
('The young boys are playing outdoors and the man is smiling nearby', 'The kids are playing outdoors near a man with a smile', '4.7')
('The kids are playing outdoors near a man with a smile', 'A group of kids is playing in a yard and an old man is standing in the background', '3.4')
('The young boys are playing outdoors and the man is smiling nearby', 'A group of kids is playing in a yard and an old man is standing in the background', '3.7')


In [13]:
# Calculate pearson correlation score for gensim implementation for sent2vec
def pearson_score_gensim(input_, loaded_model):
    input_cosine = []
    input_sick = []
    for example in input_:
        input_cosine.append(loaded_model.similarity(example[0], example[1]))
        input_sick.append(float(example[2]))
    return scipy.stats.pearsonr(input_cosine, input_sick), input_sick, input_cosine

In [14]:
train_score, train_sick_score, train_cosine = pearson_score_gensim(train_sick, loaded_model)
test_score, test_sick_score, test_cosine = pearson_score_gensim(test_sick, loaded_model)
trial_score, trial_sick_score, trial_cosine = pearson_score_gensim(trial_sick, loaded_model)
print train_score[0], test_score[0], trial_score[0]

0.292517510081 0.28036206529 0.246082734294


In order to build and train c++ implementation of sent2vec, use the following commands. This will produce object files for all the classes as well as the main binary sent2vec.

In [15]:
! git clone https://github.com/epfml/sent2vec.git
% cd sent2vec
! make

/Users/prerna135/Documents/GitHub/gensim/sent2vec


In [16]:
# Train model using original c++ implementation of sent2vec
start_time = time.time()
! ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.0001 -dropoutK 2 -bucket 2000000
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

Read 0M words
Number of words:  1837
Number of labels: 0
Progress: 100.0%  words/sec/thread: 27315  lr: 0.000000  loss: 3.016871  eta: 0h0m m 0m 0h0m   loss: 3.108133  eta: 0h0m 3.092520  eta: 0h0m h0m 


Total training time: 25.7511711121 seconds


In [17]:
# Get sentence vectors from trained sent2vec model
! ./fasttext print-sentence-vectors my_model.bin < ../train_sick.txt > train_output.txt
! ./fasttext print-sentence-vectors my_model.bin < ../test_sick.txt > test_output.txt
! ./fasttext print-sentence-vectors my_model.bin < ../trial_sick.txt > trial_output.txt

In [18]:
def similarity(sent1, sent2):
    return dot(matutils.unitvec(sent1), matutils.unitvec(sent2))

In [19]:
# Calculate pearson correlation score for original c++ implementation for sent2vec
def pearson_score_original(filename, input_sick):
    input_cosine = []
    with open(filename) as f:
        input_ = []
        for i, line in enumerate(f):
            line = line.strip().split()
            input_.append([float(j) for j in line])
            if i % 2 != 0:
                input_cosine.append(similarity(np.array(input_[i]), np.array(input_[i-1])))
    return scipy.stats.pearsonr(input_cosine, input_sick), input_cosine

In [20]:
train_score, train_cosine = pearson_score_original('train_output.txt', train_sick_score)
test_score, test_cosine = pearson_score_original('test_output.txt', test_sick_score)
trial_score, trial_cosine = pearson_score_original('trial_output.txt', trial_sick_score)
print train_score[0], test_score[0], trial_score[0]

0.316368465644 0.300564114583 0.326315285004


# Evaluation Results

| S.No. | Model Name                              | Training time (in seconds) | Pearsonr score on SICK test set |
|-------|-----------------------------------------|----------------------------|---------------------------------|
| 1.    | Gensim implementation of sent2vec       | 309.93                     | 0.28                            |
| 2.    | Original C++ implementation of sent2vec | 25.75                      | 0.30                            |

# Downstream Supervised Evaluation

Sentence embeddings are evaluated for various supervised classification tasks. We evaluate classification of movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ)(Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). To classify, we use the code provided by [(Kiros et al., 2015)](https://github.com/ryankiros/skip-thoughts). Sent2Vec embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the [MR and SUBJ](https://www.cs.cornell.edu/people/pabo/movie-review-data/) datasets. For those datasets nested cross-validation is used to tune the L2 penalty. For the [TREC dataset](http://cogcomp.cs.illinois.edu/Data/QA/QC/), the accuracy is computed on the test set.

In [21]:
% cd ..

/Users/prerna135/Documents/GitHub/gensim


In [22]:
eval_classification.eval_nested_kfold(loaded_model, 'SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
(1, 0.58111111111111113)
(1, 0.56222222222222218)
(1, 0.59111111111111114)
(1, 0.57777777777777772)
(1, 0.58555555555555561)
(1, 0.58777777777777773)
(1, 0.55333333333333334)
(1, 0.54888888888888887)
(1, 0.58111111111111113)
(1, 0.53111111111111109)
(2, 0.58111111111111113)
(2, 0.56444444444444442)
(2, 0.58777777777777773)
(2, 0.57666666666666666)
(2, 0.58555555555555561)
(2, 0.57999999999999996)
(2, 0.55000000000000004)
(2, 0.55555555555555558)
(2, 0.5755555555555556)
(2, 0.52666666666666662)
(4, 0.58444444444444443)
(4, 0.57111111111111112)
(4, 0.58555555555555561)
(4, 0.57333333333333336)
(4, 0.58777777777777773)
(4, 0.5822222222222222)
(4, 0.55111111111111111)
(4, 0.55666666666666664)
(4, 0.57666666666666666)
(4, 0.52222222222222225)
(8, 0.58666666666666667)
(8, 0.56777777777777783)
(8, 0.5822222222222222)
(8, 0.57111111111111112)
(8, 0.5822222222222222)
(8, 0.58444444444444443)
(8, 0.54888888888888887)
(8, 0.55222222222222217)
(8, 0.5822222222222222)


[0.54500000000000004,
 0.57799999999999996,
 0.58799999999999997,
 0.57999999999999996,
 0.57099999999999995,
 0.59599999999999997,
 0.56799999999999995,
 0.55600000000000005,
 0.57199999999999995,
 0.53000000000000003]

In [23]:
eval_classification.eval_nested_kfold(loaded_model, 'MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
(1, 0.546875)
(1, 0.57291666666666663)
(1, 0.55833333333333335)
(1, 0.51979166666666665)
(1, 0.55833333333333335)
(1, 0.57664233576642332)
(1, 0.55474452554744524)
(1, 0.54744525547445255)
(1, 0.58394160583941601)
(1, 0.55265901981230448)
(2, 0.55000000000000004)
(2, 0.56770833333333337)
(2, 0.55625000000000002)
(2, 0.5229166666666667)
(2, 0.55937499999999996)
(2, 0.57455683003128255)
(2, 0.55683003128258601)
(2, 0.54640250260688217)
(2, 0.58289885297184563)
(2, 0.55161626694473409)
(4, 0.55104166666666665)
(4, 0.5697916666666667)
(4, 0.55833333333333335)
(4, 0.51979166666666665)
(4, 0.55937499999999996)
(4, 0.5776850886339937)
(4, 0.55578727841501563)
(4, 0.54223149113660063)
(4, 0.58394160583941601)
(4, 0.54953076120959332)
(8, 0.55104166666666665)
(8, 0.56874999999999998)
(8, 0.55729166666666663)
(8, 0.52083333333333337)
(8, 0.55729166666666663)
(8, 0.5776850886339937)
(8, 0.55891553701772678)
(8, 0.54014598540145986)
(8, 0.5849843587069864)
(8, 0.54744

[0.51921274601686973,
 0.55295220243673848,
 0.56378986866791747,
 0.53283302063789872,
 0.54221388367729828,
 0.57129455909943716,
 0.55909943714821764,
 0.57035647279549717,
 0.575046904315197,
 0.56191369606003749]

In [25]:
eval_trec.evaluate(loaded_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.456


# Evaluation of Doc2Vec

In [26]:
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])

In [27]:
train_corpus = list(read_corpus(lee_train_file))

In [28]:
# Doc2Vec model1 with PV-DM and sum of context word vectors
doc2vec_model1 = gensim.models.doc2vec.Doc2Vec(size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model1.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [29]:
# Doc2Vec model2 with PV-DBOW and sum of context word vectors
doc2vec_model2 = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model2.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [30]:
# Doc2Vec model3 with PV-DM and mean of context word vectors
doc2vec_model3 = gensim.models.doc2vec.Doc2Vec(dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model3.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [31]:
# Doc2Vec model4 with PV-DBOW and mean of context word vectors
doc2vec_model4 = gensim.models.doc2vec.Doc2Vec(dm=0, dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
doc2vec_model4.build_vocab(train_corpus)

INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting layer weights


In [32]:
%time doc2vec_model1.train(train_corpus, total_examples=doc2vec_model1.corpus_count, epochs=doc2vec_model1.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 67.53% examples, 485065 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724805 effective words) took 1.5s, 495613 effective words/s
CPU times: user 3.41 s, sys: 231 ms, total: 3.64 s
Wall time: 1.47 s


724805

In [33]:
%time doc2vec_model2.train(train_corpus, total_examples=doc2vec_model2.corpus_count, epochs=doc2vec_model2.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 90.85% examples, 647487 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724867 effective words) took 1.1s, 648357 effective words/s
CPU times: user 2.73 s, sys: 122 ms, total: 2.85 s
Wall time: 1.13 s


724867

In [34]:
%time doc2vec_model3.train(train_corpus, total_examples=doc2vec_model3.corpus_count, epochs=doc2vec_model3.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 68.33% examples, 488054 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724929 effective words) took 1.5s, 481754 effective words/s
CPU times: user 3.42 s, sys: 235 ms, total: 3.65 s
Wall time: 1.51 s


724929

In [35]:
%time doc2vec_model4.train(train_corpus, total_examples=doc2vec_model4.corpus_count, epochs=doc2vec_model4.iter)

INFO:gensim.models.word2vec:training model with 3 workers on 1750 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=10 window=5
INFO:gensim.models.word2vec:PROGRESS: at 88.33% examples, 633312 words/s, in_qsize 6, out_qsize 0
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 1163040 raw words (724371 effective words) took 1.1s, 633528 effective words/s
CPU times: user 2.73 s, sys: 123 ms, total: 2.85 s
Wall time: 1.15 s


724371

In [37]:
print similarity(doc2vec_model1.infer_vector(['The', 'sky', 'is', 'blue']), doc2vec_model1.infer_vector(['This', 'present', 'is', 'great']))
print similarity(doc2vec_model2.infer_vector(['The', 'sky', 'is', 'blue']), doc2vec_model2.infer_vector(['This', 'present', 'is', 'great']))
print similarity(doc2vec_model3.infer_vector(['The', 'sky', 'is', 'blue']), doc2vec_model3.infer_vector(['This', 'present', 'is', 'great']))
print similarity(doc2vec_model4.infer_vector(['The', 'sky', 'is', 'blue']), doc2vec_model4.infer_vector(['This', 'present', 'is', 'great']))

0.676251668769
0.617256937593
0.559471642215
0.536031209903


In [38]:
eval_sick.evaluate(doc2vec_model1, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.5700 - val_loss: 1.5372
Epoch 2/10
0s - loss: 1.5130 - val_loss: 1.4934
Epoch 3/10
0s - loss: 1.4788 - val_loss: 1.4672
Epoch 4/10
0s - loss: 1.4589 - val_loss: 1.4519
Epoch 5/10
0s - loss: 1.4478 - val_loss: 1.4429
Epoch 6/10
0s - loss: 1.4416 - val_loss: 1.4377
Epoch 7/10
0s - loss: 1.4383 - val_loss: 1.4345
Epoch 8/10
0s - loss: 1.4365 - val_loss: 1.4325
Epoch 9/10
0s - loss: 1.4355 - val_loss: 1.4311
Epoch 10/10
0s - loss: 1.4350 - val_loss: 1.4302
-0.00506274265731
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4347 - val_loss: 1.4296
Epoch 2/10
0s - loss: 1.4345 - val_loss: 1.4291
Epoch 3/10
0s - loss: 1.4344 - val_loss: 1.4288
Epoch 4/10
0s - loss: 1.4343 - val_loss: 1.4286
Epoch 5/10
0s - loss: 1.4343 - v

array([ 3.4873876 ,  3.5041741 ,  3.49764264, ...,  3.49422558,
        3.48878673,  3.50415545])

In [39]:
eval_classification.eval_nested_kfold(doc2vec_model1, 'SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.46999999999999997)
(1, 0.47999999999999998)
(1, 0.49888888888888888)
(1, 0.46444444444444444)
(1, 0.46333333333333332)
(1, 0.49111111111111111)
(1, 0.48555555555555557)
(1, 0.4622222222222222)
(1, 0.48777777777777775)
(1, 0.47333333333333333)
(2, 0.50111111111111106)
(2, 0.48333333333333334)
(2, 0.51777777777777778)
(2, 0.47111111111111109)
(2, 0.4622222222222222)
(2, 0.48888888888888887)
(2, 0.47555555555555556)
(2, 0.46777777777777779)
(2, 0.49555555555555558)
(2, 0.46777777777777779)
(4, 0.53000000000000003)
(4, 0.4777777777777778)
(4, 0.52111111111111108)
(4, 0.49111111111111111)
(4, 0.45444444444444443)
(4, 0.51000000000000001)
(4, 0.46666666666666667)
(4, 0.47555555555555556)
(4, 0.48444444444444446)
(4, 0.4811111111111111)
(8, 0.51444444444444448)
(8, 0.4777777777777778)
(8, 0.53000000000000003)
(8, 0.5033333333333333)
(8, 0.47555555555555556)
(8, 0.49666666666666665)
(8, 0.46444444444444444)
(8, 0.50666666666666671)
(8, 0.47888888888888886)
(

[0.48399999999999999,
 0.51200000000000001,
 0.503,
 0.50900000000000001,
 0.52800000000000002,
 0.501,
 0.48999999999999999,
 0.49399999999999999,
 0.48299999999999998,
 0.50800000000000001]

In [40]:
eval_classification.eval_nested_kfold(doc2vec_model1, 'MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.49687500000000001)
(1, 0.49375000000000002)
(1, 0.515625)
(1, 0.50416666666666665)
(1, 0.48020833333333335)
(1, 0.49009384775808135)
(1, 0.49426485922836289)
(1, 0.46089676746611052)
(1, 0.48175182481751827)
(1, 0.48383733055265904)
(2, 0.49791666666666667)
(2, 0.49375000000000002)
(2, 0.51458333333333328)
(2, 0.5072916666666667)
(2, 0.48020833333333335)
(2, 0.50260688216892602)
(2, 0.50052137643378525)
(2, 0.46089676746611052)
(2, 0.48279457768508866)
(2, 0.48592283628779981)
(4, 0.5)
(4, 0.49166666666666664)
(4, 0.51666666666666672)
(4, 0.51979166666666665)
(4, 0.48541666666666666)
(4, 0.50573514077163717)
(4, 0.51303441084462986)
(4, 0.45255474452554745)
(4, 0.48070907194994789)
(4, 0.45881126173096975)
(8, 0.49687500000000001)
(8, 0.48958333333333331)
(8, 0.51666666666666672)
(8, 0.515625)
(8, 0.48749999999999999)
(8, 0.50156412930135563)
(8, 0.51824817518248179)
(8, 0.44942648592283629)
(8, 0.48175182481751827)
(8, 0.47862356621480712)
(16, 0.49

[0.51921274601686973,
 0.48266166822867856,
 0.51125703564727953,
 0.50844277673545968,
 0.50375234521575984,
 0.46622889305816134,
 0.50187617260787998,
 0.4896810506566604,
 0.49155722326454032,
 0.47467166979362102]

In [41]:
eval_sick.evaluate(doc2vec_model2, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.5704 - val_loss: 1.5373
Epoch 2/10
0s - loss: 1.5132 - val_loss: 1.4935
Epoch 3/10
0s - loss: 1.4789 - val_loss: 1.4673
Epoch 4/10
0s - loss: 1.4590 - val_loss: 1.4520
Epoch 5/10
0s - loss: 1.4478 - val_loss: 1.4431
Epoch 6/10
0s - loss: 1.4416 - val_loss: 1.4378
Epoch 7/10
0s - loss: 1.4383 - val_loss: 1.4346
Epoch 8/10
0s - loss: 1.4365 - val_loss: 1.4326
Epoch 9/10
0s - loss: 1.4356 - val_loss: 1.4313
Epoch 10/10
0s - loss: 1.4350 - val_loss: 1.4304
-0.00891048137255
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4347 - val_loss: 1.4298
Epoch 2/10
0s - loss: 1.4346 - val_loss: 1.4293
Epoch 3/10
0s - loss: 1.4345 - val_loss: 1.4290
Epoch 4/10
0s - loss: 1.4344 - val_loss: 1.4288
Epoch 5/10
0s - loss: 1.4343 - v

array([ 3.4870036 ,  3.50262909,  3.49707235, ...,  3.49304635,
        3.48765391,  3.50640745])

In [42]:
eval_classification.eval_nested_kfold(doc2vec_model2, 'SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.46999999999999997)
(1, 0.47999999999999998)
(1, 0.49888888888888888)
(1, 0.46444444444444444)
(1, 0.46333333333333332)
(1, 0.49111111111111111)
(1, 0.48555555555555557)
(1, 0.4622222222222222)
(1, 0.48777777777777775)
(1, 0.47333333333333333)
(2, 0.50111111111111106)
(2, 0.48333333333333334)
(2, 0.51777777777777778)
(2, 0.47111111111111109)
(2, 0.4622222222222222)
(2, 0.48888888888888887)
(2, 0.47555555555555556)
(2, 0.46777777777777779)
(2, 0.49555555555555558)
(2, 0.46777777777777779)
(4, 0.53000000000000003)
(4, 0.4777777777777778)
(4, 0.52111111111111108)
(4, 0.49111111111111111)
(4, 0.45444444444444443)
(4, 0.51000000000000001)
(4, 0.46666666666666667)
(4, 0.47555555555555556)
(4, 0.48444444444444446)
(4, 0.4811111111111111)
(8, 0.51444444444444448)
(8, 0.4777777777777778)
(8, 0.53000000000000003)
(8, 0.5033333333333333)
(8, 0.47555555555555556)
(8, 0.49666666666666665)
(8, 0.46444444444444444)
(8, 0.50666666666666671)
(8, 0.47888888888888886)
(

[0.48399999999999999,
 0.51200000000000001,
 0.503,
 0.50900000000000001,
 0.52800000000000002,
 0.501,
 0.48999999999999999,
 0.49399999999999999,
 0.48299999999999998,
 0.50800000000000001]

In [43]:
eval_classification.eval_nested_kfold(doc2vec_model2, 'MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.49687500000000001)
(1, 0.49375000000000002)
(1, 0.515625)
(1, 0.50416666666666665)
(1, 0.48020833333333335)
(1, 0.49009384775808135)
(1, 0.49426485922836289)
(1, 0.46089676746611052)
(1, 0.48175182481751827)
(1, 0.48383733055265904)
(2, 0.49791666666666667)
(2, 0.49375000000000002)
(2, 0.51458333333333328)
(2, 0.5072916666666667)
(2, 0.48020833333333335)
(2, 0.50260688216892602)
(2, 0.50052137643378525)
(2, 0.46089676746611052)
(2, 0.48279457768508866)
(2, 0.48592283628779981)
(4, 0.5)
(4, 0.49166666666666664)
(4, 0.51666666666666672)
(4, 0.51979166666666665)
(4, 0.48541666666666666)
(4, 0.50573514077163717)
(4, 0.51303441084462986)
(4, 0.45255474452554745)
(4, 0.48070907194994789)
(4, 0.45881126173096975)
(8, 0.49687500000000001)
(8, 0.48958333333333331)
(8, 0.51666666666666672)
(8, 0.515625)
(8, 0.48749999999999999)
(8, 0.50156412930135563)
(8, 0.51824817518248179)
(8, 0.44942648592283629)
(8, 0.48175182481751827)
(8, 0.47862356621480712)
(16, 0.49

[0.51921274601686973,
 0.48266166822867856,
 0.51125703564727953,
 0.50844277673545968,
 0.50375234521575984,
 0.46622889305816134,
 0.50187617260787998,
 0.4896810506566604,
 0.49155722326454032,
 0.47467166979362102]

In [44]:
eval_sick.evaluate(doc2vec_model3, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.5717 - val_loss: 1.5381
Epoch 2/10
0s - loss: 1.5141 - val_loss: 1.4940
Epoch 3/10
0s - loss: 1.4795 - val_loss: 1.4675
Epoch 4/10
0s - loss: 1.4594 - val_loss: 1.4520
Epoch 5/10
0s - loss: 1.4480 - val_loss: 1.4430
Epoch 6/10
0s - loss: 1.4418 - val_loss: 1.4377
Epoch 7/10
0s - loss: 1.4384 - val_loss: 1.4345
Epoch 8/10
0s - loss: 1.4366 - val_loss: 1.4324
Epoch 9/10
0s - loss: 1.4356 - val_loss: 1.4311
Epoch 10/10
0s - loss: 1.4350 - val_loss: 1.4302
-0.00473051039803
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4347 - val_loss: 1.4295
Epoch 2/10
0s - loss: 1.4346 - val_loss: 1.4291
Epoch 3/10
0s - loss: 1.4344 - val_loss: 1.4288
Epoch 4/10
0s - loss: 1.4344 - val_loss: 1.4285
Epoch 5/10
0s - loss: 1.4343 - v

array([ 3.48656098,  3.50112534,  3.49743778, ...,  3.49291068,
        3.48845585,  3.50360691])

In [45]:
eval_classification.eval_nested_kfold(doc2vec_model3, 'SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.46999999999999997)
(1, 0.47999999999999998)
(1, 0.49888888888888888)
(1, 0.46444444444444444)
(1, 0.46333333333333332)
(1, 0.49111111111111111)
(1, 0.48555555555555557)
(1, 0.4622222222222222)
(1, 0.48777777777777775)
(1, 0.47333333333333333)
(2, 0.50111111111111106)
(2, 0.48333333333333334)
(2, 0.51777777777777778)
(2, 0.47111111111111109)
(2, 0.4622222222222222)
(2, 0.48888888888888887)
(2, 0.47555555555555556)
(2, 0.46777777777777779)
(2, 0.49555555555555558)
(2, 0.46777777777777779)
(4, 0.53000000000000003)
(4, 0.4777777777777778)
(4, 0.52111111111111108)
(4, 0.49111111111111111)
(4, 0.45444444444444443)
(4, 0.51000000000000001)
(4, 0.46666666666666667)
(4, 0.47555555555555556)
(4, 0.48444444444444446)
(4, 0.4811111111111111)
(8, 0.51444444444444448)
(8, 0.4777777777777778)
(8, 0.53000000000000003)
(8, 0.5033333333333333)
(8, 0.47555555555555556)
(8, 0.49666666666666665)
(8, 0.46444444444444444)
(8, 0.50666666666666671)
(8, 0.47888888888888886)
(

[0.48399999999999999,
 0.51200000000000001,
 0.503,
 0.50900000000000001,
 0.52800000000000002,
 0.501,
 0.48999999999999999,
 0.49399999999999999,
 0.48299999999999998,
 0.50800000000000001]

In [46]:
eval_classification.eval_nested_kfold(doc2vec_model3, 'MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.49687500000000001)
(1, 0.49375000000000002)
(1, 0.515625)
(1, 0.50416666666666665)
(1, 0.48020833333333335)
(1, 0.49009384775808135)
(1, 0.49426485922836289)
(1, 0.46089676746611052)
(1, 0.48175182481751827)
(1, 0.48383733055265904)
(2, 0.49791666666666667)
(2, 0.49375000000000002)
(2, 0.51458333333333328)
(2, 0.5072916666666667)
(2, 0.48020833333333335)
(2, 0.50260688216892602)
(2, 0.50052137643378525)
(2, 0.46089676746611052)
(2, 0.48279457768508866)
(2, 0.48592283628779981)
(4, 0.5)
(4, 0.49166666666666664)
(4, 0.51666666666666672)
(4, 0.51979166666666665)
(4, 0.48541666666666666)
(4, 0.50573514077163717)
(4, 0.51303441084462986)
(4, 0.45255474452554745)
(4, 0.48070907194994789)
(4, 0.45881126173096975)
(8, 0.49687500000000001)
(8, 0.48958333333333331)
(8, 0.51666666666666672)
(8, 0.515625)
(8, 0.48749999999999999)
(8, 0.50156412930135563)
(8, 0.51824817518248179)
(8, 0.44942648592283629)
(8, 0.48175182481751827)
(8, 0.47862356621480712)
(16, 0.49

[0.51921274601686973,
 0.48266166822867856,
 0.51125703564727953,
 0.50844277673545968,
 0.50375234521575984,
 0.46622889305816134,
 0.50187617260787998,
 0.4896810506566604,
 0.49155722326454032,
 0.47467166979362102]

In [47]:
eval_sick.evaluate(doc2vec_model4, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.5717 - val_loss: 1.5383
Epoch 2/10
0s - loss: 1.5141 - val_loss: 1.4941
Epoch 3/10
0s - loss: 1.4795 - val_loss: 1.4676
Epoch 4/10
0s - loss: 1.4594 - val_loss: 1.4521
Epoch 5/10
0s - loss: 1.4480 - val_loss: 1.4431
Epoch 6/10
0s - loss: 1.4417 - val_loss: 1.4377
Epoch 7/10
0s - loss: 1.4384 - val_loss: 1.4345
Epoch 8/10
0s - loss: 1.4365 - val_loss: 1.4325
Epoch 9/10
0s - loss: 1.4356 - val_loss: 1.4312
Epoch 10/10
0s - loss: 1.4350 - val_loss: 1.4303
0.000596625258101
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4347 - val_loss: 1.4296
Epoch 2/10
0s - loss: 1.4345 - val_loss: 1.4292
Epoch 3/10
0s - loss: 1.4344 - val_loss: 1.4289
Epoch 4/10
0s - loss: 1.4344 - val_loss: 1.4286
Epoch 5/10
0s - loss: 1.4343 - v

array([ 3.48697639,  3.50321374,  3.50148225, ...,  3.49416015,
        3.48810926,  3.50739142])

In [48]:
eval_classification.eval_nested_kfold(doc2vec_model4, 'SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.46999999999999997)
(1, 0.47999999999999998)
(1, 0.49888888888888888)
(1, 0.46444444444444444)
(1, 0.46333333333333332)
(1, 0.49111111111111111)
(1, 0.48555555555555557)
(1, 0.4622222222222222)
(1, 0.48777777777777775)
(1, 0.47333333333333333)
(2, 0.50111111111111106)
(2, 0.48333333333333334)
(2, 0.51777777777777778)
(2, 0.47111111111111109)
(2, 0.4622222222222222)
(2, 0.48888888888888887)
(2, 0.47555555555555556)
(2, 0.46777777777777779)
(2, 0.49555555555555558)
(2, 0.46777777777777779)
(4, 0.53000000000000003)
(4, 0.4777777777777778)
(4, 0.52111111111111108)
(4, 0.49111111111111111)
(4, 0.45444444444444443)
(4, 0.51000000000000001)
(4, 0.46666666666666667)
(4, 0.47555555555555556)
(4, 0.48444444444444446)
(4, 0.4811111111111111)
(8, 0.51444444444444448)
(8, 0.4777777777777778)
(8, 0.53000000000000003)
(8, 0.5033333333333333)
(8, 0.47555555555555556)
(8, 0.49666666666666665)
(8, 0.46444444444444444)
(8, 0.50666666666666671)
(8, 0.47888888888888886)
(

[0.48399999999999999,
 0.51200000000000001,
 0.503,
 0.50900000000000001,
 0.52800000000000002,
 0.501,
 0.48999999999999999,
 0.49399999999999999,
 0.48299999999999998,
 0.50800000000000001]

In [49]:
eval_classification.eval_nested_kfold(doc2vec_model4, 'MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(1, 0.49687500000000001)
(1, 0.49375000000000002)
(1, 0.515625)
(1, 0.50416666666666665)
(1, 0.48020833333333335)
(1, 0.49009384775808135)
(1, 0.49426485922836289)
(1, 0.46089676746611052)
(1, 0.48175182481751827)
(1, 0.48383733055265904)
(2, 0.49791666666666667)
(2, 0.49375000000000002)
(2, 0.51458333333333328)
(2, 0.5072916666666667)
(2, 0.48020833333333335)
(2, 0.50260688216892602)
(2, 0.50052137643378525)
(2, 0.46089676746611052)
(2, 0.48279457768508866)
(2, 0.48592283628779981)
(4, 0.5)
(4, 0.49166666666666664)
(4, 0.51666666666666672)
(4, 0.51979166666666665)
(4, 0.48541666666666666)
(4, 0.50573514077163717)
(4, 0.51303441084462986)
(4, 0.45255474452554745)
(4, 0.48070907194994789)
(4, 0.45881126173096975)
(8, 0.49687500000000001)
(8, 0.48958333333333331)
(8, 0.51666666666666672)
(8, 0.515625)
(8, 0.48749999999999999)
(8, 0.50156412930135563)
(8, 0.51824817518248179)
(8, 0.44942648592283629)
(8, 0.48175182481751827)
(8, 0.47862356621480712)
(16, 0.49

[0.51921274601686973,
 0.48266166822867856,
 0.51125703564727953,
 0.50844277673545968,
 0.50375234521575984,
 0.46622889305816134,
 0.50187617260787998,
 0.4896810506566604,
 0.49155722326454032,
 0.47467166979362102]

In [50]:
eval_trec.evaluate(doc2vec_model1, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.198


In [51]:
eval_trec.evaluate(doc2vec_model2, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.198


In [52]:
eval_trec.evaluate(doc2vec_model3, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.198


In [53]:
eval_trec.evaluate(doc2vec_model4, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.198


# Evaluation of sentence vectors obtained from averaging FastText word vectors

In [54]:
lee_data = LineSentence(lee_train_file)
fasttext_model = ft(size=100, alpha=0.2, negative=10, max_vocab_size=30000000, seed=42, iter=20)
fasttext_model.build_vocab(lee_data)
start_time = time.time()
fasttext_model.train(lee_data, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)
print "\n\nTotal training time: %s seconds" % (time.time() - start_time)

INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 10781 word types from a corpus of 59890 raw words and 300 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
INFO:gensim.models.word2vec:min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 10781 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 45 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
INFO:gensim.models.word2vec:estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.fasttext:Total number of ngrams is 17006
INFO:gens

In [57]:
fasttext_model.save('ft1')

INFO:gensim.utils:saving FastText object under ft1, separately None
INFO:gensim.utils:not storing attribute syn0norm
INFO:gensim.utils:not storing attribute syn0_ngrams_norm
INFO:gensim.utils:not storing attribute syn0_vocab_norm
INFO:gensim.utils:saved ft1


In [3]:
ft_loaded_model = ft.load('ft1')

INFO:gensim.utils:loading FastText object from ft1
INFO:gensim.utils:loading wv recursively from ft1.wv.* with mmap=None
INFO:gensim.utils:setting ignored attribute syn0norm to None
INFO:gensim.utils:setting ignored attribute syn0_ngrams_norm to None
INFO:gensim.utils:setting ignored attribute syn0_vocab_norm to None
INFO:gensim.utils:loaded ft1


In [5]:
eval_sick.evaluate(ft_loaded_model, seed=42, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...


  lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))


Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 2.0968 - val_loss: 1.4364
Epoch 2/10
0s - loss: 1.4732 - val_loss: 1.4281
Epoch 3/10
0s - loss: 1.4556 - val_loss: 1.4310
Epoch 4/10
0s - loss: 1.4531 - val_loss: 1.4348
Epoch 5/10
0s - loss: 1.4528 - val_loss: 1.4370
Epoch 6/10
0s - loss: 1.4526 - val_loss: 1.4386
Epoch 7/10
0s - loss: 1.4523 - val_loss: 1.4400
Epoch 8/10
0s - loss: 1.4521 - val_loss: 1.4412
Epoch 9/10
0s - loss: 1.4519 - val_loss: 1.4423
Epoch 10/10
0s - loss: 1.4517 - val_loss: 1.4432
0.108153174068
Train on 4500 samples, validate on 500 samples
Epoch 1/10
0s - loss: 1.4515 - val_loss: 1.4439
Epoch 2/10
0s - loss: 1.4514 - val_loss: 1.4446
Epoch 3/10
0s - loss: 1.4512 - val_loss: 1.4452
Epoch 4/10
0s - loss: 1.4511 - val_loss: 1.4456
Epoch 5/10
0s - loss: 1.4510 - val_loss: 1.4460
Epoch 6/10
0s - loss: 1.4509 - val_loss: 1.4464
Epoch 7/10
0s - loss: 1.4508 - val_loss: 1.4467
Epoch 8/10
0s - loss: 1.4507 - val_loss: 1.4469
Epoch 9/10
0s - loss: 1.45

array([ 3.41832314,  3.41832314,  3.35509975, ...,  3.43674037,
        3.54149695,  3.44111353])

In [6]:
eval_classification.eval_nested_kfold(ft_loaded_model, 'SUBJ', use_nb=False, model_name='fasttext')

Computing sentence vectors...
(1, 0.56111111111111112)
(1, 0.56444444444444442)
(1, 0.54555555555555557)
(1, 0.58555555555555561)
(1, 0.54666666666666663)
(1, 0.55000000000000004)
(1, 0.56666666666666665)
(1, 0.54333333333333333)
(1, 0.57222222222222219)
(1, 0.53666666666666663)
(2, 0.56111111111111112)
(2, 0.56333333333333335)
(2, 0.54555555555555557)
(2, 0.58555555555555561)
(2, 0.54666666666666663)
(2, 0.55000000000000004)
(2, 0.56666666666666665)
(2, 0.54333333333333333)
(2, 0.57222222222222219)
(2, 0.53666666666666663)
(4, 0.56111111111111112)
(4, 0.56333333333333335)
(4, 0.54555555555555557)
(4, 0.58555555555555561)
(4, 0.54666666666666663)
(4, 0.55000000000000004)
(4, 0.56666666666666665)
(4, 0.54333333333333333)
(4, 0.57222222222222219)
(4, 0.53666666666666663)
(8, 0.56111111111111112)
(8, 0.56333333333333335)
(8, 0.54555555555555557)
(8, 0.58555555555555561)
(8, 0.54666666666666663)
(8, 0.55000000000000004)
(8, 0.56666666666666665)
(8, 0.54333333333333333)
(8, 0.57222222222222

[0.54500000000000004,
 0.55900000000000005,
 0.55700000000000005,
 0.54900000000000004,
 0.57399999999999995,
 0.56100000000000005,
 0.56599999999999995,
 0.54000000000000004,
 0.56799999999999995,
 0.54200000000000004]

In [8]:
eval_classification.eval_nested_kfold(ft_loaded_model, 'MR', use_nb=False, model_name='fasttext')

Computing sentence vectors...
(1, 0.53020833333333328)
(1, 0.51249999999999996)
(1, 0.51770833333333333)
(1, 0.50937500000000002)
(1, 0.51041666666666663)
(1, 0.51199165797705948)
(1, 0.52763295099061525)
(1, 0.50573514077163717)
(1, 0.49635036496350365)
(1, 0.53284671532846717)
(2, 0.53020833333333328)
(2, 0.51249999999999996)
(2, 0.51770833333333333)
(2, 0.50937500000000002)
(2, 0.51041666666666663)
(2, 0.51199165797705948)
(2, 0.52763295099061525)
(2, 0.50573514077163717)
(2, 0.49635036496350365)
(2, 0.53284671532846717)
(4, 0.53020833333333328)
(4, 0.51249999999999996)
(4, 0.51770833333333333)
(4, 0.50937500000000002)
(4, 0.51041666666666663)
(4, 0.51199165797705948)
(4, 0.52763295099061525)
(4, 0.50573514077163717)
(4, 0.49635036496350365)
(4, 0.53284671532846717)
(8, 0.53020833333333328)
(8, 0.51249999999999996)
(8, 0.51770833333333333)
(8, 0.50937500000000002)
(8, 0.51041666666666663)
(8, 0.51199165797705948)
(8, 0.52763295099061525)
(8, 0.50573514077163717)
(8, 0.49635036496350

[0.49765698219306465,
 0.52389878163074044,
 0.51969981238273921,
 0.5206378986866792,
 0.50844277673545968,
 0.49530956848030017,
 0.52814258911819889,
 0.51782363977485923,
 0.50093808630393999,
 0.52908067542213888]

In [9]:
eval_trec.evaluate(ft_loaded_model, evalcv=False, evaltest=True, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.268


# Evaluation Results

| S.No. | Model Name                                | Training time (in seconds) | Pearsonr/Spearman/MSE on SICK test | Mean SUBJ score | Mean MR score | TREC  |
|-------|-------------------------------------------|----------------------------|------------------------------------|-----------------|---------------|-------|
| 1.    | Gensim implementation of sent2vec         | 309.93                     | 0.39/0.37/0.86                     | 0.56            | 0.55          | 0.456 |
| 2.    | PV-DM with sum of context word vectors    | 3.64                       | 0.01/0.008/1.01                    | 0.50            | 0.49          | 0.198 |
| 3.    | PV-DM with mean of context word vectors   | 3.65                       | 0.01/0.007/1.01                    | 0.50            | 0.49          | 0.198 |
| 4.    | PV-DBOW with sum of context word vectors  | 2.85                       | 0.01/0.01/1.01                     | 0.50            | 0.49          | 0.198 |
| 5.    | PV-DBOW with mean of context word vectors | 2.85                       | 0.008/0.007/1.01                   | 0.50            | 0.49          | 0.198 |
| 6.    | Mean of fasttext word vectors             | 1540.17                    | 0.17/0.16/1.00                     | 0.55            | 0.51          | 0.26  |