## Introduction
It's general tendency of Users to construct document vectors from word vectors, as we have a really huge number of high-quality word-vectors (from Stanford, Google, Numberbatch, etc).

This is an attempt to compare different models with our mixture methods. We proposed different methods to make sentence vector from word vectors.
1. Average of all word-vectors
2. Average with weights (like TfIdf or some custom weights from a user)
3. [Smooth inverse frequency](https://openreview.net/pdf?id=SyK00v5xx)

In [1]:
from gensim.models import Word2Vec, KeyedVectors, TfidfModel
from gensim.parsing.preprocessing import STOPWORDS
from scipy.sparse.linalg import svds
from scipy.spatial.distance import cosine
import numpy as np

## Using pre-tained word2vec model

Download the pretrained model form [here](https://github.com/RaRe-Technologies/gensim-data/releases/tag/glove-wiki-gigaword-200) or use the code given below

In [None]:
import gensim.downloader as api
word2vec_model = api.load("glove-wiki-gigaword-200")

If we already have the wiki pretrained model then we can start without downloading.

In [2]:
word2vec_model = KeyedVectors.load_word2vec_format("glove-wiki-gigaword-200.gz")

In [3]:
np.shape(word2vec_model.syn0)

  """Entry point for launching an IPython kernel.


(400000, 200)

For doc2vec and sent2vec we'll train the model with [IMDB Dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

IMDB Dataset is fairly small sized dataset of 100,000 movie-reviews.

## Preprocessing with dataset

In [4]:
import locale
import glob
import os.path
import requests
import tarfile
import sys
import codecs
import smart_open

dirname = 'aclImdb'
filename = 'aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')

if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

# Convert text to lower-case and strip punctuation/symbols from words
def normalize_text(text):
    norm_text = text.lower()
    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')
    # Pad punctuation with spaces on both sides
    for char in ['.', '"', ',', '(', ')', '!', '?', ';', ':']:
        norm_text = norm_text.replace(char, ' ' + char + ' ')
    return norm_text

import time
start = time.clock()

if not os.path.isfile('aclImdb/alldata-id.txt'):
    if not os.path.isdir(dirname):
        if not os.path.isfile(filename):
            # Download IMDB archive
            print("Downloading IMDB archive...")
            url = u'http://ai.stanford.edu/~amaas/data/sentiment/' + filename
            r = requests.get(url)
            with open(filename, 'wb') as f:
                f.write(r.content)
        tar = tarfile.open(filename, mode='r')
        tar.extractall()
        tar.close()

    # Concatenate and normalize test/train data
    print("Cleaning up dataset...")
    folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg', 'train/unsup']
    alldata = u''
    for fol in folders:
        temp = u''
        output = fol.replace('/', '-') + '.txt'
        # Is there a better pattern to use?
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        for txt in txt_files:
            with smart_open.smart_open(txt, "rb") as t:
                t_clean = t.read().decode("utf-8")
                for c in control_chars:
                    t_clean = t_clean.replace(c, ' ')
                temp += t_clean
            temp += "\n"
        temp_norm = normalize_text(temp)
        with smart_open.smart_open(os.path.join(dirname, output), "wb") as n:
            n.write(temp_norm.encode("utf-8"))
        alldata += temp_norm

    with smart_open.smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:
        for idx, line in enumerate(alldata.splitlines()):
            num_line = u"_*{0} {1}\n".format(idx, line)
            f.write(num_line.encode("utf-8"))

end = time.clock()
print ("Total running time: ", end-start)

('Total running time: ', 0.001635999999990645)


In [2]:
import os.path
assert os.path.isfile("aclImdb/alldata-id.txt"), "alldata-id.txt unavailable"

In [66]:
# -*- coding: utf-8 -*-
import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

alldocs = []  # Will hold all docs in original order
with open('aclImdb/alldata-id.txt') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = gensim.utils.to_unicode(line).split()
        words = tokens[1:]
        tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
        split = ['train', 'test', 'extra', 'extra'][line_no//25000]  # 25k train, 25k test, 25k extra
        sentiment = [1.0, 0.0, 1.0, 0.0, None, None, None, None][line_no//12500] # [12.5K pos, 12.5K neg]*2 then unknown
        alldocs.append(SentimentDocument(words, tags, split, sentiment))

train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
doc_list = alldocs[:]  # For reshuffling per pass

print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))

100000 docs: 25000 train-sentiment, 25000 test-sentiment


## Training Doc2Vec

In [3]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

doc2vec_model = Doc2Vec(dm=1, dm_concat=1, vector_size=200, window=5, negative=5, hs=0, min_count=2, workers=cores)
doc2vec_model.build_vocab(alldocs)
models_by_name = OrderedDict((str(model), model) for model in [doc2vec_model])

In [8]:
import numpy as np
import statsmodels.api as sm
from random import sample

# For timing
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start
    
def logistic_predictor_from_data(train_targets, train_regressors):
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

  from pandas.core import datetools


In [9]:
from collections import defaultdict
best_error = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [None]:
from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)  # Shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
            duration = '%.1f' % elapsed()
            
        # Evaluate
        eval_duration = ''
        with elapsed_timer() as eval_elapsed:
            err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)
        eval_duration = '%.1f' % eval_elapsed()
        best_indicator = ' '
        if err <= best_error[name]:
            best_error[name] = err
            best_indicator = '*' 
        print("%s%f : %i passes : %s %ss %ss" % (best_indicator, err, epoch + 1, name, duration, eval_duration))

        if ((epoch + 1) % 5) == 0 or epoch == 0:
            eval_duration = ''
            with elapsed_timer() as eval_elapsed:
                infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)
            eval_duration = '%.1f' % eval_elapsed()
            best_indicator = ' '
            if infer_err < best_error[name + '_inferred']:
                best_error[name + '_inferred'] = infer_err
                best_indicator = '*'
            print("%s%f : %i passes : %s %ss %ss" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))

    print('Completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta
    
print("END %s" % str(datetime.datetime.now()))

In [None]:
# Print best error rates achieved
print("Err rate Model")
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

In [None]:
doc2vec_model.save('doc2vec_model')

In [50]:
from gensim.models import Doc2Vec
doc2vec_model = Doc2Vec.load('doc2vec_model')

## Training Sent2Vec model

In [29]:
from gensim import utils
import smart_open

all_docs = []
with smart_open.smart_open('aclImdb/alldata-id.txt') as alldata:
    for line_no, line in enumerate(alldata):
        tokens = line.split()
        remove = ['(', ')', '"','?','!']
        tokens = tokens[1:]
        tokens = [word for word in tokens if word not in STOPWORDS]
        tokens = [word for word in tokens if word not in remove]
        tokens = [word for word in tokens if word.isalpha()]
        all_docs.append(tokens)
print(all_docs[0])
temp = u""
with smart_open.smart_open('sent2vec.txt', 'w') as f:
    for review in all_docs:
        for item in review:
            f.write("%s " % item)
        f.write("\n")

['bizarre', 'horror', 'movie', 'filled', 'famous', 'faces', 'stolen', 'cristina', 'raines', 'later', 'flamingo', 'road', 'pretty', 'somewhat', 'unstable', 'model', 'gummy', 'smile', 'slated', 'pay', 'attempted', 'suicides', 'guarding', 'gateway', 'hell', 'scenes', 'raines', 'modeling', 'captured', 'mood', 'music', 'perfect', 'deborah', 'raffin', 'charming', 'pal', 'raines', 'moves', 'creepy', 'brooklyn', 'heights', 'brownstone', 'inhabited', 'blind', 'priest', 'floor', 'things', 'start', 'cooking', 'neighbors', 'including', 'fantastically', 'wicked', 'burgess', 'meredith', 'kinky', 'couple', 'sylvia', 'miles', 'beverly', 'diabolical', 'lot', 'eli', 'wallach', 'great', 'fun', 'wily', 'police', 'detective', 'movie', 'nearly', 'baby', 'exorcist', 'combination', 'based', 'jeffrey', 'konvitz', 'sentinel', 'entertainingly', 'spooky', 'shocks', 'brought', 'director', 'michael', 'winner', 'mounts', 'thoughtfully', 'downbeat', 'ending', 'skill']


In [32]:
! ../sent2vec-master/./fasttext sent2vec -input sent2vec.txt -output my_model -dropoutK 0 -dim 200 -epoch 9 -lr 0.2 -thread 10 -bucket 100000

Read 9M words
Number of words:  50771
Number of labels: 0
Progress: 73.8%  words/sec/thread: 34629  lr: 0.052383  loss: 2.594971  eta: 0h1m 3m m %  words/sec/thread: 12652  lr: 0.199283  loss: 4.122703  eta: 0h11m 0.4%  words/sec/thread: 13513  lr: 0.199201  loss: 4.043160  eta: 0h10m 9m 0.5%  words/sec/thread: 15865  lr: 0.198932  loss: 3.867796  eta: 0h9m 0h7m 0.9%  words/sec/thread: 20135  lr: 0.198257  loss: 3.626870  eta: 0h7m m 1.1%  words/sec/thread: 22442  lr: 0.197702  loss: 3.534271  eta: 0h6m h6m 0.196779  loss: 3.430077  eta: 0h5m 1.6%  words/sec/thread: 25071  lr: 0.196734  loss: 3.428211  eta: 0h5m 1.7%  words/sec/thread: 25485  lr: 0.196520  loss: 3.410576  eta: 0h5m   eta: 0h5m 5m 2.5%  words/sec/thread: 27728  lr: 0.195015  loss: 3.319271  eta: 0h5m 0h5m m 29899  lr: 0.192352  loss: 3.221656  eta: 0h4m %  words/sec/thread: 29973  lr: 0.192191  loss: 3.216617  eta: 0h4m 4.1%  words/sec/thread: 30220  lr: 0.191860  loss: 3.208455  eta: 0h4m 4.9%  words/sec/thread: 30792 

In [19]:
import sent2vec
sent2vec_model = sent2vec.Sent2vecModel()
sent2vec_model.load_model('my_model.bin')

## Training Classic LSI model

In [8]:
from gensim.corpora import Dictionary, MmCorpus

dictionary = Dictionary(line.split() for line in open('sent2vec.txt'))
print dictionary
corpus = [dictionary.doc2bow(text) for text in sentences]
MmCorpus.serialize('lsi_model.mm', corpus) 

Dictionary(131056 unique tokens: [u'fawn', u'tsukino', u'woode', u'nunnery', u'sonja']...)


In [9]:
from gensim.models import LsiModel
lsi_model = LsiModel(corpus,id2word=dictionary)
lsi_corpus = lsi_model[corpus]

In [13]:
print(lsi_model.print_topics(2))
print(lsi_corpus)

[(0, u'-0.447*"infinity" + -0.447*"war" + -0.447*"eagerly" + -0.447*"waiting" + -0.447*"avengers" + 0.000*"baby" + -0.000*"attempted" + -0.000*"day" + -0.000*"completely" + -0.000*"different"'), (1, u'-0.640*"dog" + -0.396*"sample" + -0.396*"cat" + -0.396*"sentence" + -0.245*"day" + -0.245*"cute" + 0.000*"baby" + -0.000*"skipper" + 0.000*"attempted" + -0.000*"absolute"')]
<gensim.interfaces.TransformedCorpus object at 0x103c44150>


## Implementation of mixture methods

In [7]:
def simple_average(sent):
    sents_emd = []
    for s in sent:
        sent_emd = []
        for w in s:
            if w in word2vec_model:
                sent_emd.append(word2vec_model[w])
        sent_emd_ar = np.array(sent_emd)
        sum_ = sent_emd_ar.sum(axis=0)
        result = sum_/np.sqrt((sum_**2).sum())
        sents_emd.append(result)
    return sents_emd

In [114]:
def tf_idf(sent):
    word_counter = {}
    total_count = 0
    no_of_sentences = 0
    for s in sent:
        for w in s:
            if w in word_counter:
                word_counter[w] = word_counter[w] + 1
            else:
                word_counter[w] = 1
        total_count = total_count + len(s)
        no_of_sentences = no_of_sentences +  1
    sents_emd = []
    for s in sent:
        sent_emd = []
        for word in s:
            tf = word_counter[word]/float(len(s))
            idf = np.log(no_of_sentences/float(1+ word_counter[word]))
            try:
                emd = tf*idf*word2vec_model[word]
                sent_emd.append(emd)
            except:
                continue
        sent_emd = np.array(sent_emd)
        sum_ = sent_emd.sum(axis=0)
        result = sum_/np.sqrt((sum_**2).sum())
        sents_emd.append(result)
    return sents_emd

Or we could use the TFIDF API from gensim

In [6]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary


def tf_idf_v2(sent):
    dct = Dictionary(sent)
    corpus = [dct.doc2bow(line) for line in sent]
    tf_idf_model = TfidfModel(corpus)
    vector = tf_idf_model[corpus]
    d = {dct.get(id): value for doc in vector for id, value in doc}
    sents_emd = []
    no_of_sent = sum(1 for i in sent)
    for i in xrange(no_of_sent):
        sent_emd = []
        for j in xrange(len(sent[i])):
            word = sent[i][j]
            if word in word2vec_model:
                emd = d[word]*word2vec_model[word]
                sent_emd.append(emd)
        sent_emd_np = np.array(sent_emd)
        sum_ = sent_emd_np.sum(axis=0)
        result = sum_/np.sqrt((sum_**2).sum())
        sents_emd.append(result)
    
    return sents_emd

In [176]:
s1_s = "this is a sample sentence with cat and dog"
s1 = s1_s.lower().split()
s1 = [w for w in s1 if w not in STOPWORDS]
s2_s = "there was a time when computers were very expensive"
s2 = s2_s.lower().split()
s2 = [w for w in s2 if w not in STOPWORDS]
s3_s = "one more day with cute dog"
s3 = s3_s.lower().split()
s3 = [w for w in s3 if w not in STOPWORDS]
s4_s = "eagerly waiting for Avengers Infinity War"
s4 = s4_s.lower().split()
s4 = [w for w in s4 if w not in STOPWORDS]
s5_s = "this is a completely different"
s5 = s5_s.lower().split()
s5 = [w for w in s5 if w not in STOPWORDS]
sentences = [s1, s2, s3, s4, s5]
sentences_s = [s1_s, s2_s, s3_s, s4_s, s5_s,]
print(sentences, sentences_s)

([['sample', 'sentence', 'cat', 'dog'], ['time', 'computers', 'expensive'], ['day', 'cute', 'dog'], ['eagerly', 'waiting', 'avengers', 'infinity', 'war'], ['completely', 'different']], ['this is a sample sentence with cat and dog', 'there was a time when computers were very expensive', 'one more day with cute dog', 'eagerly waiting for Avengers Infinity War', 'this is a completely different'])


In [5]:
def smooth_inverse_frequency(sent, a=0.001):
    word_counter = {}
    sentences = []
    total_count = 0
    no_of_sentences = 0
    for s in sent:
        for w in s:
            if w in word_counter:
                word_counter[w] = word_counter[w] + 1
            else:
                word_counter[w] = 1
        total_count = total_count + len(s)
        no_of_sentences = no_of_sentences + 1
    sents_emd = []
    for s in sent:
        sent_emd = []
        for word in s:
            if word in word2vec_model:
                emd = (a/(a + (word_counter[word]/total_count)))*word2vec_model[word]
                sent_emd.append(emd)
        sum_ = np.array(sent_emd).sum(axis=0)
        sentence_emd = sum_/float(no_of_sentences)
        sents_emd.append(sentence_emd)
    [_, _, u]  = np.array(svds(sents_emd, k=1))
    new_sents_emd = []
    for s in sents_emd:
        s = s - s.dot(u*u.transpose())
        new_sents_emd.append(s)
    return new_sents_emd

In [211]:
sentences_emd1 = smooth_inverse_frequency(sentences)
sentences_emd2 = tf_idf_v2(sentences)
sentences_emd3 = simple_average(sentences)
print np.shape(sentences_emd1), np.shape(sentences_emd2), np.shape(sentences_emd3)

(5, 200) (5, 200) (5, 200)


Testing with cosine distance

In [178]:
d1 = cosine(sentences_emd1[0],sentences_emd1[2])
d2 = cosine(sentences_emd2[0],sentences_emd2[2])
d3 = cosine(sentences_emd3[0],sentences_emd3[2])
print("SIF: {} tfIdf: {} SimAvg: {}".format(d1, d2, d3))
d4 = cosine(sentences_emd1[1],sentences_emd1[3])
d5 = cosine(sentences_emd2[1],sentences_emd2[3])
d6 = cosine(sentences_emd3[1],sentences_emd3[3])
print("SIF: {} tfIdf: {} SimAvg: {}".format(d4, d5, d6))

SIF: 0.30360096693 tfIdf: 0.384991586208 SimAvg: 0.297450304031
SIF: 0.552763521671 tfIdf: 0.569366067648 SimAvg: 0.569366067648


In [179]:
doc_d1 = doc2vec_model.infer_vector(s1)
doc_d2 = doc2vec_model.infer_vector(s3)
print("doc2vec for s1 and s3: {}".format(cosine(doc_d1,doc_d2)))
doc_d3 = doc2vec_model.infer_vector(s1)
doc_d4 = doc2vec_model.infer_vector(s4)
print("doc2vec for s1 and s4: {}".format(cosine(doc_d3,doc_d4)))

doc2vec for s1 and s3: 0.448433697224
doc2vec for s1 and s4: 1.06828606129


In [180]:
embs_sent2vec = sent2vec_model.embed_sentences(sentences_s)
print("sent2vec for s1 and s3 {}".format(cosine(embs_sent2vec[0],embs_sent2vec[2])))
print("sent2vec for s1 and s4 {}".format(cosine(embs_sent2vec[0],embs_sent2vec[3])))

sent2vec for s1 and s3 0.374465703964
sent2vec for s1 and s4 0.9011329934


## Unsupervised Evaluation
Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the SICK 2014 datasets. These similarity scores are compared to the gold-standard human judgements using Pearson’s correlation scores. The [SICK dataset](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools) consists of about 10,000 sentence pairs along with relatedness scores of the pairs. We use the code provided by Kiros et al., 2015.

Download dataset from [here](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools)

In [221]:
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from sklearn.metrics import mean_squared_error as mse

import os


def load_data(loc='./data/'):
    """
    Load the SICK dataset
    """
    testA, testB = [],[]
    testS = []
    with open(os.path.join(loc, 'SICK_test_annotated.txt'), 'rb') as f:
        for line in f:
            text = line.strip().split('\t')
            testA.append(text[1])
            testB.append(text[2])
            testS.append(text[3])
    testS = [float(s) for s in testS[1:]]

    return [testA[1:], testB[1:]], testS

def evaluate_sick(model, model_name, evaltest=1):
    test, scores = load_data()
    if evaltest:
            print 'Computing test sentence vectors...'
            if model_name == 'sent2vec':
                testA = np.array(model.embed_sentences(test[0]))
                testB = np.array(model.embed_sentences(test[1]))
            elif model_name == 'doc2vec':
                testA = np.array([model.infer_vector(example.split(' ')) for example in test[0]])
                testB = np.array([model.infer_vector(example.split(' ')) for example in test[1]])
            elif model_name == 'word2vec_sif':
                testA = smooth_inverse_frequency([example.split(' ') for example in test[0]])
                testB = smooth_inverse_frequency([example.split(' ') for example in test[1]])
            elif model_name == 'word2vec_tfidf':
                testA = tf_idf_v2([example.split(' ') for example in test[0]])
                testB = tf_idf_v2([example.split(' ') for example in test[1]])
            else:
                testA = simple_average([example.split(' ') for example in test[0]])
                testB = simple_average([example.split(' ') for example in test[1]])

            print 'Computing feature combinations...'
            result = []
            for i in range(len(testA)):
                result.append(5.0*(1 - cosine(testA[i],testB[i])))
#             print result

            print 'Evaluating...'
            pr = pearsonr(result, scores)[0]
            print 'Test Pearson: ' + str(pr)
            sr = spearmanr(result, scores)[0]
            print 'Test Spearman: ' + str(sr)
            se = mse(result, scores)
            print 'Test MSE: ' + str(se)

In [226]:
evaluate_sick(word2vec_model,'word2vec') # simple average

Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.6036488695603748
Test Spearman: 0.517107076620976
Test MSE: 1.886883490561824


In [227]:
evaluate_sick(word2vec_model,'word2vec_sif') # smooth inverse frequency

Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.6640034477509451
Test Spearman: 0.5635887757792902
Test MSE: 1.3158386583403208


In [228]:
evaluate_sick(word2vec_model,'word2vec_tfidf') # tfidf

Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.6386344221803884
Test Spearman: 0.5186580421530436
Test MSE: 0.977400413016066


In [229]:
evaluate_sick(doc2vec_model,'doc2vec')

Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.35673098444911305
Test Spearman: 0.34845272300340857
Test MSE: 4.510147300151664


In [230]:
evaluate_sick(sent2vec_model,'sent2vec')

Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.5258587190142471
Test Spearman: 0.4542118616338372
Test MSE: 1.6227105838354625


## Supervised  evaluation
Sentence embeddings are evaluated for various supervised classification tasks. We evaluate classification of movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ)(Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). Models embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the MR and SUBJ datasets. For those datasets nested cross-validation is used to tune the L2 penalty. For the TREC dataset, the accuracy is computed on the test set.

Download the datasets from here -
1. [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/)
2. [MR and SUBJ](https://www.cs.cornell.edu/people/pabo/movie-review-data/)

In [231]:
# -*- coding: utf-8 -*-
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error as mse
from sklearn.utils import shuffle

import os


def load_data(loc='./trec_data/'):
    """
    Load the TREC question-type dataset
    """
    train, test = [], []
    with open(os.path.join(loc, 'train_5500.label'), 'rb') as f:
        for line in f:
            train.append(line.decode("ascii", "replace").strip())
    with open(os.path.join(loc, 'TREC_10.label'), 'rb') as f:
        for line in f:
            test.append(line.decode("ascii", "replace").strip())
    return train, test

def prepare_data(text):
    """
    Prepare data
    """
    labels = [t.split()[0] for t in text]
    labels = [l.split(':')[0] for l in labels]
    X = [t.split()[1:] for t in text]
    X = [' '.join(t) for t in X]
    return X, labels

def prepare_labels(labels):
    """
    Process labels to numerical values
    """
    d = {}
    count = 0
    setlabels = set(labels)
    for w in setlabels:
        d[w] = count
        count += 1
    idxlabels = np.array([d[w] for w in labels])
    return idxlabels

def eval_trec(model, model_name, evaltest=1):
    traintext, testtext = load_data()
    train, train_labels = prepare_data(traintext)
    test, test_labels = prepare_data(testtext)
    train_labels = prepare_labels(train_labels)
    test_labels = prepare_labels(test_labels)
    train, train_labels = shuffle(train, train_labels, random_state=1234)

    print 'Computing train sentence vectors...'
    if model_name == 'sent2vec':
        trainA = np.array(model.embed_sentences(train))
    elif model_name == 'doc2vec':
        trainA = np.array([model.infer_vector(example.split(' ')) for example in train])
    elif model_name == 'word2vec_sif':
        trainA = smooth_inverse_frequency([example.split(' ') for example in train])
    elif model_name == 'word2vec_tfidf':
        trainA = tf_idf_v2([example.split(' ') for example in train])
    else:
        trainA = simple_average([example.split(' ') for example in train])
    
    if evaltest:
        print 'Computing test sentence vectors...'
        if model_name == 'sent2vec':
            testA = np.array(model.embed_sentences(test))
        elif model_name == 'doc2vec':
            testA = np.array([model.infer_vector(example.split(' ')) for example in test])
        elif model_name == 'word2vec_sif':
            testA = smooth_inverse_frequency([example.split(' ') for example in test])
        elif model_name == 'word2vec_tfidf':
            testA = tf_idf_v2([example.split(' ') for example in test])
        else:
            testA = simple_average([example.split(' ') for example in test])

        print 'Evaluating...'
#         print np.shape(testA), np.shape(test_labels)
        clf = LogisticRegression(C=128)
        clf.fit(trainA, train_labels)
        yhat = clf.predict(testA)
#         print np.shape(yhat)
        print 'Test accuracy: ' + str(clf.score(testA, test_labels))

In [234]:
eval_trec(word2vec_model,'word2vec')

Computing train sentence vectors...
Computing test sentence vectors...
Evaluating...
Test accuracy: 0.688


In [235]:
eval_trec(word2vec_model,'word2vec_tfidf')

Computing train sentence vectors...
Computing test sentence vectors...
Evaluating...
Test accuracy: 0.684


In [236]:
eval_trec(word2vec_model,'word2vec_sif')

Computing train sentence vectors...
Computing test sentence vectors...
Evaluating...
Test accuracy: 0.484


In [232]:
eval_trec(sent2vec_model,'sent2vec')

Computing train sentence vectors...
Computing test sentence vectors...
Evaluating...
Test accuracy: 0.642


In [233]:
eval_trec(doc2vec_model,'doc2vec')

Computing train sentence vectors...
Computing test sentence vectors...
Evaluating...
Test accuracy: 0.542


In [40]:
# Experiment scripts for binary classification benchmarks (e.g. MR, SUBJ)

import numpy as np
import sys
import nbsvm
import os.path
import numpy as np
from numpy.random import RandomState
import subprocess
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold


def eval_nested_kfold(name, model=None, model_name='sent2vec', loc='./classification_data/', k=10, seed=42, use_nb=False):
    """
    Evaluate features with nested K-fold cross validation
    Outer loop: Held-out evaluation
    Inner loop: Hyperparameter tuning

    Datasets can be found at http://nlp.stanford.edu/~sidaw/home/projects:nbsvm
    Options for name are 'MR', 'CR', 'SUBJ' and 'MPQA'
    """
    # Load the dataset and extract features
    z, features = load_data_2(name, model, loc=loc, seed=seed, model_name=model_name)
    print np.shape(features)

    scan = [2**t for t in range(0,9,1)]
    npts = len(z['text'])
    kf = KFold(npts, n_folds=k, random_state=seed)
    scores = []
    for train, test in kf:

        # Split data
        print train
        X_train = features[train]
        y_train = z['labels'][train]
        X_test = features[test]
        y_test = z['labels'][test]

        Xraw = [z['text'][i] for i in train]
        Xraw_test = [z['text'][i] for i in test]

        scanscores = []
        for s in scan:

            # Inner KFold
            innerkf = KFold(len(X_train), n_folds=k, random_state=seed+1)
            innerscores = []
            for innertrain, innertest in innerkf:
        
                # Split data
                X_innertrain = X_train[innertrain]
                y_innertrain = y_train[innertrain]
                X_innertest = X_train[innertest]
                y_innertest = y_train[innertest]

                Xraw_innertrain = [Xraw[i] for i in innertrain]
                Xraw_innertest = [Xraw[i] for i in innertest]

                # NB (if applicable)
                if use_nb:
                    NBtrain, NBtest = compute_nb(Xraw_innertrain, y_innertrain, Xraw_innertest)
                    X_innertrain = hstack((X_innertrain, NBtrain))
                    X_innertest = hstack((X_innertest, NBtest))

                # Train classifier
                clf = LogisticRegression(C=s)
                clf.fit(X_innertrain, y_innertrain)
                acc = clf.score(X_innertest, y_innertest)
                innerscores.append(acc)
                print (s, acc)

            # Append mean score
            scanscores.append(np.mean(innerscores))

        # Get the index of the best score
        s_ind = np.argmax(scanscores)
        s = scan[s_ind]
        print scanscores
        print s
 
        # NB (if applicable)
        if use_nb:
            NBtrain, NBtest = compute_nb(Xraw, y_train, Xraw_test)
            X_train = hstack((X_train, NBtrain))
            X_test = hstack((X_test, NBtest))
       
        # Train classifier
        clf = LogisticRegression(C=s)
        clf.fit(X_train, y_train)

        # Evaluate
        acc = clf.score(X_test, y_test)
        scores.append(acc)
        print scores

    return scores

def compute_labels(pos, neg):
    """
    Construct list of labels
    """
    labels = np.zeros(len(pos) + len(neg))
    labels[:len(pos)] = 1.0
    labels[len(pos):] = 0.0
    return labels

def shuffle_data(X, L, seed=42):
    """
    Shuffle the data
    """
    prng = RandomState(seed)
    inds = np.arange(len(X))
    prng.shuffle(inds)
    X = [X[i] for i in inds]
    L = L[inds]
    return (X, L)    

def compute_nb(X, y, Z):
    """
    Compute NB features
    """
    labels = [int(t) for t in y]
    ptrain = [X[i] for i in range(len(labels)) if labels[i] == 0]
    ntrain = [X[i] for i in range(len(labels)) if labels[i] == 1]
    poscounts = nbsvm.build_dict(ptrain, [1,2])
    negcounts = nbsvm.build_dict(ntrain, [1,2])
    dic, r = nbsvm.compute_ratio(poscounts, negcounts)
    trainX = nbsvm.process_text(X, dic, r, [1,2])
    devX = nbsvm.process_text(Z, dic, r, [1,2])
    return trainX, devX

def load_rt(loc='./classification_data/'):
    """
    Load the MR dataset
    """
    pos, neg = [], []
    with open(os.path.join(loc, 'rt-polarity.pos'), 'rb') as f:
        for line in f:
            pos.append(line.decode('latin-1').strip())
    with open(os.path.join(loc, 'rt-polarity.neg'), 'rb') as f:
        for line in f:
            neg.append(line.decode('latin-1').strip())
    return pos, neg


def load_subj(loc='./classification_data/'):
    """
    Load the SUBJ dataset
    """
    pos, neg = [], []
    with open(os.path.join(loc, 'plot.tok.gt9.5000'), 'rb') as f:
        for line in f:
            pos.append(line.decode('latin-1').strip())
    with open(os.path.join(loc, 'quote.tok.gt9.5000'), 'rb') as f:
        for line in f:
            neg.append(line.decode('latin-1').strip())
    return pos, neg

def load_data_2(name, model=None, model_name='sent2vec', loc='./classification_data/', seed=42):
    """
    Load one of MR, CR, SUBJ or MPQA
    """
    z = {}
    if name == 'MR':
        pos, neg = load_rt(loc=loc)
    elif name == 'SUBJ':
        pos, neg = load_subj(loc=loc)
    elif name == 'CR':
        pos, neg = load_cr(loc=loc)
    elif name == 'MPQA':
        pos, neg = load_mpqa(loc=loc)

    labels = compute_labels(pos, neg)
    text, labels = shuffle_data(pos+neg, labels, seed=seed)
    z['text'] = text
    z['labels'] = labels
    print 'Computing sentence vectors...'
    if model_name == 'sent2vec':
        testA = np.array(model.embed_sentences(text))
    elif model_name == 'doc2vec':
        testA = np.array([model.infer_vector(example.split(' ')) for example in text])
    elif model_name == 'word2vec_sif':
        testA = np.array(smooth_inverse_frequency([example.split(' ') for example in text]))
    elif model_name == 'word2vec_tfidf':
        testA = np.array(tf_idf_v2([example.split(' ') for example in text]))
    else:
        testA = np.array(simple_average([example.split(' ') for example in text]))

    return z, testA

In [42]:
eval_nested_kfold(model=sent2vec_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
(10000, 200)
[1000 1001 1002 ... 9997 9998 9999]
(1, 0.8733333333333333)
(1, 0.8911111111111111)
(1, 0.8722222222222222)
(1, 0.8666666666666667)
(1, 0.8544444444444445)
(1, 0.8711111111111111)
(1, 0.8677777777777778)
(1, 0.8933333333333333)
(1, 0.8633333333333333)
(1, 0.8677777777777778)
(2, 0.8733333333333333)
(2, 0.8911111111111111)
(2, 0.8722222222222222)
(2, 0.8655555555555555)
(2, 0.8533333333333334)
(2, 0.8711111111111111)
(2, 0.8677777777777778)
(2, 0.8922222222222222)
(2, 0.8633333333333333)
(2, 0.8677777777777778)
(4, 0.8733333333333333)
(4, 0.8911111111111111)
(4, 0.8711111111111111)
(4, 0.8655555555555555)
(4, 0.8522222222222222)
(4, 0.87)
(4, 0.8677777777777778)
(4, 0.8922222222222222)
(4, 0.8633333333333333)
(4, 0.8677777777777778)
(8, 0.8733333333333333)
(8, 0.8911111111111111)
(8, 0.8688888888888889)
(8, 0.8655555555555555)
(8, 0.8533333333333334)
(8, 0.87)
(8, 0.8677777777777778)
(8, 0.8911111111111111)
(8, 0.8622222222222222)
(8, 0.8677777

(64, 0.8844444444444445)
(64, 0.86)
(64, 0.8588888888888889)
(128, 0.8788888888888889)
(128, 0.88)
(128, 0.8877777777777778)
(128, 0.87)
(128, 0.8511111111111112)
(128, 0.8744444444444445)
(128, 0.8711111111111111)
(128, 0.8855555555555555)
(128, 0.86)
(128, 0.8588888888888889)
(256, 0.8788888888888889)
(256, 0.88)
(256, 0.8877777777777778)
(256, 0.87)
(256, 0.8511111111111112)
(256, 0.8744444444444445)
(256, 0.8711111111111111)
(256, 0.8855555555555555)
(256, 0.86)
(256, 0.8588888888888889)
[0.8721111111111111, 0.8716666666666667, 0.8715555555555555, 0.8717777777777778, 0.8717777777777778, 0.8717777777777778, 0.8716666666666667, 0.8717777777777778, 0.8717777777777778]
1
[0.881, 0.878, 0.89, 0.866]
[   0    1    2 ... 9997 9998 9999]
(1, 0.8822222222222222)
(1, 0.8677777777777778)
(1, 0.8955555555555555)
(1, 0.8544444444444445)
(1, 0.86)
(1, 0.8733333333333333)
(1, 0.8733333333333333)
(1, 0.8888888888888888)
(1, 0.86)
(1, 0.8655555555555555)
(2, 0.8811111111111111)
(2, 0.86888888888888

(8, 0.8655555555555555)
(8, 0.8666666666666667)
(16, 0.8811111111111111)
(16, 0.8777777777777778)
(16, 0.8866666666666667)
(16, 0.8722222222222222)
(16, 0.86)
(16, 0.8577777777777778)
(16, 0.87)
(16, 0.8688888888888889)
(16, 0.8655555555555555)
(16, 0.8666666666666667)
(32, 0.8811111111111111)
(32, 0.8777777777777778)
(32, 0.8866666666666667)
(32, 0.8722222222222222)
(32, 0.86)
(32, 0.8577777777777778)
(32, 0.87)
(32, 0.8688888888888889)
(32, 0.8655555555555555)
(32, 0.8666666666666667)
(64, 0.8811111111111111)
(64, 0.8777777777777778)
(64, 0.8866666666666667)
(64, 0.8722222222222222)
(64, 0.86)
(64, 0.8577777777777778)
(64, 0.87)
(64, 0.8688888888888889)
(64, 0.8655555555555555)
(64, 0.8666666666666667)
(128, 0.8811111111111111)
(128, 0.8777777777777778)
(128, 0.8866666666666667)
(128, 0.8722222222222222)
(128, 0.86)
(128, 0.8577777777777778)
(128, 0.87)
(128, 0.8688888888888889)
(128, 0.8655555555555555)
(128, 0.8666666666666667)
(256, 0.8811111111111111)
(256, 0.8777777777777778)
(2

[0.881, 0.878, 0.89, 0.866, 0.858, 0.852, 0.88, 0.887, 0.863, 0.871]

In [52]:
eval_nested_kfold(model=doc2vec_model, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(10000, 200)
[1000 1001 1002 ... 9997 9998 9999]
(1, 0.8022222222222222)
(1, 0.7833333333333333)
(1, 0.7722222222222223)
(1, 0.7766666666666666)
(1, 0.7833333333333333)
(1, 0.8155555555555556)
(1, 0.7877777777777778)
(1, 0.7733333333333333)
(1, 0.8033333333333333)
(1, 0.7888888888888889)
(2, 0.8022222222222222)
(2, 0.7788888888888889)
(2, 0.7688888888888888)
(2, 0.7766666666666666)
(2, 0.7855555555555556)
(2, 0.8122222222222222)
(2, 0.7866666666666666)
(2, 0.7733333333333333)
(2, 0.8044444444444444)
(2, 0.7877777777777778)
(4, 0.8033333333333333)
(4, 0.7788888888888889)
(4, 0.7666666666666667)
(4, 0.7766666666666666)
(4, 0.7855555555555556)
(4, 0.8111111111111111)
(4, 0.7866666666666666)
(4, 0.7711111111111111)
(4, 0.8044444444444444)
(4, 0.7888888888888889)
(8, 0.8033333333333333)
(8, 0.7788888888888889)
(8, 0.7666666666666667)
(8, 0.7755555555555556)
(8, 0.7855555555555556)
(8, 0.8122222222222222)
(8, 0.7866666666666666)
(8, 0.7711111111111111)
(8, 0.804

(32, 0.7833333333333333)
(32, 0.7722222222222223)
(32, 0.7833333333333333)
(32, 0.81)
(32, 0.7866666666666666)
(32, 0.7677777777777778)
(32, 0.7955555555555556)
(32, 0.7822222222222223)
(64, 0.78)
(64, 0.7944444444444444)
(64, 0.7833333333333333)
(64, 0.7722222222222223)
(64, 0.7833333333333333)
(64, 0.81)
(64, 0.7866666666666666)
(64, 0.7677777777777778)
(64, 0.7955555555555556)
(64, 0.7822222222222223)
(128, 0.78)
(128, 0.7944444444444444)
(128, 0.7833333333333333)
(128, 0.7722222222222223)
(128, 0.7833333333333333)
(128, 0.81)
(128, 0.7866666666666666)
(128, 0.7677777777777778)
(128, 0.7955555555555556)
(128, 0.7822222222222223)
(256, 0.78)
(256, 0.7944444444444444)
(256, 0.7833333333333333)
(256, 0.7722222222222223)
(256, 0.7833333333333333)
(256, 0.8088888888888889)
(256, 0.7866666666666666)
(256, 0.7677777777777778)
(256, 0.7955555555555556)
(256, 0.7822222222222223)
[0.7865555555555555, 0.7862222222222222, 0.7857777777777778, 0.7855555555555556, 0.7854444444444445, 0.78555555555

(4, 0.7822222222222223)
(4, 0.8033333333333333)
(4, 0.7866666666666666)
(4, 0.7666666666666667)
(4, 0.7777777777777778)
(4, 0.7766666666666666)
(4, 0.8122222222222222)
(4, 0.7833333333333333)
(4, 0.7911111111111111)
(4, 0.7877777777777778)
(8, 0.7822222222222223)
(8, 0.8011111111111111)
(8, 0.7866666666666666)
(8, 0.7666666666666667)
(8, 0.7755555555555556)
(8, 0.7766666666666666)
(8, 0.8122222222222222)
(8, 0.7833333333333333)
(8, 0.7911111111111111)
(8, 0.7877777777777778)
(16, 0.7822222222222223)
(16, 0.8022222222222222)
(16, 0.7866666666666666)
(16, 0.7666666666666667)
(16, 0.7766666666666666)
(16, 0.7766666666666666)
(16, 0.8122222222222222)
(16, 0.7833333333333333)
(16, 0.7911111111111111)
(16, 0.7888888888888889)
(32, 0.7822222222222223)
(32, 0.8022222222222222)
(32, 0.7855555555555556)
(32, 0.7666666666666667)
(32, 0.7766666666666666)
(32, 0.7766666666666666)
(32, 0.8122222222222222)
(32, 0.7833333333333333)
(32, 0.7911111111111111)
(32, 0.7888888888888889)
(64, 0.7822222222222

[0.779, 0.799, 0.769, 0.78, 0.774, 0.797, 0.792, 0.789, 0.78, 0.79]

In [41]:
eval_nested_kfold(model=word2vec_model, name='SUBJ', use_nb=False, model_name='word2vec')

Computing sentence vectors...
(10000, 200)
[1000 1001 1002 ... 9997 9998 9999]
(1, 0.9066666666666666)
(1, 0.8988888888888888)
(1, 0.8922222222222222)
(1, 0.8955555555555555)
(1, 0.8944444444444445)
(1, 0.9066666666666666)
(1, 0.9022222222222223)
(1, 0.9133333333333333)
(1, 0.9055555555555556)
(1, 0.9088888888888889)
(2, 0.9088888888888889)
(2, 0.9011111111111111)
(2, 0.8922222222222222)
(2, 0.8944444444444445)
(2, 0.8977777777777778)
(2, 0.9066666666666666)
(2, 0.9066666666666666)
(2, 0.9122222222222223)
(2, 0.9088888888888889)
(2, 0.91)
(4, 0.91)
(4, 0.9044444444444445)
(4, 0.8922222222222222)
(4, 0.8955555555555555)
(4, 0.9)
(4, 0.9077777777777778)
(4, 0.9122222222222223)
(4, 0.9122222222222223)
(4, 0.9155555555555556)
(4, 0.9155555555555556)
(8, 0.9088888888888889)
(8, 0.9033333333333333)
(8, 0.8966666666666666)
(8, 0.8933333333333333)
(8, 0.9044444444444445)
(8, 0.9144444444444444)
(8, 0.9088888888888889)
(8, 0.9111111111111111)
(8, 0.9122222222222223)
(8, 0.9155555555555556)
(16,

(32, 0.9122222222222223)
(32, 0.9)
(32, 0.9055555555555556)
(32, 0.9088888888888889)
(32, 0.9155555555555556)
(32, 0.91)
(32, 0.9144444444444444)
(32, 0.91)
(32, 0.91)
(64, 0.9088888888888889)
(64, 0.9133333333333333)
(64, 0.9044444444444445)
(64, 0.9033333333333333)
(64, 0.9033333333333333)
(64, 0.9111111111111111)
(64, 0.9077777777777778)
(64, 0.9111111111111111)
(64, 0.9111111111111111)
(64, 0.9077777777777778)
(128, 0.9133333333333333)
(128, 0.9122222222222223)
(128, 0.9077777777777778)
(128, 0.9033333333333333)
(128, 0.9022222222222223)
(128, 0.9111111111111111)
(128, 0.91)
(128, 0.9122222222222223)
(128, 0.9155555555555556)
(128, 0.9077777777777778)
(256, 0.9144444444444444)
(256, 0.91)
(256, 0.9066666666666666)
(256, 0.9)
(256, 0.9022222222222223)
(256, 0.9122222222222223)
(256, 0.9066666666666666)
(256, 0.9133333333333333)
(256, 0.9133333333333333)
(256, 0.9077777777777778)
[0.9036666666666667, 0.9071111111111112, 0.9083333333333332, 0.9092222222222223, 0.9091111111111111, 0.90

(1, 0.9)
(1, 0.9066666666666666)
(1, 0.8911111111111111)
(1, 0.9077777777777778)
(1, 0.9055555555555556)
(2, 0.9033333333333333)
(2, 0.9088888888888889)
(2, 0.9111111111111111)
(2, 0.8966666666666666)
(2, 0.8933333333333333)
(2, 0.9044444444444445)
(2, 0.9066666666666666)
(2, 0.8944444444444445)
(2, 0.91)
(2, 0.9088888888888889)
(4, 0.9033333333333333)
(4, 0.9111111111111111)
(4, 0.9077777777777778)
(4, 0.8922222222222222)
(4, 0.8955555555555555)
(4, 0.9044444444444445)
(4, 0.9122222222222223)
(4, 0.8977777777777778)
(4, 0.9133333333333333)
(4, 0.9133333333333333)
(8, 0.9111111111111111)
(8, 0.91)
(8, 0.9055555555555556)
(8, 0.8988888888888888)
(8, 0.8944444444444445)
(8, 0.9033333333333333)
(8, 0.9133333333333333)
(8, 0.8955555555555555)
(8, 0.91)
(8, 0.9122222222222223)
(16, 0.9088888888888889)
(16, 0.9077777777777778)
(16, 0.9077777777777778)
(16, 0.9033333333333333)
(16, 0.8955555555555555)
(16, 0.9088888888888889)
(16, 0.9122222222222223)
(16, 0.8944444444444445)
(16, 0.9111111111

[0.91, 0.912, 0.904, 0.898, 0.904, 0.905, 0.904, 0.925, 0.906, 0.911]

In [43]:
eval_nested_kfold(model=word2vec_model, name='SUBJ', use_nb=False, model_name='word2vec_sif')

Computing sentence vectors...
(10000, 200)
[1000 1001 1002 ... 9997 9998 9999]
(1, 0.4822222222222222)
(1, 0.4911111111111111)
(1, 0.52)
(1, 0.4633333333333333)
(1, 0.4633333333333333)
(1, 0.5077777777777778)
(1, 0.4922222222222222)
(1, 0.4622222222222222)
(1, 0.5322222222222223)
(1, 0.47333333333333333)
(2, 0.5666666666666667)
(2, 0.5911111111111111)
(2, 0.6366666666666667)
(2, 0.4777777777777778)
(2, 0.4633333333333333)
(2, 0.6377777777777778)
(2, 0.5244444444444445)
(2, 0.4622222222222222)
(2, 0.6688888888888889)
(2, 0.47333333333333333)
(4, 0.7144444444444444)
(4, 0.7477777777777778)
(4, 0.7622222222222222)
(4, 0.57)
(4, 0.4688888888888889)
(4, 0.7988888888888889)
(4, 0.6666666666666666)
(4, 0.4633333333333333)
(4, 0.7911111111111111)
(4, 0.5177777777777778)
(8, 0.8144444444444444)
(8, 0.8255555555555556)
(8, 0.8177777777777778)
(8, 0.7166666666666667)
(8, 0.5922222222222222)
(8, 0.8622222222222222)
(8, 0.8233333333333334)
(8, 0.5533333333333333)
(8, 0.8333333333333334)
(8, 0.66333

(16, 0.7766666666666666)
(16, 0.8733333333333333)
(32, 0.8722222222222222)
(32, 0.8188888888888889)
(32, 0.8466666666666667)
(32, 0.7988888888888889)
(32, 0.8722222222222222)
(32, 0.8666666666666667)
(32, 0.8611111111111112)
(32, 0.8955555555555555)
(32, 0.8311111111111111)
(32, 0.8733333333333333)
(64, 0.8755555555555555)
(64, 0.86)
(64, 0.8733333333333333)
(64, 0.8388888888888889)
(64, 0.8688888888888889)
(64, 0.8733333333333333)
(64, 0.8711111111111111)
(64, 0.8933333333333333)
(64, 0.8544444444444445)
(64, 0.8755555555555555)
(128, 0.8777777777777778)
(128, 0.8622222222222222)
(128, 0.87)
(128, 0.8533333333333334)
(128, 0.8711111111111111)
(128, 0.8755555555555555)
(128, 0.8755555555555555)
(128, 0.8933333333333333)
(128, 0.8711111111111111)
(128, 0.8766666666666667)
(256, 0.8777777777777778)
(256, 0.8722222222222222)
(256, 0.8755555555555555)
(256, 0.8577777777777778)
(256, 0.8777777777777778)
(256, 0.8855555555555555)
(256, 0.8755555555555555)
(256, 0.8966666666666666)
(256, 0.88

(1, 0.48777777777777775)
(1, 0.5388888888888889)
(1, 0.5188888888888888)
(1, 0.5022222222222222)
(1, 0.47555555555555556)
(1, 0.4855555555555556)
(1, 0.4855555555555556)
(1, 0.49)
(1, 0.5177777777777778)
(1, 0.47333333333333333)
(2, 0.48777777777777775)
(2, 0.6577777777777778)
(2, 0.5355555555555556)
(2, 0.5022222222222222)
(2, 0.6011111111111112)
(2, 0.4855555555555556)
(2, 0.4855555555555556)
(2, 0.49)
(2, 0.5255555555555556)
(2, 0.47333333333333333)
(4, 0.49)
(4, 0.7855555555555556)
(4, 0.6477777777777778)
(4, 0.5244444444444445)
(4, 0.74)
(4, 0.4866666666666667)
(4, 0.4866666666666667)
(4, 0.4911111111111111)
(4, 0.6544444444444445)
(4, 0.47333333333333333)
(8, 0.5633333333333334)
(8, 0.8633333333333333)
(8, 0.8033333333333333)
(8, 0.6855555555555556)
(8, 0.8022222222222222)
(8, 0.5677777777777778)
(8, 0.5411111111111111)
(8, 0.5788888888888889)
(8, 0.8133333333333334)
(8, 0.5255555555555556)
(16, 0.7411111111111112)
(16, 0.87)
(16, 0.8722222222222222)
(16, 0.83)
(16, 0.83333333333

[0.875, 0.882, 0.877, 0.866, 0.858, 0.882, 0.877, 0.904, 0.884, 0.876]

In [44]:
eval_nested_kfold(model=word2vec_model, name='SUBJ', use_nb=False, model_name='word2vec_tfidf')

Computing sentence vectors...
(10000, 200)
[1000 1001 1002 ... 9997 9998 9999]
(1, 0.8955555555555555)
(1, 0.8955555555555555)
(1, 0.8955555555555555)
(1, 0.8955555555555555)
(1, 0.8933333333333333)
(1, 0.9077777777777778)
(1, 0.9055555555555556)
(1, 0.8955555555555555)
(1, 0.8988888888888888)
(1, 0.9055555555555556)
(2, 0.8955555555555555)
(2, 0.8977777777777778)
(2, 0.8911111111111111)
(2, 0.8944444444444445)
(2, 0.9033333333333333)
(2, 0.9055555555555556)
(2, 0.9044444444444445)
(2, 0.8966666666666666)
(2, 0.8977777777777778)
(2, 0.9022222222222223)
(4, 0.8944444444444445)
(4, 0.8988888888888888)
(4, 0.8888888888888888)
(4, 0.8944444444444445)
(4, 0.9011111111111111)
(4, 0.9066666666666666)
(4, 0.9)
(4, 0.8955555555555555)
(4, 0.9)
(4, 0.9011111111111111)
(8, 0.8955555555555555)
(8, 0.8977777777777778)
(8, 0.8911111111111111)
(8, 0.8922222222222222)
(8, 0.8977777777777778)
(8, 0.9111111111111111)
(8, 0.8988888888888888)
(8, 0.8944444444444445)
(8, 0.9022222222222223)
(8, 0.9)
(16, 0

(32, 0.8966666666666666)
(32, 0.8955555555555555)
(32, 0.8888888888888888)
(32, 0.9022222222222223)
(32, 0.9055555555555556)
(32, 0.9)
(32, 0.8955555555555555)
(32, 0.8966666666666666)
(32, 0.8922222222222222)
(64, 0.9066666666666666)
(64, 0.8966666666666666)
(64, 0.8955555555555555)
(64, 0.8866666666666667)
(64, 0.9033333333333333)
(64, 0.9044444444444445)
(64, 0.9011111111111111)
(64, 0.8933333333333333)
(64, 0.8988888888888888)
(64, 0.8922222222222222)
(128, 0.9077777777777778)
(128, 0.8955555555555555)
(128, 0.8944444444444445)
(128, 0.8888888888888888)
(128, 0.9)
(128, 0.9055555555555556)
(128, 0.9022222222222223)
(128, 0.8933333333333333)
(128, 0.9)
(128, 0.8944444444444445)
(256, 0.9088888888888889)
(256, 0.8977777777777778)
(256, 0.8944444444444445)
(256, 0.89)
(256, 0.9011111111111111)
(256, 0.9077777777777778)
(256, 0.9066666666666666)
(256, 0.8922222222222222)
(256, 0.8977777777777778)
(256, 0.8944444444444445)
[0.9023333333333333, 0.9001111111111111, 0.8992222222222223, 0.8

(1, 0.8977777777777778)
(1, 0.8911111111111111)
(1, 0.8888888888888888)
(1, 0.9011111111111111)
(1, 0.9044444444444445)
(1, 0.8877777777777778)
(1, 0.9)
(1, 0.9055555555555556)
(2, 0.9055555555555556)
(2, 0.8944444444444445)
(2, 0.8966666666666666)
(2, 0.8888888888888888)
(2, 0.8866666666666667)
(2, 0.9011111111111111)
(2, 0.9022222222222223)
(2, 0.8888888888888888)
(2, 0.9033333333333333)
(2, 0.9)
(4, 0.9077777777777778)
(4, 0.8955555555555555)
(4, 0.9)
(4, 0.8866666666666667)
(4, 0.89)
(4, 0.9011111111111111)
(4, 0.9011111111111111)
(4, 0.8811111111111111)
(4, 0.9)
(4, 0.9)
(8, 0.9111111111111111)
(8, 0.8944444444444445)
(8, 0.8944444444444445)
(8, 0.8833333333333333)
(8, 0.8911111111111111)
(8, 0.8966666666666666)
(8, 0.9066666666666666)
(8, 0.8811111111111111)
(8, 0.8988888888888888)
(8, 0.9022222222222223)
(16, 0.9111111111111111)
(16, 0.8911111111111111)
(16, 0.8922222222222222)
(16, 0.8877777777777778)
(16, 0.8866666666666667)
(16, 0.89)
(16, 0.9055555555555556)
(16, 0.877777777

[0.901, 0.893, 0.9, 0.893, 0.895, 0.899, 0.899, 0.913, 0.891, 0.901]

In [45]:
eval_nested_kfold(model=word2vec_model, name='MR', use_nb=False, model_name='word2vec')

Computing sentence vectors...
(10662, 200)
[ 1067  1068  1069 ... 10659 10660 10661]
(1, 0.7458333333333333)
(1, 0.7635416666666667)
(1, 0.725)
(1, 0.7697916666666667)
(1, 0.7322916666666667)
(1, 0.7570385818561001)
(1, 0.7664233576642335)
(1, 0.7424400417101147)
(1, 0.7205422314911366)
(1, 0.7174139728884255)
(2, 0.7395833333333334)
(2, 0.771875)
(2, 0.7322916666666667)
(2, 0.765625)
(2, 0.740625)
(2, 0.7549530761209593)
(2, 0.7674661105318039)
(2, 0.7413972888425443)
(2, 0.7299270072992701)
(2, 0.7267987486965589)
(4, 0.7395833333333334)
(4, 0.7760416666666666)
(4, 0.73125)
(4, 0.7666666666666667)
(4, 0.7458333333333333)
(4, 0.7539103232533889)
(4, 0.7591240875912408)
(4, 0.7413972888425443)
(4, 0.7278415015641293)
(4, 0.7309697601668405)
(8, 0.740625)
(8, 0.7760416666666666)
(8, 0.7364583333333333)
(8, 0.76875)
(8, 0.7458333333333333)
(8, 0.7601668404588112)
(8, 0.7570385818561001)
(8, 0.7382690302398331)
(8, 0.7361835245046924)
(8, 0.7320125130344108)
(16, 0.7427083333333333)
(16, 

(32, 0.7570385818561001)
(32, 0.7434827945776851)
(32, 0.7278415015641293)
(32, 0.7340980187695516)
(64, 0.73125)
(64, 0.7322916666666667)
(64, 0.7510416666666667)
(64, 0.7395833333333334)
(64, 0.7489583333333333)
(64, 0.75625)
(64, 0.7601668404588112)
(64, 0.7393117831074035)
(64, 0.7278415015641293)
(64, 0.7330552659019812)
(128, 0.7375)
(128, 0.7302083333333333)
(128, 0.7552083333333334)
(128, 0.7385416666666667)
(128, 0.7458333333333333)
(128, 0.7583333333333333)
(128, 0.7653806047966631)
(128, 0.7382690302398331)
(128, 0.7299270072992701)
(128, 0.7330552659019812)
(256, 0.7354166666666667)
(256, 0.7322916666666667)
(256, 0.7614583333333333)
(256, 0.7364583333333333)
(256, 0.746875)
(256, 0.7552083333333334)
(256, 0.7664233576642335)
(256, 0.7382690302398331)
(256, 0.7320125130344108)
(256, 0.7309697601668405)
[0.7422867787626, 0.7432248218630517, 0.743432829336114, 0.7434333724365658, 0.742287430483142, 0.7428085896767467, 0.7419750391032326, 0.7432256908237749, 0.7435382994438651

(1, 0.7354166666666667)
(1, 0.7333333333333333)
(1, 0.7541666666666667)
(1, 0.734375)
(1, 0.759375)
(1, 0.7322916666666667)
(1, 0.7632950990615224)
(1, 0.7664233576642335)
(1, 0.7205422314911366)
(1, 0.7247132429614181)
(2, 0.734375)
(2, 0.7395833333333334)
(2, 0.7666666666666667)
(2, 0.7385416666666667)
(2, 0.753125)
(2, 0.7385416666666667)
(2, 0.7643378519290928)
(2, 0.7591240875912408)
(2, 0.7278415015641293)
(2, 0.7257559958289885)
(4, 0.734375)
(4, 0.73125)
(4, 0.7635416666666667)
(4, 0.7416666666666667)
(4, 0.7572916666666667)
(4, 0.7447916666666666)
(4, 0.7664233576642335)
(4, 0.7476538060479666)
(4, 0.7226277372262774)
(4, 0.7278415015641293)
(8, 0.7333333333333333)
(8, 0.7322916666666667)
(8, 0.7645833333333333)
(8, 0.74375)
(8, 0.7583333333333333)
(8, 0.746875)
(8, 0.7643378519290928)
(8, 0.7507820646506778)
(8, 0.7205422314911366)
(8, 0.7288842544316997)
(16, 0.7364583333333333)
(16, 0.7322916666666667)
(16, 0.7635416666666667)
(16, 0.7479166666666667)
(16, 0.759375)
(16, 0.

[0.7244611059044048,
 0.7403936269915652,
 0.7579737335834896,
 0.7504690431519699,
 0.7485928705440901,
 0.7532833020637899,
 0.7579737335834896,
 0.7439024390243902,
 0.7485928705440901,
 0.7279549718574109]

In [46]:
eval_nested_kfold(model=word2vec_model, name='MR', use_nb=False, model_name='word2vec_sif')

Computing sentence vectors...
(10662, 200)
[ 1067  1068  1069 ... 10659 10660 10661]
(1, 0.6208333333333333)
(1, 0.5104166666666666)
(1, 0.521875)
(1, 0.4791666666666667)
(1, 0.48020833333333335)
(1, 0.5099061522419187)
(1, 0.5005213764337852)
(1, 0.4608967674661105)
(1, 0.5557872784150156)
(1, 0.48383733055265904)
(2, 0.65)
(2, 0.5104166666666666)
(2, 0.6145833333333334)
(2, 0.484375)
(2, 0.48020833333333335)
(2, 0.5099061522419187)
(2, 0.5005213764337852)
(2, 0.4608967674661105)
(2, 0.6047966631908238)
(2, 0.48383733055265904)
(4, 0.6770833333333334)
(4, 0.540625)
(4, 0.6458333333333334)
(4, 0.5375)
(4, 0.48020833333333335)
(4, 0.5130344108446299)
(4, 0.5015641293013556)
(4, 0.4608967674661105)
(4, 0.6287799791449427)
(4, 0.48383733055265904)
(8, 0.6802083333333333)
(8, 0.5989583333333334)
(8, 0.6479166666666667)
(8, 0.6354166666666666)
(8, 0.48020833333333335)
(8, 0.5849843587069864)
(8, 0.5067778936392076)
(8, 0.4608967674661105)
(8, 0.6517205422314911)
(8, 0.48383733055265904)
(16

(16, 0.5119916579770595)
(16, 0.49113660062565173)
(16, 0.6746611053180396)
(32, 0.678125)
(32, 0.521875)
(32, 0.5885416666666666)
(32, 0.5572916666666666)
(32, 0.665625)
(32, 0.5916666666666667)
(32, 0.6319082377476538)
(32, 0.5922836287799792)
(32, 0.5411887382690302)
(32, 0.6777893639207507)
(64, 0.68125)
(64, 0.6125)
(64, 0.6604166666666667)
(64, 0.640625)
(64, 0.6604166666666667)
(64, 0.659375)
(64, 0.6809176225234619)
(64, 0.6496350364963503)
(64, 0.6068821689259646)
(64, 0.6652763295099061)
(128, 0.6822916666666666)
(128, 0.64375)
(128, 0.684375)
(128, 0.6708333333333333)
(128, 0.6645833333333333)
(128, 0.6958333333333333)
(128, 0.6903023983315955)
(128, 0.6684045881126173)
(128, 0.6465067778936392)
(128, 0.6631908237747653)
(256, 0.6833333333333333)
(256, 0.675)
(256, 0.703125)
(256, 0.6822916666666666)
(256, 0.665625)
(256, 0.709375)
(256, 0.6955161626694474)
(256, 0.6923879040667362)
(256, 0.6600625651720542)
(256, 0.6673618352450469)
[0.4945793143899896, 0.49874619829683703,

(1, 0.478125)
(1, 0.503125)
(1, 0.4875)
(1, 0.4791666666666667)
(1, 0.49270833333333336)
(1, 0.4942648592283629)
(1, 0.47340980187695514)
(1, 0.4848800834202294)
(1, 0.48383733055265904)
(2, 0.48333333333333334)
(2, 0.478125)
(2, 0.575)
(2, 0.4875)
(2, 0.4791666666666667)
(2, 0.49270833333333336)
(2, 0.4942648592283629)
(2, 0.47340980187695514)
(2, 0.4848800834202294)
(2, 0.48383733055265904)
(4, 0.48333333333333334)
(4, 0.478125)
(4, 0.6520833333333333)
(4, 0.5125)
(4, 0.48020833333333335)
(4, 0.49166666666666664)
(4, 0.4942648592283629)
(4, 0.47340980187695514)
(4, 0.4869655891553702)
(4, 0.48383733055265904)
(8, 0.4864583333333333)
(8, 0.4875)
(8, 0.6802083333333333)
(8, 0.6083333333333333)
(8, 0.490625)
(8, 0.50625)
(8, 0.5078206465067779)
(8, 0.47340980187695514)
(8, 0.5401459854014599)
(8, 0.48905109489051096)
(16, 0.5135416666666667)
(16, 0.55625)
(16, 0.684375)
(16, 0.6510416666666666)
(16, 0.5625)
(16, 0.5520833333333334)
(16, 0.5933263816475496)
(16, 0.47758081334723673)
(16,

[0.6757263355201499,
 0.6860356138706654,
 0.6960600375234521,
 0.6613508442776735,
 0.6857410881801126,
 0.6791744840525328,
 0.7035647279549718,
 0.6913696060037523,
 0.6726078799249531,
 0.6575984990619137]

In [47]:
eval_nested_kfold(model=word2vec_model, name='MR', use_nb=False, model_name='word2vec_tfidf')

Computing sentence vectors...
(10662, 200)
[ 1067  1068  1069 ... 10659 10660 10661]
(1, 0.7489583333333333)
(1, 0.7697916666666667)
(1, 0.7447916666666666)
(1, 0.759375)
(1, 0.7260416666666667)
(1, 0.7664233576642335)
(1, 0.7664233576642335)
(1, 0.7413972888425443)
(1, 0.7236704900938478)
(1, 0.7247132429614181)
(2, 0.7510416666666667)
(2, 0.7677083333333333)
(2, 0.7458333333333333)
(2, 0.7583333333333333)
(2, 0.7333333333333333)
(2, 0.7716371220020855)
(2, 0.7685088633993743)
(2, 0.7403545359749739)
(2, 0.7247132429614181)
(2, 0.7236704900938478)
(4, 0.75)
(4, 0.771875)
(4, 0.7427083333333333)
(4, 0.759375)
(4, 0.7333333333333333)
(4, 0.7695516162669447)
(4, 0.7705943691345151)
(4, 0.7372262773722628)
(4, 0.7267987486965589)
(4, 0.7267987486965589)
(8, 0.746875)
(8, 0.771875)
(8, 0.7427083333333333)
(8, 0.759375)
(8, 0.7375)
(8, 0.7705943691345151)
(8, 0.7716371220020855)
(8, 0.7361835245046924)
(8, 0.7236704900938478)
(8, 0.7278415015641293)
(16, 0.7447916666666666)
(16, 0.76875)
(1

(32, 0.765625)
(32, 0.762252346193952)
(32, 0.735140771637122)
(32, 0.7174139728884255)
(32, 0.7278415015641293)
(64, 0.7479166666666667)
(64, 0.7322916666666667)
(64, 0.7604166666666666)
(64, 0.7427083333333333)
(64, 0.740625)
(64, 0.7604166666666666)
(64, 0.7685088633993743)
(64, 0.7382690302398331)
(64, 0.7174139728884255)
(64, 0.7278415015641293)
(128, 0.746875)
(128, 0.73125)
(128, 0.7583333333333333)
(128, 0.7427083333333333)
(128, 0.734375)
(128, 0.7614583333333333)
(128, 0.7674661105318039)
(128, 0.7382690302398331)
(128, 0.7184567257559958)
(128, 0.7278415015641293)
(256, 0.7458333333333333)
(256, 0.7302083333333333)
(256, 0.7583333333333333)
(256, 0.7395833333333334)
(256, 0.734375)
(256, 0.7614583333333333)
(256, 0.7705943691345151)
(256, 0.7361835245046924)
(256, 0.721584984358707)
(256, 0.7309697601668405)
[0.7418680483142162, 0.7431191345151199, 0.7423898592283628, 0.7414513816475495, 0.7408262730274592, 0.7425981925616962, 0.7436408368091763, 0.7427033368091762, 0.742912

(1, 0.7427083333333333)
(1, 0.7385416666666667)
(1, 0.7697916666666667)
(1, 0.7458333333333333)
(1, 0.7541666666666667)
(1, 0.7333333333333333)
(1, 0.7632950990615224)
(1, 0.7601668404588112)
(1, 0.7205422314911366)
(1, 0.7278415015641293)
(2, 0.7447916666666666)
(2, 0.7322916666666667)
(2, 0.7708333333333334)
(2, 0.7520833333333333)
(2, 0.7614583333333333)
(2, 0.7333333333333333)
(2, 0.7632950990615224)
(2, 0.7570385818561001)
(2, 0.7247132429614181)
(2, 0.7267987486965589)
(4, 0.746875)
(4, 0.73125)
(4, 0.76875)
(4, 0.7489583333333333)
(4, 0.7614583333333333)
(4, 0.7364583333333333)
(4, 0.7612095933263816)
(4, 0.7591240875912408)
(4, 0.7278415015641293)
(4, 0.7288842544316997)
(8, 0.7479166666666667)
(8, 0.7333333333333333)
(8, 0.76875)
(8, 0.7489583333333333)
(8, 0.7583333333333333)
(8, 0.7375)
(8, 0.7632950990615224)
(8, 0.7580813347236705)
(8, 0.7267987486965589)
(8, 0.7278415015641293)
(16, 0.75)
(16, 0.7364583333333333)
(16, 0.7677083333333333)
(16, 0.7479166666666667)
(16, 0.76

[0.7385192127460168,
 0.7441424554826617,
 0.7589118198874296,
 0.7589118198874296,
 0.7429643527204502,
 0.7532833020637899,
 0.7701688555347092,
 0.7382739212007504,
 0.7373358348968105,
 0.7288930581613509]

In [51]:
eval_nested_kfold(model=doc2vec_model, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
(10662, 200)
[ 1067  1068  1069 ... 10659 10660 10661]
(1, 0.68125)
(1, 0.6864583333333333)
(1, 0.715625)
(1, 0.659375)
(1, 0.6760416666666667)
(1, 0.7038581856100105)
(1, 0.6809176225234619)
(1, 0.6652763295099061)
(1, 0.6809176225234619)
(1, 0.6809176225234619)
(2, 0.6802083333333333)
(2, 0.6854166666666667)
(2, 0.7166666666666667)
(2, 0.6583333333333333)
(2, 0.678125)
(2, 0.7049009384775808)
(2, 0.6819603753910324)
(2, 0.6684045881126173)
(2, 0.6798748696558915)
(2, 0.6798748696558915)
(4, 0.6802083333333333)
(4, 0.684375)
(4, 0.7166666666666667)
(4, 0.6583333333333333)
(4, 0.6791666666666667)
(4, 0.7049009384775808)
(4, 0.6819603753910324)
(4, 0.6684045881126173)
(4, 0.6798748696558915)
(4, 0.6777893639207507)
(8, 0.6802083333333333)
(8, 0.684375)
(8, 0.7177083333333333)
(8, 0.6583333333333333)
(8, 0.678125)
(8, 0.7049009384775808)
(8, 0.6840458811261731)
(8, 0.6684045881126173)
(8, 0.6798748696558915)
(8, 0.6777893639207507)
(16, 0.68125)
(16, 0.68437

(16, 0.7)
(16, 0.6830031282586028)
(16, 0.6725755995828988)
(16, 0.6798748696558915)
(16, 0.6798748696558915)
(32, 0.675)
(32, 0.6708333333333333)
(32, 0.69375)
(32, 0.659375)
(32, 0.6885416666666667)
(32, 0.7)
(32, 0.6830031282586028)
(32, 0.6725755995828988)
(32, 0.6798748696558915)
(32, 0.6798748696558915)
(64, 0.675)
(64, 0.6708333333333333)
(64, 0.69375)
(64, 0.659375)
(64, 0.6885416666666667)
(64, 0.7)
(64, 0.6830031282586028)
(64, 0.6725755995828988)
(64, 0.6809176225234619)
(64, 0.6798748696558915)
(128, 0.675)
(128, 0.6708333333333333)
(128, 0.69375)
(128, 0.659375)
(128, 0.6885416666666667)
(128, 0.7)
(128, 0.6830031282586028)
(128, 0.6725755995828988)
(128, 0.6809176225234619)
(128, 0.6798748696558915)
(256, 0.675)
(256, 0.6708333333333333)
(256, 0.69375)
(256, 0.659375)
(256, 0.6885416666666667)
(256, 0.7)
(256, 0.6830031282586028)
(256, 0.6725755995828988)
(256, 0.6809176225234619)
(256, 0.6798748696558915)
[0.6805953467153284, 0.6806997306221759, 0.6802828467153283, 0.680

[0.6738519212746017, 0.6804123711340206, 0.6932457786116323, 0.7157598499061913, 0.650093808630394, 0.6913696060037523, 0.7035647279549718]
[    0     1     2 ... 10659 10660 10661]
(1, 0.675)
(1, 0.6708333333333333)
(1, 0.7020833333333333)
(1, 0.7)
(1, 0.6708333333333333)
(1, 0.684375)
(1, 0.694473409801877)
(1, 0.6840458811261731)
(1, 0.6715328467153284)
(1, 0.6840458811261731)
(2, 0.675)
(2, 0.6697916666666667)
(2, 0.7020833333333333)
(2, 0.7020833333333333)
(2, 0.6697916666666667)
(2, 0.684375)
(2, 0.694473409801877)
(2, 0.6840458811261731)
(2, 0.67570385818561)
(2, 0.6830031282586028)
(4, 0.675)
(4, 0.671875)
(4, 0.7020833333333333)
(4, 0.6989583333333333)
(4, 0.66875)
(4, 0.6854166666666667)
(4, 0.694473409801877)
(4, 0.6840458811261731)
(4, 0.6767466110531803)
(4, 0.6840458811261731)
(8, 0.6760416666666667)
(8, 0.671875)
(8, 0.7020833333333333)
(8, 0.6979166666666666)
(8, 0.6697916666666667)
(8, 0.684375)
(8, 0.694473409801877)
(8, 0.6840458811261731)
(8, 0.6777893639207507)
(8,

[0.6738519212746017,
 0.6804123711340206,
 0.6932457786116323,
 0.7157598499061913,
 0.650093808630394,
 0.6913696060037523,
 0.7035647279549718,
 0.6707317073170732,
 0.6772983114446529,
 0.6838649155722326]

In [49]:
eval_nested_kfold(model=sent2vec_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
(10662, 200)
[ 1067  1068  1069 ... 10659 10660 10661]
(1, 0.75625)
(1, 0.7614583333333333)
(1, 0.7739583333333333)
(1, 0.765625)
(1, 0.76875)
(1, 0.7518248175182481)
(1, 0.7778936392075079)
(1, 0.7528675703858185)
(1, 0.7424400417101147)
(1, 0.7205422314911366)
(2, 0.75625)
(2, 0.7614583333333333)
(2, 0.7739583333333333)
(2, 0.765625)
(2, 0.76875)
(2, 0.7497393117831074)
(2, 0.7778936392075079)
(2, 0.7528675703858185)
(2, 0.7424400417101147)
(2, 0.7194994786235662)
(4, 0.75625)
(4, 0.7614583333333333)
(4, 0.7739583333333333)
(4, 0.765625)
(4, 0.76875)
(4, 0.748696558915537)
(4, 0.7778936392075079)
(4, 0.7539103232533889)
(4, 0.7424400417101147)
(4, 0.7194994786235662)
(8, 0.75625)
(8, 0.7604166666666666)
(8, 0.7739583333333333)
(8, 0.7666666666666667)
(8, 0.7677083333333333)
(8, 0.748696558915537)
(8, 0.7778936392075079)
(8, 0.7539103232533889)
(8, 0.7424400417101147)
(8, 0.7194994786235662)
(16, 0.75625)
(16, 0.759375)
(16, 0.7739583333333333)
(16, 0.766

(32, 0.7768508863399375)
(32, 0.7591240875912408)
(32, 0.735140771637122)
(32, 0.7247132429614181)
(64, 0.7395833333333334)
(64, 0.7427083333333333)
(64, 0.7677083333333333)
(64, 0.76875)
(64, 0.7697916666666667)
(64, 0.74375)
(64, 0.7768508863399375)
(64, 0.7591240875912408)
(64, 0.735140771637122)
(64, 0.7247132429614181)
(128, 0.7395833333333334)
(128, 0.7427083333333333)
(128, 0.7677083333333333)
(128, 0.76875)
(128, 0.7697916666666667)
(128, 0.74375)
(128, 0.7768508863399375)
(128, 0.7591240875912408)
(128, 0.735140771637122)
(128, 0.7247132429614181)
(256, 0.7395833333333334)
(256, 0.7427083333333333)
(256, 0.7677083333333333)
(256, 0.76875)
(256, 0.7697916666666667)
(256, 0.74375)
(256, 0.7768508863399375)
(256, 0.7591240875912408)
(256, 0.735140771637122)
(256, 0.7247132429614181)
[0.7530207247132429, 0.7527080074730621, 0.7524995655196385, 0.7526037321863052, 0.7527078988529718, 0.7528120655196385, 0.7528120655196385, 0.7528120655196385, 0.7528120655196385]
1
[0.74320524835988

[0.7432052483598875, 0.7422680412371134, 0.7645403377110694, 0.7626641651031895, 0.7729831144465291, 0.7523452157598499, 0.7673545966228893]
[    0     1     2 ... 10659 10660 10661]
(1, 0.746875)
(1, 0.7395833333333334)
(1, 0.765625)
(1, 0.765625)
(1, 0.7604166666666666)
(1, 0.7729166666666667)
(1, 0.7413972888425443)
(1, 0.7768508863399375)
(1, 0.7382690302398331)
(1, 0.7278415015641293)
(2, 0.746875)
(2, 0.7385416666666667)
(2, 0.765625)
(2, 0.765625)
(2, 0.7604166666666666)
(2, 0.771875)
(2, 0.7413972888425443)
(2, 0.7758081334723671)
(2, 0.7393117831074035)
(2, 0.7288842544316997)
(4, 0.746875)
(4, 0.7375)
(4, 0.7645833333333333)
(4, 0.765625)
(4, 0.7604166666666666)
(4, 0.7708333333333334)
(4, 0.7403545359749739)
(4, 0.7747653806047967)
(4, 0.7393117831074035)
(4, 0.7299270072992701)
(8, 0.7479166666666667)
(8, 0.7375)
(8, 0.7645833333333333)
(8, 0.7645833333333333)
(8, 0.7604166666666666)
(8, 0.7708333333333334)
(8, 0.7403545359749739)
(8, 0.7747653806047967)
(8, 0.7393117831074

[0.7432052483598875,
 0.7422680412371134,
 0.7645403377110694,
 0.7626641651031895,
 0.7729831144465291,
 0.7523452157598499,
 0.7673545966228893,
 0.7514071294559099,
 0.7476547842401501,
 0.726078799249531]

## Evaluation Result
Higer means better with an exception in MSE.

| S.No. | Model Name                 | Pearson | Spearman | MSE( lower is better ) | Mean SUBJ | Mean MR | TREC |
|-------|----------------------------|---------|----------|------------------------|-----------|---------|------|
| 1.    | Doc2Vec                          | 0.35 | 0.34 | 4.54       | 0.797     | 0.715   | 0.542|
| 2.    | Sent2Vec                         | 0.52 | 0.45 | 1.62       | 0.890     | 0.772   | 0.642|
| 3.    | Word2Vec with simple average     | 0.60 | 0.51 | 1.88       | 0.925     | 0.757   | 0.688|
| 4.    | Word2Vec with TF-IDF             | 0.63 | 0.51 | 0.97       | 0.913     | 0.758   | 0.684|
| 5.    | word2Vec with SIF                | 0.59 | 0.50 | 1.71       | 0.904     | 0.703   | 0.484|
| 6.    | LSI model                        | TO DO| TO DO| TO DO      | TO DO     | TO DO   | TO DO|