This notebook contains evaluation of Gensim sent2vec and its comparison with the [original c++ implementation of sent2vec](https://github.com/epfml/sent2vec), Gensim's Doc2Vec and Gensim's FastText. The evaluation scripts used in this notebook are based on the code provided by [Kiros et al., 2015](https://github.com/ryankiros/skip-thoughts).

In [None]:
# Additional installations required to run the notebook
! pip install scipy
! pip install sklearn
! pip install keras
! pip install tensorflow

# Download datasets for evaluation

In [None]:
# Download SICK data
# Unzip the following files and place their contents in a folder named SICK2014
! wget http://alt.qcri.org/semeval2014/task1/data/uploads/sick_train.zip
! wget http://alt.qcri.org/semeval2014/task1/data/uploads/sick_trial.zip
! wget http://alt.qcri.org/semeval2014/task1/data/uploads/sick_test_annotated.zip

In [None]:
# Download SUBJ and MR data
# Unzip the following files and place their contents in a folder named classification_data
! wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz
! wget https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

In [None]:
# Download TREC data
# Unzip the following files and place their contents in a folder named trec_data
! wget http://cogcomp.org/Data/QA/QC/train_5500.label
! wget http://cogcomp.org/Data/QA/QC/TREC_10.label

# Add necessary imports

In [1]:
import gensim
import os
from gensim.models.word2vec import LineSentence
from gensim.models.sent2vec import Sent2Vec
from gensim.models.fasttext import FastText
from gensim.utils import tokenize as gensim_tokenize
import re
import time
import numpy as np
import smart_open
import logging
import sys
from collections import Counter
from scipy.sparse import lil_matrix
from scipy.sparse import csr_matrix
import os.path
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
from sklearn.utils import shuffle
import subprocess
from sklearn.metrics import mean_squared_error as mse
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from scipy.sparse import hstack
from numpy.random import RandomState
logging.basicConfig(stream=sys.stdout, level=logging.INFO)

Using TensorFlow backend.


# Prepare Evaluation Scripts

In [2]:
# Naive-Bayes features
# Derived from https://github.com/mesnilgr/nbsvm
def tokenize(sentence, grams):
    words = sentence.split()
    tokens = []
    for gram in grams:
        for i in range(len(words) - gram + 1):
            tokens += ["_*_".join(words[i:i+gram])]
    return tokens


def build_dict(X, grams):
    dic = Counter()
    for sentence in X:
        dic.update(tokenize(sentence, grams))
    return dic


def compute_ratio(poscounts, negcounts, alpha=1):
    alltokens = list(set(poscounts.keys() + negcounts.keys()))
    dic = dict((t, i) for i, t in enumerate(alltokens))
    d = len(dic)
    p, q = np.ones(d) * alpha , np.ones(d) * alpha
    for t in alltokens:
        p[dic[t]] += poscounts[t]
        q[dic[t]] += negcounts[t]
    p /= abs(p).sum()
    q /= abs(q).sum()
    r = np.log(p/q)
    return dic, r


def process_text(text, dic, r, grams):
    """
    Return sparse feature matrix
    """
    X = lil_matrix((len(text), len(dic)))
    for i, l in enumerate(text):
        tokens = tokenize(l, grams)
        indexes = []
        for t in tokens:
            try:
                indexes += [dic[t]]
            except KeyError:
                pass
        indexes = list(set(indexes))
        indexes.sort()
        for j in indexes:
            X[i,j] = r[j]
    return csr_matrix(X)

In [3]:
def ft_sent_vec(model, sentence):
    sent_vec = np.zeros(model.vector_size)
    cnt = 0
    for word in sentence:
        if word in model.wv.vocab:
            sent_vec += model[word]
            cnt += 1
    if cnt > 0:
        sent_vec *= (1.0 / cnt)
    return sent_vec


def original_sent2vec(sentences):
    with smart_open.smart_open('./input.txt', 'w') as f:
        for sentence in sentences:
            f.write(sentence + '\n')
    with smart_open.smart_open('./output.txt', 'w') as f1, smart_open.smart_open('./input.txt') as f2:
        p = subprocess.Popen('./sent2vec/fasttext print-sentence-vectors ./sent2vec/my_model.bin', shell=True, stdin=f2, stdout=f1, stderr=f1)
        p.communicate()
    with smart_open.smart_open('./output.txt') as f:
        input_ = []
        for line in f:
            line = line.strip().split()
            input_.append([float(j) for j in line])
    return np.array(input_)


def sent2vec_vectors(model, examples):
    ans = []
    for example in examples:
        res = model[example.split(' ')]
        ans.append(res)
    return np.array(ans)

## Evaluation code for SICK data set

In [4]:
def evaluate_sick(model=None, model_name='sent2vec', evaltest=True, loc='./SICK2014', seed=42):
    """
    Run experiment
    """
    print 'Preparing data...'
    train, dev, test, scores = load_sick_data(loc)
    train[0], train[1], scores[0] = shuffle(train[0], train[1], scores[0], random_state=seed)
    
    print 'Computing training sentence vectors...'
    if model_name == 'sent2vec':
        trainA = sent2vec_vectors(model, train[0])
        trainB = sent2vec_vectors(model, train[1])
    elif model_name == 'doc2vec':
        trainA = np.array([model.infer_vector(example.split(' ')) for example in train[0]])
        trainB = np.array([model.infer_vector(example.split(' ')) for example in train[1]])
    elif model_name == 'original_sent2vec':
        trainA = original_sent2vec(train[0])
        trainB = original_sent2vec(train[1])
    else:
        trainA = np.array([ft_sent_vec(model, example.split(' ')) for example in train[0]])
        trainB = np.array([ft_sent_vec(model, example.split(' ')) for example in train[1]])
    
    print 'Computing development sentence vectors...'
    if model_name == 'sent2vec':
        devA = sent2vec_vectors(model, dev[0])
        devB = sent2vec_vectors(model, dev[1])
    elif model_name == 'doc2vec':
        devA = np.array([model.infer_vector(example.split(' ')) for example in dev[0]])
        devB = np.array([model.infer_vector(example.split(' ')) for example in dev[1]])
    elif model_name == 'original_sent2vec':
        devA = original_sent2vec(dev[0])
        devB = original_sent2vec(dev[1])
    else:
        devA = np.array([ft_sent_vec(model, example.split(' ')) for example in dev[0]])
        devB = np.array([ft_sent_vec(model, example.split(' ')) for example in dev[1]])

    print 'Computing feature combinations...'
    trainF = np.c_[np.abs(trainA - trainB), trainA * trainB]
    devF = np.c_[np.abs(devA - devB), devA * devB]

    print 'Encoding labels...'
    trainY = encode_labels(scores[0])
    devY = encode_labels(scores[1])

    print 'Compiling model...'
    lrmodel = prepare_model(ninputs=trainF.shape[1])

    print 'Training...'
    bestlrmodel = train_model(lrmodel, trainF, trainY, devF, devY, scores[1])

    if evaltest:
        print 'Computing test sentence vectors...'
        if model_name == 'sent2vec':
            testA = sent2vec_vectors(model, test[0])
            testB = sent2vec_vectors(model, test[1])
        elif model_name == 'doc2vec':
            testA = np.array([model.infer_vector(example.split(' ')) for example in test[0]])
            testB = np.array([model.infer_vector(example.split(' ')) for example in test[1]])
        elif model_name == 'original_sent2vec':
            testA = original_sent2vec(test[0])
            testB = original_sent2vec(test[1])
        else:
            testA = np.array([ft_sent_vec(model, example.split(' ')) for example in test[0]])
            testB = np.array([ft_sent_vec(model, example.split(' ')) for example in test[1]])

        print 'Computing feature combinations...'
        testF = np.c_[np.abs(testA - testB), testA * testB]

        print 'Evaluating...'
        r = np.arange(1,6)
        yhat = np.dot(bestlrmodel.predict_proba(testF, verbose=0), r)
        pr = pearsonr(yhat, scores[2])[0]
        sr = spearmanr(yhat, scores[2])[0]
        se = mse(yhat, scores[2])
        print 'Test Pearson: ' + str(pr)
        print 'Test Spearman: ' + str(sr)
        print 'Test MSE: ' + str(se)

        return yhat


def prepare_model(ninputs=9600, nclass=5):
    """
    Set up and compile the model architecture (Logistic regression)
    """
    lrmodel = Sequential()
    lrmodel.add(Dense(input_dim=ninputs, output_dim=nclass))
    lrmodel.add(Activation('softmax'))
    lrmodel.compile(loss='categorical_crossentropy', optimizer='adam')
    return lrmodel


def train_model(lrmodel, X, Y, devX, devY, devscores):
    """
    Train model, using pearsonr on dev for early stopping
    """
    done = False
    best = -1.0
    r = np.arange(1,6)
    prev_val_loss = 1000000007
    
    while not done:
        # Every 100 epochs, check Pearson on development set
        train_history = lrmodel.fit(X, Y, verbose=0, shuffle=False, validation_data=(devX, devY))
        val_loss = round(np.min(train_history.history['val_loss']), 4)
        yhat = np.dot(lrmodel.predict_proba(devX, verbose=0), r)
        score = pearsonr(yhat, devscores)[0]
        if score > best and val_loss < prev_val_loss:
            # print score
            best = score
            bestlrmodel = prepare_model(ninputs=X.shape[1])
            bestlrmodel.set_weights(lrmodel.get_weights())
            prev_val_loss = val_loss
        else:
            done = True

    yhat = np.dot(bestlrmodel.predict_proba(devX, verbose=0), r)
    score = pearsonr(yhat, devscores)[0]
    print 'Dev Pearson: ' + str(score)
    return bestlrmodel
    

def encode_labels(labels, nclass=5):
    """
    Label encoding from Tree LSTM paper (Tai, Socher, Manning)
    """
    Y = np.zeros((len(labels), nclass)).astype('float32')
    for j, y in enumerate(labels):
        for i in range(nclass):
            if i+1 == np.floor(y) + 1:
                Y[j,i] = y - np.floor(y)
            if i+1 == np.floor(y):
                Y[j,i] = np.floor(y) - y + 1
    return Y


def load_sick_data(loc='./data/'):
    """
    Load the SICK semantic-relatedness dataset
    """
    trainA, trainB, devA, devB, testA, testB = [],[],[],[],[],[]
    trainS, devS, testS = [],[],[]

    with smart_open.smart_open(os.path.join(loc, 'SICK_train.txt'), 'rb') as f:
        for line in f:
            text = line.strip().split('\t')
            trainA.append(text[1])
            trainB.append(text[2])
            trainS.append(text[3])
    with smart_open.smart_open(os.path.join(loc, 'SICK_trial.txt'), 'rb') as f:
        for line in f:
            text = line.strip().split('\t')
            devA.append(text[1])
            devB.append(text[2])
            devS.append(text[3])
    with smart_open.smart_open(os.path.join(loc, 'SICK_test_annotated.txt'), 'rb') as f:
        for line in f:
            text = line.strip().split('\t')
            testA.append(text[1])
            testB.append(text[2])
            testS.append(text[3])

    trainS = [float(s) for s in trainS[1:]]
    devS = [float(s) for s in devS[1:]]
    testS = [float(s) for s in testS[1:]]

    return [trainA[1:], trainB[1:]], [devA[1:], devB[1:]], [testA[1:], testB[1:]], [trainS, devS, testS]

## Evaluation code for CLASSIFICATION (SUBJ/TREC) data set

In [5]:
def load_classification_data(name, model=None, model_name='sent2vec', loc='./classification_data/', seed=42):
    """
    Load one of MR or SUBJ
    """
    z = {}
    if name == 'MR':
        pos, neg = load_rt(loc=loc)
    elif name == 'SUBJ':
        pos, neg = load_subj(loc=loc)

    labels = compute_labels(pos, neg)
    text, labels = shuffle_data(pos+neg, labels, seed=seed)
    text = [texts.encode('utf-8') for texts in text]
    z['text'] = text
    z['labels'] = labels
    print 'Computing sentence vectors...'
    if model_name == 'sent2vec':
        features = sent2vec_vectors(model, text)
    elif model_name == 'doc2vec':
        features = np.array([model.infer_vector(example.split(' ')) for example in text])
    elif model_name == 'original_sent2vec':
        features = original_sent2vec(text)
    else:
        features = np.array([ft_sent_vec(model, example.split(' ')) for example in text])
    return z, features


def load_rt(loc='./classification_data/'):
    """
    Load the MR dataset
    """
    pos, neg = [], []
    with smart_open.smart_open(os.path.join(loc, 'rt-polarity.pos'), 'rb') as f:
        for line in f:
            pos.append(line.decode('latin-1').strip())
    with smart_open.smart_open(os.path.join(loc, 'rt-polarity.neg'), 'rb') as f:
        for line in f:
            neg.append(line.decode('latin-1').strip())
    return pos, neg


def load_subj(loc='./classification_data/'):
    """
    Load the SUBJ dataset
    """
    pos, neg = [], []
    with smart_open.smart_open(os.path.join(loc, 'plot.tok.gt9.5000'), 'rb') as f:
        for line in f:
            pos.append(line.decode('latin-1').strip())
    with smart_open.smart_open(os.path.join(loc, 'quote.tok.gt9.5000'), 'rb') as f:
        for line in f:
            neg.append(line.decode('latin-1').strip())
    return pos, neg


def compute_labels(pos, neg):
    """
    Construct list of labels
    """
    labels = np.zeros(len(pos) + len(neg))
    labels[:len(pos)] = 1.0
    labels[len(pos):] = 0.0
    return labels


def shuffle_data(X, L, seed=42):
    """
    Shuffle the data
    """
    prng = RandomState(seed)
    inds = np.arange(len(X))
    prng.shuffle(inds)
    X = [X[i] for i in inds]
    L = L[inds]
    return (X, L)

In [6]:
def eval_nested_kfold(name, model=None, model_name='sent2vec', loc='./classification_data/', k=10, seed=42, use_nb=False):
    """
    Evaluate features with nested K-fold cross validation
    Outer loop: Held-out evaluation
    Inner loop: Hyperparameter tuning

    Datasets can be found at http://nlp.stanford.edu/~sidaw/home/projects:nbsvm
    Options for name are 'MR' and 'SUBJ'
    """
    # Load the dataset and extract features
    z, features = load_classification_data(name, model, loc=loc, seed=seed, model_name=model_name)

    scan = [2**t for t in range(0,9,1)]
    npts = len(z['text'])
    kf = KFold(npts, n_folds=k, random_state=seed)
    scores = []
    for train, test in kf:

        # Split data
        X_train = features[train]
        y_train = z['labels'][train]
        X_test = features[test]
        y_test = z['labels'][test]

        Xraw = [z['text'][i] for i in train]
        Xraw_test = [z['text'][i] for i in test]

        scanscores = []
        for s in scan:

            # Inner KFold
            innerkf = KFold(len(X_train), n_folds=k, random_state=seed+1)
            innerscores = []
            for innertrain, innertest in innerkf:
        
                # Split data
                X_innertrain = X_train[innertrain]
                y_innertrain = y_train[innertrain]
                X_innertest = X_train[innertest]
                y_innertest = y_train[innertest]

                Xraw_innertrain = [Xraw[i] for i in innertrain]
                Xraw_innertest = [Xraw[i] for i in innertest]

                # NB (if applicable)
                if use_nb:
                    NBtrain, NBtest = compute_nb(Xraw_innertrain, y_innertrain, Xraw_innertest)
                    X_innertrain = hstack((X_innertrain, NBtrain))
                    X_innertest = hstack((X_innertest, NBtest))

                # Train classifier
                clf = LogisticRegression(C=s)
                clf.fit(X_innertrain, y_innertrain)
                acc = clf.score(X_innertest, y_innertest)
                innerscores.append(acc)
                # print (s, acc)

            # Append mean score
            scanscores.append(np.mean(innerscores))

        # Get the index of the best score
        s_ind = np.argmax(scanscores)
        s = scan[s_ind]
        # print scanscores
        # print s
 
        # NB (if applicable)
        if use_nb:
            NBtrain, NBtest = compute_nb(Xraw, y_train, Xraw_test)
            X_train = hstack((X_train, NBtrain))
            X_test = hstack((X_test, NBtest))
       
        # Train classifier
        clf = LogisticRegression(C=s)
        clf.fit(X_train, y_train)

        # Evaluate
        acc = clf.score(X_test, y_test)
        scores.append(acc)
        print scores

    return scores


def compute_nb(X, y, Z):
    """
    Compute NB features
    """
    labels = [int(t) for t in y]
    ptrain = [X[i] for i in range(len(labels)) if labels[i] == 0]
    ntrain = [X[i] for i in range(len(labels)) if labels[i] == 1]
    poscounts = build_dict(ptrain, [1,2])
    negcounts = build_dict(ntrain, [1,2])
    dic, r = compute_ratio(poscounts, negcounts)
    trainX = process_text(X, dic, r, [1,2])
    devX = process_text(Z, dic, r, [1,2])
    return trainX, devX

## Evaluation code for TREC data set

In [7]:
def evaluate_trec(model=None, model_name='sent2vec', k=10, seed=1234, evalcv=True, evaltest=True, loc='./trec_data/'):
    """
    Run experiment
    k: number of CV folds
    test: whether to evaluate on test set
    """
    print 'Preparing data...'
    traintext, testtext = load_trec_data(loc)
    train, train_labels = prepare_data(traintext)
    test, test_labels = prepare_data(testtext)
    train_labels = prepare_labels(train_labels)
    test_labels = prepare_labels(test_labels)
    train, train_labels = shuffle(train, train_labels, random_state=seed)

    print 'Computing training sentence vectors...'
    if model_name == 'sent2vec':
        trainF = sent2vec_vectors(model, train)
    elif model_name == 'doc2vec':
        trainF = [model.infer_vector(example.split(' ')) for example in train]
    elif model_name == 'original_sent2vec':
        trainF = original_sent2vec(train)
    else:
        trainF = [ft_sent_vec(model, example.split(' ')) for example in train]
    
    if evalcv:
        print 'Running cross-validation...'
        interval = [2**t for t in range(0,9,1)]     # coarse-grained
        C = eval_kfold(trainF, train_labels, k=k, scan=interval, seed=seed)

    if evaltest:
        if not evalcv:
            C = 128     # Best parameter found from CV

        print 'Computing testing sentence vectors...'
        if model_name == 'sent2vec':
            testF = sent2vec_vectors(model, test)
        elif model_name == 'doc2vec':
            testF = [model.infer_vector(example.split(' ')) for example in test]
        elif model_name == 'original_sent2vec':
            testF = original_sent2vec(test)
        else:
            testF = [ft_sent_vec(model, example.split(' ')) for example in test]

        print 'Evaluating...'
        clf = LogisticRegression(C=C)
        clf.fit(trainF, train_labels)
        clf.predict(testF)
        print 'Test accuracy: ' + str(clf.score(testF, test_labels))


def load_trec_data(loc='./trec_data/'):
    """
    Load the TREC question-type dataset
    """
    train, test = [], []
    with smart_open.smart_open(os.path.join(loc, 'train_5500.label'), 'rb') as f:
        for line in f:
            train.append(line.strip())
    with smart_open.smart_open(os.path.join(loc, 'TREC_10.label'), 'rb') as f:
        for line in f:
            test.append(line.strip())
    return train, test


def prepare_data(text):
    """
    Prepare data
    """
    labels = [t.split()[0] for t in text]
    labels = [l.split(':')[0] for l in labels]
    X = [t.split()[1:] for t in text]
    X = [' '.join(t) for t in X]
    return X, labels


def prepare_labels(labels):
    """
    Process labels to numerical values
    """
    d = {}
    count = 0
    setlabels = set(labels)
    for w in setlabels:
        d[w] = count
        count += 1
    idxlabels = np.array([d[w] for w in labels])
    return idxlabels


def eval_kfold(features, labels, k=10, scan=[2**t for t in range(0,9,1)], seed=1234):
    """
    Perform k-fold cross validation
    """
    npts = len(features)
    kf = KFold(npts, n_folds=k, random_state=seed)
    scores = []

    for s in scan:

        scanscores = []

        for train, test in kf:

            # Split data
            X_train = features[train]
            y_train = labels[train]
            X_test = features[test]
            y_test = labels[test]

            # Train classifier
            clf = LogisticRegression(C=s)
            clf.fit(X_train, y_train)
            score = clf.score(X_test, y_test)
            scanscores.append(score)
            print (s, score)

        # Append mean score
        scores.append(np.mean(scanscores))
        print scores

    # Get the index of the best score
    s_ind = np.argmax(scores)
    s = scan[s_ind]
    print (s_ind, s)
    return s

# Evaluation on Lee Corpus

First, we'll be training our model using the Lee Background Corpus included in gensim. This corpus contains 314 documents selected from the Australian Broadcasting Corporation’s news mail service, which provides text e-mails of headline stories and covers a number of broad topics.

## Prepare Training Data

In [8]:
data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = data_dir + 'lee_background.cor'


# Prepare training data for Sent2Vec
lee_data = []
with smart_open.smart_open(lee_train_file) as f1, smart_open.smart_open("./input.txt",'w') as f2:
    for line in f1:
        if line not in ['\n', '\r\n']:
            line = re.split('\.|\?|\n', line.strip())
            for sentence in line:
                if len(sentence) > 1:
                    sentence = gensim_tokenize(sentence)
                    lee_data.append(list(sentence))
                    f2.write(' '.join(lee_data[-1]) + '\n')


# Prepare training data for Doc2Vec
def read_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])
lee_doc2vec_data = list(read_corpus(lee_train_file))


# Prepare training data for FastText
lee_fasttext_data = LineSentence(lee_train_file)


# Print sample training data
for sentence in lee_data[:5]:
    print sentence,'\n'

[u'Hundreds', u'of', u'people', u'have', u'been', u'forced', u'to', u'vacate', u'their', u'homes', u'in', u'the', u'Southern', u'Highlands', u'of', u'New', u'South', u'Wales', u'as', u'strong', u'winds', u'today', u'pushed', u'a', u'huge', u'bushfire', u'towards', u'the', u'town', u'of', u'Hill', u'Top'] 

[u'A', u'new', u'blaze', u'near', u'Goulburn', u'south', u'west', u'of', u'Sydney', u'has', u'forced', u'the', u'closure', u'of', u'the', u'Hume', u'Highway'] 

[u'At', u'about', u'pm', u'AEDT', u'a', u'marked', u'deterioration', u'in', u'the', u'weather', u'as', u'a', u'storm', u'cell', u'moved', u'east', u'across', u'the', u'Blue', u'Mountains', u'forced', u'authorities', u'to', u'make', u'a', u'decision', u'to', u'evacuate', u'people', u'from', u'homes', u'in', u'outlying', u'streets', u'at', u'Hill', u'Top', u'in', u'the', u'New', u'South', u'Wales', u'southern', u'highlands'] 

[u'An', u'estimated', u'residents', u'have', u'left', u'their', u'homes', u'for', u'nearby', u'Mittago

## Train Models

In [9]:
# Train new Gensim sent2vec model
% time sent2vec_model = Sent2Vec(lee_data, size=100, epochs=20, seed=42, workers=4)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 0.06 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 1531, tokens read: 60302
INFO:gensim.models.sent2vec:training model with 4 workers on 1531 vocabulary and 100 features
INFO:gensim.models.sent2vec:PROGRESS: at 9.11% words, 85690 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 18.22% words, 85582 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 28.98% words, 90985 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 40.57% words, 93338 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 52.99% words, 98055 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 64.59% words, 99523 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 74.53% words, 99278 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 85.30% words, 99112 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 97.72% words, 100813 words/s
INFO:gensim.models.sent2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.

In [9]:
# Train new original c++ sent2vec model
#! git clone https://github.com/epfml/sent2vec.git
% cd sent2vec
#! make
! time ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 20 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 4 -t 0.0001 -dropoutK 2 -bucket 2000000
% cd ..

/Users/prerna135/Documents/GitHub/gensim/docs/notebooks/sent2vec
Read 0M words
Number of words:  1837
Number of labels: 0
Progress: 100.0%  words/sec/thread: 91638  lr: 0.000000  loss: 3.132677  eta: 0h0m 4m   eta: 0h0m   eta: 0h0m   lr: 0.144960  loss: 3.627340  eta: 0h0m thread: 44594  lr: 0.136616  loss: 3.590207  eta: 0h0m m m m 0h0m 

real	0m13.698s
user	0m16.583s
sys	0m2.638s
/Users/prerna135/Documents/GitHub/gensim/docs/notebooks


In [11]:
# Doc2Vec model1 with PV-DM and sum of context word vectors
% time doc2vec_model1 = gensim.models.doc2vec.Doc2Vec(size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
% time doc2vec_model1.build_vocab(lee_doc2vec_data)
% time doc2vec_model1.train(lee_doc2vec_data, total_examples=doc2vec_model1.corpus_count, epochs=doc2vec_model1.iter)

CPU times: user 217 µs, sys: 776 µs, total: 993 µs
Wall time: 1 ms
INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting lay

724836

In [12]:
# Doc2Vec model2 with PV-DBOW and sum of context word vectors
% time doc2vec_model2 = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
% time doc2vec_model2.build_vocab(lee_doc2vec_data)
% time doc2vec_model2.train(lee_doc2vec_data, total_examples=doc2vec_model2.corpus_count, epochs=doc2vec_model2.iter)

CPU times: user 124 µs, sys: 34 µs, total: 158 µs
Wall time: 146 µs
INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting la

724572

In [13]:
# Doc2Vec model3 with PV-DM and mean of context word vectors
% time doc2vec_model3 = gensim.models.doc2vec.Doc2Vec(dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
% time doc2vec_model3.build_vocab(lee_doc2vec_data)
% time doc2vec_model3.train(lee_doc2vec_data, total_examples=doc2vec_model3.corpus_count, epochs=doc2vec_model3.iter)

CPU times: user 99 µs, sys: 30 µs, total: 129 µs
Wall time: 117 µs
INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting lay

724840

In [14]:
# Doc2Vec model4 with PV-DBOW and mean of context word vectors
% time doc2vec_model4 = gensim.models.doc2vec.Doc2Vec(dm=0, dm_mean=1, size=100, min_count=5, iter=20, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
% time doc2vec_model4.build_vocab(lee_doc2vec_data)
% time doc2vec_model4.train(lee_doc2vec_data, total_examples=doc2vec_model4.corpus_count, epochs=doc2vec_model4.iter)

CPU times: user 122 µs, sys: 20 µs, total: 142 µs
Wall time: 137 µs
INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1750 unique words (25% of original 6981, drops 5231)
INFO:gensim.models.word2vec:min_count=5 leaves 49335 word corpus (84% of original 58152, drops 8817)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 6981 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 51 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 35935 word corpus (72.8% of prior 49335)
INFO:gensim.models.word2vec:estimated required memory for 1750 words and 100 dimensions: 2395000 bytes
INFO:gensim.models.word2vec:resetting la

724538

In [15]:
# Train new Gensim fasttext model
% time fasttext_model = FastText(size=100, alpha=0.2, negative=10, max_vocab_size=30000000, seed=42, iter=20)
% time fasttext_model.build_vocab(lee_fasttext_data)
% time fasttext_model.train(lee_fasttext_data, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)

CPU times: user 118 µs, sys: 269 µs, total: 387 µs
Wall time: 368 µs
INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 10781 word types from a corpus of 59890 raw words and 300 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
INFO:gensim.models.word2vec:min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 10781 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 45 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
INFO:gensim.models.word2vec:estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
INFO:gensim.models.word2vec:resetting layer weights


## Unsupervised similarity evaluation

Unsupervised evaluation of the the learnt sentence embeddings is performed using the sentence cosine similarity, on the [SICK 2014](http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools) datasets. These similarity scores are compared to the gold-standard human judgements using [Pearson’s correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) scores. The SICK dataset consists of about 10,000 sentence pairs along with relatedness scores of the pairs.

In [16]:
# Evaluate Gensim sent2vec model
evaluate_sick(sent2vec_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.454888083808
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.442726701635
Test Spearman: 0.429214793005
Test MSE: 0.819983275569


array([ 3.33202561,  3.22572068,  3.42379462, ...,  3.22189722,
        2.97011595,  3.45620743])

In [17]:
# Evaluate original c++ sent2vec model
evaluate_sick(seed=42, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...




Training...
Dev Pearson: 0.410262365496
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.385389086275
Test Spearman: 0.389124024191
Test MSE: 0.869150134381


array([ 3.03246912,  3.10153695,  3.2572212 , ...,  3.32569634,
        2.58646177,  2.93490493])

In [18]:
# Evaluate doc2vec model1
evaluate_sick(doc2vec_model1, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...




Training...
Dev Pearson: 0.328894630838
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.278608836989
Test Spearman: 0.27301301549
Test MSE: 0.938821396858


array([ 3.52826811,  3.7083939 ,  3.47524076, ...,  2.88458871,
        3.31107186,  3.86412868])

In [19]:
# Evaluate doc2vec model2
evaluate_sick(doc2vec_model2, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.450398849521
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.359457176208
Test Spearman: 0.357782710799
Test MSE: 0.886162110107


array([ 3.49265406,  3.57416919,  3.04496671, ...,  3.05261419,
        3.1936516 ,  2.89496976])

In [20]:
# Evaluate doc2vec model3
evaluate_sick(doc2vec_model3, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...




Training...
Dev Pearson: 0.300688964786
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.256824201286
Test Spearman: 0.253144870099
Test MSE: 0.951138264852


array([ 3.24743648,  3.67006003,  3.3777536 , ...,  3.50714775,
        3.37920226,  3.3091968 ])

In [21]:
# Evaluate doc2vec model4
evaluate_sick(doc2vec_model4, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.437143868676
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.343684457858
Test Spearman: 0.342749636539
Test MSE: 0.897495330406


array([ 3.59765554,  3.72423426,  2.97780104, ...,  3.32320554,
        3.41562496,  2.53789302])

In [22]:
# Evaluate fasttext model
evaluate_sick(fasttext_model, seed=42, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.501943522556
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.514681550184
Test Spearman: 0.506972908332
Test MSE: 0.749812329121


array([ 3.15966516,  3.21333559,  3.42929444, ...,  3.35751852,
        2.77122039,  2.49526048])

## Downstream Supervised Evaluation

Sentence embeddings are evaluated for various supervised classification tasks. We evaluate classification of movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ)(Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). Sent2Vec embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the [MR and SUBJ](https://www.cs.cornell.edu/people/pabo/movie-review-data/) datasets. For those datasets nested cross-validation is used to tune the L2 penalty. For the [TREC dataset](http://cogcomp.cs.illinois.edu/Data/QA/QC/), the accuracy is computed on the test set.

### Evaluation of Gensim Sent2Vec

In [23]:
eval_nested_kfold(model=sent2vec_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.69999999999999996]
[0.69999999999999996, 0.72399999999999998]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995, 0.73299999999999998]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995, 0.73299999999999998, 0.71999999999999997]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995, 0.73299999999999998, 0.71999999999999997, 0.72699999999999998]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995, 0.73299999999999998, 0.71999999999999997, 0.72699999999999998, 0.71399999999999997]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995, 0.73299999999999998, 0.71999999999999997, 0.72699999999999998, 0.71399999999999997, 0.72199999999999998]
[0.69999999999999996, 0.72399999999999998, 0.69899999999999995, 0.73299999999999998, 0.71999999999999997, 0.72699999999999998, 0.71399999999999997, 0.72199999999999998, 0.71899999999999997]
[0.6999999999999

[0.69999999999999996,
 0.72399999999999998,
 0.69899999999999995,
 0.73299999999999998,
 0.71999999999999997,
 0.72699999999999998,
 0.71399999999999997,
 0.72199999999999998,
 0.71899999999999997,
 0.73899999999999999]

In [24]:
eval_nested_kfold(model=sent2vec_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.56607310215557638]
[0.56607310215557638, 0.56513589503280226]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668, 0.55534709193245779]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668, 0.55534709193245779, 0.55065666041275796]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668, 0.55534709193245779, 0.55065666041275796, 0.56003752345215763]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668, 0.55534709193245779, 0.55065666041275796, 0.56003752345215763, 0.59756097560975607]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668, 0.55534709193245779, 0.55065666041275796, 0.56003752345215763, 0.59756097560975607, 0.60225140712945591]
[0.56607310215557638, 0.56513589503280226, 0.58348968105065668, 0.55534709193245779, 0.55065666041275796, 0.56003752345215763, 0.59756097560975607, 0.60225140712945591, 0.58536585365853655]
[0.5660731021555

[0.56607310215557638,
 0.56513589503280226,
 0.58348968105065668,
 0.55534709193245779,
 0.55065666041275796,
 0.56003752345215763,
 0.59756097560975607,
 0.60225140712945591,
 0.58536585365853655,
 0.575046904315197]

In [25]:
evaluate_trec(model=sent2vec_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.584


### Evaluation of Original C++ Sent2Vec

In [10]:
eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.78700000000000003]
[0.78700000000000003, 0.80600000000000005]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002, 0.78300000000000003]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002, 0.78300000000000003, 0.79600000000000004]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002, 0.78300000000000003, 0.79600000000000004, 0.76300000000000001]
[0.7870000000000

[0.78700000000000003,
 0.80600000000000005,
 0.78200000000000003,
 0.77800000000000002,
 0.77500000000000002,
 0.76700000000000002,
 0.78300000000000003,
 0.79600000000000004,
 0.76300000000000001,
 0.79700000000000004]

In [11]:
eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.57357075913776945]
[0.57357075913776945, 0.57263355201499533]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56660412757973733]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56660412757973733, 0.58348968105065668]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56660412757973733, 0.58348968105065668, 0.59380863039399623]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56660412757973733, 0.58348968105065668, 0.59380863039399623, 0.60037523452157604]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56660412757973733, 0.58348968105065668, 0.59380863039399623, 0.60037523452157604, 0.61819887429643527]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56660412757973733, 0.58348968105065668, 0.59380863039399623, 0.60037523452157604, 0.61819887429643527, 0.60412757973733588]
[0.5735707591377

[0.57357075913776945,
 0.57263355201499533,
 0.56378986866791747,
 0.56660412757973733,
 0.58348968105065668,
 0.59380863039399623,
 0.60037523452157604,
 0.61819887429643527,
 0.60412757973733588,
 0.59380863039399623]

In [12]:
evaluate_trec(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.592


### Evaluation of Doc2Vec

In [29]:
eval_nested_kfold(model=doc2vec_model1, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.66900000000000004]
[0.66900000000000004, 0.67200000000000004]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005, 0.67500000000000004]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005, 0.67500000000000004, 0.67800000000000005]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005, 0.67500000000000004, 0.67800000000000005, 0.67900000000000005]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005, 0.67500000000000004, 0.67800000000000005, 0.67900000000000005, 0.67700000000000005]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005, 0.67500000000000004, 0.67800000000000005, 0.67900000000000005, 0.67700000000000005, 0.67200000000000004]
[0.66900000000000004, 0.67200000000000004, 0.67800000000000005, 0.67500000000000004, 0.67800000000000005, 0.67900000000000005, 0.67700000000000005, 0.67200000000000004, 0.67300000000000004]
[0.6690000000000

[0.66900000000000004,
 0.67200000000000004,
 0.67800000000000005,
 0.67500000000000004,
 0.67800000000000005,
 0.67900000000000005,
 0.67700000000000005,
 0.67200000000000004,
 0.67300000000000004,
 0.68000000000000005]

In [30]:
eval_nested_kfold(model=doc2vec_model1, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.55201499531396436]
[0.55201499531396436, 0.55295220243673848]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765, 0.5412757973733584]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765, 0.5412757973733584, 0.56285178236397748]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765, 0.5412757973733584, 0.56285178236397748, 0.57129455909943716]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765, 0.5412757973733584, 0.56285178236397748, 0.57129455909943716, 0.575046904315197]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765, 0.5412757973733584, 0.56285178236397748, 0.57129455909943716, 0.575046904315197, 0.55816135084427765]
[0.55201499531396436, 0.55295220243673848, 0.55816135084427765, 0.5412757973733584, 0.56285178236397748, 0.57129455909943716, 0.575046904315197, 0.55816135084427765, 0.55534709193245779]
[0.55201499531396436, 0.5529

[0.55201499531396436,
 0.55295220243673848,
 0.55816135084427765,
 0.5412757973733584,
 0.56285178236397748,
 0.57129455909943716,
 0.575046904315197,
 0.55816135084427765,
 0.55534709193245779,
 0.57410881801125702]

In [31]:
evaluate_trec(doc2vec_model1, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.38


In [32]:
eval_nested_kfold(model=doc2vec_model2, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.746]
[0.746, 0.72399999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998, 0.72799999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998, 0.72799999999999998, 0.69499999999999995]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998, 0.72799999999999998, 0.69499999999999995, 0.72499999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998, 0.72799999999999998, 0.69499999999999995, 0.72499999999999998, 0.72699999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998, 0.72799999999999998, 0.69499999999999995, 0.72499999999999998, 0.72699999999999998, 0.72999999999999998]
[0.746, 0.72399999999999998, 0.72499999999999998, 0.73199999999999998, 0.72799999999999998, 0.69499999999999995, 0.72499999999999998, 0.726999

[0.746,
 0.72399999999999998,
 0.72499999999999998,
 0.73199999999999998,
 0.72799999999999998,
 0.69499999999999995,
 0.72499999999999998,
 0.72699999999999998,
 0.72999999999999998,
 0.755]

In [33]:
eval_nested_kfold(model=doc2vec_model2, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.5670103092783505]
[0.5670103092783505, 0.56513589503280226]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606, 0.56378986866791747]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606, 0.56378986866791747, 0.56097560975609762]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606, 0.56378986866791747, 0.56097560975609762, 0.5684803001876173]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606, 0.56378986866791747, 0.56097560975609762, 0.5684803001876173, 0.60225140712945591]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606, 0.56378986866791747, 0.56097560975609762, 0.5684803001876173, 0.60225140712945591, 0.57035647279549717]
[0.5670103092783505, 0.56513589503280226, 0.59849906191369606, 0.56378986866791747, 0.56097560975609762, 0.5684803001876173, 0.60225140712945591, 0.57035647279549717, 0.56097560975609762]
[0.5670103092783505, 0.565135

[0.5670103092783505,
 0.56513589503280226,
 0.59849906191369606,
 0.56378986866791747,
 0.56097560975609762,
 0.5684803001876173,
 0.60225140712945591,
 0.57035647279549717,
 0.56097560975609762,
 0.59287054409005624]

In [34]:
evaluate_trec(doc2vec_model2, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.406


In [35]:
eval_nested_kfold(model=doc2vec_model3, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.68799999999999994]
[0.68799999999999994, 0.65800000000000003]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003, 0.67900000000000005]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003, 0.67900000000000005, 0.66500000000000004]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003, 0.67900000000000005, 0.66500000000000004, 0.67000000000000004]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003, 0.67900000000000005, 0.66500000000000004, 0.67000000000000004, 0.69099999999999995]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003, 0.67900000000000005, 0.66500000000000004, 0.67000000000000004, 0.69099999999999995, 0.66800000000000004]
[0.68799999999999994, 0.65800000000000003, 0.65600000000000003, 0.67900000000000005, 0.66500000000000004, 0.67000000000000004, 0.69099999999999995, 0.66800000000000004, 0.67000000000000004]
[0.6879999999999

[0.68799999999999994,
 0.65800000000000003,
 0.65600000000000003,
 0.67900000000000005,
 0.66500000000000004,
 0.67000000000000004,
 0.69099999999999995,
 0.66800000000000004,
 0.67000000000000004,
 0.67400000000000004]

In [36]:
eval_nested_kfold(model=doc2vec_model3, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.53139643861293351]
[0.53139643861293351, 0.53139643861293351]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763, 0.53846153846153844]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763, 0.53846153846153844, 0.53752345215759845]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763, 0.53846153846153844, 0.53752345215759845, 0.55347091932457781]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763, 0.53846153846153844, 0.53752345215759845, 0.55347091932457781, 0.59287054409005624]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763, 0.53846153846153844, 0.53752345215759845, 0.55347091932457781, 0.59287054409005624, 0.53283302063789872]
[0.53139643861293351, 0.53139643861293351, 0.56003752345215763, 0.53846153846153844, 0.53752345215759845, 0.55347091932457781, 0.59287054409005624, 0.53283302063789872, 0.5544090056285178]
[0.53139643861293

[0.53139643861293351,
 0.53139643861293351,
 0.56003752345215763,
 0.53846153846153844,
 0.53752345215759845,
 0.55347091932457781,
 0.59287054409005624,
 0.53283302063789872,
 0.5544090056285178,
 0.56378986866791747]

In [38]:
evaluate_trec(doc2vec_model3, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.404


In [37]:
eval_nested_kfold(model=doc2vec_model4, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.748]
[0.748, 0.72199999999999998]
[0.748, 0.72199999999999998, 0.70799999999999996]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998, 0.73099999999999998]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998, 0.73099999999999998, 0.73099999999999998]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998, 0.73099999999999998, 0.73099999999999998, 0.73599999999999999]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998, 0.73099999999999998, 0.73099999999999998, 0.73599999999999999, 0.71399999999999997]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998, 0.73099999999999998, 0.73099999999999998, 0.73599999999999999, 0.71399999999999997, 0.71999999999999997]
[0.748, 0.72199999999999998, 0.70799999999999996, 0.73299999999999998, 0.73099999999999998, 0.73099999999999998, 0.73599999999999999, 0.713999

[0.748,
 0.72199999999999998,
 0.70799999999999996,
 0.73299999999999998,
 0.73099999999999998,
 0.73099999999999998,
 0.73599999999999999,
 0.71399999999999997,
 0.71999999999999997,
 0.746]

In [39]:
eval_nested_kfold(model=doc2vec_model4, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.56888472352389874]
[0.56888472352389874, 0.55576382380506095]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639, 0.56566604127579734]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639, 0.56566604127579734, 0.56941838649155718]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639, 0.56566604127579734, 0.56941838649155718, 0.57410881801125702]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639, 0.56566604127579734, 0.56941838649155718, 0.57410881801125702, 0.59474671669793622]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639, 0.56566604127579734, 0.56941838649155718, 0.57410881801125702, 0.59474671669793622, 0.54971857410881797]
[0.56888472352389874, 0.55576382380506095, 0.58911819887429639, 0.56566604127579734, 0.56941838649155718, 0.57410881801125702, 0.59474671669793622, 0.54971857410881797, 0.56941838649155718]
[0.5688847235238

[0.56888472352389874,
 0.55576382380506095,
 0.58911819887429639,
 0.56566604127579734,
 0.56941838649155718,
 0.57410881801125702,
 0.59474671669793622,
 0.54971857410881797,
 0.56941838649155718,
 0.60694183864915574]

In [40]:
evaluate_trec(doc2vec_model4, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.424


### Evaluation of sentence vectors obtained from averaging FastText word vectors

In [41]:
eval_nested_kfold(model=fasttext_model, name='SUBJ', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.82599999999999996]
[0.82599999999999996, 0.82099999999999995]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003, 0.81699999999999995]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003, 0.81699999999999995, 0.80400000000000005]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003, 0.81699999999999995, 0.80400000000000005, 0.79100000000000004]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003, 0.81699999999999995, 0.80400000000000005, 0.79100000000000004, 0.79500000000000004]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003, 0.81699999999999995, 0.80400000000000005, 0.79100000000000004, 0.79500000000000004, 0.82599999999999996]
[0.82599999999999996, 0.82099999999999995, 0.78700000000000003, 0.81699999999999995, 0.80400000000000005, 0.79100000000000004, 0.79500000000000004, 0.82599999999999996, 0.79300000000000004]
[0.8259999999999

[0.82599999999999996,
 0.82099999999999995,
 0.78700000000000003,
 0.81699999999999995,
 0.80400000000000005,
 0.79100000000000004,
 0.79500000000000004,
 0.82599999999999996,
 0.79300000000000004,
 0.81000000000000005]

In [42]:
eval_nested_kfold(model=fasttext_model, name='MR', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.5895032802249297]
[0.5895032802249297, 0.60168697282099348]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687, 0.58536585365853655]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687, 0.58536585365853655, 0.575046904315197]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687, 0.58536585365853655, 0.575046904315197, 0.60694183864915574]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687, 0.58536585365853655, 0.575046904315197, 0.60694183864915574, 0.62757973733583494]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687, 0.58536585365853655, 0.575046904315197, 0.60694183864915574, 0.62757973733583494, 0.62570356472795496]
[0.5895032802249297, 0.60168697282099348, 0.57692307692307687, 0.58536585365853655, 0.575046904315197, 0.60694183864915574, 0.62757973733583494, 0.62570356472795496, 0.62288930581613511]
[0.5895032802249297, 0.601686972820

[0.5895032802249297,
 0.60168697282099348,
 0.57692307692307687,
 0.58536585365853655,
 0.575046904315197,
 0.60694183864915574,
 0.62757973733583494,
 0.62570356472795496,
 0.62288930581613511,
 0.61726078799249529]

In [43]:
evaluate_trec(fasttext_model, evalcv=False, evaltest=True, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.632


## Evaluation Results

| S.No. | Model Name                                | Training Time (in seconds) | Pearson/Spearman/MSE on SICK | Mean SUBJ | Mean MR | TREC |
|-------|-------------------------------------------|----------------------------|------------------------------|-----------|---------|------|
| 1.    | Gensim Sent2Vec                           | 9.6                        | 0.44/0.44/0.81               | 0.71      | 0.57    | 0.58 |
| 2.    | Original Sent2Vec                         | 13.6(!)                    | 0.38/0.38/0.86               | 0.78      | 0.58    | 0.59 |
| 3.    | PV-DM with sum of context word vectors    | 1.7                        | 0.27/0.27/0.93               | 0.67      | 0.56    | 0.38 |
| 4.    | PV-DM with mean of context word vectors   | 1.3                        | 0.25/0.25/0.95               | 0.67      | 0.54    | 0.40 |
| 5.    | PV-DBOW with sum of context word vector   | 1.7                        | 0.34/0.34/0.89               | 0.72      | 0.57    | 0.40 |
| 6.    | PV-DBOW with mean of context word vectors | 1.5                        | 0.34/0.34/0.89               | 0.72      | 0.57    | 0.42 |
| 7.    | Mean of gensim fasttext word vectors      | 8.5                        | 0.49/0.48/0.76               | 0.80      | 0.61    | 0.61 |

(!)NOTE: For original sent2vec the time mentioned in the table denotes total execution time instead of just the training time i.e. the time taken to build the vocabulary, other preprocessing etc is also considered.

# Evaluation on sample of Toronto Corpus

The Toronto Book Corpus has all sentences in 11,038 books. Only 7,087 out of 11,038 books in BookCorpus are unique. Among them 2089 books have one duplicate, 733 books have two and 95 books have more than two duplicates.

In this notebook, we use a sample of 100,000 randomly selected sentences from the Toronto Corpus.

As this is a private corpus, it can only be downloaded on request. See [this](http://yknzhu.wixsite.com/mbweb) for more details.

## Prepare Training Data

In [13]:
# Prepare training data for sent2vec
toronto_data_file = './books_in_sentences/books_large_p1.txt'
toronto_data = []
lines = 0
with smart_open.smart_open(toronto_data_file) as f1, smart_open.smart_open("./input.txt",'w') as f2:
    for line in f1:
        if np.random.random() > 0.5:
            if lines >= 100000:
                break
            lines += 1
            if line not in ['\n', '\r\n']:
                line = re.split('\.|\?|\n', line.strip())
                for sentence in line:
                    if len(sentence) > 1:
                        sentence = gensim_tokenize(sentence)
                        toronto_data.append(list(sentence))
                        f2.write(' '.join(toronto_data[-1]) + '\n')


# Prepare training data for doc2vec
def read_toronto_corpus(fname, tokens_only=False):
    with smart_open.smart_open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            if i >= 100000:
                break
            if tokens_only:
                yield gensim.utils.simple_preprocess(line)
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i])
toronto_doc2vec_data = list(read_toronto_corpus(toronto_data_file))


# Print sample training data
for sentence in toronto_data[:10]:
    print sentence,'\n'

[u'the', u'half', u'ling', u'book', u'one', u'in', u'the', u'fall', u'of', u'igneeria', u'series', u'kaylee', u'soderburg', u'copyright', u'kaylee', u'soderburg', u'all', u'rights', u'reserved'] 

[u'isbn', u'isbn', u'for', u'my', u'family', u'who', u'encouraged', u'me', u'to', u'never', u'stop', u'fighting', u'for', u'my', u'dreams', u'chapter', u'summer', u'vacations', u'supposed', u'to', u'be', u'fun', u'right'] 

[u'starlings', u'new', u'york', u'is', u'not', u'the', u'place', u'youd', u'expect', u'much', u'to', u'happen'] 

[u'its', u'a', u'place', u'where', u'your', u'parents', u'wouldnt', u'even', u'care', u'if', u'you', u'stayed', u'out', u'late', u'biking', u'with', u'your', u'friends'] 

[u'they', u'dont', u'know', u'the', u'half', u'of', u'it'] 

[u'the', u'only', u'reason', u'why', u'no', u'one', u'knows', u'this', u'is', u'because', u'jason', u'emily', u'seth', u'and', u'i', u'have', u'kept', u'it', u'that', u'way'] 

[u'i', u'walked', u'along', u'the', u'empty', u'road', 

## Train Models

In [14]:
# Train new sent2vec model
% time sent2vec_toronto_model = Sent2Vec(toronto_data, size=100, epochs=5, seed=42, workers=4)

INFO:gensim.models.sent2vec:Creating dictionary...
INFO:gensim.models.sent2vec:Read 1.00 M words
INFO:gensim.models.sent2vec:Read 1.35 M words
INFO:gensim.models.sent2vec:Dictionary created, dictionary size: 11587, tokens read: 1352333
INFO:gensim.models.sent2vec:training model with 4 workers on 11587 vocabulary and 100 features
INFO:gensim.models.sent2vec:PROGRESS: at 1.18% words, 73365 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 3.10% words, 95459 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 5.02% words, 103381 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 6.79% words, 104080 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 8.57% words, 104455 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 10.49% words, 107485 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 12.41% words, 109406 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 14.62% words, 112886 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 16.54% words, 113359 words/s
INFO:gensim.models.sent2vec:PROGRESS: at 18.02

In [16]:
# Train model using original c++ implementation of sent2vec
% cd sent2vec
! time ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 5 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 4 -t 0.0001 -dropoutK 2 -bucket 2000000
% cd ..

/Users/prerna135/Documents/GitHub/gensim/docs/notebooks/sent2vec
Read 1M words
Number of words:  12999
Number of labels: 0
Progress: 8.0%  words/sec/thread: 47948  lr: 0.183938  loss: 3.334899  eta: 0h0m /bin/sh: line 1:  4272 Segmentation fault: 11  ./fasttext sent2vec -input ../input.txt -output my_model -minCount 5 -dim 100 -epoch 5 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 4 -t 0.0001 -dropoutK 2 -bucket 2000000

real	0m8.875s
user	0m13.991s
sys	0m1.651s
/Users/prerna135/Documents/GitHub/gensim/docs/notebooks


In [17]:
# Train new Doc2Vec model with PV-DBOW and sum of context word vectors
% time doc2vec_model = gensim.models.doc2vec.Doc2Vec(dm=0, size=100, min_count=5, iter=5, alpha=0.2, max_vocab_size=30000000, negative=10, seed=42)
% time doc2vec_model.build_vocab(toronto_doc2vec_data)
% time doc2vec_model.train(toronto_doc2vec_data, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

CPU times: user 211 µs, sys: 1.03 ms, total: 1.25 ms
Wall time: 2.02 ms
INFO:gensim.models.doc2vec:collecting all words and their counts
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #10000, processed 113217 words (760493/s), 7839 word types, 10000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #20000, processed 256086 words (845753/s), 14477 word types, 20000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #30000, processed 377626 words (865909/s), 17118 word types, 30000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #40000, processed 481481 words (764731/s), 19727 word types, 40000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #50000, processed 594311 words (702955/s), 21187 word types, 50000 tags
INFO:gensim.models.doc2vec:PROGRESS: at example #60000, processed 755296 words (972531/s), 24414 word types, 60000 tags
INFO:gensim.models.doc2vec:PROGRESS: at exampl

5690060

In [18]:
# Train new Gensim fasttext model
% time fasttext_model = FastText(size=100, alpha=0.2, negative=10, max_vocab_size=30000000, seed=42, iter=5)
% time fasttext_model.build_vocab(toronto_data)
% time fasttext_model.train(toronto_data, total_examples=fasttext_model.corpus_count, epochs=fasttext_model.iter)

CPU times: user 172 µs, sys: 424 µs, total: 596 µs
Wall time: 590 µs
INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #10000, processed 124196 words, keeping 10460 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #20000, processed 241930 words, keeping 14721 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #30000, processed 382938 words, keeping 18492 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #40000, processed 581109 words, keeping 23462 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #50000, processed 707569 words, keeping 25275 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #60000, processed 823532 words, keeping 27145 word types
INFO:gensim.models.word2vec:PROGRESS: at sentence #70000, processed 951042 words, keeping 29090 word types
INFO:gensim.models.wor

## Unsupervised Similarity Evaluation

In [19]:
# Evaluate gensim sent2vec model
evaluate_sick(sent2vec_toronto_model, seed=42, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.468258209783
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.498787005964
Test Spearman: 0.506137348117
Test MSE: 0.764512114702


array([ 3.2211674 ,  3.68008844,  3.37953488, ...,  3.27733084,
        2.89488411,  3.11638982])

In [20]:
# Evaluate original sent2vec model
evaluate_sick(seed=42, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...




Training...
Dev Pearson: 0.40930191142
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.388923255965
Test Spearman: 0.391691012721
Test MSE: 0.866768122911


array([ 3.12643148,  3.12836795,  3.27785104, ...,  3.37349816,
        2.57517373,  2.94589332])

In [21]:
# Evaluate gensim doc2vec model
evaluate_sick(doc2vec_model, seed=42, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.502551939299
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.497833663676
Test Spearman: 0.477028702016
Test MSE: 0.768101925819


array([ 2.84552883,  3.52230607,  3.06499181, ...,  2.86501964,
        2.76126113,  2.84552446])

In [22]:
# Evaluate gensim fasttext model
evaluate_sick(fasttext_model, seed=42, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing development sentence vectors...
Computing feature combinations...
Encoding labels...
Compiling model...
Training...




Dev Pearson: 0.557180612207
Computing test sentence vectors...
Computing feature combinations...
Evaluating...
Test Pearson: 0.569093999422
Test Spearman: 0.559321843143
Test MSE: 0.689002154196


array([ 2.73126831,  2.70995621,  2.80035074, ...,  2.78172965,
        2.28476275,  3.02997125])

## Downstream Supervised Evaluation

### Evaluation of Gensim Sent2Vec model

In [23]:
eval_nested_kfold(model=sent2vec_toronto_model, name='SUBJ', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.71399999999999997]
[0.71399999999999997, 0.71699999999999997]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995, 0.69899999999999995]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995, 0.69899999999999995, 0.69599999999999995]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995, 0.69899999999999995, 0.69599999999999995, 0.72499999999999998]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995, 0.69899999999999995, 0.69599999999999995, 0.72499999999999998, 0.69499999999999995]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995, 0.69899999999999995, 0.69599999999999995, 0.72499999999999998, 0.69499999999999995, 0.71699999999999997]
[0.71399999999999997, 0.71699999999999997, 0.69199999999999995, 0.69899999999999995, 0.69599999999999995, 0.72499999999999998, 0.69499999999999995, 0.71699999999999997, 0.69299999999999995]
[0.7139999999999

[0.71399999999999997,
 0.71699999999999997,
 0.69199999999999995,
 0.69899999999999995,
 0.69599999999999995,
 0.72499999999999998,
 0.69499999999999995,
 0.71699999999999997,
 0.69299999999999995,
 0.70699999999999996]

In [24]:
eval_nested_kfold(model=sent2vec_toronto_model, name='MR', use_nb=False, model_name='sent2vec')

Computing sentence vectors...
[0.59793814432989689]
[0.59793814432989689, 0.57638238050609181]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606, 0.59380863039399623]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606, 0.59380863039399623, 0.56472795497185746]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606, 0.59380863039399623, 0.56472795497185746, 0.59193245778611636]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606, 0.59380863039399623, 0.56472795497185746, 0.59193245778611636, 0.59099437148217637]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606, 0.59380863039399623, 0.56472795497185746, 0.59193245778611636, 0.59099437148217637, 0.57129455909943716]
[0.59793814432989689, 0.57638238050609181, 0.59849906191369606, 0.59380863039399623, 0.56472795497185746, 0.59193245778611636, 0.59099437148217637, 0.57129455909943716, 0.61726078799249529]
[0.5979381443298

[0.59793814432989689,
 0.57638238050609181,
 0.59849906191369606,
 0.59380863039399623,
 0.56472795497185746,
 0.59193245778611636,
 0.59099437148217637,
 0.57129455909943716,
 0.61726078799249529,
 0.60131332082551592]

In [25]:
evaluate_trec(sent2vec_toronto_model, evalcv=False, evaltest=True, model_name='sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.508


### Evaluation of Original C++ Sent2Vec

In [26]:
eval_nested_kfold(name='SUBJ', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.78700000000000003]
[0.78700000000000003, 0.80600000000000005]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002, 0.78300000000000003]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002, 0.78300000000000003, 0.79600000000000004]
[0.78700000000000003, 0.80600000000000005, 0.78200000000000003, 0.77800000000000002, 0.77500000000000002, 0.76700000000000002, 0.78300000000000003, 0.79600000000000004, 0.76300000000000001]
[0.7870000000000

[0.78700000000000003,
 0.80600000000000005,
 0.78200000000000003,
 0.77800000000000002,
 0.77500000000000002,
 0.76700000000000002,
 0.78300000000000003,
 0.79600000000000004,
 0.76300000000000001,
 0.79700000000000004]

In [27]:
eval_nested_kfold(name='MR', use_nb=False, model_name='original_sent2vec')

Computing sentence vectors...
[0.57357075913776945]
[0.57357075913776945, 0.57263355201499533]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56566604127579734]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56566604127579734, 0.58348968105065668]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56566604127579734, 0.58348968105065668, 0.59380863039399623]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56566604127579734, 0.58348968105065668, 0.59380863039399623, 0.60037523452157604]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56566604127579734, 0.58348968105065668, 0.59380863039399623, 0.60037523452157604, 0.61819887429643527]
[0.57357075913776945, 0.57263355201499533, 0.56378986866791747, 0.56566604127579734, 0.58348968105065668, 0.59380863039399623, 0.60037523452157604, 0.61819887429643527, 0.60412757973733588]
[0.5735707591377

[0.57357075913776945,
 0.57263355201499533,
 0.56378986866791747,
 0.56566604127579734,
 0.58348968105065668,
 0.59380863039399623,
 0.60037523452157604,
 0.61819887429643527,
 0.60412757973733588,
 0.59380863039399623]

In [28]:
evaluate_trec(evalcv=False, evaltest=True, model_name='original_sent2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.592


### Evaluation of Gensim Doc2Vec

In [29]:
eval_nested_kfold(model=doc2vec_model, name='SUBJ', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.75600000000000001]
[0.75600000000000001, 0.75900000000000001]
[0.75600000000000001, 0.75900000000000001, 0.75]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75, 0.76900000000000002]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75, 0.76900000000000002, 0.74199999999999999]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75, 0.76900000000000002, 0.74199999999999999, 0.745]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75, 0.76900000000000002, 0.74199999999999999, 0.745, 0.77300000000000002]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75, 0.76900000000000002, 0.74199999999999999, 0.745, 0.77300000000000002, 0.746]
[0.75600000000000001, 0.75900000000000001, 0.75, 0.75, 0.76900000000000002, 0.74199999999999999, 0.745, 0.77300000000000002, 0.746, 0.78200000000000003]


[0.75600000000000001,
 0.75900000000000001,
 0.75,
 0.75,
 0.76900000000000002,
 0.74199999999999999,
 0.745,
 0.77300000000000002,
 0.746,
 0.78200000000000003]

In [30]:
eval_nested_kfold(model=doc2vec_model, name='MR', use_nb=False, model_name='doc2vec')

Computing sentence vectors...
[0.60168697282099348]
[0.60168697282099348, 0.6204311152764761]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559, 0.6303939962476548]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559, 0.6303939962476548, 0.60037523452157604]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559, 0.6303939962476548, 0.60037523452157604, 0.64165103189493433]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559, 0.6303939962476548, 0.60037523452157604, 0.64165103189493433, 0.6163227016885553]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559, 0.6303939962476548, 0.60037523452157604, 0.64165103189493433, 0.6163227016885553, 0.62101313320825513]
[0.60168697282099348, 0.6204311152764761, 0.61069418386491559, 0.6303939962476548, 0.60037523452157604, 0.64165103189493433, 0.6163227016885553, 0.62101313320825513, 0.61444652908067543]
[0.60168697282099348, 0.620431115

[0.60168697282099348,
 0.6204311152764761,
 0.61069418386491559,
 0.6303939962476548,
 0.60037523452157604,
 0.64165103189493433,
 0.6163227016885553,
 0.62101313320825513,
 0.61444652908067543,
 0.61350844277673544]

In [31]:
evaluate_trec(doc2vec_model, evalcv=False, evaltest=True, model_name='doc2vec')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.464


### Evaluation of Gensim FastText

In [32]:
eval_nested_kfold(model=fasttext_model, name='SUBJ', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.81000000000000005]
[0.81000000000000005, 0.78400000000000003]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002, 0.79200000000000004]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002, 0.79200000000000004, 0.78900000000000003]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002, 0.79200000000000004, 0.78900000000000003, 0.76900000000000002]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002, 0.79200000000000004, 0.78900000000000003, 0.76900000000000002, 0.77000000000000002]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002, 0.79200000000000004, 0.78900000000000003, 0.76900000000000002, 0.77000000000000002, 0.80100000000000005]
[0.81000000000000005, 0.78400000000000003, 0.76700000000000002, 0.79200000000000004, 0.78900000000000003, 0.76900000000000002, 0.77000000000000002, 0.80100000000000005, 0.77400000000000002]
[0.8100000000000

[0.81000000000000005,
 0.78400000000000003,
 0.76700000000000002,
 0.79200000000000004,
 0.78900000000000003,
 0.76900000000000002,
 0.77000000000000002,
 0.80100000000000005,
 0.77400000000000002,
 0.81200000000000006]

In [33]:
eval_nested_kfold(model=fasttext_model, name='MR', use_nb=False, model_name='fasttext')

Computing sentence vectors...
[0.60543580131208996]
[0.60543580131208996, 0.61199625117150891]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463, 0.62476547842401498]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463, 0.62476547842401498, 0.59193245778611636]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463, 0.62476547842401498, 0.59193245778611636, 0.64915572232645402]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463, 0.62476547842401498, 0.59193245778611636, 0.64915572232645402, 0.64165103189493433]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463, 0.62476547842401498, 0.59193245778611636, 0.64915572232645402, 0.64165103189493433, 0.63789868667917449]
[0.60543580131208996, 0.61199625117150891, 0.63508442776735463, 0.62476547842401498, 0.59193245778611636, 0.64915572232645402, 0.64165103189493433, 0.63789868667917449, 0.63508442776735463]
[0.6054358013120

[0.60543580131208996,
 0.61199625117150891,
 0.63508442776735463,
 0.62476547842401498,
 0.59193245778611636,
 0.64915572232645402,
 0.64165103189493433,
 0.63789868667917449,
 0.63508442776735463,
 0.62476547842401498]

In [34]:
evaluate_trec(fasttext_model, evalcv=False, evaltest=True, model_name='fasttext')

Preparing data...
Computing training sentence vectors...
Computing testing sentence vectors...
Evaluating...
Test accuracy: 0.588


## Evaluation Results

It is evident that more data = better results for sent2vec, as the above model (trained for 5 epochs) achieves similar results to the model trained on the much smaller Lee corpus (trained for 20 epochs)

| S.No. | Model             | Training Time (in seconds)        | Pearson/Spearman/MSE on SICK | MR   | SUBJ | TREC |
|-------|-------------------|-----------------------------------|------------------------------|------|------|------|
| 1.    | Gensim Sent2Vec   | 54.4                              | 0.49/0.50/0.76               | 0.59 | 0.70 | 0.50 |
| 2.    | Original Sent2Vec | 8.87(!)                           | 0.38/0.39/0.86               | 0.58 | 0.78 | 0.59 |
| 3.    | Doc2Vec DBOW (sum)| 33.8                              | 0.49/0.47/0.76               | 0.61 | 0.75 | 0.46 |
| 4.    | FastText (average)| 56.4                              | 0.56/0.55/0.68               | 0.62 | 0.78 | 0.58 |

(!)NOTE: For original sent2vec the time mentioned in the table denotes total execution time instead of just the training time i.e. the time taken to build the vocabulary, other preprocessing etc is also considered.