# gensim doc2vec & IMDB sentiment dataset

TODO: section on introduction & motivation

TODO: prerequisites + dependencies (statsmodels, patsy, ?)

## Load corpus

Fetch and prep exactly as in Mikolov's go.sh shell script. (Note this cell tests for existence of required files, so steps won't repeat once the final summary file (`aclImdb/alldata-id.txt`) is available alongside this notebook.)

In [3]:
import locale
import glob
import os.path
import requests
import tarfile
import pandas as pd

dirname = 'aclImdb'
filename = 'aclImdb_v1.tar.gz'
locale.setlocale(locale.LC_ALL, 'C')


# Convert text to lower-case and strip punctuation/symbols from words
def normalize_text(text):
    norm_text = text.lower()

    # Replace breaks with spaces
    norm_text = norm_text.replace('<br />', ' ')

    # Pad punctuation with spaces on both sides
    for char in ['.', '"', ',', '(', ')', '!', '?', ';', ':']:
        norm_text = norm_text.replace(char, ' ' + char + ' ')

    return norm_text



In [6]:
df = pd.read_csv("compiled_data_toclassify.csv",index_col=None, header=0)
df=df.drop(df.columns[0],1)

df = df.loc[df['response_type']=='asker', (df!=0).any(0)]
txt = df['response'].as_matrix()
cols = [col for col in df.columns if 'diag_' in col]
diag = 1*(df[cols].as_matrix()>0)

In [18]:
import gensim
from gensim.models.doc2vec import TaggedDocument
from collections import namedtuple

SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

alldocs = []  # will hold all docs in original order
for line_no, line in enumerate(txt):
    tokens = gensim.utils.to_unicode(line).split()
    words = tokens
    tags = [line_no] # `tags = [tokens[0]]` would also work at extra memory cost
    split = ['train','test','extra','extra'][line_no//(len(txt)/4)]  # 25k train, 25k test, 25k extra
    sentiment = diag[line_no,14] # [12.5K pos, 12.5K neg]*2 then unknown
    alldocs.append(SentimentDocument(words, tags, split, sentiment))

train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
doc_list = alldocs[:]  # for reshuffling per pass

print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_docs), len(test_docs)))

5692 docs: 1423 train-sentiment, 1423 test-sentiment


## Set-up Doc2Vec Training & Evaluation Models

Approximating experiment of Le & Mikolov ["Distributed Representations of Sentences and Documents"](http://cs.stanford.edu/~quocle/paragraph_vector.pdf), also with guidance from Mikolov's [example go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

`./word2vec -train ../alldata-id.txt -output vectors.txt -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1`

Parameter choices below vary:

* 100-dimensional vectors, as the 400d vectors of the paper don't seem to offer much benefit on this task
* similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* a `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [25]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "this will be painfully slow otherwise"

simple_models = [
    # PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# speed setup by sharing results of 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM/concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
Doc2Vec(dbow,d100,n5,mc2,t4)
Doc2Vec(dm/m,d100,n5,w10,mc2,t4)


Following the paper, we also evaluate models in pairs. These wrappers return the concatenation of the vectors from each model. (Only the singular models are trained.)

## Predictive Evaluation Methods

Helper methods for evaluating error rate.

In [36]:
import numpy as np
import statsmodels.api as sm
from random import sample

# for timing
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start
    
def logistic_predictor_from_data(train_targets, train_regressors):
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(method='bfgs')
    #print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)


## Bulk Training

Using explicit multiple-pass, alpha-reduction approach as sketched in [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) – with added shuffling of corpus on each pass.

Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.

Evaluation of each model's sentiment-predictive power is repeated after each pass, as an error rate (lower is better), to see the rates-of-relative-improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. 

(On a 4-core 2.6Ghz Intel Core i7, these 20 passes training and evaluating 3 main models takes about an hour.)

In [37]:
from collections import defaultdict
best_error = defaultdict(lambda :1.0)  # to selectively-print only best errors achieved

In [39]:
from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
    shuffle(doc_list)  # shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            train_model.train(doc_list)
            duration = '%.1f' % elapsed()
            
        # evaluate
        eval_duration = ''
        with elapsed_timer() as eval_elapsed:
            err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)
        eval_duration = '%.1f' % eval_elapsed()
        best_indicator = ' '
        if err <= best_error[name]:
            best_error[name] = err
            best_indicator = '*' 
        print("%s%f : %i passes : %s %ss %ss" % (best_indicator, err, epoch + 1, name, duration, eval_duration))

        if ((epoch + 1) % 5) == 0 or epoch == 0:
            eval_duration = ''
            with elapsed_timer() as eval_elapsed:
                infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)
            eval_duration = '%.1f' % eval_elapsed()
            best_indicator = ' '
            if infer_err < best_error[name + '_inferred']:
                best_error[name + '_inferred'] = infer_err
                best_indicator = '*'
            print("%s%f : %i passes : %s %ss %ss" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))

    print('completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta
    
print("END %s" % str(datetime.datetime.now()))

START 2016-09-26 14:52:35.807063
         Current function value: 0.086451
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 15.3s 0.4s




         Current function value: 0.086451
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.000000 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_inferred 15.3s 0.8s
         Current function value: 0.035297
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.016163 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.5s 0.3s




         Current function value: 0.035297
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.007042 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,t4)_inferred 5.5s 0.4s
         Current function value: 0.063490
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
*0.004919 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 9.5s 0.3s




         Current function value: 0.063490
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
*0.014085 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4)_inferred 9.5s 0.6s
completed pass 1 at alpha 0.025000
         Current function value: 0.084170
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
*0.003514 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 15.1s 0.2s




         Current function value: 0.027971
         Iterations: 35
         Function evaluations: 40
         Gradient evaluations: 40
 0.023190 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.7s 0.3s
         Current function value: 0.060596
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.007730 : 2 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.6s 0.4s
completed pass 2 at alpha 0.023800




         Current function value: 0.084379
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.7s 0.2s
         Current function value: 0.032239
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.023893 : 3 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 6.1s 0.3s




         Current function value: 0.055943
         Iterations: 35
         Function evaluations: 40
         Gradient evaluations: 40
 0.008433 : 3 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 9.1s 0.3s
completed pass 3 at alpha 0.022600
         Current function value: 0.083235
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.3s 0.3s




         Current function value: 0.031158
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.028812 : 4 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.5s 0.2s
         Current function value: 0.050490
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.013352 : 4 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 9.5s 0.4s
completed pass 4 at alpha 0.021400




         Current function value: 0.081472
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
*0.003514 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.0s 0.2s
         Current function value: 0.081472
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.007042 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_inferred 14.0s 0.8s




         Current function value: 0.025673
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.025299 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.4s 0.3s
         Current function value: 0.025673
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.014085 : 5 passes : Doc2Vec(dbow,d100,n5,mc2,t4)_inferred 5.4s 0.4s




         Current function value: 0.050749
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.011947 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.3s 0.3s
         Current function value: 0.050749
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
*0.007042 : 5 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4)_inferred 8.3s 0.6s
completed pass 5 at alpha 0.020200




         Current function value: 0.081301
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
*0.003514 : 6 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.8s 0.3s
         Current function value: 0.027116
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.023190 : 6 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.9s 0.3s




         Current function value: 0.047744
         Iterations: 35
         Function evaluations: 43
         Gradient evaluations: 43
 0.014055 : 6 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 9.5s 0.4s
completed pass 6 at alpha 0.019000
         Current function value: 0.081099
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 7 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.7s 0.2s




         Current function value: 0.024714
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.030218 : 7 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 6.1s 0.3s
         Current function value: 0.046800
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.014055 : 7 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 9.2s 0.3s
completed pass 7 at alpha 0.017800




         Current function value: 0.081522
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 8 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.8s 0.3s
         Current function value: 0.028792
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.020379 : 8 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 6.0s 0.3s




         Current function value: 0.046141
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.011244 : 8 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 9.8s 0.2s
completed pass 8 at alpha 0.016600
         Current function value: 0.079449
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 9 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.2s 0.2s




         Current function value: 0.029077
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.021785 : 9 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.3s
         Current function value: 0.046610
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.012649 : 9 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.4s 0.2s
completed pass 9 at alpha 0.015400




         Current function value: 0.079087
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.5s 0.2s
         Current function value: 0.079087
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
 0.000000 : 10 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_inferred 13.5s 0.7s




         Current function value: 0.028692
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.020379 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.3s
         Current function value: 0.028692
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.014085 : 10 passes : Doc2Vec(dbow,d100,n5,mc2,t4)_inferred 5.6s 0.5s




         Current function value: 0.045625
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.013352 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.4s 0.3s
         Current function value: 0.045625
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.014085 : 10 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4)_inferred 8.4s 0.7s
completed pass 10 at alpha 0.014200




         Current function value: 0.078824
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 11 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.1s 0.2s
         Current function value: 0.029090
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.023893 : 11 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.3s




         Current function value: 0.045584
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.014055 : 11 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.3s 0.3s
completed pass 11 at alpha 0.013000
         Current function value: 0.078798
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 12 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.9s 0.3s




         Current function value: 0.028946
         Iterations: 35
         Function evaluations: 37
         Gradient evaluations: 37
 0.021785 : 12 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.5s 0.3s
         Current function value: 0.045357
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.011947 : 12 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.2s 0.2s
completed pass 12 at alpha 0.011800




         Current function value: 0.079508
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 13 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.9s 0.3s
         Current function value: 0.027802
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.027407 : 13 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.9s 0.3s




         Current function value: 0.044872
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.010541 : 13 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.5s 0.3s
completed pass 13 at alpha 0.010600
         Current function value: 0.079436
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 14 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.4s 0.3s




         Current function value: 0.029386
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.023190 : 14 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.5s 0.3s
         Current function value: 0.044476
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.010541 : 14 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.3s 0.2s
completed pass 14 at alpha 0.009400




         Current function value: 0.079456
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.2s 0.2s
         Current function value: 0.079456
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
 0.007042 : 15 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_inferred 14.2s 0.7s




         Current function value: 0.028678
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.021082 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.5s 0.4s
         Current function value: 0.028678
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.007042 : 15 passes : Doc2Vec(dbow,d100,n5,mc2,t4)_inferred 5.5s 0.6s




         Current function value: 0.043993
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.011244 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.5s 0.3s
         Current function value: 0.043993
         Iterations: 35
         Function evaluations: 42
         Gradient evaluations: 42
 0.007042 : 15 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4)_inferred 8.5s 0.6s
completed pass 15 at alpha 0.008200




         Current function value: 0.079197
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 16 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.0s 0.2s
         Current function value: 0.027826
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.021082 : 16 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.3s




         Current function value: 0.044545
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.011244 : 16 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.3s 0.3s
completed pass 16 at alpha 0.007000
         Current function value: 0.079372
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 17 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 13.8s 0.3s




         Current function value: 0.029555
         Iterations: 35
         Function evaluations: 38
         Gradient evaluations: 38
 0.020379 : 17 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.3s
         Current function value: 0.044636
         Iterations: 35
         Function evaluations: 45
         Gradient evaluations: 45
 0.011244 : 17 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.3s 0.3s
completed pass 17 at alpha 0.005800




         Current function value: 0.079206
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 18 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.5s 0.2s
         Current function value: 0.026905
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.028110 : 18 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.4s




         Current function value: 0.044542
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.011244 : 18 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.3s 0.3s
completed pass 18 at alpha 0.004600
         Current function value: 0.079226
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 19 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.3s 0.3s




         Current function value: 0.027406
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.028110 : 19 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.2s
         Current function value: 0.044980
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.011244 : 19 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.1s 0.2s
completed pass 19 at alpha 0.003400




         Current function value: 0.079212
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
*0.003514 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4) 14.0s 0.3s
         Current function value: 0.079212
         Iterations: 35
         Function evaluations: 36
         Gradient evaluations: 36
 0.007042 : 20 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_inferred 14.0s 0.7s




         Current function value: 0.027718
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.026704 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,t4) 5.6s 0.2s
         Current function value: 0.027718
         Iterations: 35
         Function evaluations: 39
         Gradient evaluations: 39
 0.014085 : 20 passes : Doc2Vec(dbow,d100,n5,mc2,t4)_inferred 5.6s 0.4s




         Current function value: 0.044947
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
 0.010541 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4) 8.1s 0.2s
         Current function value: 0.044947
         Iterations: 35
         Function evaluations: 41
         Gradient evaluations: 41
*0.000000 : 20 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,t4)_inferred 8.1s 0.7s
completed pass 20 at alpha 0.002200
END 2016-09-26 15:02:32.271577




## Achieved Sentiment-Prediction Accuracy

In [40]:
# print best error rates achieved
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

0.000000 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)_inferred
0.000000 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)_inferred
0.003514 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.004919 Doc2Vec(dm/m,d100,n5,w10,mc2,t4)
0.007042 Doc2Vec(dbow,d100,n5,mc2,t4)_inferred
0.016163 Doc2Vec(dbow,d100,n5,mc2,t4)


In my testing, unlike the paper's report, DBOW performs best. Concatenating vectors from different models only offers a small predictive improvement. The best results I've seen are still just under 10% error rate, still a ways from the paper's 7.42%.


## Examining Results

### Are inferred vectors close to the precalculated ones?

In [13]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

for doc 25430...
Doc2Vec(dm/c,d100,n5,w5,mc2,t8):
 [(25430, 0.6583491563796997), (27314, 0.4142411947250366), (16479, 0.40846431255340576)]
Doc2Vec(dbow,d100,n5,mc2,t8):
 [(25430, 0.9325973987579346), (49281, 0.5766637921333313), (79679, 0.5634804964065552)]
Doc2Vec(dm/m,d100,n5,w10,mc2,t8):
 [(25430, 0.7970066666603088), (97818, 0.6925815343856812), (230, 0.690807580947876)]


(Yes, here the stored vector from 20 epochs of training is usually one of the closest to a freshly-inferred vector for the same words. Note the defaults for inference are very abbreviated – just 3 steps starting at a high alpha – and likely need tuning for other applications.)

### Do close documents seem more related than distant ones?

In [43]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))


TARGET (3071): «Hello, i'm a 20 y/o sexually active female, and my partner was recently run for an STD panel, and he was diagnosed with an outbreak of ureaplasma urealyticum, which i read to be a normal comensel of the vagina. i was also run for STDs 2months ago, and the report was clean.i don\'t have any symptoms .should i get tested again? do i need treatment, as ureaplasma urealyticum is a normal commensal of the vagina, can it cause anything in me, when i have a fairly good immune system?»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/c,d100,n5,w5,mc2,t4):

MOST (4240, 0.551862895488739): «hi i have had random very light bleeeding on and off for last to days. i am not due for period for two weeks. i am on no birthcontrol as ttc. i also have a 1year old»

MEDIAN (4316, -0.009853601455688477): «hi there just wondering with fibroids and a uterus that is enlarged 3 times the size could make me feel like im going through perimenopause? i have awful backpain,warm body temperatures at nig

(Somewhat, in terms of reviewer tone, movie genre, etc... the MOST cosine-similar docs usually seem more like the TARGET than the MEDIAN or LEAST.)

### Do the word vectors show useful similarities?

In [44]:
word_models = simple_models[:]

In [51]:
import random
from IPython.display import HTML
# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].index2word)
    if word_models[0].vocab[word].count > 10:
        break
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'comedy/drama'
similars_per_model = [str(model.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurences)" % (word, simple_models[0].vocab[word].count))
HTML(similar_table)

most similar words for 'turned' (102 occurences)


"Doc2Vec(dm/c,d100,n5,w5,mc2,t4)","Doc2Vec(dbow,d100,n5,mc2,t4)","Doc2Vec(dm/m,d100,n5,w10,mc2,t4)"
"[(u'moved', 0.44929295778274536), (u'shows', 0.4469910264015198), (u'gone', 0.4420161843299866), (u'poor', 0.4380697011947632), (u'returned', 0.43781766295433044), (u'stays', 0.4321470856666565), (u'turns', 0.43158257007598877), (u'gotten', 0.4310656785964966), (u'tasted', 0.43086183071136475), (u'yest', 0.428362637758255), (u'depends', 0.42474836111068726), (u""didn't."", 0.4231417179107666), (u'intend', 0.42213356494903564), (u'adds', 0.4152626097202301), (u'got', 0.4135403633117676), (u'grew', 0.408092200756073), (u'spreads', 0.4066140651702881), (u'was', 0.40580853819847107), (u'(enough', 0.4030357599258423), (u'hurted', 0.4008103311061859)]","[(u'fluttering', 0.36668860912323), (u'tongue,', 0.3573368191719055), (u'inside.', 0.3471986651420593), (u'bruise', 0.34093177318573), (u'horrible', 0.33784306049346924), (u'this?', 0.3312210440635681), (u'thin.', 0.3285936713218689), (u'pelvic', 0.3244484066963196), (u'partener', 0.3230714201927185), (u'distinct', 0.31963855028152466), (u'Hoping', 0.3180590569972992), (u'bloodshot', 0.3178950846195221), (u'attempted', 0.3154103755950928), (u'orgasam.', 0.31210803985595703), (u'experiencing:', 0.30775538086891174), (u'city', 0.3042522072792053), (u'shower?', 0.30192750692367554), (u'noticeable.', 0.3007444143295288), (u'suspects', 0.29737356305122375), (u'food', 0.29411038756370544)]","[(u'came', 0.43062785267829895), (u'turns', 0.40876054763793945), (u'stayed', 0.40178734064102173), (u'reached', 0.40010756254196167), (u'comes', 0.3972766101360321), (u'grows', 0.3926957845687866), (u'got', 0.38521766662597656), (u'ended', 0.3720528483390808), (u'settles', 0.3671431541442871), (u'appeared', 0.3639615774154663), (u'rolls', 0.36344873905181885), (u'fell', 0.3612420856952667), (u'armpit,', 0.3517459034919739), (u'seemed', 0.349328875541687), (u'was', 0.34488916397094727), (u'act', 0.34478825330734253), (u'inside).', 0.3439972400665283), (u'labia,', 0.3416295349597931), (u'erosive', 0.34070563316345215), (u'seems', 0.3406856060028076)]"


Do the DBOW words look meaningless? That's because the gensim DBOW model doesn't train word vectors – they remain at their random initialized values – unless you ask with the `dbow_words=1` initialization parameter. Concurrent word-training slows DBOW mode significantly, and offers little improvement (and sometimes a little worsening) of the error rate on this IMDB sentiment-prediction task. 

Words from DM models tend to show meaningfully similar words when there are many examples in the training data (as with 'plot' or 'actor'). (All DM modes inherently involve word vector training concurrent with doc vector training.)

### Are the word vectors from this dataset any good at analogies?

In [52]:
# assuming something like
# https://word2vec.googlecode.com/svn/trunk/questions-words.txt 
# is in local directory
# note: this takes many minutes
for model in word_models:
    sections = model.accuracy('questions-words.txt')
    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
    print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))

IOError: [Errno 2] No such file or directory: 'questions-words.txt'

Even though this is a tiny, domain-specific dataset, it shows some meager capability on the general word analogies – at least for the DM/concat and DM/mean models which actually train word vectors. (The untrained random-initialized words of the DBOW model of course fail miserably.)

## Slop

In [None]:
This cell left intentionally erroneous. 

To mix the Google dataset (if locally available) into the word tests...

In [None]:
from gensim.models import Word2Vec
w2v_g100b = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
w2v_g100b.compact_name = 'w2v_g100b'
word_models.append(w2v_g100b)

To get copious logging output from above steps...

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
rootLogger = logging.getLogger()
rootLogger.setLevel(logging.INFO)

To auto-reload python code while developing...

In [None]:
%load_ext autoreload
%autoreload 2