# NLP on SEC Forms Using Doc2vec with Gensim
## Introduction
Throughout this notebook, we reference <a href="https://arxiv.org/pdf/1405.4053.pdf">Le and Mikolov 2014</a>. 

### Bag-of-words Model
Traditional state-of-the-art document representations are based on the <a href="https://en.wikipedia.org/wiki/Bag-of-words_model">bag-of-words model</a>, which represent input documents as a fixed-length vector. For example, borrowing from the Wikipedia article, the two documents  
(1) `John likes to watch movies. Mary likes movies too.`  
(2) `John also likes to watch football games.`  
are used to construct a length 10 list of words  
`["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games"]`  
so then we can represent the two documents as fixed length vectors whose elements are the frequencies of the corresponding words in our list  
(1) `[1, 2, 1, 1, 2, 1, 1, 0, 0, 0]`  
(2) `[1, 1, 1, 1, 0, 0, 0, 1, 1, 1]`  
Bag-of-words models are surprisingly effective but lose information about word order. Bag of <a href="https://en.wikipedia.org/wiki/N-gram">n-grams</a> models consider word phrases of length n to capture local word order but suffer from data sparsity and high dimensionality.

### Word2vec Model
Word2vec uses a shallow neural network to embed words in a high-dimensional vector space. In the resulting vector space, close word vectors have similar contextual meanings and distant word vectors have different contextual meanings. For example, `strong` and `powerful` would be closer together than `strong` and `Paris`. Word2vec models can be trained using two prediction tasks which represent the skip-gram and continuous-bag-of-words models.


#### Word2vec - Skip-gram Model
The Skip-gram <a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Word2vec</a> Model takes in pairs (word1, word2) generated by moving a window across text data and trains a 1-hidden-layer neural network based on the fake task giving us a predicted probability distribution of nearby words to a given input word. The hidden-to-output weights in the neural network become the word embeddings. So if the hidden layer has 300 neurons, this network woulld give us 300-dimensional word embeddings. We use <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> encoding for the words.

#### Word2vec - Continuous-bag-of-words Model
The Continuous-bag-of-words Word2vec Model is also a 1-hidden-layer neural network. This time, the fake task is to predict the center word based on context words in a window around the center word. Again, the hidden-to-output weights become the word embeddings and we use one-hot encoding.

### Paragraph Vector
Le and Mikolov 2014 introduces the <i>Paragraph Vector</i>, which outperforms representing a documents by averaging or concatenating the word vectors of a document. We determine the embedding of the paragraph in its vector space in the same way as words - by training a shallow neural network on a fake task. Paragraph Vectors consider local word order but also give us a dense vector representations.

#### Paragraph Vector - Distributed Memory (PV-DM)
This is the Paragraph Vector Model analogous to the Continuous-bag-of-words Word2vec Model. Paragraph Vectors are obtained by training a neural network on the fake task of inferring a center word based on context words and a context paragraph. A paragraph is a context for all words in the paragraph.

#### Paragraph Vector - Distributed Bag of Words (PV-DBOW)
This is the Paragraph Vector Model analogous to the Skip-gram Word2vec Model. Paragraph Vectors are obtained by training a neural network on the fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

### Requirements
The following python modules are dependencies for this notebook:
* testfixtures
* statsmodels

In [47]:
from os import remove
import locale
import glob
import os.path
import sys
import smart_open
import re

dirname = 'data'
locale.setlocale(locale.LC_ALL, 'C')
if sys.version > '3':
    control_chars = [chr(0x85)]
else:
    control_chars = [unichr(0x85)]

# Convert text to lower-case and strip punctuation and symbols from words
def normalize_text(text):
    norm_text = text.lower()
    norm_text = norm_text.replace('<br />', ' ')
    norm_text = norm_text.replace('\r', ' ')
    norm_text = norm_text.replace('\n', ' ')
    #Pad punctuation with spaces on both sides
    for char in ['.', '"', ',', '(', ')', '!', '?', ';', ':', '{', '}', '<', '>']:
        norm_text = norm_text.replace(char, ' ' + char + ' ')
    re.sub('\s+', ' ', norm_text).strip()
    return norm_text

def prepData():
    folders = ['train/pos', 'train/neg', 'test/pos', 'test/neg']
    # Delete .txt files from the last run
    if os.path.isfile(os.path.join(dirname, 'alldata-id.txt')):
        for fol in folders:   
            os.remove(os.path.join(dirname, fol.replace('/', '-') + '.txt'))
            os.remove(os.path.join(dirname, 'alldata-id.txt'))
        
    print("Preparing dataset...")
    alldata = u''
    for fol in folders:
        temp = u''
        txt_files = glob.glob(os.path.join(dirname, fol, '*.txt'))
        i = 0
        for txt in txt_files:
            with smart_open.smart_open(txt, "rb") as t:
                t_clean = t.read().decode("utf-8")
                for c in control_chars:
                    t_clean = t_clean.replace(c, ' ')                
                t_clean = normalize_text(t_clean)
                temp += t_clean
            temp += "\n"
        output = fol.replace('/', '-') + '.txt'
        print("{} aggregated".format(os.path.join(dirname, output)))
        with smart_open.smart_open(os.path.join(dirname, output), "wb") as n:
            for idx, line in enumerate(temp.split('\n')):
                num_line = u"_*{0} {1}\n".format(idx, line)
                n.write(num_line.encode("UTF-8"))
        alldata += temp
    with smart_open.smart_open(os.path.join(dirname, 'alldata-id.txt'), 'wb') as f:
        for idx, line in enumerate(alldata.split('\n')):
            num_line = u"_*{0} {1}\n".format(idx, line)
            f.write(num_line.encode("UTF-8")) 
    print("alldata-id.txt aggregated")
    return alldata

import time
start = time.clock()
alldata = prepData()
end = time.clock()
print ("Total running time: ", end-start)

Preparing dataset...
data/train-pos.txt aggregated
data/train-neg.txt aggregated
data/test-pos.txt aggregated
data/test-neg.txt aggregated
alldata-id.txt aggregated
Total running time:  29.059736000000015


In [48]:
import os.path
assert os.path.isfile(os.path.join(dirname, "alldata-id.txt")), "alldata-id.txt unavailable"

In [49]:
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

num_lines_train_pos = file_len(os.path.join(dirname, 'train-pos.txt'))
num_lines_train_neg = file_len(os.path.join(dirname, 'train-neg.txt'))
num_lines_test_pos = file_len(os.path.join(dirname, 'test-pos.txt'))
num_lines_test_neg = file_len(os.path.join(dirname, 'test-neg.txt'))
train_pos_idx = 0
train_neg_idx = train_pos_idx + num_lines_train_pos
test_pos_idx = train_neg_idx + num_lines_train_neg
test_neg_idx = test_pos_idx + num_lines_test_pos
unsup_idx = test_neg_idx + num_lines_test_neg
num_lines_alldata =  len(alldata.split('\n'))

print(num_lines_alldata)
print(num_lines_train_pos + num_lines_train_neg + num_lines_test_pos + num_lines_test_neg)
print(file_len(os.path.join(dirname, 'alldata-id.txt')))

print("Total number of paragraphs is {}".format(num_lines_alldata))

1392
1395
1392
Total number of paragraphs is 1392


In [50]:
import gensim
from collections import namedtuple

SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')

def getAllDocs():
    alldocs = []  # Will hold all docs in original order
    with open(os.path.join(dirname, 'alldata-id.txt'), encoding='utf-8') as alldata:
        for line_no, line in enumerate(alldata):
            tokens = gensim.utils.to_unicode(line).split()
            words = tokens[1:]
            tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
            split_enum = ['train', 'test', 'extra']
            sentiment_enum = [1.0, 0.0, None]
            if (train_pos_idx <= line_no and line_no < train_neg_idx):
                split = split_enum[0]
                sentiment = sentiment_enum[0]
            elif (train_neg_idx <= line_no and line_no < test_pos_idx):
                split = split_enum[0]
                sentiment = sentiment_enum[1]
            elif (test_pos_idx <= line_no and line_no < test_neg_idx):
                split = split_enum[1]
                sentiment = sentiment_enum[0]
            elif (test_neg_idx <= line_no and line_no < unsup_idx):
                split = split_enum[1]
                sentiment = sentiment_enum[1]
            else:
                split = split_enum[2]
                sentiment = sentiment_enum[2]
            alldocs.append(SentimentDocument(words, tags, split, sentiment))
    return alldocs
alldocs = getAllDocs()
alldata = None # Clear this out of memory because we're done using it.

In [52]:
print(len(alldocs))
print(alldocs[0])

1392
SentimentDocument(words=['<', 'header', '>', '<', 'filestats', '>', '<', 'filename', '>', '20161101_10-q_edgar_data_1302573_0001564590-16-026741_1', '.', 'txt', '<', '/filename', '>', '<', 'grossfilesize', '>', '5253181', '<', '/grossfilesize', '>', '<', 'netfilesize', '>', '92420', '<', '/netfilesize', '>', '<', 'ascii_embedded_chars', '>', '185805', '<', '/ascii_embedded_chars', '>', '<', 'html_chars', '>', '1212702', '<', '/html_chars', '>', '<', 'xbrl_chars', '>', '1911476', '<', '/xbrl_chars', '>', '<', 'xml_chars', '>', '1137301', '<', '/xml_chars', '>', '<', 'n_tables', '>', '23', '<', '/n_tables', '>', '<', 'n_exhibits', '>', '10', '<', '/n_exhibits', '>', '<', '/filestats', '>', '<', 'sec-header', '>', '0001564590-16-026741', '.', 'hdr', '.', 'sgml', ':', '20161101', '<', 'acceptance-datetime', '>', '20161101162421', 'accession', 'number', ':', '0001564590-16-026741', 'conformed', 'submission', 'type', ':', '10-q', 'public', 'document', 'count', ':', '55', 'conformed', 'p

## Set-up Doc2Vec Training & Evaluation Models

We reference [Le & Mikolov 2014](http://cs.stanford.edu/~quocle/paragraph_vector.pdf) and [go.sh](https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ):

We vary the following parameter choices:
* 100-dimensional vectors, as the 400-d vectors of the paper don't seem to offer much benefit on this task
* Similarly, frequent word subsampling seems to decrease sentiment-prediction accuracy, so it's left out
* `cbow=0` means skip-gram which is equivalent to the paper's 'PV-DBOW' mode, matched in gensim with `dm=0`
* Added to that DBOW model are two DM models, one which averages context vectors (`dm_mean`) and one which concatenates them (`dm_concat`, resulting in a much larger, slower, more data-hungry model)
* A `min_count=2` saves quite a bit of model memory, discarding only words that appear in a single doc (and are thus no more expressive than the unique-to-each doc vectors themselves)

In [55]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
print("{} cores".format(cores))
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

# simple_models = [
#     # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
#     Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
#     # PV-DBOW 
#     Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
#     # PV-DM w/ average
#     Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
# ]

simple_models = [
    #   PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores)
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

4 cores
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)
Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)


Le and Mikolov notes that combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. Here, we concatenate the paragraph vectors obtained from each model.

In [56]:
#from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
#models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])
#models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])

## Predictive Evaluation Methods

Let's define some helper methods for evaluating the performance of our Doc2vec using paragraph vectors. We will classify document sentiments using a logistic regression model based on our paragraph embeddings. We will compare the error rates based on word embeddings from our various Doc2vec models.

In [57]:
import numpy as np
import statsmodels.api as sm
from random import sample

# For timing
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start
    
def logistic_predictor_from_data(train_targets, train_regressors):
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set, infer=False, infer_steps=3, infer_alpha=0.1, infer_subsample=0.1):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets, train_regressors = zip(*[(doc.sentiment, test_model.docvecs[doc.tags[0]]) for doc in train_set])
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_data = test_set
    if infer:
        if infer_subsample < 1.0:
            test_data = sample(test_data, int(infer_subsample * len(test_data)))
        test_regressors = [test_model.infer_vector(doc.words, steps=infer_steps, alpha=infer_alpha) for doc in test_data]
    else:
        test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_docs]
    test_regressors = sm.add_constant(test_regressors)
    
    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_data])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

  from pandas.core import datetools


## Bulk Training
We use an explicit multiple-pass, alpha-reduction approach as sketched in this [gensim doc2vec blog post](http://radimrehurek.com/2014/12/doc2vec-tutorial/) with added shuffling of corpus on each pass.

Note that vector training is occurring on *all* documents of the dataset, which includes all TRAIN/TEST/DEV docs.

We evaluate each model's sentiment predictive power based on error rate, and the evaluation is repeated after each pass so we can see the rates of relative improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the _inferred_ results use newly-inferred TEST vectors. 

In [58]:
from collections import defaultdict
best_error = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [59]:
#from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 10)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

train_docs = alldocs[train_pos_idx:test_pos_idx]
test_docs = alldocs[test_pos_idx:unsup_idx]

for epoch in range(passes):
    #shuffle(doc_list)  # Shuffling gets best results
    
    for name, train_model in models_by_name.items():
        # Train
        duration = 'na'
        train_model.alpha, train_model.min_alpha = alpha, alpha
        with elapsed_timer() as elapsed:
            #train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
            train_model.train(alldocs, total_examples=len(alldocs), epochs=1)
            duration = '%.1f' % elapsed()
            
        # Evaluate
        eval_duration = ''
        with elapsed_timer() as eval_elapsed:
            err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs)
        eval_duration = '%.1f' % eval_elapsed()
        best_indicator = ' '
        if err <= best_error[name]:
            best_error[name] = err
            best_indicator = '*' 
        print("%s%f : %i passes : %s %ss %ss" % (best_indicator, err, epoch + 1, name, duration, eval_duration))

        if ((epoch + 1) % 5) == 0 or epoch == 0:
            eval_duration = ''
            with elapsed_timer() as eval_elapsed:
                infer_err, err_count, test_count, predictor = error_rate_for_model(train_model, train_docs, test_docs, infer=True)
            eval_duration = '%.1f' % eval_elapsed()
            best_indicator = ' '
            if infer_err < best_error[name + '_inferred']:
                best_error[name + '_inferred'] = infer_err
                best_indicator = '*'
            print("%s%f : %i passes : %s %ss %ss" % (best_indicator, infer_err, epoch + 1, name + '_inferred', duration, eval_duration))

    print('Completed pass %i at alpha %f' % (epoch + 1, alpha))
    alpha -= alpha_delta

train_docs = None
test_docs = None 
alldocs = None # We're done with these.
    
print("END %s" % str(datetime.datetime.now()))

START 2017-07-27 12:50:01.449633
*0.542553 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 27.1s 0.1s
*0.500000 : 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred 27.1s 3.0s
*0.514184 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 9.4s 0.0s
*0.535714 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 9.4s 1.4s
Completed pass 1 at alpha 0.025000
*0.489362 : 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 28.3s 0.0s
*0.478723 : 2 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 10.2s 0.0s
Completed pass 2 at alpha 0.022600
*0.485816 : 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 33.1s 0.1s
*0.464539 : 3 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 12.9s 0.0s
Completed pass 3 at alpha 0.020200
 0.492908 : 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 26.4s 0.0s
 0.524823 : 4 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4) 9.9s 0.0s
Completed pass 4 at alpha 0.017800
*0.450355 : 5 passes : Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4) 25.9s 0.0s
 0.571429 : 5 passes : D

## Achieved Sentiment-Prediction Accuracy

In [60]:
# Print best error rates achieved
print("Err rate Model")
for rate, name in sorted((rate, name) for name, rate in best_error.items()):
    print("%f %s" % (rate, name))

Err rate Model
0.428571 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred
0.450355 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)
0.464539 Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)
0.500000 Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)_inferred


For comparison, the paper reports a 7.42% error rate.

## Examining Results

### Are inferred vectors close to the precalculated ones?

In [61]:
alldocs = getAllDocs()

In [63]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

for doc 739...
Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4):
 [(739, 0.9389610290527344), (1116, 0.6220653057098389), (757, 0.587980329990387)]
Doc2Vec(dbow,d100,n5,mc2,s0.001,t4):
 [(739, 0.9524559378623962), (1116, 0.5095655918121338), (781, 0.502806544303894)]


### Do close documents seem more related than distant ones?

In [64]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[sims[index][0]].words)))

TARGET (411): «< header > < filestats > < filename > 20161109_10-q_edgar_data_811831_0001437749-16-041549_1 . txt < /filename > < grossfilesize > 13232849 < /grossfilesize > < netfilesize > 120778 < /netfilesize > < ascii_embedded_chars > 399100 < /ascii_embedded_chars > < html_chars > 5718989 < /html_chars > < xbrl_chars > 5332215 < /xbrl_chars > < xml_chars > 1356541 < /xml_chars > < n_tables > 63 < /n_tables > < n_exhibits > 11 < /n_exhibits > < /filestats > < sec-header > 0001437749-16-041549 . hdr . sgml : 20161109 < acceptance-datetime > 20161109154653 accession number : 0001437749-16-041549 conformed submission type : 10-q public document count : 76 conformed period of report : 20160930 filed as of date : 20161109 date as of change : 20161109 filer : company data : company conformed name : northeast bancorp /me/ central index key : 0000811831 standard industrial classification : state commercial banks [6022] irs number : 010425066 state of incorporation : me fiscal year end : 06

### Do the word vectors show useful similarities?

In [65]:
word_models = simple_models[:]

In [71]:
import random
from IPython.display import HTML
# pick a random word with a suitable number of occurences
while True:
    word = random.choice(word_models[0].wv.index2word)
    if word_models[0].wv.vocab[word].count > 10:
        break
# or uncomment below line, to just pick a word from the relevant domain:
#word = 'comedy/drama'
similars_per_model = [str(model.most_similar(word, topn=20)).replace('), ','),<br>\n') for model in word_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
print("most similar words for '%s' (%d occurrences)" % (word, simple_models[0].wv.vocab[word].count))
HTML(similar_table)

most similar words for '1966' (25 occurrences)


"Doc2Vec(dm/c,d100,n5,w5,mc2,s0.001,t4)","Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)"
"[('dex', 0.6843340396881104), ('1972', 0.6816273331642151), ('oslo', 0.6721408367156982), ('pecos', 0.6614382266998291), ('ellsworth', 0.6578540802001953), ('1916', 0.657844066619873), ('ledyard', 0.6578050851821899), ('bartlett', 0.6551552414894104), ('fy15', 0.6519767045974731), ('miami-dade', 0.6326720118522644), ('ashtabula', 0.626255989074707), ('1907', 0.6212862730026245), ('1951', 0.61444091796875), ('jerez', 0.6130475997924805), ('tupelo', 0.6070981025695801), ('meantime', 0.6059596538543701), ('1981', 0.605670690536499), ('1971', 0.6050196886062622), ('situ', 0.6001111268997192), ('healdsburg', 0.5993859171867371)]","[('aqx-1125', 0.42173779010772705), ('stockton', 0.4145112633705139), ('ardagh', 0.4088331460952759), ('appurtenant', 0.4061877727508545), ('albeit', 0.39595872163772583), ('9/30/11', 0.38168269395828247), ('point-of-sale-discount', 0.3783271312713623), ('superpass', 0.3761482238769531), ('koruna', 0.3753677010536194), ('royall', 0.3736392855644226), ('banks', 0.3689401149749756), ('four', 0.3660191297531128), ('syna-ex312_6', 0.3654278516769409), ('muscle-related', 0.3601308763027191), ('sub-plans', 0.3599775731563568), (""24'00"", 0.359175443649292), ('irvine', 0.35802456736564636), ('maximizing', 0.35775670409202576), (""advance's"", 0.35222744941711426), ('norwest', 0.3509473204612732)]"


DBOW words look meaningless because DBOW Model doesn't train word vectors - they remain at random initialized values (unless you use dbow_words=1, which slows training with little improvement)

DM Models show meaningfully similar words when there are enough examples.

In [None]:
# pick random doc
doc_id = np.random.randint(simple_models[0].docvecs.count)
similars_per_model = [str([(' '.join(alldocs[a_sim[0]].words), a_sim[1]) for a_sim in model.docvecs.most_similar(doc_id, topn=1)]).replace('), ','),<br>\n') for model in simple_models]
similar_table = ("<table><tr><th>" +
    "</th><th>".join([str(model) for model in word_models]) + 
    "</th></tr><tr><td>" +
    "</td><td>".join(similars_per_model) +
    "</td></tr></table>")
#print("most similar words for {}".format(' '.join(alldocs[doc_id].words)))
HTML(similar_table)