In [1]:
%matplotlib inline

In [2]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Introduction
------------
steps:

#. Load the IMDB dataset
#. Train a variety of Doc2Vec models on the dataset
#. Evaluate the performance of each model using a logistic regression
#. Examine some of the results directly:

When examining results, we will look for answers for the following questions:

#. Are inferred vectors close to the precalculated ones?
#. Do close documents seem more related than distant ones?
#. Do the word vectors show useful similarities?
#. Are the word vectors from this dataset any good at analogies?

Load corpus
-----------

data is available on `IMDB archive
<http://ai.stanford.edu/~amaas/data/sentiment/>`_.

Each review is a single line of text containing multiple sentences, for example:

```
One of the best movie-dramas I have ever seen. We do a lot of acting in the
church and this is one that can be used as a resource that highlights all the
good things that actors can do in their work. I highly recommend this one,
especially for those who have an interest in acting, as a "must see."
```

These reviews will be the **documents** that we will work with in this tutorial.
There are 100 thousand reviews in total.

#. 25k reviews for training (12.5k positive, 12.5k negative)
#. 25k reviews for testing (12.5k positive, 12.5k negative)
#. 50k unlabeled reviews

Out of 100k reviews, 50k have a label: either positive (the reviewer liked
the movie) or negative.
The remaining 50k are unlabeled.


#. Download the tar.gz file (it's only 84MB, so this shouldn't take too long)
#. Unpack it and extract each movie review
#. Split the reviews into training and test datasets

define a convenient datatype for holding data for a single document:

* words: The text of the document, as a ``list`` of words.
* tags: Used to keep the index of the document in the entire dataset.
* split: one of ``train``\ , ``test`` or ``extra``. Determines how the document will be used (for training, testing, etc).
* sentiment: either 1 (positive), 0 (negative) or None (unlabeled document).

This data type is helpful for later evaluation and reporting.
In particular, the ``index`` member will help us quickly and easily retrieve the vectors for a document from a model.




In [3]:
import collections

SentimentDocument = collections.namedtuple('SentimentDocument', 'words tags split sentiment')

In [7]:
! pip install gensim

Collecting gensim
  Downloading https://files.pythonhosted.org/packages/09/ed/b59a2edde05b7f5755ea68648487c150c7c742361e9c8733c6d4ca005020/gensim-3.8.1-cp37-cp37m-win_amd64.whl (24.2MB)
Installing collected packages: gensim
Successfully installed gensim-3.8.1


We can now proceed with loading the corpus.



In [8]:
import io
import re
import tarfile
import os.path

import smart_open
import gensim.utils

def download_dataset(url='http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'):
    fname = url.split('/')[-1]

    if os.path.isfile(fname):
       return fname

    # Download the file to local storage first.
    # We can't read it on the fly because of
    # https://github.com/RaRe-Technologies/smart_open/issues/331
    with smart_open.open(url, "rb", ignore_ext=True) as fin:
        with smart_open.open(fname, 'wb', ignore_ext=True) as fout:
            while True:
                buf = fin.read(io.DEFAULT_BUFFER_SIZE)
                if not buf:
                    break
                fout.write(buf)

    return fname

def create_sentiment_document(name, text, index):
    _, split, sentiment_str, _ = name.split('/')
    sentiment = {'pos': 1.0, 'neg': 0.0, 'unsup': None}[sentiment_str]

    if sentiment is None:
        split = 'extra'

    tokens = gensim.utils.to_unicode(text).split()
    return SentimentDocument(tokens, [index], split, sentiment)

def extract_documents():
    fname = download_dataset()

    index = 0

    with tarfile.open(fname, mode='r:gz') as tar:
        for member in tar.getmembers():
            if re.match(r'aclImdb/(train|test)/(pos|neg|unsup)/\d+_\d+.txt$', member.name):
                member_bytes = tar.extractfile(member).read()
                member_text = member_bytes.decode('utf-8', errors='replace')
                assert member_text.count('\n') == 0
                yield create_sentiment_document(member.name, member_text, index)
                index += 1

alldocs = list(extract_documents())

Here's what a single document looks like



In [9]:
print(alldocs[27])

SentimentDocument(words=['I', 'was', 'looking', 'forward', 'to', 'this', 'movie.', 'Trustworthy', 'actors,', 'interesting', 'plot.', 'Great', 'atmosphere', 'then', '?????', 'IF', 'you', 'are', 'going', 'to', 'attempt', 'something', 'that', 'is', 'meant', 'to', 'encapsulate', 'the', 'meaning', 'of', 'life.', 'First.', 'Know', 'it.', 'OK', 'I', 'did', 'not', 'expect', 'the', 'directors', 'or', 'writers', 'to', 'actually', 'know', 'the', 'meaning', 'but', 'I', 'thought', 'they', 'may', 'have', 'offered', 'crumbs', 'to', 'peck', 'at', 'and', 'treats', 'to', 'add', 'fuel', 'to', 'the', 'fire-Which!', 'they', 'almost', 'did.', 'Things', 'I', "didn't", 'get.', 'A', 'woman', 'wandering', 'around', 'in', 'dark', 'places', 'and', 'lonely', 'car', 'parks', 'alone-oblivious', 'to', 'the', 'consequences.', 'Great', 'riddles', 'that', 'fell', 'by', 'the', 'wayside.', 'The', 'promise', 'of', 'the', 'knowledge', 'therein', 'contained', 'by', 'the', 'original', 'so-called', 'criminal.', 'I', 'had', 'no

Extract our documents and split into training/test sets



In [10]:
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))

100000 docs: 25000 train-sentiment, 25000 test-sentiment


Set-up Doc2Vec Training & Evaluation Models
-------------------------------------------


We vary the following parameter choices:

* 100-dimensional vectors, as the 400-d vectors of the paper take a lot of
  memory and, in our tests of this task, don't seem to offer much benefit
* Similarly, frequent word subsampling seems to decrease sentiment-prediction
  accuracy, so it's left out
* ``cbow=0`` means skip-gram which is equivalent to the paper's 'PV-DBOW'
  mode, matched in gensim with ``dm=0``
* Added to that DBOW model are two DM models, one which averages context
  vectors (\ ``dm_mean``\ ) and one which concatenates them (\ ``dm_concat``\ ,
  resulting in a much larger, slower, more data-hungry model)
* A ``min_count=2`` saves quite a bit of model memory, discarding only words
  that appear in a single doc (and are thus no more expressive than the
  unique-to-each doc vectors themselves)




In [11]:
import multiprocessing
from collections import OrderedDict

import gensim.models.doc2vec
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

from gensim.models.doc2vec import Doc2Vec

common_kwargs = dict(
    vector_size=100, epochs=20, min_count=2,
    sample=0, workers=multiprocessing.cpu_count(), negative=5, hs=0,
)

simple_models = [
    # PV-DBOW plain
    Doc2Vec(dm=0, **common_kwargs),
    # PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
    Doc2Vec(dm=1, window=10, alpha=0.05, comment='alpha=0.05', **common_kwargs),
    # PV-DM w/ concatenation - big, slow, experimental mode
    # window=5 (both sides) approximates paper's apparent 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, window=5, **common_kwargs),
]

for model in simple_models:
    model.build_vocab(alldocs)
    print("%s vocabulary scanned & state initialized" % model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

2019-11-21 17:03:27,612 : INFO : using concatenative 1100-dimensional layer1
2019-11-21 17:03:27,614 : INFO : collecting all words and their counts
2019-11-21 17:03:27,615 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-11-21 17:03:28,227 : INFO : PROGRESS: at example #10000, processed 2292381 words (3747805/s), 150816 word types, 10000 tags
2019-11-21 17:03:28,804 : INFO : PROGRESS: at example #20000, processed 4573645 words (3955736/s), 238497 word types, 20000 tags
2019-11-21 17:03:29,424 : INFO : PROGRESS: at example #30000, processed 6865575 words (3706859/s), 312348 word types, 30000 tags
2019-11-21 17:03:30,056 : INFO : PROGRESS: at example #40000, processed 9190019 words (3683341/s), 377231 word types, 40000 tags
2019-11-21 17:03:30,713 : INFO : PROGRESS: at example #50000, processed 11557847 words (3606381/s), 438729 word types, 50000 tags
2019-11-21 17:03:31,335 : INFO : PROGRESS: at example #60000, processed 13899883 words (3772862/s), 49

Doc2Vec(dbow,d100,n5,mc2,t4) vocabulary scanned & state initialized


2019-11-21 17:03:42,664 : INFO : PROGRESS: at example #10000, processed 2292381 words (3938447/s), 150816 word types, 10000 tags
2019-11-21 17:03:43,176 : INFO : PROGRESS: at example #20000, processed 4573645 words (4468265/s), 238497 word types, 20000 tags
2019-11-21 17:03:43,754 : INFO : PROGRESS: at example #30000, processed 6865575 words (3965285/s), 312348 word types, 30000 tags
2019-11-21 17:03:44,336 : INFO : PROGRESS: at example #40000, processed 9190019 words (4007503/s), 377231 word types, 40000 tags
2019-11-21 17:03:44,922 : INFO : PROGRESS: at example #50000, processed 11557847 words (4038083/s), 438729 word types, 50000 tags
2019-11-21 17:03:45,478 : INFO : PROGRESS: at example #60000, processed 13899883 words (4235782/s), 493913 word types, 60000 tags
2019-11-21 17:03:46,086 : INFO : PROGRESS: at example #70000, processed 16270094 words (3898611/s), 548474 word types, 70000 tags
2019-11-21 17:03:46,650 : INFO : PROGRESS: at example #80000, processed 18598876 words (414105

Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4) vocabulary scanned & state initialized


2019-11-21 17:03:57,185 : INFO : PROGRESS: at example #10000, processed 2292381 words (4246788/s), 150816 word types, 10000 tags
2019-11-21 17:03:57,732 : INFO : PROGRESS: at example #20000, processed 4573645 words (4180321/s), 238497 word types, 20000 tags
2019-11-21 17:03:58,263 : INFO : PROGRESS: at example #30000, processed 6865575 words (4331546/s), 312348 word types, 30000 tags
2019-11-21 17:03:58,850 : INFO : PROGRESS: at example #40000, processed 9190019 words (3964787/s), 377231 word types, 40000 tags
2019-11-21 17:03:59,427 : INFO : PROGRESS: at example #50000, processed 11557847 words (4113837/s), 438729 word types, 50000 tags
2019-11-21 17:03:59,972 : INFO : PROGRESS: at example #60000, processed 13899883 words (4311051/s), 493913 word types, 60000 tags
2019-11-21 17:04:00,588 : INFO : PROGRESS: at example #70000, processed 16270094 words (3856949/s), 548474 word types, 70000 tags
2019-11-21 17:04:01,136 : INFO : PROGRESS: at example #80000, processed 18598876 words (426227

Doc2Vec(dm/c,d100,n5,w5,mc2,t4) vocabulary scanned & state initialized


Le and Mikolov note that combining a paragraph vector from Distributed Bag of
Words (DBOW) and Distributed Memory (DM) improves performance. We will
follow, pairing the models together for evaluation. Here, we concatenate the
paragraph vectors obtained from each model with the help of a thin wrapper
class included in a gensim test module.



In [15]:
!pip install testfixtures

Collecting testfixtures
  Downloading https://files.pythonhosted.org/packages/6a/fe/0cf62bbb32c3589e955a447d582c9d07b051baf5edc3fae9973709766353/testfixtures-6.10.2-py2.py3-none-any.whl (86kB)
Installing collected packages: testfixtures
Successfully installed testfixtures-6.10.2


In [16]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[1]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[0], simple_models[2]])

2019-11-21 17:06:36,529 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-11-21 17:06:36,534 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)


Predictive Evaluation Methods
-----------------------------

Given a document, our ``Doc2Vec`` models output a vector representation of the document.
How useful is a particular model?
In case of sentiment analysis, we want the ouput vector to reflect the sentiment in the input document.
So, in vector space, positive documents should be distant from negative documents.

We train a logistic regression from the training set:

  - regressors (inputs): document vectors from the Doc2Vec model
  - target (outpus): sentiment labels

So, this logistic regression will be able to predict sentiment given a document vector.

Next, we test our logistic regression on the test set, and measure the rate of errors (incorrect predictions).
If the document vectors from the Doc2Vec model reflect the actual sentiment well, the error rate will be low.

Therefore, the error rate of the logistic regression is indication of *how well* the given Doc2Vec model represents documents as vectors.
We can then compare different ``Doc2Vec`` models by looking at their error rates.




In [17]:
import numpy as np
import statsmodels.api as sm
from random import sample

def logistic_predictor_from_data(train_targets, train_regressors):
    """Fit a statsmodel logistic predictor on supplied data"""
    logit = sm.Logit(train_targets, train_regressors)
    predictor = logit.fit(disp=0)
    # print(predictor.summary())
    return predictor

def error_rate_for_model(test_model, train_set, test_set):
    """Report error rate on test_doc sentiments, using supplied model and train_docs"""

    train_targets = [doc.sentiment for doc in train_set]
    train_regressors = [test_model.docvecs[doc.tags[0]] for doc in train_set]
    train_regressors = sm.add_constant(train_regressors)
    predictor = logistic_predictor_from_data(train_targets, train_regressors)

    test_regressors = [test_model.docvecs[doc.tags[0]] for doc in test_set]
    test_regressors = sm.add_constant(test_regressors)

    # Predict & evaluate
    test_predictions = predictor.predict(test_regressors)
    corrects = sum(np.rint(test_predictions) == [doc.sentiment for doc in test_set])
    errors = len(test_predictions) - corrects
    error_rate = float(errors) / len(test_predictions)
    return (error_rate, errors, len(test_predictions), predictor)

Bulk Training & Per-Model Evaluation
------------------------------------

Note that doc-vector training is occurring on *all* documents of the dataset,
which includes all TRAIN/TEST/DEV docs.  Because the native document-order
has similar-sentiment documents in large clumps – which is suboptimal for
training – we work with once-shuffled copy of the training set.

We evaluate each model's sentiment predictive power based on error rate, and
the evaluation is done for each model.

In [18]:
from collections import defaultdict
error_rates = defaultdict(lambda: 1.0)  # To selectively print only best errors achieved

In [19]:
from random import shuffle
shuffled_alldocs = alldocs[:]
shuffle(shuffled_alldocs)

for model in simple_models:
    print("Training %s" % model)
    model.train(shuffled_alldocs, total_examples=len(shuffled_alldocs), epochs=model.epochs)

    print("\nEvaluating %s" % model)
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

for model in [models_by_name['dbow+dmm'], models_by_name['dbow+dmc']]:
    print("\nEvaluating %s" % model)
    err_rate, err_count, test_count, predictor = error_rate_for_model(model, train_docs, test_docs)
    error_rates[str(model)] = err_rate
    print("\n%f %s\n" % (err_rate, model))

2019-11-21 17:07:00,563 : INFO : training model with 4 workers on 265408 vocabulary and 100 features, using sg=1 hs=0 sample=0 negative=5 window=5


Training Doc2Vec(dbow,d100,n5,mc2,t4)


2019-11-21 17:07:01,601 : INFO : EPOCH 1 - PROGRESS: at 3.88% examples, 900425 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:07:02,609 : INFO : EPOCH 1 - PROGRESS: at 8.43% examples, 977406 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:07:03,618 : INFO : EPOCH 1 - PROGRESS: at 13.14% examples, 1007538 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:07:04,618 : INFO : EPOCH 1 - PROGRESS: at 17.90% examples, 1026540 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:07:05,619 : INFO : EPOCH 1 - PROGRESS: at 22.69% examples, 1037116 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:07:06,620 : INFO : EPOCH 1 - PROGRESS: at 27.28% examples, 1040022 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:07:07,623 : INFO : EPOCH 1 - PROGRESS: at 32.04% examples, 1044958 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:07:08,623 : INFO : EPOCH 1 - PROGRESS: at 36.58% examples, 1048531 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:07:09,626 : INFO : EPOCH 1 - PROGRESS: at 40.78% examples, 1038293 words/s, in

2019-11-21 17:08:06,836 : INFO : EPOCH 3 - PROGRESS: at 86.74% examples, 896661 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:08:07,849 : INFO : EPOCH 3 - PROGRESS: at 91.04% examples, 899556 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:08:08,852 : INFO : EPOCH 3 - PROGRESS: at 95.22% examples, 901582 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:08:09,881 : INFO : EPOCH 3 - PROGRESS: at 98.99% examples, 899400 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:08:10,129 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-21 17:08:10,138 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 17:08:10,142 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 17:08:10,147 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-21 17:08:10,150 : INFO : EPOCH - 3 : training on 23279529 raw words (22951015 effective words) took 25.5s, 898960 effective words/s
2019-11-21 17:08:11,178 : INFO : EPOCH 4 - P

2019-11-21 17:09:08,718 : INFO : EPOCH 6 - PROGRESS: at 27.16% examples, 1029034 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:09,722 : INFO : EPOCH 6 - PROGRESS: at 31.22% examples, 1013373 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:09:10,743 : INFO : EPOCH 6 - PROGRESS: at 34.96% examples, 992157 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:11,744 : INFO : EPOCH 6 - PROGRESS: at 39.10% examples, 990263 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:12,745 : INFO : EPOCH 6 - PROGRESS: at 43.67% examples, 994669 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:13,748 : INFO : EPOCH 6 - PROGRESS: at 48.15% examples, 998295 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:14,750 : INFO : EPOCH 6 - PROGRESS: at 52.69% examples, 1002068 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:15,751 : INFO : EPOCH 6 - PROGRESS: at 57.26% examples, 1005612 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:09:16,753 : INFO : EPOCH 6 - PROGRESS: at 61.73% examples, 1007706 words/s, in

2019-11-21 17:10:10,857 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-21 17:10:10,858 : INFO : EPOCH - 8 : training on 23279529 raw words (22951015 effective words) took 23.0s, 996324 effective words/s
2019-11-21 17:10:11,884 : INFO : EPOCH 9 - PROGRESS: at 4.35% examples, 990593 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:10:12,892 : INFO : EPOCH 9 - PROGRESS: at 8.59% examples, 988728 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:10:13,899 : INFO : EPOCH 9 - PROGRESS: at 11.78% examples, 896904 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:10:14,905 : INFO : EPOCH 9 - PROGRESS: at 14.64% examples, 836660 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:10:15,911 : INFO : EPOCH 9 - PROGRESS: at 17.45% examples, 796476 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:10:16,912 : INFO : EPOCH 9 - PROGRESS: at 20.78% examples, 787522 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:10:17,948 : INFO : EPOCH 9 - PROGRESS: at 23.64% examples, 766495 words/s, in_q

2019-11-21 17:11:16,629 : INFO : EPOCH 11 - PROGRESS: at 10.29% examples, 579180 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:11:17,633 : INFO : EPOCH 11 - PROGRESS: at 13.06% examples, 588723 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:11:18,649 : INFO : EPOCH 11 - PROGRESS: at 14.81% examples, 555982 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:11:19,668 : INFO : EPOCH 11 - PROGRESS: at 16.74% examples, 538016 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:11:20,690 : INFO : EPOCH 11 - PROGRESS: at 18.70% examples, 524932 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:11:21,692 : INFO : EPOCH 11 - PROGRESS: at 21.34% examples, 532735 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:11:22,696 : INFO : EPOCH 11 - PROGRESS: at 24.00% examples, 540952 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:11:23,704 : INFO : EPOCH 11 - PROGRESS: at 26.59% examples, 545080 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:11:24,732 : INFO : EPOCH 11 - PROGRESS: at 29.31% examples, 549345 words/s

2019-11-21 17:12:26,324 : INFO : EPOCH 12 - PROGRESS: at 68.21% examples, 550852 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:12:27,326 : INFO : EPOCH 12 - PROGRESS: at 70.89% examples, 552828 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:28,342 : INFO : EPOCH 12 - PROGRESS: at 73.65% examples, 555118 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:29,385 : INFO : EPOCH 12 - PROGRESS: at 76.45% examples, 556536 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:30,388 : INFO : EPOCH 12 - PROGRESS: at 79.24% examples, 559335 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:31,394 : INFO : EPOCH 12 - PROGRESS: at 82.02% examples, 561369 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:32,400 : INFO : EPOCH 12 - PROGRESS: at 84.70% examples, 562859 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:33,404 : INFO : EPOCH 12 - PROGRESS: at 87.39% examples, 564675 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:12:34,406 : INFO : EPOCH 12 - PROGRESS: at 90.09% examples, 566150 words/s

2019-11-21 17:13:29,713 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 17:13:29,728 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-21 17:13:29,730 : INFO : EPOCH - 14 : training on 23279529 raw words (22951015 effective words) took 25.5s, 899096 effective words/s
2019-11-21 17:13:30,740 : INFO : EPOCH 15 - PROGRESS: at 4.79% examples, 1110056 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:13:31,749 : INFO : EPOCH 15 - PROGRESS: at 9.61% examples, 1109471 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:13:32,749 : INFO : EPOCH 15 - PROGRESS: at 14.64% examples, 1120802 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:13:33,754 : INFO : EPOCH 15 - PROGRESS: at 19.73% examples, 1124796 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:13:34,756 : INFO : EPOCH 15 - PROGRESS: at 24.59% examples, 1125363 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:13:35,760 : INFO : EPOCH 15 - PROGRESS: at 29.56% examples, 1124546 words/s, in_qsize 7, o

2019-11-21 17:14:31,590 : INFO : EPOCH 17 - PROGRESS: at 61.44% examples, 1081224 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:14:32,591 : INFO : EPOCH 17 - PROGRESS: at 64.97% examples, 1061971 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:14:33,594 : INFO : EPOCH 17 - PROGRESS: at 68.07% examples, 1038026 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:14:34,594 : INFO : EPOCH 17 - PROGRESS: at 71.33% examples, 1019227 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:14:35,602 : INFO : EPOCH 17 - PROGRESS: at 74.78% examples, 1005238 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:14:36,602 : INFO : EPOCH 17 - PROGRESS: at 78.50% examples, 996344 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:14:37,605 : INFO : EPOCH 17 - PROGRESS: at 83.48% examples, 1003364 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:14:38,614 : INFO : EPOCH 17 - PROGRESS: at 88.30% examples, 1009272 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:14:39,620 : INFO : EPOCH 17 - PROGRESS: at 92.65% examples, 1007881

2019-11-21 17:15:35,756 : INFO : EPOCH 19 - PROGRESS: at 41.38% examples, 629764 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:15:36,762 : INFO : EPOCH 19 - PROGRESS: at 44.30% examples, 631934 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:15:37,783 : INFO : EPOCH 19 - PROGRESS: at 47.19% examples, 633826 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:15:38,796 : INFO : EPOCH 19 - PROGRESS: at 50.12% examples, 635276 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:15:39,810 : INFO : EPOCH 19 - PROGRESS: at 52.83% examples, 633991 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:15:40,810 : INFO : EPOCH 19 - PROGRESS: at 55.39% examples, 632184 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:15:41,847 : INFO : EPOCH 19 - PROGRESS: at 58.04% examples, 629858 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:15:42,860 : INFO : EPOCH 19 - PROGRESS: at 60.86% examples, 630379 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:15:43,882 : INFO : EPOCH 19 - PROGRESS: at 63.84% examples, 632153 words/s


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)


2019-11-21 17:16:34,293 : INFO : training model with 4 workers on 265408 vocabulary and 100 features, using sg=0 hs=0 sample=0 negative=5 window=10



0.105920 Doc2Vec(dbow,d100,n5,mc2,t4)

Training Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)


2019-11-21 17:16:35,312 : INFO : EPOCH 1 - PROGRESS: at 1.26% examples, 278325 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:36,344 : INFO : EPOCH 1 - PROGRESS: at 3.04% examples, 345971 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:37,393 : INFO : EPOCH 1 - PROGRESS: at 4.87% examples, 366644 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:38,425 : INFO : EPOCH 1 - PROGRESS: at 6.67% examples, 375948 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:39,448 : INFO : EPOCH 1 - PROGRESS: at 8.19% examples, 370414 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:40,451 : INFO : EPOCH 1 - PROGRESS: at 9.61% examples, 363333 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:41,474 : INFO : EPOCH 1 - PROGRESS: at 11.22% examples, 362906 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:42,474 : INFO : EPOCH 1 - PROGRESS: at 12.86% examples, 363608 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:16:43,498 : INFO : EPOCH 1 - PROGRESS: at 14.46% examples, 363108 words/s, in_qsize 7, o

2019-11-21 17:17:46,498 : INFO : EPOCH 2 - PROGRESS: at 7.24% examples, 329232 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:17:47,519 : INFO : EPOCH 2 - PROGRESS: at 8.96% examples, 340216 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:17:48,527 : INFO : EPOCH 2 - PROGRESS: at 10.86% examples, 352143 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:17:49,528 : INFO : EPOCH 2 - PROGRESS: at 12.66% examples, 358765 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:17:50,532 : INFO : EPOCH 2 - PROGRESS: at 14.38% examples, 362693 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:17:51,538 : INFO : EPOCH 2 - PROGRESS: at 16.37% examples, 371689 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:17:52,567 : INFO : EPOCH 2 - PROGRESS: at 18.30% examples, 376325 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:17:53,581 : INFO : EPOCH 2 - PROGRESS: at 20.06% examples, 376779 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:17:54,608 : INFO : EPOCH 2 - PROGRESS: at 21.75% examples, 377436 words/s, in_qsize 

2019-11-21 17:18:57,428 : INFO : EPOCH 3 - PROGRESS: at 29.52% examples, 389632 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:18:58,447 : INFO : EPOCH 3 - PROGRESS: at 31.27% examples, 389672 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:18:59,450 : INFO : EPOCH 3 - PROGRESS: at 32.85% examples, 388508 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:19:00,483 : INFO : EPOCH 3 - PROGRESS: at 34.14% examples, 383163 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:19:01,488 : INFO : EPOCH 3 - PROGRESS: at 35.36% examples, 378695 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:19:02,528 : INFO : EPOCH 3 - PROGRESS: at 36.58% examples, 374475 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:19:03,546 : INFO : EPOCH 3 - PROGRESS: at 38.00% examples, 372248 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:19:04,565 : INFO : EPOCH 3 - PROGRESS: at 39.60% examples, 371795 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:19:05,626 : INFO : EPOCH 3 - PROGRESS: at 40.87% examples, 367361 words/s, in_qsiz

2019-11-21 17:20:08,033 : INFO : EPOCH 4 - PROGRESS: at 45.10% examples, 404681 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:09,087 : INFO : EPOCH 4 - PROGRESS: at 46.39% examples, 400320 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:10,108 : INFO : EPOCH 4 - PROGRESS: at 48.11% examples, 399542 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:11,163 : INFO : EPOCH 4 - PROGRESS: at 49.83% examples, 398672 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:12,172 : INFO : EPOCH 4 - PROGRESS: at 51.52% examples, 398220 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:13,180 : INFO : EPOCH 4 - PROGRESS: at 53.29% examples, 398349 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:14,190 : INFO : EPOCH 4 - PROGRESS: at 55.09% examples, 398757 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:15,192 : INFO : EPOCH 4 - PROGRESS: at 56.89% examples, 399144 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:20:16,240 : INFO : EPOCH 4 - PROGRESS: at 58.69% examples, 399329 words/s, in_qsiz

2019-11-21 17:21:19,327 : INFO : EPOCH 5 - PROGRESS: at 65.15% examples, 386441 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:20,333 : INFO : EPOCH 5 - PROGRESS: at 66.98% examples, 387056 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:21,351 : INFO : EPOCH 5 - PROGRESS: at 68.49% examples, 385946 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:21:22,354 : INFO : EPOCH 5 - PROGRESS: at 69.78% examples, 383641 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:23,358 : INFO : EPOCH 5 - PROGRESS: at 71.54% examples, 383921 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:24,438 : INFO : EPOCH 5 - PROGRESS: at 73.00% examples, 382215 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:25,451 : INFO : EPOCH 5 - PROGRESS: at 74.74% examples, 382269 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:26,479 : INFO : EPOCH 5 - PROGRESS: at 76.66% examples, 383039 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:21:27,505 : INFO : EPOCH 5 - PROGRESS: at 78.11% examples, 381885 words/s, in_qsiz

2019-11-21 17:22:30,194 : INFO : EPOCH 6 - PROGRESS: at 88.68% examples, 407435 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:22:31,204 : INFO : EPOCH 6 - PROGRESS: at 90.53% examples, 407495 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:22:32,225 : INFO : EPOCH 6 - PROGRESS: at 92.31% examples, 407304 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:22:33,254 : INFO : EPOCH 6 - PROGRESS: at 94.19% examples, 407463 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:22:34,291 : INFO : EPOCH 6 - PROGRESS: at 96.03% examples, 407524 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:22:35,319 : INFO : EPOCH 6 - PROGRESS: at 97.85% examples, 407661 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:22:36,341 : INFO : EPOCH 6 - PROGRESS: at 99.70% examples, 407661 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:22:36,444 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-21 17:22:36,452 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 17:22:36,462 : I

2019-11-21 17:23:36,347 : INFO : EPOCH 8 - PROGRESS: at 7.19% examples, 409248 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:37,368 : INFO : EPOCH 8 - PROGRESS: at 9.00% examples, 410298 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:38,371 : INFO : EPOCH 8 - PROGRESS: at 10.86% examples, 411359 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:39,373 : INFO : EPOCH 8 - PROGRESS: at 12.56% examples, 407799 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:40,382 : INFO : EPOCH 8 - PROGRESS: at 14.19% examples, 403676 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:23:41,425 : INFO : EPOCH 8 - PROGRESS: at 15.87% examples, 399012 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:42,430 : INFO : EPOCH 8 - PROGRESS: at 17.57% examples, 397581 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:43,458 : INFO : EPOCH 8 - PROGRESS: at 19.42% examples, 397367 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:23:44,459 : INFO : EPOCH 8 - PROGRESS: at 21.27% examples, 399548 words/s, in_qsize 

2019-11-21 17:24:47,543 : INFO : EPOCH 9 - PROGRESS: at 32.66% examples, 406806 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:24:48,556 : INFO : EPOCH 9 - PROGRESS: at 34.45% examples, 406598 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:24:49,559 : INFO : EPOCH 9 - PROGRESS: at 36.20% examples, 407462 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:24:50,568 : INFO : EPOCH 9 - PROGRESS: at 38.04% examples, 408523 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:24:51,597 : INFO : EPOCH 9 - PROGRESS: at 39.91% examples, 408797 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:24:52,597 : INFO : EPOCH 9 - PROGRESS: at 41.76% examples, 409187 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:24:53,599 : INFO : EPOCH 9 - PROGRESS: at 43.58% examples, 409387 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:24:54,629 : INFO : EPOCH 9 - PROGRESS: at 45.45% examples, 409988 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:24:55,643 : INFO : EPOCH 9 - PROGRESS: at 47.24% examples, 410016 words/s, in_qsiz

2019-11-21 17:25:57,352 : INFO : EPOCH 10 - PROGRESS: at 56.07% examples, 407949 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:25:58,365 : INFO : EPOCH 10 - PROGRESS: at 57.88% examples, 407912 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:25:59,365 : INFO : EPOCH 10 - PROGRESS: at 59.69% examples, 408426 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:26:00,389 : INFO : EPOCH 10 - PROGRESS: at 61.43% examples, 407847 words/s, in_qsize 6, out_qsize 1
2019-11-21 17:26:01,397 : INFO : EPOCH 10 - PROGRESS: at 63.24% examples, 407944 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:26:02,432 : INFO : EPOCH 10 - PROGRESS: at 65.07% examples, 407786 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:26:03,436 : INFO : EPOCH 10 - PROGRESS: at 66.90% examples, 407910 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:26:04,442 : INFO : EPOCH 10 - PROGRESS: at 68.62% examples, 407575 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:26:05,453 : INFO : EPOCH 10 - PROGRESS: at 70.50% examples, 407902 words/s

2019-11-21 17:27:07,178 : INFO : EPOCH 11 - PROGRESS: at 78.67% examples, 403325 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:08,214 : INFO : EPOCH 11 - PROGRESS: at 80.52% examples, 403539 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:09,219 : INFO : EPOCH 11 - PROGRESS: at 82.24% examples, 403149 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:10,256 : INFO : EPOCH 11 - PROGRESS: at 83.94% examples, 402471 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:11,298 : INFO : EPOCH 11 - PROGRESS: at 85.55% examples, 401963 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:12,351 : INFO : EPOCH 11 - PROGRESS: at 87.25% examples, 401272 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:13,352 : INFO : EPOCH 11 - PROGRESS: at 88.98% examples, 401184 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:27:14,371 : INFO : EPOCH 11 - PROGRESS: at 90.86% examples, 401452 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:27:15,410 : INFO : EPOCH 11 - PROGRESS: at 92.74% examples, 401640 words/s

2019-11-21 17:28:16,190 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 17:28:16,199 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 17:28:16,202 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-21 17:28:16,203 : INFO : EPOCH - 12 : training on 23279529 raw words (22951015 effective words) took 56.8s, 404167 effective words/s
2019-11-21 17:28:17,236 : INFO : EPOCH 13 - PROGRESS: at 1.76% examples, 388626 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:28:18,247 : INFO : EPOCH 13 - PROGRESS: at 3.54% examples, 404659 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:28:19,256 : INFO : EPOCH 13 - PROGRESS: at 5.34% examples, 407421 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:28:20,263 : INFO : EPOCH 13 - PROGRESS: at 7.15% examples, 411277 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:28:21,268 : INFO : EPOCH 13 - PROGRESS: at 8.92% examples, 411397 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:28:22,2

2019-11-21 17:29:22,985 : INFO : EPOCH 14 - PROGRESS: at 18.92% examples, 427948 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:29:24,029 : INFO : EPOCH 14 - PROGRESS: at 20.87% examples, 426795 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:25,041 : INFO : EPOCH 14 - PROGRESS: at 22.89% examples, 430238 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:26,070 : INFO : EPOCH 14 - PROGRESS: at 24.90% examples, 431901 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:27,110 : INFO : EPOCH 14 - PROGRESS: at 26.77% examples, 430320 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:28,135 : INFO : EPOCH 14 - PROGRESS: at 28.66% examples, 429418 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:29,167 : INFO : EPOCH 14 - PROGRESS: at 30.59% examples, 429002 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:30,181 : INFO : EPOCH 14 - PROGRESS: at 32.42% examples, 427936 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:29:31,189 : INFO : EPOCH 14 - PROGRESS: at 34.23% examples, 427174 words/s

2019-11-21 17:30:32,497 : INFO : EPOCH 15 - PROGRESS: at 42.98% examples, 406451 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:33,498 : INFO : EPOCH 15 - PROGRESS: at 44.76% examples, 406440 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:34,520 : INFO : EPOCH 15 - PROGRESS: at 46.48% examples, 406114 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:30:35,534 : INFO : EPOCH 15 - PROGRESS: at 48.37% examples, 406664 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:36,560 : INFO : EPOCH 15 - PROGRESS: at 50.16% examples, 406591 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:37,571 : INFO : EPOCH 15 - PROGRESS: at 51.98% examples, 406787 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:38,591 : INFO : EPOCH 15 - PROGRESS: at 53.70% examples, 406188 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:39,598 : INFO : EPOCH 15 - PROGRESS: at 55.39% examples, 405745 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:30:40,601 : INFO : EPOCH 15 - PROGRESS: at 57.14% examples, 405345 words/s

2019-11-21 17:31:42,843 : INFO : EPOCH 16 - PROGRESS: at 66.32% examples, 404096 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:43,854 : INFO : EPOCH 16 - PROGRESS: at 68.12% examples, 404263 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:44,861 : INFO : EPOCH 16 - PROGRESS: at 70.00% examples, 404744 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:45,871 : INFO : EPOCH 16 - PROGRESS: at 71.82% examples, 404935 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:46,898 : INFO : EPOCH 16 - PROGRESS: at 73.69% examples, 405223 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:47,926 : INFO : EPOCH 16 - PROGRESS: at 75.63% examples, 405507 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:48,958 : INFO : EPOCH 16 - PROGRESS: at 77.51% examples, 405691 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:50,024 : INFO : EPOCH 16 - PROGRESS: at 79.21% examples, 404916 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:31:51,043 : INFO : EPOCH 16 - PROGRESS: at 81.04% examples, 405052 words/s

2019-11-21 17:32:52,958 : INFO : EPOCH 17 - PROGRESS: at 90.82% examples, 409802 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:32:54,002 : INFO : EPOCH 17 - PROGRESS: at 92.61% examples, 409402 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:32:55,006 : INFO : EPOCH 17 - PROGRESS: at 94.41% examples, 409353 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:32:56,040 : INFO : EPOCH 17 - PROGRESS: at 96.11% examples, 408861 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:32:57,074 : INFO : EPOCH 17 - PROGRESS: at 97.85% examples, 408577 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:32:58,091 : INFO : EPOCH 17 - PROGRESS: at 99.66% examples, 408407 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:32:58,221 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-21 17:32:58,248 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 17:32:58,262 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 17:32:58,271 : INFO : worker thr

2019-11-21 17:33:59,289 : INFO : EPOCH 19 - PROGRESS: at 8.59% examples, 391411 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:00,403 : INFO : EPOCH 19 - PROGRESS: at 10.13% examples, 377391 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:01,428 : INFO : EPOCH 19 - PROGRESS: at 11.55% examples, 368250 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:02,466 : INFO : EPOCH 19 - PROGRESS: at 12.82% examples, 357209 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:03,480 : INFO : EPOCH 19 - PROGRESS: at 13.99% examples, 347430 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:04,498 : INFO : EPOCH 19 - PROGRESS: at 15.57% examples, 347936 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:05,530 : INFO : EPOCH 19 - PROGRESS: at 16.90% examples, 342746 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:06,551 : INFO : EPOCH 19 - PROGRESS: at 18.61% examples, 345610 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:34:07,629 : INFO : EPOCH 19 - PROGRESS: at 20.37% examples, 346690 words/s,

2019-11-21 17:35:08,926 : INFO : EPOCH 20 - PROGRESS: at 28.98% examples, 407560 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:09,949 : INFO : EPOCH 20 - PROGRESS: at 30.59% examples, 404786 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:10,958 : INFO : EPOCH 20 - PROGRESS: at 32.21% examples, 402602 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:11,964 : INFO : EPOCH 20 - PROGRESS: at 33.90% examples, 401684 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:12,989 : INFO : EPOCH 20 - PROGRESS: at 35.52% examples, 400515 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:14,019 : INFO : EPOCH 20 - PROGRESS: at 37.30% examples, 401080 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:15,058 : INFO : EPOCH 20 - PROGRESS: at 39.13% examples, 401464 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:16,060 : INFO : EPOCH 20 - PROGRESS: at 41.04% examples, 402526 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:17,085 : INFO : EPOCH 20 - PROGRESS: at 42.90% examples, 403102 words/s


Evaluating Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)


2019-11-21 17:35:50,152 : INFO : training model with 4 workers on 265409 vocabulary and 1100 features, using sg=0 hs=0 sample=0 negative=5 window=5



0.174240 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)

Training Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


2019-11-21 17:35:51,208 : INFO : EPOCH 1 - PROGRESS: at 0.73% examples, 157697 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:52,211 : INFO : EPOCH 1 - PROGRESS: at 1.79% examples, 198770 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:35:53,306 : INFO : EPOCH 1 - PROGRESS: at 2.87% examples, 212407 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:54,306 : INFO : EPOCH 1 - PROGRESS: at 3.80% examples, 212802 words/s, in_qsize 8, out_qsize 0
2019-11-21 17:35:55,351 : INFO : EPOCH 1 - PROGRESS: at 4.75% examples, 212879 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:56,365 : INFO : EPOCH 1 - PROGRESS: at 5.84% examples, 218660 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:57,375 : INFO : EPOCH 1 - PROGRESS: at 6.87% examples, 221612 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:58,443 : INFO : EPOCH 1 - PROGRESS: at 8.02% examples, 225639 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:35:59,509 : INFO : EPOCH 1 - PROGRESS: at 9.16% examples, 228597 words/s, in_qsize 7, out_

2019-11-21 17:37:06,777 : INFO : EPOCH 1 - PROGRESS: at 81.89% examples, 245275 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:07,805 : INFO : EPOCH 1 - PROGRESS: at 83.10% examples, 245493 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:08,807 : INFO : EPOCH 1 - PROGRESS: at 84.18% examples, 245558 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:09,812 : INFO : EPOCH 1 - PROGRESS: at 85.13% examples, 245363 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:10,823 : INFO : EPOCH 1 - PROGRESS: at 86.18% examples, 245403 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:11,828 : INFO : EPOCH 1 - PROGRESS: at 87.21% examples, 245238 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:12,860 : INFO : EPOCH 1 - PROGRESS: at 88.34% examples, 245335 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:13,907 : INFO : EPOCH 1 - PROGRESS: at 89.45% examples, 245287 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:37:14,959 : INFO : EPOCH 1 - PROGRESS: at 90.61% examples, 245288 words/s, in_qsiz

2019-11-21 17:38:18,557 : INFO : EPOCH 2 - PROGRESS: at 60.82% examples, 255277 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:19,559 : INFO : EPOCH 2 - PROGRESS: at 61.94% examples, 255385 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:20,589 : INFO : EPOCH 2 - PROGRESS: at 63.11% examples, 255494 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:21,618 : INFO : EPOCH 2 - PROGRESS: at 64.24% examples, 255456 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:22,655 : INFO : EPOCH 2 - PROGRESS: at 65.47% examples, 255570 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:23,672 : INFO : EPOCH 2 - PROGRESS: at 66.47% examples, 255080 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:24,758 : INFO : EPOCH 2 - PROGRESS: at 67.61% examples, 254812 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:25,866 : INFO : EPOCH 2 - PROGRESS: at 68.81% examples, 254624 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:38:26,874 : INFO : EPOCH 2 - PROGRESS: at 69.91% examples, 254549 words/s, in_qsiz

2019-11-21 17:39:29,902 : INFO : EPOCH 3 - PROGRESS: at 40.96% examples, 262313 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:30,933 : INFO : EPOCH 3 - PROGRESS: at 42.09% examples, 262082 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:31,963 : INFO : EPOCH 3 - PROGRESS: at 43.14% examples, 261346 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:33,052 : INFO : EPOCH 3 - PROGRESS: at 44.34% examples, 260995 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:34,078 : INFO : EPOCH 3 - PROGRESS: at 45.49% examples, 261098 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:35,079 : INFO : EPOCH 3 - PROGRESS: at 46.68% examples, 261574 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:36,088 : INFO : EPOCH 3 - PROGRESS: at 47.85% examples, 261508 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:37,091 : INFO : EPOCH 3 - PROGRESS: at 49.05% examples, 261715 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:39:38,175 : INFO : EPOCH 3 - PROGRESS: at 50.34% examples, 262081 words/s, in_qsiz

2019-11-21 17:40:40,905 : INFO : EPOCH 4 - PROGRESS: at 22.52% examples, 264528 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:42,024 : INFO : EPOCH 4 - PROGRESS: at 23.72% examples, 263776 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:43,070 : INFO : EPOCH 4 - PROGRESS: at 24.90% examples, 263616 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:44,134 : INFO : EPOCH 4 - PROGRESS: at 26.08% examples, 263164 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:45,162 : INFO : EPOCH 4 - PROGRESS: at 27.28% examples, 263206 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:46,170 : INFO : EPOCH 4 - PROGRESS: at 28.54% examples, 263868 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:47,269 : INFO : EPOCH 4 - PROGRESS: at 29.81% examples, 263930 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:48,278 : INFO : EPOCH 4 - PROGRESS: at 31.10% examples, 264818 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:40:49,340 : INFO : EPOCH 4 - PROGRESS: at 32.33% examples, 264823 words/s, in_qsiz

2019-11-21 17:41:52,729 : INFO : EPOCH 5 - PROGRESS: at 6.05% examples, 272176 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:53,747 : INFO : EPOCH 5 - PROGRESS: at 7.11% examples, 267934 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:54,780 : INFO : EPOCH 5 - PROGRESS: at 8.23% examples, 265665 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:55,798 : INFO : EPOCH 5 - PROGRESS: at 9.34% examples, 264336 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:56,822 : INFO : EPOCH 5 - PROGRESS: at 10.56% examples, 264546 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:57,845 : INFO : EPOCH 5 - PROGRESS: at 11.86% examples, 266481 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:58,885 : INFO : EPOCH 5 - PROGRESS: at 13.06% examples, 266839 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:41:59,908 : INFO : EPOCH 5 - PROGRESS: at 14.27% examples, 267403 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:42:00,947 : INFO : EPOCH 5 - PROGRESS: at 15.49% examples, 267008 words/s, in_qsize 7,

2019-11-21 17:43:07,442 : INFO : EPOCH 5 - PROGRESS: at 93.94% examples, 269876 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:43:08,515 : INFO : EPOCH 5 - PROGRESS: at 95.17% examples, 269770 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:43:09,542 : INFO : EPOCH 5 - PROGRESS: at 96.31% examples, 269690 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:43:10,566 : INFO : EPOCH 5 - PROGRESS: at 97.41% examples, 269514 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:43:11,570 : INFO : EPOCH 5 - PROGRESS: at 98.53% examples, 269299 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:43:12,622 : INFO : EPOCH 5 - PROGRESS: at 99.74% examples, 269157 words/s, in_qsize 6, out_qsize 0
2019-11-21 17:43:12,726 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-21 17:43:12,751 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 17:43:12,774 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 17:43:12,776 : INFO : worker thread fi

2019-11-21 17:44:18,510 : INFO : EPOCH 6 - PROGRESS: at 77.47% examples, 270417 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:19,570 : INFO : EPOCH 6 - PROGRESS: at 78.75% examples, 270628 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:20,597 : INFO : EPOCH 6 - PROGRESS: at 79.97% examples, 270695 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:21,629 : INFO : EPOCH 6 - PROGRESS: at 81.21% examples, 270733 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:22,742 : INFO : EPOCH 6 - PROGRESS: at 82.59% examples, 270838 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:23,747 : INFO : EPOCH 6 - PROGRESS: at 83.83% examples, 270937 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:24,783 : INFO : EPOCH 6 - PROGRESS: at 85.01% examples, 271053 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:25,792 : INFO : EPOCH 6 - PROGRESS: at 86.22% examples, 271284 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:44:26,802 : INFO : EPOCH 6 - PROGRESS: at 87.48% examples, 271380 words/s, in_qsiz

2019-11-21 17:59:30,680 : INFO : EPOCH 7 - PROGRESS: at 58.86% examples, 14470 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:31,792 : INFO : EPOCH 7 - PROGRESS: at 60.08% examples, 14752 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:32,846 : INFO : EPOCH 7 - PROGRESS: at 61.06% examples, 14974 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:33,860 : INFO : EPOCH 7 - PROGRESS: at 61.86% examples, 15153 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:34,878 : INFO : EPOCH 7 - PROGRESS: at 62.73% examples, 15352 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:35,913 : INFO : EPOCH 7 - PROGRESS: at 63.70% examples, 15570 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:36,915 : INFO : EPOCH 7 - PROGRESS: at 64.76% examples, 15811 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:37,967 : INFO : EPOCH 7 - PROGRESS: at 65.84% examples, 16050 words/s, in_qsize 7, out_qsize 0
2019-11-21 17:59:39,021 : INFO : EPOCH 7 - PROGRESS: at 67.31% examples, 16388 words/s, in_qsize 7, out_

2019-11-21 18:00:42,130 : INFO : EPOCH 8 - PROGRESS: at 61.57% examples, 336780 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:43,157 : INFO : EPOCH 8 - PROGRESS: at 63.24% examples, 337684 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:44,176 : INFO : EPOCH 8 - PROGRESS: at 64.98% examples, 338875 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:45,194 : INFO : EPOCH 8 - PROGRESS: at 66.87% examples, 340639 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:46,199 : INFO : EPOCH 8 - PROGRESS: at 68.62% examples, 342025 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:47,218 : INFO : EPOCH 8 - PROGRESS: at 70.46% examples, 343458 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:48,224 : INFO : EPOCH 8 - PROGRESS: at 72.25% examples, 344739 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:49,226 : INFO : EPOCH 8 - PROGRESS: at 74.03% examples, 346035 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:00:50,242 : INFO : EPOCH 8 - PROGRESS: at 75.89% examples, 347183 words/s, in_qsiz

2019-11-21 18:01:52,595 : INFO : EPOCH 9 - PROGRESS: at 85.63% examples, 403143 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:53,622 : INFO : EPOCH 9 - PROGRESS: at 87.48% examples, 403214 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:54,675 : INFO : EPOCH 9 - PROGRESS: at 89.32% examples, 403251 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:55,681 : INFO : EPOCH 9 - PROGRESS: at 91.13% examples, 403221 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:56,682 : INFO : EPOCH 9 - PROGRESS: at 92.93% examples, 403292 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:57,695 : INFO : EPOCH 9 - PROGRESS: at 94.74% examples, 403462 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:58,732 : INFO : EPOCH 9 - PROGRESS: at 96.56% examples, 403580 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:01:59,754 : INFO : EPOCH 9 - PROGRESS: at 98.44% examples, 404020 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:02:00,550 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-

2019-11-21 18:02:59,209 : INFO : EPOCH 11 - PROGRESS: at 1.68% examples, 373752 words/s, in_qsize 6, out_qsize 1
2019-11-21 18:03:00,220 : INFO : EPOCH 11 - PROGRESS: at 3.43% examples, 392541 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:03:01,223 : INFO : EPOCH 11 - PROGRESS: at 5.13% examples, 393528 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:03:02,246 : INFO : EPOCH 11 - PROGRESS: at 6.91% examples, 397204 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:03:03,292 : INFO : EPOCH 11 - PROGRESS: at 8.71% examples, 398605 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:03:04,329 : INFO : EPOCH 11 - PROGRESS: at 10.60% examples, 400778 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:03:05,331 : INFO : EPOCH 11 - PROGRESS: at 12.43% examples, 402889 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:03:06,341 : INFO : EPOCH 11 - PROGRESS: at 14.15% examples, 401703 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:03:07,360 : INFO : EPOCH 11 - PROGRESS: at 16.04% examples, 403587 words/s, in_

2019-11-21 18:04:08,920 : INFO : EPOCH 12 - PROGRESS: at 23.43% examples, 406013 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:09,954 : INFO : EPOCH 12 - PROGRESS: at 25.16% examples, 404521 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:10,973 : INFO : EPOCH 12 - PROGRESS: at 26.86% examples, 402876 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:11,980 : INFO : EPOCH 12 - PROGRESS: at 28.58% examples, 401809 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:12,995 : INFO : EPOCH 12 - PROGRESS: at 30.29% examples, 400686 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:14,012 : INFO : EPOCH 12 - PROGRESS: at 32.16% examples, 401738 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:15,024 : INFO : EPOCH 12 - PROGRESS: at 33.98% examples, 402270 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:04:16,053 : INFO : EPOCH 12 - PROGRESS: at 35.77% examples, 402882 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:04:17,056 : INFO : EPOCH 12 - PROGRESS: at 37.51% examples, 403375 words/s

2019-11-21 18:05:18,629 : INFO : EPOCH 13 - PROGRESS: at 46.06% examples, 401356 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:19,661 : INFO : EPOCH 13 - PROGRESS: at 47.98% examples, 402131 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:20,673 : INFO : EPOCH 13 - PROGRESS: at 49.83% examples, 402806 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:21,681 : INFO : EPOCH 13 - PROGRESS: at 51.64% examples, 403194 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:22,685 : INFO : EPOCH 13 - PROGRESS: at 53.54% examples, 404174 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:23,701 : INFO : EPOCH 13 - PROGRESS: at 55.32% examples, 404284 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:24,736 : INFO : EPOCH 13 - PROGRESS: at 57.23% examples, 404726 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:05:25,750 : INFO : EPOCH 13 - PROGRESS: at 59.03% examples, 405148 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:05:26,765 : INFO : EPOCH 13 - PROGRESS: at 60.90% examples, 405604 words/s

2019-11-21 18:06:28,157 : INFO : EPOCH 14 - PROGRESS: at 71.06% examples, 411600 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:29,173 : INFO : EPOCH 14 - PROGRESS: at 72.91% examples, 411819 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:30,187 : INFO : EPOCH 14 - PROGRESS: at 74.83% examples, 412077 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:31,190 : INFO : EPOCH 14 - PROGRESS: at 76.75% examples, 412410 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:32,223 : INFO : EPOCH 14 - PROGRESS: at 78.59% examples, 412428 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:33,228 : INFO : EPOCH 14 - PROGRESS: at 80.44% examples, 412736 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:34,267 : INFO : EPOCH 14 - PROGRESS: at 82.38% examples, 412883 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:35,268 : INFO : EPOCH 14 - PROGRESS: at 84.21% examples, 413098 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:06:36,312 : INFO : EPOCH 14 - PROGRESS: at 85.97% examples, 412949 words/s

2019-11-21 18:07:38,232 : INFO : EPOCH 15 - PROGRESS: at 97.22% examples, 413392 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:07:39,274 : INFO : EPOCH 15 - PROGRESS: at 99.11% examples, 413332 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:07:39,706 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-21 18:07:39,736 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 18:07:39,748 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 18:07:39,754 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-21 18:07:39,756 : INFO : EPOCH - 15 : training on 23279529 raw words (22951015 effective words) took 55.5s, 413386 effective words/s
2019-11-21 18:07:40,773 : INFO : EPOCH 16 - PROGRESS: at 1.68% examples, 374855 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:07:41,773 : INFO : EPOCH 16 - PROGRESS: at 3.38% examples, 390327 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:07:42,775 : INFO : EPOCH 16

2019-11-21 18:08:44,238 : INFO : EPOCH 17 - PROGRESS: at 16.04% examples, 404626 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:08:45,252 : INFO : EPOCH 17 - PROGRESS: at 17.90% examples, 406107 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:08:46,252 : INFO : EPOCH 17 - PROGRESS: at 19.89% examples, 408687 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:08:47,263 : INFO : EPOCH 17 - PROGRESS: at 21.71% examples, 409625 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:08:48,266 : INFO : EPOCH 17 - PROGRESS: at 23.57% examples, 410685 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:08:49,268 : INFO : EPOCH 17 - PROGRESS: at 25.07% examples, 406326 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:08:50,299 : INFO : EPOCH 17 - PROGRESS: at 26.77% examples, 404257 words/s, in_qsize 8, out_qsize 0
2019-11-21 18:08:51,308 : INFO : EPOCH 17 - PROGRESS: at 28.71% examples, 406040 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:08:52,325 : INFO : EPOCH 17 - PROGRESS: at 30.59% examples, 406879 words/s

2019-11-21 18:09:54,197 : INFO : EPOCH 18 - PROGRESS: at 42.36% examples, 417176 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:09:55,197 : INFO : EPOCH 18 - PROGRESS: at 44.22% examples, 417514 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:09:56,263 : INFO : EPOCH 18 - PROGRESS: at 46.09% examples, 417571 words/s, in_qsize 8, out_qsize 1
2019-11-21 18:09:57,270 : INFO : EPOCH 18 - PROGRESS: at 48.03% examples, 418138 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:09:58,273 : INFO : EPOCH 18 - PROGRESS: at 49.91% examples, 418722 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:09:59,280 : INFO : EPOCH 18 - PROGRESS: at 51.77% examples, 418931 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:10:00,284 : INFO : EPOCH 18 - PROGRESS: at 53.54% examples, 418421 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:10:01,312 : INFO : EPOCH 18 - PROGRESS: at 55.43% examples, 418838 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:10:02,334 : INFO : EPOCH 18 - PROGRESS: at 57.39% examples, 419271 words/s

2019-11-21 18:11:04,365 : INFO : EPOCH 19 - PROGRESS: at 70.21% examples, 416538 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:05,385 : INFO : EPOCH 19 - PROGRESS: at 72.12% examples, 416827 words/s, in_qsize 8, out_qsize 1
2019-11-21 18:11:06,400 : INFO : EPOCH 19 - PROGRESS: at 73.99% examples, 416954 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:07,405 : INFO : EPOCH 19 - PROGRESS: at 75.93% examples, 417171 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:08,416 : INFO : EPOCH 19 - PROGRESS: at 77.83% examples, 417509 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:09,439 : INFO : EPOCH 19 - PROGRESS: at 79.58% examples, 417082 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:10,443 : INFO : EPOCH 19 - PROGRESS: at 81.38% examples, 416833 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:11,514 : INFO : EPOCH 19 - PROGRESS: at 83.14% examples, 415713 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:11:12,552 : INFO : EPOCH 19 - PROGRESS: at 84.82% examples, 414945 words/s

2019-11-21 18:12:13,723 : INFO : EPOCH 20 - PROGRESS: at 96.03% examples, 417189 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:12:14,739 : INFO : EPOCH 20 - PROGRESS: at 97.89% examples, 417421 words/s, in_qsize 7, out_qsize 0
2019-11-21 18:12:15,760 : INFO : EPOCH 20 - PROGRESS: at 99.83% examples, 417593 words/s, in_qsize 4, out_qsize 0
2019-11-21 18:12:15,773 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-11-21 18:12:15,812 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-11-21 18:12:15,823 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-21 18:12:15,832 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-21 18:12:15,834 : INFO : EPOCH - 20 : training on 23279529 raw words (22951015 effective words) took 54.9s, 417720 effective words/s
2019-11-21 18:12:15,835 : INFO : training on a 465590580 raw words (459020300 effective words) took 2227.7s, 206048 effective words/s



Evaluating Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.306080 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)

0.104360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)


Evaluating Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)

0.104840 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)



Achieved Sentiment-Prediction Accuracy
--------------------------------------
Compare error rates achieved, best-to-worst



In [20]:
print("Err_rate Model")
for rate, name in sorted((rate, name) for name, rate in error_rates.items()):
    print("%f %s" % (rate, name))

Err_rate Model
0.104360 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
0.104840 Doc2Vec(dbow,d100,n5,mc2,t4)+Doc2Vec(dm/c,d100,n5,w5,mc2,t4)
0.105920 Doc2Vec(dbow,d100,n5,mc2,t4)
0.174240 Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4)
0.306080 Doc2Vec(dm/c,d100,n5,w5,mc2,t4)


In [None]:
print("Accuracy")

for acc,name in sorted((acc,name) for name, acc in error)

In our testing, contrary to the results of the paper, on this problem,
PV-DBOW alone performs as good as anything else. Concatenating vectors from
different models only sometimes offers a tiny predictive improvement – and
stays generally close to the best-performing solo model included.

The best results achieved here are just around 10% error rate, still a long
way from the paper's reported 7.42% error rate.

(Other trials not shown, with larger vectors and other changes, also don't
come close to the paper's reported value. Others around the net have reported
a similar inability to reproduce the paper's best numbers. The PV-DM/C mode
improves a bit with many more training epochs – but doesn't reach parity with
PV-DBOW.)




Examining Results
-----------------

Let's look for answers to the following questions:

#. Are inferred vectors close to the precalculated ones?
#. Do close documents seem more related than distant ones?
#. Do the word vectors show useful similarities?
#. Are the word vectors from this dataset any good at analogies?




Are inferred vectors close to the precalculated ones?
-----------------------------------------------------



In [21]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # Pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

2019-11-21 18:15:28,126 : INFO : precomputing L2-norms of doc weight vectors


for doc 71229...


2019-11-21 18:15:28,518 : INFO : precomputing L2-norms of doc weight vectors
2019-11-21 18:15:28,648 : INFO : precomputing L2-norms of doc weight vectors


Doc2Vec(dbow,d100,n5,mc2,t4):
 [(71229, 0.9894205331802368), (49613, 0.5576101541519165), (20334, 0.5438925623893738)]
Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4):
 [(71229, 0.9080021977424622), (49613, 0.5746079087257385), (80528, 0.5528310537338257)]
Doc2Vec(dm/c,d100,n5,w5,mc2,t4):
 [(71229, 0.7498977184295654), (61396, 0.4301270842552185), (3886, 0.42097970843315125)]


(Yes, here the stored vector from 20 epochs of training is usually one of the
closest to a freshly-inferred vector for the same words. Defaults for
inference may benefit from tuning for each dataset or model parameters.)




Do close documents seem more related than distant ones?
-------------------------------------------------------



In [22]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    s = sims[index]
    i = sims[index][0]
    words = ' '.join(alldocs[i].words)
    print(u'%s %s: «%s»\n' % (label, s, words))

TARGET (20907): «Being a fan of time travel stories I was surprised that there was no 'device' that sent Russell Johnson's character through time. He just appeared in 1865. It was a disappointing part of the episode. I enjoyed the premise of the dangers of 'altering' the future by changing the past. Other Twilight Zone episodes about time travel such as "No Time Like The Past", "Once Upon A Time", etc. were more to my liking because of the uses of time travel 'devices'. Perhaps if Russell Johnson's character had been the same character he played in "Execution" it would have been more acceptable to fans like myself. As sci-fi fans know, no characters really ever have to 'die' in time travel episodes. All sorts of plot 'twists' can be applied in these types of stories. That was the only flaw I could comment on for this episode.»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/c,d100,n5,w5,mc2,t4):

MOST (90064, 0.4897709786891937): «Maybe it's because I just recently got into Indian soap o

Somewhat, in terms of reviewer tone, movie genre, etc... the MOST
cosine-similar docs usually seem more like the TARGET than the MEDIAN or
LEAST... especially if the MOST has a cosine-similarity > 0.5. Re-run the
cell to try another random target document.




Do the word vectors show useful similarities?
---------------------------------------------




In [23]:
import random

word_models = simple_models[:]

def pick_random_word(model, threshold=10):
    # pick a random word with a suitable number of occurences
    while True:
        word = random.choice(model.wv.index2word)
        if model.wv.vocab[word].count > threshold:
            return word

target_word = pick_random_word(word_models[0])
# or uncomment below line, to just pick a word from the relevant domain:
# target_word = 'comedy/drama'

for model in word_models:
    print('target_word: %r model: %s similar words:' % (target_word, model))
    for i, (word, sim) in enumerate(model.wv.most_similar(target_word, topn=10), 1):
        print('    %d. %.2f %r' % (i, sim, word))
    print()

2019-11-21 18:20:41,863 : INFO : precomputing L2-norms of word weight vectors


target_word: '(Ron' model: Doc2Vec(dbow,d100,n5,mc2,t4) similar words:


2019-11-21 18:20:42,168 : INFO : precomputing L2-norms of word weight vectors


    1. 0.47 'attests'
    2. 0.45 'Beach,"'
    3. 0.42 'parody?'
    4. 0.42 'unite'
    5. 0.41 'Jansen,'
    6. 0.41 '(wonderful!)'
    7. 0.41 'scamming'
    8. 0.40 "Talia's"
    9. 0.40 'Aeris'
    10. 0.40 'Russo),'

target_word: '(Ron' model: Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4) similar words:


2019-11-21 18:20:42,411 : INFO : precomputing L2-norms of word weight vectors


    1. 0.55 'Smith),'
    2. 0.54 'Ford),'
    3. 0.51 '(Gordon'
    4. 0.51 'Poe.<br'
    5. 0.50 '(Humphrey'
    6. 0.50 '(Willie'
    7. 0.50 'Boone,'
    8. 0.50 'Stein,'
    9. 0.49 'Osborne,'
    10. 0.49 'Curtis)'

target_word: '(Ron' model: Doc2Vec(dm/c,d100,n5,w5,mc2,t4) similar words:
    1. 0.67 '(Timothy'
    2. 0.66 '(Gary'
    3. 0.66 '(Ben'
    4. 0.63 '(Linda'
    5. 0.63 '(Morgan'
    6. 0.63 '(Edward'
    7. 0.63 '(Jack'
    8. 0.63 'Hobbes,'
    9. 0.63 '(Christopher'
    10. 0.62 '(Isabelle'



Do the DBOW words look meaningless? That's because the gensim DBOW model
doesn't train word vectors – they remain at their random initialized values –
unless you ask with the ``dbow_words=1`` initialization parameter. Concurrent
word-training slows DBOW mode significantly, and offers little improvement
(and sometimes a little worsening) of the error rate on this IMDB
sentiment-prediction task, but may be appropriate on other tasks, or if you
also need word-vectors.

Words from DM models tend to show meaningfully similar words when there are
many examples in the training data (as with 'plot' or 'actor'). (All DM modes
inherently involve word-vector training concurrent with doc-vector training.)




Are the word vectors from this dataset any good at analogies?
-------------------------------------------------------------



In [25]:
# grab the file if not already local
questions_filename = 'questions-words.txt'
if not os.path.isfile(questions_filename):
    # Download IMDB archive
    print("Downloading analogy questions file...")
    url = u'https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt'
    with smart_open.open(url, 'rb') as fin:
        with smart_open.open(questions_filename, 'wb') as fout:
            fout.write(fin.read())
assert os.path.isfile(questions_filename), "questions-words.txt unavailable"
print("Success, questions-words.txt is available for next steps.")

# Note: this analysis takes many minutes
for model in word_models:
    score, sections = model.wv.evaluate_word_analogies('questions-words.txt')
    correct, incorrect = len(sections[-1]['correct']), len(sections[-1]['incorrect'])
    print('%s: %0.2f%% correct (%d of %d)' % (model, float(correct*100)/(correct+incorrect), correct, correct+incorrect))

Downloading analogy questions file...
Success, questions-words.txt is available for next steps.


2019-11-21 18:21:09,211 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2019-11-21 18:21:15,694 : INFO : capital-common-countries: 0.0% (0/420)
2019-11-21 18:21:26,841 : INFO : capital-world: 0.0% (0/902)
2019-11-21 18:21:27,792 : INFO : currency: 0.0% (0/86)
2019-11-21 18:21:45,032 : INFO : city-in-state: 0.0% (0/1510)
2019-11-21 18:21:51,400 : INFO : family: 0.0% (0/506)
2019-11-21 18:22:02,292 : INFO : gram1-adjective-to-adverb: 0.0% (0/992)
2019-11-21 18:22:10,881 : INFO : gram2-opposite: 0.0% (0/756)
2019-11-21 18:22:25,265 : INFO : gram3-comparative: 0.0% (0/1332)
2019-11-21 18:22:36,619 : INFO : gram4-superlative: 0.0% (0/1056)
2019-11-21 18:22:47,368 : INFO : gram5-present-participle: 0.0% (0/992)
2019-11-21 18:23:02,669 : INFO : gram6-nationality-adjective: 0.0% (0/1445)
2019-11-21 18:23:19,227 : INFO : gram7-past-tense: 0.0% (0/1560)
2019-11-21 18:23:31,653 : INFO : gram8-plural: 0.0% (0/1190)
2019-11-21 18:23:41,259 : INFO : gram9-

Doc2Vec(dbow,d100,n5,mc2,t4): 0.00% correct (0 of 13617)


2019-11-21 18:23:42,263 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2019-11-21 18:23:46,695 : INFO : capital-common-countries: 3.3% (14/420)
2019-11-21 18:23:57,115 : INFO : capital-world: 0.4% (4/902)
2019-11-21 18:23:58,037 : INFO : currency: 0.0% (0/86)
2019-11-21 18:24:14,259 : INFO : city-in-state: 0.1% (2/1510)
2019-11-21 18:24:20,041 : INFO : family: 37.2% (188/506)
2019-11-21 18:24:31,152 : INFO : gram1-adjective-to-adverb: 2.8% (28/992)
2019-11-21 18:24:39,537 : INFO : gram2-opposite: 5.6% (42/756)
2019-11-21 18:24:56,137 : INFO : gram3-comparative: 49.2% (656/1332)
2019-11-21 18:25:08,873 : INFO : gram4-superlative: 24.5% (259/1056)
2019-11-21 18:25:21,103 : INFO : gram5-present-participle: 23.3% (231/992)
2019-11-21 18:25:38,774 : INFO : gram6-nationality-adjective: 2.4% (35/1445)
2019-11-21 18:25:59,006 : INFO : gram7-past-tense: 28.1% (438/1560)
2019-11-21 18:26:12,183 : INFO : gram8-plural: 16.9% (201/1190)
2019-11-21 18:26:

Doc2Vec("alpha=0.05",dm/m,d100,n5,w10,mc2,t4): 18.15% correct (2472 of 13617)


2019-11-21 18:26:24,098 : INFO : Evaluating word analogies for top 300000 words in the model on questions-words.txt
2019-11-21 18:26:29,037 : INFO : capital-common-countries: 2.4% (10/420)
2019-11-21 18:26:38,397 : INFO : capital-world: 0.3% (3/902)
2019-11-21 18:26:39,348 : INFO : currency: 0.0% (0/86)
2019-11-21 18:26:55,112 : INFO : city-in-state: 0.3% (5/1510)
2019-11-21 18:27:00,694 : INFO : family: 37.4% (189/506)
2019-11-21 18:27:11,203 : INFO : gram1-adjective-to-adverb: 7.4% (73/992)
2019-11-21 18:27:19,545 : INFO : gram2-opposite: 3.7% (28/756)
2019-11-21 18:27:34,592 : INFO : gram3-comparative: 36.5% (486/1332)
2019-11-21 18:27:46,404 : INFO : gram4-superlative: 28.0% (296/1056)
2019-11-21 18:27:57,470 : INFO : gram5-present-participle: 34.4% (341/992)
2019-11-21 18:28:14,761 : INFO : gram6-nationality-adjective: 2.0% (29/1445)
2019-11-21 18:28:32,915 : INFO : gram7-past-tense: 26.0% (406/1560)
2019-11-21 18:28:47,032 : INFO : gram8-plural: 9.4% (112/1190)
2019-11-21 18:28:5

Doc2Vec(dm/c,d100,n5,w5,mc2,t4): 17.69% correct (2409 of 13617)


Even though this is a tiny, domain-specific dataset, it shows some meager
capability on the general word analogies – at least for the DM/mean and
DM/concat models which actually train word vectors. (The untrained
random-initialized words of the DBOW model of course fail miserably.)


