# Word2Vec Tutorial

In case you missed the buzz, word2vec is a widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as "deep learning" (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

Word2vec is very useful in [automatic text tagging](https://github.com/RaRe-Technologies/movie-plots-by-genre), recommender systems and machine translation.

Check out an [online word2vec demo](http://radimrehurek.com/2014/02/word2vec-tutorial/#app) where you can try this vector algebra for yourself. That demo runs `word2vec` on the Google News dataset, of **about 100 billion words**.

## This tutorial

In this tutorial you will learn how to train and evaluate word2vec models on your business data.


## Preparing the Input
Starting from the beginning, gensim’s `word2vec` expects a sequence of sentences as its input. Each sentence is a list of words (utf8 strings):

In [1]:
# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

2017-12-21 11:11:54,294 : INFO : collecting all words and their counts
2017-12-21 11:11:54,296 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:54,297 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences
2017-12-21 11:11:54,298 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:54,300 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-12-21 11:11:54,301 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-12-21 11:11:54,302 : INFO : deleting the raw counts dictionary of 3 items
2017-12-21 11:11:54,303 : INFO : sample=0.001 downsamples 3 most-common words
2017-12-21 11:11:54,304 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-12-21 11:11:54,305 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-12-21 11:11:54,307 : INFO : resetting layer weights
2017-12-21 11:11:54,308 : INFO : training model with 3 workers o

Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.

Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…

For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:

In [3]:
# create some toy data to use with the following example
import smart_open, os

if not os.path.exists('./data/'):
    os.makedirs('./data/')

filenames = ['./data/f1.txt', './data/f2.txt']

for i, fname in enumerate(filenames):
    with smart_open.smart_open(fname, 'w') as fout:
        for line in sentences[i]:
            fout.write(line + '\n')

In [4]:
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

In [5]:
sentences = MySentences('./data/') # a memory-friendly iterator
print(list(sentences))

[['second'], ['sentence'], ['first'], ['sentence']]


In [6]:
# generate the Word2Vec model
model = gensim.models.Word2Vec(sentences, min_count=1)

2017-12-21 11:11:54,344 : INFO : collecting all words and their counts
2017-12-21 11:11:54,345 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:54,346 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2017-12-21 11:11:54,347 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:54,348 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-12-21 11:11:54,349 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-12-21 11:11:54,349 : INFO : deleting the raw counts dictionary of 3 items
2017-12-21 11:11:54,350 : INFO : sample=0.001 downsamples 3 most-common words
2017-12-21 11:11:54,351 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-12-21 11:11:54,352 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-12-21 11:11:54,352 : INFO : resetting layer weights
2017-12-21 11:11:54,353 : INFO : training model with 3 workers o

In [7]:
print(model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'second': <gensim.models.keyedvectors.Vocab object at 0x7fe7b19207d0>, 'sentence': <gensim.models.keyedvectors.Vocab object at 0x7fe7b1920b10>, 'first': <gensim.models.keyedvectors.Vocab object at 0x7fe7b1920990>}


Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the `MySentences` iterator and `word2vec` doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

**Note to advanced users:** calling `Word2Vec(sentences, iter=1)` will run **two** passes over the sentences iterator. In general it runs `iter+1` passes. By the way, the default value is `iter=5` to comply with Google's word2vec in C language. 
  1. The first pass collects words and their frequencies to build an internal dictionary tree structure. 
  2. The second pass trains the neural model.

These two passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

In [8]:
# build the same model, making the 2 steps explicit
new_model = gensim.models.Word2Vec(min_count=1)  # an empty model, no training
new_model.build_vocab(sentences)                 # can be a non-repeatable, 1-pass generator     
new_model.train(sentences, total_examples=new_model.corpus_count, epochs=new_model.iter)                       
# can be a non-repeatable, 1-pass generator

2017-12-21 11:11:54,372 : INFO : collecting all words and their counts
2017-12-21 11:11:54,373 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:54,375 : INFO : collected 3 word types from a corpus of 4 raw words and 4 sentences
2017-12-21 11:11:54,376 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:54,377 : INFO : min_count=1 retains 3 unique words (100% of original 3, drops 0)
2017-12-21 11:11:54,378 : INFO : min_count=1 leaves 4 word corpus (100% of original 4, drops 0)
2017-12-21 11:11:54,380 : INFO : deleting the raw counts dictionary of 3 items
2017-12-21 11:11:54,381 : INFO : sample=0.001 downsamples 3 most-common words
2017-12-21 11:11:54,382 : INFO : downsampling leaves estimated 0 word corpus (5.7% of prior 4)
2017-12-21 11:11:54,383 : INFO : estimated required memory for 3 words and 100 dimensions: 3900 bytes
2017-12-21 11:11:54,384 : INFO : resetting layer weights
2017-12-21 11:11:54,385 : INFO : training model with 3 workers o

0

In [9]:
print(new_model)
print(model.wv.vocab)

Word2Vec(vocab=3, size=100, alpha=0.025)
{'second': <gensim.models.keyedvectors.Vocab object at 0x7fe7b19207d0>, 'sentence': <gensim.models.keyedvectors.Vocab object at 0x7fe7b1920b10>, 'first': <gensim.models.keyedvectors.Vocab object at 0x7fe7b1920990>}


### More data would be nice
For the following examples, we'll use the [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor) (which you already have if you've installed gensim):

In [10]:
# Set file names for train and test data
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data']) + os.sep
lee_train_file = test_data_dir + 'lee_background.cor'

In [11]:
class MyText(object):
    def __iter__(self):
        for line in open(lee_train_file):
            # assume there's one document per line, tokens separated by whitespace
            yield line.lower().split()

sentences = MyText()

print(sentences)

<__main__.MyText object at 0x7fe7b18d3450>


## Training
`Word2Vec` accepts several parameters that affect both training speed and quality.

### min_count
`min_count` is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

In [12]:
# default value of min_count=5
model = gensim.models.Word2Vec(sentences, min_count=10)

2017-12-21 11:11:54,428 : INFO : collecting all words and their counts
2017-12-21 11:11:54,430 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:54,448 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-12-21 11:11:54,449 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:54,455 : INFO : min_count=10 retains 806 unique words (7% of original 10186, drops 9380)
2017-12-21 11:11:54,456 : INFO : min_count=10 leaves 40964 word corpus (68% of original 59890, drops 18926)
2017-12-21 11:11:54,460 : INFO : deleting the raw counts dictionary of 10186 items
2017-12-21 11:11:54,461 : INFO : sample=0.001 downsamples 54 most-common words
2017-12-21 11:11:54,462 : INFO : downsampling leaves estimated 26224 word corpus (64.0% of prior 40964)
2017-12-21 11:11:54,463 : INFO : estimated required memory for 806 words and 100 dimensions: 1047800 bytes
2017-12-21 11:11:54,466 : INFO : resetting layer weights
2017-12-21 11:1

### size
`size` is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

In [13]:
# default value of size=100
model = gensim.models.Word2Vec(sentences, size=200)

2017-12-21 11:11:54,616 : INFO : collecting all words and their counts
2017-12-21 11:11:54,617 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:54,639 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-12-21 11:11:54,640 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:54,647 : INFO : min_count=5 retains 1723 unique words (16% of original 10186, drops 8463)
2017-12-21 11:11:54,648 : INFO : min_count=5 leaves 46858 word corpus (78% of original 59890, drops 13032)
2017-12-21 11:11:54,656 : INFO : deleting the raw counts dictionary of 10186 items
2017-12-21 11:11:54,657 : INFO : sample=0.001 downsamples 49 most-common words
2017-12-21 11:11:54,658 : INFO : downsampling leaves estimated 32849 word corpus (70.1% of prior 46858)
2017-12-21 11:11:54,659 : INFO : estimated required memory for 1723 words and 200 dimensions: 3618300 bytes
2017-12-21 11:11:54,669 : INFO : resetting layer weights
2017-12-21 11:

### workers
`workers`, the last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:

In [14]:
# default value of workers=3 (tutorial says 1...)
model = gensim.models.Word2Vec(sentences, workers=4)

2017-12-21 11:11:54,902 : INFO : collecting all words and their counts
2017-12-21 11:11:54,904 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:54,929 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-12-21 11:11:54,930 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:54,981 : INFO : min_count=5 retains 1723 unique words (16% of original 10186, drops 8463)
2017-12-21 11:11:54,982 : INFO : min_count=5 leaves 46858 word corpus (78% of original 59890, drops 13032)
2017-12-21 11:11:54,988 : INFO : deleting the raw counts dictionary of 10186 items
2017-12-21 11:11:54,989 : INFO : sample=0.001 downsamples 49 most-common words
2017-12-21 11:11:54,990 : INFO : downsampling leaves estimated 32849 word corpus (70.1% of prior 46858)
2017-12-21 11:11:54,992 : INFO : estimated required memory for 1723 words and 100 dimensions: 2239900 bytes
2017-12-21 11:11:54,996 : INFO : resetting layer weights
2017-12-21 11:

The `workers` parameter only has an effect if you have [Cython](http://cython.org/) installed. Without Cython, you’ll only be able to use one core because of the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) (and `word2vec` training will be [miserably slow](http://rare-technologies.com/word2vec-in-python-part-two-optimizing/)).

## Memory
At its core, `word2vec` model parameters are stored as matrices (NumPy arrays). Each array is **#vocabulary** (controlled by min_count parameter) times **#size** (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer `size=200`, the model will require approx. `100,000*200*4*3 bytes = ~229MB`.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

## Evaluating
`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.

For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.

The semantic questions contain five types of semantic analogies, such as capital cities (Paris:France;Tokyo:?) or family members (brother:sister;dad:?). 

Gensim supports the same evaluation set, in exactly the same format:

In [15]:
model.accuracy('./datasets/questions-words.txt')

2017-12-21 11:11:55,204 : INFO : precomputing L2-norms of word weight vectors
2017-12-21 11:11:55,219 : INFO : family: 0.0% (0/2)
2017-12-21 11:11:55,244 : INFO : gram3-comparative: 0.0% (0/12)
2017-12-21 11:11:55,256 : INFO : gram4-superlative: 0.0% (0/12)
2017-12-21 11:11:55,268 : INFO : gram5-present-participle: 5.0% (1/20)
2017-12-21 11:11:55,285 : INFO : gram6-nationality-adjective: 0.0% (0/20)
2017-12-21 11:11:55,308 : INFO : gram7-past-tense: 0.0% (0/20)
2017-12-21 11:11:55,340 : INFO : gram8-plural: 0.0% (0/12)
2017-12-21 11:11:55,350 : INFO : total: 1.0% (1/98)


[{'correct': [], 'incorrect': [], 'section': u'capital-common-countries'},
 {'correct': [], 'incorrect': [], 'section': u'capital-world'},
 {'correct': [], 'incorrect': [], 'section': u'currency'},
 {'correct': [], 'incorrect': [], 'section': u'city-in-state'},
 {'correct': [],
  'incorrect': [(u'HE', u'SHE', u'HIS', u'HER'),
   (u'HIS', u'HER', u'HE', u'SHE')],
  'section': u'family'},
 {'correct': [], 'incorrect': [], 'section': u'gram1-adjective-to-adverb'},
 {'correct': [], 'incorrect': [], 'section': u'gram2-opposite'},
 {'correct': [],
  'incorrect': [(u'GOOD', u'BETTER', u'GREAT', u'GREATER'),
   (u'GOOD', u'BETTER', u'LONG', u'LONGER'),
   (u'GOOD', u'BETTER', u'LOW', u'LOWER'),
   (u'GREAT', u'GREATER', u'LONG', u'LONGER'),
   (u'GREAT', u'GREATER', u'LOW', u'LOWER'),
   (u'GREAT', u'GREATER', u'GOOD', u'BETTER'),
   (u'LONG', u'LONGER', u'LOW', u'LOWER'),
   (u'LONG', u'LONGER', u'GOOD', u'BETTER'),
   (u'LONG', u'LONGER', u'GREAT', u'GREATER'),
   (u'LOW', u'LOWER', u'GOOD',

This `accuracy` takes an 
[optional parameter](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy) `restrict_vocab` 
which limits which test examples are to be considered.



In the December 2016 release of Gensim we added a better way to evaluate semantic similarity.

By default it uses an academic dataset WS-353 but one can create a dataset specific to your business based on it. It contains word pairs together with human-assigned similarity judgments. It measures the relatedness or co-occurrence of two words. For example, 'coast' and 'shore' are very similar as they appear in the same context. At the same time 'clothes' and 'closet' are less similar because they are related but not interchangeable.

In [16]:
model.evaluate_word_pairs(test_data_dir + 'wordsim353.tsv')

  """Entry point for launching an IPython kernel.
2017-12-21 11:11:55,415 : INFO : Pearson correlation coefficient against /usr/local/lib/python2.7/dist-packages/gensim-3.2.0-py2.7-linux-x86_64.egg/gensim/test/test_data/wordsim353.tsv: 0.0826
2017-12-21 11:11:55,426 : INFO : Spearman rank-order correlation coefficient against /usr/local/lib/python2.7/dist-packages/gensim-3.2.0-py2.7-linux-x86_64.egg/gensim/test/test_data/wordsim353.tsv: 0.0491
2017-12-21 11:11:55,427 : INFO : Pairs with unknown words ratio: 85.6%


((0.082634954456099585, 0.5642847319174914),
 SpearmanrResult(correlation=0.049100577101470595, pvalue=0.7322291607477649),
 85.55240793201133)

Once again, **good performance on Google's or WS-353 test set doesn’t mean word2vec will work well in your application, or vice versa**. It’s always best to evaluate directly on your intended task. For an example of how to use word2vec in a classifier pipeline, see this [tutorial](https://github.com/RaRe-Technologies/movie-plots-by-genre).

## Storing and loading models
You can store/load models using the standard gensim methods:

In [17]:
from tempfile import mkstemp

fs, temp_path = mkstemp("gensim_temp")  # creates a temp file

model.save(temp_path)  # save the model

2017-12-21 11:11:55,435 : INFO : saving Word2Vec object under /tmp/tmpuvMZHLgensim_temp, separately None
2017-12-21 11:11:55,436 : INFO : not storing attribute syn0norm
2017-12-21 11:11:55,437 : INFO : not storing attribute cum_table
2017-12-21 11:11:55,454 : INFO : saved /tmp/tmpuvMZHLgensim_temp


In [18]:
new_model = gensim.models.Word2Vec.load(temp_path)  # open the model

2017-12-21 11:11:55,459 : INFO : loading Word2Vec object from /tmp/tmpuvMZHLgensim_temp
2017-12-21 11:11:55,466 : INFO : loading wv recursively from /tmp/tmpuvMZHLgensim_temp.wv.* with mmap=None
2017-12-21 11:11:55,467 : INFO : setting ignored attribute syn0norm to None
2017-12-21 11:11:55,468 : INFO : setting ignored attribute cum_table to None
2017-12-21 11:11:55,469 : INFO : loaded /tmp/tmpuvMZHLgensim_temp


which uses pickle internally, optionally `mmap`‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:
```
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
  # using gzipped/bz2 input works too, no need to unzip:
  model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)
```

## Online training / Resuming training
Advanced users can load a model and continue training it with more sentences and [new vocabulary words](online_w2v_tutorial.ipynb):

In [19]:
model = gensim.models.Word2Vec.load(temp_path)
more_sentences = [['Advanced', 'users', 'can', 'load', 'a', 'model', 'and', 'continue', 'training', 'it', 'with', 'more', 'sentences']]
model.build_vocab(more_sentences, update=True)
model.train(more_sentences, total_examples=model.corpus_count, epochs=model.iter)

# cleaning up temp
os.close(fs)
os.remove(temp_path)

2017-12-21 11:11:55,480 : INFO : loading Word2Vec object from /tmp/tmpuvMZHLgensim_temp
2017-12-21 11:11:55,489 : INFO : loading wv recursively from /tmp/tmpuvMZHLgensim_temp.wv.* with mmap=None
2017-12-21 11:11:55,490 : INFO : setting ignored attribute syn0norm to None
2017-12-21 11:11:55,491 : INFO : setting ignored attribute cum_table to None
2017-12-21 11:11:55,491 : INFO : loaded /tmp/tmpuvMZHLgensim_temp
2017-12-21 11:11:55,497 : INFO : collecting all words and their counts
2017-12-21 11:11:55,498 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:55,499 : INFO : collected 13 word types from a corpus of 13 raw words and 1 sentences
2017-12-21 11:11:55,500 : INFO : Updating model with new vocabulary
2017-12-21 11:11:55,501 : INFO : New added 0 unique words (0% of original 13) and increased the count of 0 pre-existing words (0% of original 13)
2017-12-21 11:11:55,502 : INFO : deleting the raw counts dictionary of 13 items
2017-12-21 11:11:55

You may need to tweak the `total_words` parameter to `train()`, depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, `KeyedVectors.load_word2vec_format()`. You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

## Using the model
`Word2Vec` supports several word similarity tasks out of the box:

In [20]:
model.most_similar(positive=['human', 'crime'], negative=['party'], topn=1)

  """Entry point for launching an IPython kernel.
2017-12-21 11:11:55,526 : INFO : precomputing L2-norms of word weight vectors


[('ensure', 0.991725504398346)]

In [21]:
model.doesnt_match("input is lunch he sentence cat".split())

  """Entry point for launching an IPython kernel.


'sentence'

In [22]:
print(model.similarity('human', 'party'))
print(model.similarity('tree', 'murder'))

0.999161631577
0.995816511186


  """Entry point for launching an IPython kernel.
  


You can get the probability distribution for the center word given the context words as input:

In [23]:
print(model.predict_output_word(['emergency', 'beacon', 'received']))

[('more', 0.0010627158), ('continue', 0.00092760776), ('can', 0.00091921422), ('training', 0.0009017812), ('it', 0.00077213935), ('australia', 0.00077146909), ('could', 0.00075169088), ('there', 0.00074797159), ('say', 0.00073896395), ('or', 0.0007375718)]


The results here don't look good because the training corpus is very small. To get meaningful results one needs to train on 500k+ words.

If you need the raw output vectors in your application, you can access these either on a word-by-word basis:

In [24]:
model['tree']  # raw NumPy vector of a word

  """Entry point for launching an IPython kernel.


array([  5.35218790e-03,   2.29746513e-02,  -2.94440538e-02,
        -7.62169762e-03,  -3.33879478e-02,  -3.95843573e-02,
         9.22061410e-03,  -9.42539349e-02,  -1.52402772e-02,
        -2.51969919e-02,   2.66638417e-02,  -2.49711890e-02,
        -6.93099424e-02,  -2.20992770e-02,  -2.18875315e-02,
         4.75446582e-02,  -3.00612282e-02,  -1.62413158e-03,
         2.24862937e-02,  -2.83395667e-02,  -6.89103603e-02,
        -6.27040397e-03,   7.02168345e-02,   5.15035242e-02,
         2.38279793e-02,   1.46166543e-02,  -6.40407996e-03,
         1.87507458e-02,  -1.70347784e-02,  -5.62083966e-04,
        -8.30001954e-05,   7.45436326e-02,   9.68717784e-03,
        -4.92604543e-03,   2.64829025e-02,   2.25560013e-02,
        -4.85874750e-02,   1.70466714e-02,   1.34616625e-02,
        -6.92491233e-02,  -8.00219327e-02,   5.77354897e-03,
        -4.43342701e-02,  -2.96465941e-02,   3.13350894e-02,
        -5.80306426e-02,   3.97544727e-02,   3.16670653e-03,
        -1.56737454e-02,

…or en-masse as a 2D NumPy matrix from `model.wv.syn0`.

## Training Loss Computation

The parameter `compute_loss` can be used to toggle computation of loss while training the Word2Vec model. The computed loss is stored in the model attribute `running_training_loss` and can be retrieved using the function `get_latest_training_loss` as follows : 

In [25]:
# instantiating and training the Word2Vec model
model_with_loss = gensim.models.Word2Vec(sentences, min_count=1, compute_loss=True, hs=0, sg=1, seed=42)

# getting the training loss value
training_loss = model_with_loss.get_latest_training_loss()
print(training_loss)

2017-12-21 11:11:55,571 : INFO : collecting all words and their counts
2017-12-21 11:11:55,572 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:55,590 : INFO : collected 10186 word types from a corpus of 59890 raw words and 300 sentences
2017-12-21 11:11:55,591 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:55,623 : INFO : min_count=1 retains 10186 unique words (100% of original 10186, drops 0)
2017-12-21 11:11:55,625 : INFO : min_count=1 leaves 59890 word corpus (100% of original 59890, drops 0)
2017-12-21 11:11:55,662 : INFO : deleting the raw counts dictionary of 10186 items
2017-12-21 11:11:55,663 : INFO : sample=0.001 downsamples 37 most-common words
2017-12-21 11:11:55,664 : INFO : downsampling leaves estimated 47231 word corpus (78.9% of prior 59890)
2017-12-21 11:11:55,664 : INFO : estimated required memory for 10186 words and 100 dimensions: 13241800 bytes
2017-12-21 11:11:55,688 : INFO : resetting layer weights
2017-12-21 11:11

1644486.625


### Benchmarks to see effect of training loss compuation code on training time

We first download and setup the test data used for getting the benchmarks.

In [26]:
input_data_files = []

def setup_input_data():
    # check if test data already present
    if os.path.isfile('./text8') is False:

        # download and decompress 'text8' corpus
        import zipfile
        ! wget 'http://mattmahoney.net/dc/text8.zip'
        ! unzip 'text8.zip'
    
        # create 1 MB, 10 MB and 50 MB files
        ! head -c1000000 text8 > text8_1000000
        ! head -c10000000 text8 > text8_10000000
        ! head -c50000000 text8 > text8_50000000
                
    # add 25 KB test file
    input_data_files.append(os.path.join(os.getcwd(), '../../gensim/test/test_data/lee_background.cor'))

    # add 1 MB test file
    input_data_files.append(os.path.join(os.getcwd(), 'text8_1000000'))

    # add 10 MB test file
    input_data_files.append(os.path.join(os.getcwd(), 'text8_10000000'))

    # add 50 MB test file
    input_data_files.append(os.path.join(os.getcwd(), 'text8_50000000'))

    # add 100 MB test file
    input_data_files.append(os.path.join(os.getcwd(), 'text8'))

setup_input_data()
print(input_data_files)

['/home/markroxor/Documents/gensim/docs/notebooks/../../gensim/test/test_data/lee_background.cor', '/home/markroxor/Documents/gensim/docs/notebooks/text8_1000000', '/home/markroxor/Documents/gensim/docs/notebooks/text8_10000000', '/home/markroxor/Documents/gensim/docs/notebooks/text8_50000000', '/home/markroxor/Documents/gensim/docs/notebooks/text8']


We now compare the training time taken for different combinations of input data and model training parameters like `hs` and `sg`.

In [27]:
# using 25 KB and 50 MB files only for generating o/p -> comment next line for using all 5 test files
input_data_files = [input_data_files[0], input_data_files[-2]]
print(input_data_files)

import time
import numpy as np
import pandas as pd

train_time_values = []
seed_val = 42
sg_values = [0, 1]
hs_values = [0, 1]

for data_file in input_data_files:
    data = gensim.models.word2vec.LineSentence(data_file) 
    for sg_val in sg_values:
        for hs_val in hs_values:
            for loss_flag in [True, False]:
                time_taken_list = []
                for i in range(3):
                    start_time = time.time()
                    w2v_model = gensim.models.Word2Vec(data, compute_loss=loss_flag, sg=sg_val, hs=hs_val, seed=seed_val) 
                    time_taken_list.append(time.time() - start_time)

                time_taken_list = np.array(time_taken_list)
                time_mean = np.mean(time_taken_list)
                time_std = np.std(time_taken_list)
                train_time_values.append({'train_data': data_file, 'compute_loss': loss_flag, 'sg': sg_val, 'hs': hs_val, 'mean': time_mean, 'std': time_std})

train_times_table = pd.DataFrame(train_time_values)
train_times_table = train_times_table.sort_values(by=['train_data', 'sg', 'hs', 'compute_loss'], ascending=[False, False, True, False])
print(train_times_table)

['/home/markroxor/Documents/gensim/docs/notebooks/../../gensim/test/test_data/lee_background.cor', '/home/markroxor/Documents/gensim/docs/notebooks/text8_50000000']


2017-12-21 11:11:56,682 : INFO : collecting all words and their counts
2017-12-21 11:11:56,684 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:11:56,708 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2017-12-21 11:11:56,709 : INFO : Loading a fresh vocabulary
2017-12-21 11:11:56,716 : INFO : min_count=5 retains 1762 unique words (16% of original 10781, drops 9019)
2017-12-21 11:11:56,717 : INFO : min_count=5 leaves 46084 word corpus (76% of original 59890, drops 13806)
2017-12-21 11:11:56,724 : INFO : deleting the raw counts dictionary of 10781 items
2017-12-21 11:11:56,725 : INFO : sample=0.001 downsamples 45 most-common words
2017-12-21 11:11:56,726 : INFO : downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
2017-12-21 11:11:56,727 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
2017-12-21 11:11:56,734 : INFO : resetting layer weights
2017-12-21 11:

2017-12-21 11:11:58,032 : INFO : deleting the raw counts dictionary of 10781 items
2017-12-21 11:11:58,033 : INFO : sample=0.001 downsamples 45 most-common words
2017-12-21 11:11:58,033 : INFO : downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
2017-12-21 11:11:58,034 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
2017-12-21 11:11:58,039 : INFO : resetting layer weights
2017-12-21 11:11:58,059 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-12-21 11:11:58,231 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:11:58,234 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:11:58,242 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:11:58,243 : INFO : training on 299450 raw words (162889 effective words) took 0.2s, 897697 effective words/s
2017-12-21 11:11:58,244 

2017-12-21 11:11:59,970 : INFO : deleting the raw counts dictionary of 10781 items
2017-12-21 11:11:59,971 : INFO : sample=0.001 downsamples 45 most-common words
2017-12-21 11:11:59,972 : INFO : downsampling leaves estimated 32610 word corpus (70.8% of prior 46084)
2017-12-21 11:11:59,972 : INFO : estimated required memory for 1762 words and 100 dimensions: 3347800 bytes
2017-12-21 11:11:59,974 : INFO : constructing a huffman tree from 1762 words
2017-12-21 11:12:00,017 : INFO : built huffman tree with maximum node depth 13
2017-12-21 11:12:00,025 : INFO : resetting layer weights
2017-12-21 11:12:00,052 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2017-12-21 11:12:00,351 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:12:00,354 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:12:00,359 : INFO : worker thread finished; awaiting finish of 0 more

2017-12-21 11:12:02,405 : INFO : estimated required memory for 1762 words and 100 dimensions: 2290600 bytes
2017-12-21 11:12:02,415 : INFO : resetting layer weights
2017-12-21 11:12:02,433 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=0 sample=0.001 negative=5 window=5
2017-12-21 11:12:02,869 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:12:02,871 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:12:02,888 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:12:02,889 : INFO : training on 299450 raw words (162889 effective words) took 0.5s, 358088 effective words/s
2017-12-21 11:12:02,892 : INFO : collecting all words and their counts
2017-12-21 11:12:02,894 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:12:02,922 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2017-12-2

2017-12-21 11:12:06,179 : INFO : resetting layer weights
2017-12-21 11:12:06,199 : INFO : training model with 3 workers on 1762 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2017-12-21 11:12:07,061 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:12:07,070 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:12:07,071 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:12:07,072 : INFO : training on 299450 raw words (162889 effective words) took 0.9s, 187287 effective words/s
2017-12-21 11:12:07,075 : INFO : collecting all words and their counts
2017-12-21 11:12:07,076 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:12:07,100 : INFO : collected 10781 word types from a corpus of 59890 raw words and 300 sentences
2017-12-21 11:12:07,101 : INFO : Loading a fresh vocabulary
2017-12-21 11:12:07,155 : INFO : min_count=5 retains 1762

2017-12-21 11:12:28,851 : INFO : PROGRESS: at 43.58% examples, 896871 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:12:29,851 : INFO : PROGRESS: at 46.88% examples, 904450 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:12:30,852 : INFO : PROGRESS: at 50.27% examples, 913241 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:12:31,867 : INFO : PROGRESS: at 53.33% examples, 915275 words/s, in_qsize 5, out_qsize 2
2017-12-21 11:12:32,867 : INFO : PROGRESS: at 56.44% examples, 917800 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:12:33,884 : INFO : PROGRESS: at 59.69% examples, 921684 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:12:34,884 : INFO : PROGRESS: at 60.73% examples, 893121 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:12:35,890 : INFO : PROGRESS: at 63.89% examples, 896194 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:12:36,891 : INFO : PROGRESS: at 67.21% examples, 901841 words/s, in_qsize 6, out_qsize 2
2017-12-21 11:12:37,892 : INFO : PROGRESS: at 70.53% examples, 907304 wor

2017-12-21 11:13:28,708 : INFO : resetting layer weights
2017-12-21 11:13:29,149 : INFO : training model with 3 workers on 48753 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-12-21 11:13:30,152 : INFO : PROGRESS: at 1.39% examples, 432212 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:13:31,157 : INFO : PROGRESS: at 4.95% examples, 757926 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:13:32,158 : INFO : PROGRESS: at 8.20% examples, 839986 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:13:33,162 : INFO : PROGRESS: at 11.71% examples, 903246 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:13:34,164 : INFO : PROGRESS: at 15.15% examples, 937525 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:13:35,172 : INFO : PROGRESS: at 18.70% examples, 964183 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:13:36,251 : INFO : PROGRESS: at 20.00% examples, 873858 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:13:37,255 : INFO : PROGRESS: at 23.25% examples, 888901 words

2017-12-21 11:14:38,334 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:14:38,337 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:14:38,338 : INFO : training on 42426485 raw words (31023876 effective words) took 33.0s, 940769 effective words/s
2017-12-21 11:14:38,365 : INFO : collecting all words and their counts
2017-12-21 11:14:38,952 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:14:40,895 : INFO : collected 171140 word types from a corpus of 8485297 raw words and 849 sentences
2017-12-21 11:14:40,897 : INFO : Loading a fresh vocabulary
2017-12-21 11:14:41,182 : INFO : min_count=5 retains 48753 unique words (28% of original 171140, drops 122387)
2017-12-21 11:14:41,183 : INFO : min_count=5 leaves 8292974 word corpus (97% of original 8485297, drops 192323)
2017-12-21 11:14:41,332 : INFO : deleting the raw counts dictionary of 171140 items
2017-12-21 11:14:41,365 : INFO : sample

2017-12-21 11:15:39,622 : INFO : PROGRESS: at 57.08% examples, 906395 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:15:41,149 : INFO : PROGRESS: at 59.98% examples, 883908 words/s, in_qsize 1, out_qsize 0
2017-12-21 11:15:42,144 : INFO : PROGRESS: at 63.49% examples, 892325 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:15:43,148 : INFO : PROGRESS: at 67.09% examples, 901735 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:15:44,154 : INFO : PROGRESS: at 70.86% examples, 912857 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:15:45,162 : INFO : PROGRESS: at 74.39% examples, 920515 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:15:46,181 : INFO : PROGRESS: at 78.14% examples, 929221 words/s, in_qsize 6, out_qsize 2
2017-12-21 11:15:47,383 : INFO : PROGRESS: at 79.98% examples, 909130 words/s, in_qsize 2, out_qsize 0
2017-12-21 11:15:48,381 : INFO : PROGRESS: at 83.56% examples, 915717 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:15:49,398 : INFO : PROGRESS: at 87.28% examples, 923329 wor

2017-12-21 11:16:49,700 : INFO : PROGRESS: at 88.36% examples, 527565 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:16:50,714 : INFO : PROGRESS: at 90.20% examples, 528305 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:16:51,721 : INFO : PROGRESS: at 92.06% examples, 529341 words/s, in_qsize 4, out_qsize 1
2017-12-21 11:16:52,721 : INFO : PROGRESS: at 93.88% examples, 529983 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:16:53,763 : INFO : PROGRESS: at 95.59% examples, 529720 words/s, in_qsize 6, out_qsize 2
2017-12-21 11:16:54,763 : INFO : PROGRESS: at 97.27% examples, 529551 words/s, in_qsize 5, out_qsize 2
2017-12-21 11:16:55,769 : INFO : PROGRESS: at 99.20% examples, 530714 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:16:56,331 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:16:56,336 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:16:56,345 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-1

2017-12-21 11:17:59,661 : INFO : PROGRESS: at 98.99% examples, 525674 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:18:00,368 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:18:00,382 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:18:00,384 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:18:00,385 : INFO : training on 42426485 raw words (31022331 effective words) took 59.1s, 524505 effective words/s
2017-12-21 11:18:00,439 : INFO : collecting all words and their counts
2017-12-21 11:18:00,978 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:18:02,794 : INFO : collected 171140 word types from a corpus of 8485297 raw words and 849 sentences
2017-12-21 11:18:02,796 : INFO : Loading a fresh vocabulary
2017-12-21 11:18:02,980 : INFO : min_count=5 retains 48753 unique words (28% of original 171140, drops 122387)
2017-12-21 11:18:02,981 : INFO : min_coun

2017-12-21 11:19:05,359 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:19:07,128 : INFO : collected 171140 word types from a corpus of 8485297 raw words and 849 sentences
2017-12-21 11:19:07,130 : INFO : Loading a fresh vocabulary
2017-12-21 11:19:07,393 : INFO : min_count=5 retains 48753 unique words (28% of original 171140, drops 122387)
2017-12-21 11:19:07,394 : INFO : min_count=5 leaves 8292974 word corpus (97% of original 8485297, drops 192323)
2017-12-21 11:19:07,546 : INFO : deleting the raw counts dictionary of 171140 items
2017-12-21 11:19:07,578 : INFO : sample=0.001 downsamples 38 most-common words
2017-12-21 11:19:07,580 : INFO : downsampling leaves estimated 6205111 word corpus (74.8% of prior 8292974)
2017-12-21 11:19:07,581 : INFO : estimated required memory for 48753 words and 100 dimensions: 92630700 bytes
2017-12-21 11:19:07,652 : INFO : constructing a huffman tree from 48753 words
2017-12-21 11:19:09,043 : INFO : built huffma

2017-12-21 11:20:09,884 : INFO : built huffman tree with maximum node depth 21
2017-12-21 11:20:10,004 : INFO : resetting layer weights
2017-12-21 11:20:10,472 : INFO : training model with 3 workers on 48753 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2017-12-21 11:20:11,492 : INFO : PROGRESS: at 0.71% examples, 216977 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:20:12,496 : INFO : PROGRESS: at 2.47% examples, 376732 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:20:13,508 : INFO : PROGRESS: at 4.36% examples, 441223 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:20:14,520 : INFO : PROGRESS: at 5.91% examples, 449596 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:20:15,536 : INFO : PROGRESS: at 7.66% examples, 465960 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:20:16,555 : INFO : PROGRESS: at 9.33% examples, 473600 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:20:17,563 : INFO : PROGRESS: at 11.10% examples, 484200 words/s, in_qsize 6, out_qsize 1

2017-12-21 11:21:18,939 : INFO : PROGRESS: at 0.66% examples, 202646 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:21:19,939 : INFO : PROGRESS: at 2.59% examples, 394819 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:21:20,947 : INFO : PROGRESS: at 4.45% examples, 451566 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:21:21,958 : INFO : PROGRESS: at 6.34% examples, 483888 words/s, in_qsize 4, out_qsize 2
2017-12-21 11:21:22,967 : INFO : PROGRESS: at 8.24% examples, 503566 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:21:23,992 : INFO : PROGRESS: at 10.25% examples, 521599 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:21:24,999 : INFO : PROGRESS: at 12.11% examples, 529802 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:21:26,012 : INFO : PROGRESS: at 14.06% examples, 539681 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:21:27,014 : INFO : PROGRESS: at 16.02% examples, 546634 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:21:28,026 : INFO : PROGRESS: at 18.00% examples, 553156 words/s,

2017-12-21 11:22:29,118 : INFO : PROGRESS: at 12.39% examples, 316302 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:22:30,149 : INFO : PROGRESS: at 13.47% examples, 317309 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:22:31,178 : INFO : PROGRESS: at 14.58% examples, 318770 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:22:32,185 : INFO : PROGRESS: at 15.67% examples, 319667 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:22:33,203 : INFO : PROGRESS: at 16.80% examples, 321233 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:22:34,258 : INFO : PROGRESS: at 17.88% examples, 321155 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:22:35,277 : INFO : PROGRESS: at 18.96% examples, 321717 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:22:36,843 : INFO : PROGRESS: at 19.86% examples, 310518 words/s, in_qsize 5, out_qsize 2
2017-12-21 11:22:37,880 : INFO : PROGRESS: at 21.01% examples, 312019 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:22:38,888 : INFO : PROGRESS: at 22.19% examples, 314157 wor

2017-12-21 11:23:51,961 : INFO : PROGRESS: at 99.84% examples, 326144 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:23:52,032 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:23:52,058 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:23:52,076 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:23:52,077 : INFO : training on 42426485 raw words (31026044 effective words) took 95.1s, 326247 effective words/s
2017-12-21 11:23:52,130 : INFO : collecting all words and their counts
2017-12-21 11:23:52,669 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:23:54,415 : INFO : collected 171140 word types from a corpus of 8485297 raw words and 849 sentences
2017-12-21 11:23:54,417 : INFO : Loading a fresh vocabulary
2017-12-21 11:23:54,783 : INFO : min_count=5 retains 48753 unique words (28% of original 171140, drops 122387)
2017-12-21 11:23:54,784 : INFO : min_coun

2017-12-21 11:25:01,969 : INFO : PROGRESS: at 68.62% examples, 320701 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:25:02,998 : INFO : PROGRESS: at 69.78% examples, 321165 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:25:04,001 : INFO : PROGRESS: at 70.91% examples, 321609 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:25:05,024 : INFO : PROGRESS: at 72.04% examples, 322019 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:25:06,034 : INFO : PROGRESS: at 73.17% examples, 322407 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:25:07,036 : INFO : PROGRESS: at 74.30% examples, 322842 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:25:08,040 : INFO : PROGRESS: at 75.45% examples, 323305 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:25:09,052 : INFO : PROGRESS: at 76.56% examples, 323544 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:25:10,056 : INFO : PROGRESS: at 77.67% examples, 323788 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:25:11,088 : INFO : PROGRESS: at 78.80% examples, 324037 wor

2017-12-21 11:26:12,419 : INFO : PROGRESS: at 38.59% examples, 315339 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:26:13,420 : INFO : PROGRESS: at 39.67% examples, 315852 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:26:14,430 : INFO : PROGRESS: at 39.98% examples, 310256 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:26:15,451 : INFO : PROGRESS: at 41.04% examples, 310526 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:26:16,451 : INFO : PROGRESS: at 42.14% examples, 311209 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:26:17,466 : INFO : PROGRESS: at 43.25% examples, 311670 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:26:18,473 : INFO : PROGRESS: at 44.31% examples, 311937 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:26:19,479 : INFO : PROGRESS: at 45.42% examples, 312571 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:26:20,493 : INFO : PROGRESS: at 46.53% examples, 313101 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:26:21,501 : INFO : PROGRESS: at 47.61% examples, 313543 wor

2017-12-21 11:27:22,377 : INFO : PROGRESS: at 8.60% examples, 327480 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:27:23,423 : INFO : PROGRESS: at 9.71% examples, 327981 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:27:24,458 : INFO : PROGRESS: at 10.77% examples, 327083 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:27:25,463 : INFO : PROGRESS: at 11.85% examples, 328113 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:27:26,480 : INFO : PROGRESS: at 13.00% examples, 330364 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:27:27,503 : INFO : PROGRESS: at 14.13% examples, 331626 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:27:28,504 : INFO : PROGRESS: at 15.24% examples, 332548 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:27:29,512 : INFO : PROGRESS: at 16.35% examples, 333122 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:27:30,515 : INFO : PROGRESS: at 17.48% examples, 334081 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:27:31,535 : INFO : PROGRESS: at 18.61% examples, 334720 words

2017-12-21 11:28:44,023 : INFO : PROGRESS: at 96.00% examples, 331868 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:28:45,039 : INFO : PROGRESS: at 97.13% examples, 332028 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:28:46,043 : INFO : PROGRESS: at 98.28% examples, 332302 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:28:47,055 : INFO : PROGRESS: at 99.43% examples, 332544 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:28:47,682 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:28:47,694 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:28:47,708 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:28:47,709 : INFO : training on 42426485 raw words (31023288 effective words) took 93.4s, 332058 effective words/s
2017-12-21 11:28:47,736 : INFO : collecting all words and their counts
2017-12-21 11:28:48,287 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:28:

2017-12-21 11:29:54,828 : INFO : PROGRESS: at 66.64% examples, 324014 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:29:55,834 : INFO : PROGRESS: at 67.77% examples, 324373 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:29:56,849 : INFO : PROGRESS: at 68.88% examples, 324586 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:29:57,867 : INFO : PROGRESS: at 70.04% examples, 325042 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:29:58,887 : INFO : PROGRESS: at 71.10% examples, 325072 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:29:59,892 : INFO : PROGRESS: at 72.23% examples, 325474 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:30:00,907 : INFO : PROGRESS: at 73.38% examples, 325921 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:30:01,930 : INFO : PROGRESS: at 74.49% examples, 326118 words/s, in_qsize 4, out_qsize 1
2017-12-21 11:30:02,941 : INFO : PROGRESS: at 75.62% examples, 326423 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:30:03,942 : INFO : PROGRESS: at 76.77% examples, 326829 wor

2017-12-21 11:31:05,368 : INFO : PROGRESS: at 38.28% examples, 332178 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:31:06,389 : INFO : PROGRESS: at 39.46% examples, 332901 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:31:07,460 : INFO : PROGRESS: at 39.84% examples, 326664 words/s, in_qsize 3, out_qsize 2
2017-12-21 11:31:08,453 : INFO : PROGRESS: at 40.97% examples, 327218 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:31:09,456 : INFO : PROGRESS: at 42.05% examples, 327336 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:31:10,458 : INFO : PROGRESS: at 43.23% examples, 328065 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:31:11,494 : INFO : PROGRESS: at 44.41% examples, 328539 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:31:12,502 : INFO : PROGRESS: at 45.56% examples, 329180 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:31:13,510 : INFO : PROGRESS: at 46.74% examples, 329890 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:31:14,515 : INFO : PROGRESS: at 47.89% examples, 330458 wor

2017-12-21 11:32:15,579 : INFO : PROGRESS: at 4.24% examples, 158354 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:16,582 : INFO : PROGRESS: at 4.83% examples, 160764 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:17,602 : INFO : PROGRESS: at 5.37% examples, 161112 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:18,603 : INFO : PROGRESS: at 5.91% examples, 161607 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:19,608 : INFO : PROGRESS: at 6.45% examples, 162051 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:20,612 : INFO : PROGRESS: at 7.02% examples, 163026 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:21,641 : INFO : PROGRESS: at 7.61% examples, 164000 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:22,665 : INFO : PROGRESS: at 8.17% examples, 164422 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:32:23,681 : INFO : PROGRESS: at 8.74% examples, 164995 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:32:24,736 : INFO : PROGRESS: at 9.26% examples, 164340 words/s, in_q

2017-12-21 11:33:38,269 : INFO : PROGRESS: at 48.93% examples, 166760 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:33:39,317 : INFO : PROGRESS: at 49.49% examples, 166815 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:33:40,325 : INFO : PROGRESS: at 50.04% examples, 166794 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:33:41,338 : INFO : PROGRESS: at 50.60% examples, 166886 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:33:42,338 : INFO : PROGRESS: at 51.17% examples, 167005 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:33:43,375 : INFO : PROGRESS: at 51.71% examples, 166961 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:33:44,377 : INFO : PROGRESS: at 52.27% examples, 167070 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:33:45,401 : INFO : PROGRESS: at 52.84% examples, 167140 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:33:46,411 : INFO : PROGRESS: at 53.40% examples, 167215 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:33:47,427 : INFO : PROGRESS: at 53.97% examples, 167293 wor

2017-12-21 11:35:01,161 : INFO : PROGRESS: at 93.29% examples, 166499 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:35:02,205 : INFO : PROGRESS: at 93.83% examples, 166477 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:35:03,260 : INFO : PROGRESS: at 94.39% examples, 166485 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:35:04,370 : INFO : PROGRESS: at 95.03% examples, 166554 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:35:05,443 : INFO : PROGRESS: at 95.59% examples, 166533 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:35:06,456 : INFO : PROGRESS: at 96.18% examples, 166612 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:35:07,464 : INFO : PROGRESS: at 96.68% examples, 166530 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:35:08,475 : INFO : PROGRESS: at 97.24% examples, 166582 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:35:09,523 : INFO : PROGRESS: at 97.83% examples, 166619 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:35:10,635 : INFO : PROGRESS: at 98.42% examples, 166607 wor

2017-12-21 11:36:13,225 : INFO : PROGRESS: at 29.78% examples, 167995 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:36:14,255 : INFO : PROGRESS: at 30.34% examples, 168059 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:36:15,308 : INFO : PROGRESS: at 30.91% examples, 168044 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:36:16,340 : INFO : PROGRESS: at 31.47% examples, 168116 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:36:17,357 : INFO : PROGRESS: at 32.04% examples, 168250 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:36:18,406 : INFO : PROGRESS: at 32.60% examples, 168244 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:36:19,432 : INFO : PROGRESS: at 33.17% examples, 168321 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:36:20,461 : INFO : PROGRESS: at 33.73% examples, 168381 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:36:21,482 : INFO : PROGRESS: at 34.30% examples, 168464 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:36:22,504 : INFO : PROGRESS: at 34.89% examples, 168643 wor

2017-12-21 11:37:35,608 : INFO : PROGRESS: at 74.75% examples, 168897 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:37:36,625 : INFO : PROGRESS: at 75.34% examples, 168966 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:37:37,634 : INFO : PROGRESS: at 75.90% examples, 169006 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:37:38,646 : INFO : PROGRESS: at 76.49% examples, 169085 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:37:39,695 : INFO : PROGRESS: at 77.01% examples, 168977 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:37:40,725 : INFO : PROGRESS: at 77.57% examples, 168984 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:37:41,790 : INFO : PROGRESS: at 78.14% examples, 168961 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:37:42,807 : INFO : PROGRESS: at 78.73% examples, 169043 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:37:43,824 : INFO : PROGRESS: at 79.32% examples, 169121 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:37:45,356 : INFO : PROGRESS: at 79.84% examples, 168462 wor

2017-12-21 11:38:46,469 : INFO : PROGRESS: at 11.33% examples, 170745 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:38:47,477 : INFO : PROGRESS: at 11.87% examples, 170680 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:38:48,549 : INFO : PROGRESS: at 12.46% examples, 170737 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:38:49,565 : INFO : PROGRESS: at 13.03% examples, 170881 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:38:50,636 : INFO : PROGRESS: at 13.55% examples, 170047 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:38:51,655 : INFO : PROGRESS: at 14.11% examples, 170189 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:38:52,659 : INFO : PROGRESS: at 14.70% examples, 170685 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:38:53,689 : INFO : PROGRESS: at 15.24% examples, 170425 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:38:54,731 : INFO : PROGRESS: at 15.81% examples, 170361 words/s, in_qsize 5, out_qsize 2
2017-12-21 11:38:55,736 : INFO : PROGRESS: at 16.42% examples, 170930 wor

2017-12-21 11:40:08,998 : INFO : PROGRESS: at 56.00% examples, 168551 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:40:10,006 : INFO : PROGRESS: at 56.51% examples, 168453 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:40:11,073 : INFO : PROGRESS: at 57.06% examples, 168358 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:40:12,142 : INFO : PROGRESS: at 57.62% examples, 168306 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:40:13,209 : INFO : PROGRESS: at 58.14% examples, 168143 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:40:14,304 : INFO : PROGRESS: at 58.70% examples, 168071 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:40:15,305 : INFO : PROGRESS: at 59.27% examples, 168142 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:40:16,397 : INFO : PROGRESS: at 59.79% examples, 167947 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:40:17,394 : INFO : PROGRESS: at 60.07% examples, 167195 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:40:18,430 : INFO : PROGRESS: at 60.64% examples, 167221 wor

2017-12-21 11:41:31,067 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-12-21 11:41:31,125 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-12-21 11:41:31,166 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-12-21 11:41:31,168 : INFO : training on 42426485 raw words (31021933 effective words) took 185.2s, 167479 effective words/s
2017-12-21 11:41:31,216 : INFO : collecting all words and their counts
2017-12-21 11:41:31,784 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-12-21 11:41:33,595 : INFO : collected 171140 word types from a corpus of 8485297 raw words and 849 sentences
2017-12-21 11:41:33,596 : INFO : Loading a fresh vocabulary
2017-12-21 11:41:33,852 : INFO : min_count=5 retains 48753 unique words (28% of original 171140, drops 122387)
2017-12-21 11:41:33,853 : INFO : min_count=5 leaves 8292974 word corpus (97% of original 8485297, drops 192323)
2017-12-21 11:41:34,005 : INFO 

2017-12-21 11:42:43,750 : INFO : PROGRESS: at 36.00% examples, 165024 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:42:44,766 : INFO : PROGRESS: at 36.58% examples, 165232 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:42:45,845 : INFO : PROGRESS: at 37.17% examples, 165325 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:42:46,851 : INFO : PROGRESS: at 37.74% examples, 165435 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:42:47,881 : INFO : PROGRESS: at 38.33% examples, 165619 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:42:48,961 : INFO : PROGRESS: at 38.94% examples, 165782 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:42:50,012 : INFO : PROGRESS: at 39.51% examples, 165806 words/s, in_qsize 5, out_qsize 2
2017-12-21 11:42:51,261 : INFO : PROGRESS: at 39.84% examples, 164467 words/s, in_qsize 3, out_qsize 2
2017-12-21 11:42:52,243 : INFO : PROGRESS: at 40.40% examples, 164571 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:42:53,249 : INFO : PROGRESS: at 40.97% examples, 164699 wor

2017-12-21 11:44:06,623 : INFO : PROGRESS: at 81.79% examples, 168547 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:44:07,640 : INFO : PROGRESS: at 82.36% examples, 168546 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:44:08,732 : INFO : PROGRESS: at 82.94% examples, 168508 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:44:09,754 : INFO : PROGRESS: at 83.53% examples, 168567 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:44:10,778 : INFO : PROGRESS: at 84.12% examples, 168622 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:44:11,783 : INFO : PROGRESS: at 84.69% examples, 168648 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:44:12,794 : INFO : PROGRESS: at 85.23% examples, 168633 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:44:13,817 : INFO : PROGRESS: at 85.80% examples, 168647 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:44:14,891 : INFO : PROGRESS: at 86.36% examples, 168608 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:44:15,921 : INFO : PROGRESS: at 86.93% examples, 168620 wor

2017-12-21 11:45:18,783 : INFO : PROGRESS: at 16.61% examples, 166031 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:45:19,833 : INFO : PROGRESS: at 17.15% examples, 165896 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:45:20,863 : INFO : PROGRESS: at 17.71% examples, 166005 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:45:21,875 : INFO : PROGRESS: at 18.28% examples, 166255 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:45:22,920 : INFO : PROGRESS: at 18.87% examples, 166528 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:45:23,925 : INFO : PROGRESS: at 19.43% examples, 166755 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:45:25,244 : INFO : PROGRESS: at 19.84% examples, 164225 words/s, in_qsize 6, out_qsize 2
2017-12-21 11:45:26,298 : INFO : PROGRESS: at 20.47% examples, 164797 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:45:27,361 : INFO : PROGRESS: at 21.04% examples, 164813 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:45:28,370 : INFO : PROGRESS: at 21.60% examples, 164993 wor

2017-12-21 11:46:41,440 : INFO : PROGRESS: at 61.65% examples, 168239 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:46:42,446 : INFO : PROGRESS: at 62.17% examples, 168144 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:46:43,472 : INFO : PROGRESS: at 62.69% examples, 167987 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:46:44,574 : INFO : PROGRESS: at 63.20% examples, 167776 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:46:45,575 : INFO : PROGRESS: at 63.77% examples, 167825 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:46:46,575 : INFO : PROGRESS: at 64.36% examples, 167923 words/s, in_qsize 5, out_qsize 1
2017-12-21 11:46:47,582 : INFO : PROGRESS: at 64.90% examples, 167914 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:46:48,586 : INFO : PROGRESS: at 65.49% examples, 168028 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:46:49,613 : INFO : PROGRESS: at 66.08% examples, 168094 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:46:50,665 : INFO : PROGRESS: at 66.64% examples, 168083 wor

2017-12-21 11:47:54,670 : INFO : estimated required memory for 48753 words and 100 dimensions: 92630700 bytes
2017-12-21 11:47:54,750 : INFO : constructing a huffman tree from 48753 words
2017-12-21 11:47:56,225 : INFO : built huffman tree with maximum node depth 21
2017-12-21 11:47:56,349 : INFO : resetting layer weights
2017-12-21 11:47:56,798 : INFO : training model with 3 workers on 48753 vocabulary and 100 features, using sg=1 hs=1 sample=0.001 negative=5 window=5
2017-12-21 11:47:57,835 : INFO : PROGRESS: at 0.16% examples, 50262 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:47:58,854 : INFO : PROGRESS: at 0.68% examples, 103881 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:47:59,909 : INFO : PROGRESS: at 1.22% examples, 123038 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:48:00,928 : INFO : PROGRESS: at 1.79% examples, 134575 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:48:02,003 : INFO : PROGRESS: at 2.36% examples, 139702 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:48:

2017-12-21 11:49:16,760 : INFO : PROGRESS: at 42.99% examples, 166684 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:17,802 : INFO : PROGRESS: at 43.53% examples, 166601 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:18,807 : INFO : PROGRESS: at 44.00% examples, 166325 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:19,825 : INFO : PROGRESS: at 44.59% examples, 166455 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:20,865 : INFO : PROGRESS: at 45.18% examples, 166572 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:21,887 : INFO : PROGRESS: at 45.77% examples, 166708 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:22,910 : INFO : PROGRESS: at 46.38% examples, 166924 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:23,952 : INFO : PROGRESS: at 46.93% examples, 166855 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:25,012 : INFO : PROGRESS: at 47.49% examples, 166841 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:49:26,059 : INFO : PROGRESS: at 48.08% examples, 166912 wor

2017-12-21 11:50:39,642 : INFO : PROGRESS: at 88.74% examples, 168962 words/s, in_qsize 6, out_qsize 1
2017-12-21 11:50:40,695 : INFO : PROGRESS: at 89.35% examples, 169050 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:50:41,724 : INFO : PROGRESS: at 89.92% examples, 169060 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:50:42,735 : INFO : PROGRESS: at 90.48% examples, 169100 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:50:43,736 : INFO : PROGRESS: at 91.07% examples, 169194 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:50:44,738 : INFO : PROGRESS: at 91.66% examples, 169279 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:50:45,796 : INFO : PROGRESS: at 92.25% examples, 169315 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:50:46,810 : INFO : PROGRESS: at 92.82% examples, 169345 words/s, in_qsize 5, out_qsize 0
2017-12-21 11:50:47,906 : INFO : PROGRESS: at 93.45% examples, 169427 words/s, in_qsize 6, out_qsize 0
2017-12-21 11:50:48,957 : INFO : PROGRESS: at 94.04% examples, 169463 wor

    compute_loss  hs        mean  sg       std  \
12          True   0   99.157053   1  0.471764   
13         False   0   97.210701   1  0.778177   
14          True   1  189.541087   1  1.284753   
15         False   1  189.569699   1  1.475226   
8           True   0   37.264525   0  0.843278   
9          False   0   37.029671   0  0.868878   
10          True   1   63.923994   0  0.418183   
11         False   1   62.899229   0  3.403489   
4           True   0    0.529135   1  0.016272   
5          False   0    0.529785   1  0.004869   
6           True   1    1.042990   1  0.047401   
7          False   1    1.020446   1  0.021715   
0           True   0    0.249932   0  0.005673   
1          False   0    0.270726   0  0.017647   
2           True   1    0.423162   0  0.004231   
3          False   1    0.417867   0  0.014918   

                                           train_data  
12  /home/markroxor/Documents/gensim/docs/notebook...  
13  /home/markroxor/Documents/gensim/

### Adding Word2Vec "model to dict" method to production pipeline
Suppose, we still want more performance improvement in production. 
One good way is to cache all the similar words in a dictionary.
So that next time when we get the similar query word, we'll search it first in the dict.
And if it's a hit then we will show the result directly from the dictionary.
otherwise we will query the word and then cache it so that it doesn't miss next time.

In [28]:
most_similars_precalc = {word : model.wv.most_similar(word) for word in model.wv.index2word}
print(most_similars_precalc)



### Comparison with and without caching

for time being lets take 4 words randomly

In [29]:
import time
words = ['voted','few','their','around']

Without caching

In [30]:
start = time.time()
for word in words:
    result = model.wv.most_similar(word)
    print(result)
end = time.time()
print(end-start)

[('action', 0.9988629221916199), ('could', 0.9988350868225098), ('would', 0.9988296031951904), ('need', 0.9988236427307129), ('it', 0.9988223314285278), ('will', 0.9988141059875488), ('expected', 0.9988051056861877), ('legal', 0.9988030791282654), ('we', 0.9987929463386536), ('is', 0.9987916946411133)]
[('australian', 0.999800443649292), ('from', 0.9997938275337219), ('which', 0.9997937679290771), ('police', 0.9997909069061279), ('a', 0.9997888207435608), ('be', 0.9997885227203369), ('told', 0.9997879266738892), ('if', 0.999785840511322), ('it', 0.9997842311859131), ('his', 0.9997828006744385)]
[('up', 0.9999565482139587), ('last', 0.9999539256095886), ('had', 0.9999518990516663), ('on', 0.9999512434005737), ('and', 0.9999508261680603), ('over', 0.99994957447052), ('an', 0.99994957447052), ('are', 0.9999495148658752), ('with', 0.9999477863311768), ('as', 0.9999470710754395)]
[('two', 0.9999292492866516), ('over', 0.9999260902404785), ('their', 0.999923586845398), ('three', 0.9999228119

Now with caching

In [31]:
start = time.time()
for word in words:
    if 'voted' in most_similars_precalc:
        result = most_similars_precalc[word]
        print(result)
    else:
        result = model.wv.most_similar(word)
        most_similars_precalc[word] = result
        print(result)
    
end = time.time()
print(end-start)

[('action', 0.9988629221916199), ('could', 0.9988350868225098), ('would', 0.9988296031951904), ('need', 0.9988236427307129), ('it', 0.9988223314285278), ('will', 0.9988141059875488), ('expected', 0.9988051056861877), ('legal', 0.9988030791282654), ('we', 0.9987929463386536), ('is', 0.9987916946411133)]
[('australian', 0.999800443649292), ('from', 0.9997938275337219), ('which', 0.9997937679290771), ('police', 0.9997909069061279), ('a', 0.9997888207435608), ('be', 0.9997885227203369), ('told', 0.9997879266738892), ('if', 0.999785840511322), ('it', 0.9997842311859131), ('his', 0.9997828006744385)]
[('up', 0.9999565482139587), ('last', 0.9999539256095886), ('had', 0.9999518990516663), ('on', 0.9999512434005737), ('and', 0.9999508261680603), ('over', 0.99994957447052), ('an', 0.99994957447052), ('are', 0.9999495148658752), ('with', 0.9999477863311768), ('as', 0.9999470710754395)]
[('two', 0.9999292492866516), ('over', 0.9999260902404785), ('their', 0.999923586845398), ('three', 0.9999228119

Clearly you can see the improvement but this difference will be even larger when we take more words in the consideration.

# Visualising the Word Embeddings

The word embeddings made by the model can be visualised by reducing dimensionality of the words to 2 dimensions using tSNE.

Visualisations can be used to notice semantic and syntactic trends in the data.

Example: Semantic- words like cat, dog, cow, etc. have a tendency to lie close by
         Syntactic- words like run, running or cut, cutting lie close together.
Vector relations like vKing - vMan = vQueen - vWoman can also be noticed.

Additional dependencies : 
- sklearn
- numpy
- plotly

The function below can be used to plot the embeddings in an ipython notebook.
It requires the model as the necessary parameter. If you don't have the model, you can load it by

`model = gensim.models.Word2Vec.load('path/to/model')`

If you don't want to plot inside a notebook, set the `plot_in_notebook` parameter to `False`.

Note: the model used for the visualisation is trained on a small corpus. Thus some of the relations might not be so clear

Beware : This sort dimension reduction comes at the cost of loss of information.

In [32]:
from sklearn.decomposition import IncrementalPCA    # inital reduction
from sklearn.manifold import TSNE                   # final reduction
import numpy as np                                  # array handling

from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go

def reduce_dimensions(model, plot_in_notebook = True):

    num_dimensions = 2  # final num dimensions (2D, 3D, etc)

    vectors = []        # positions in vector space
    labels = []         # keep track of words to label our data again later
    for word in model.wv.vocab:
        vectors.append(model[word])
        labels.append(word)


    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    
    # reduce using t-SNE
    vectors = np.asarray(vectors)
    logging.info('starting tSNE dimensionality reduction. This may take some time.')
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)

    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
        
    # Create a trace
    trace = go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='text',
        text=labels
        )
    
    data = [trace]
    
    logging.info('All done. Plotting.')
    
    if plot_in_notebook:
        init_notebook_mode(connected=True)
        iplot(data, filename='word-embedding-plot')
    else:
        plot(data, filename='word-embedding-plot.html')

In [33]:
reduce_dimensions(model)


Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

2017-12-21 11:51:00,740 : INFO : starting tSNE dimensionality reduction. This may take some time.
2017-12-21 11:51:17,759 : INFO : All done. Plotting.


## Conclusion

In this tutorial we learned how to train word2vec models on your custom data and also how to evaluate it. Hope that you too will find this popular tool useful in your Machine Learning tasks!

## Links


Full `word2vec` API docs [here](http://radimrehurek.com/gensim/models/word2vec.html); get [gensim](http://radimrehurek.com/gensim/) here. Original C toolkit and `word2vec` papers by Google [here](https://code.google.com/archive/p/word2vec/).