# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Imports and logging

First, we start with our imports and get logging established:

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


### Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference. 

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [4]:
data_file="/mnt/servx1vol/Bams/genXone/20190914_1350_hackyeah/hackyeah_data_80/testKUBA_first100.txt.gz"

In [81]:


with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        gensim.utils.simple_preprocess(line)



order asc asc desc

sql sprintf select from data where day curdate interval day and statistic_id order by value limit statistic_id order
echo sql pre
result mysql_query sql
array array
while obj result
array obj player_id obj value

echo pre var_dump array echo pre

version
zones
africa abidjan lmt gmt en
africa accra lmt gmt eo ep
africa nairobi lmt eat eq er es et
africa algiers lmt pmt wet west cet cest eu ev ew ex ey ez
africa lagos lmt wat
africa bissau lmt gmt
africa maputo lmt cat
africa cairo lmt eet eest fa fb fc fd fe ff fg fh fi fj fk fl fm fn
africa casablanca lmt fo fp fq fr fs ft fu aa ab ac ad ae af ag ah
africa ceuta lmt wet west cet cest ai fv fw aj fx fy
africa el_aaiun lmt fz aa ab ac ad ae af ag ah
africa johannesburg lmt sast sast sast ak
africa juba lmt cat cast eat al am
africa khartoum lmt cat cast eat al am
africa monrovia lmt mmt mmt gmt ga gb gc
africa ndjamena lmt wat wast gd ge
africa sao_tome lmt lmt gmt wat gf gg gh
africa tripoli lmt cet cest eet gi gj 

bought accounts from him didnt work contacted him took him seconds to reply replace
drdeadkin
good customer service and very cheap prices recommend to buy set fish_user_paths fastlane bin

rbenv init source

the next line updates path for the google cloud sdk
if users yusuke tashiro lib google cloud sdk path fish inc if type source dev null source users yusuke tashiro lib google cloud sdk path fish inc else users yusuke tashiro lib google cloud sdk path fish inc end end

go
set gopath home go
set path path gopath bin

anyenv
set path home anyenv bin path

ndenv
set ndenv_root home anyenv envs ndenv
set path path ndenv_root bin

set gx path ndenv_root shims path
set gx ndenv_shell fish
command ndenv rehash dev null
function ndenv
set command argv
set argv

switch command
case rehash shell
eval ndenv sh command argv psub
case
command ndenv command argv
end
end

pyenv
set pyenv_root home anyenv envs pyenv
set path pyenv_root bin path
pyenv init psub

set pythonpath usr local lib python si

puts book
puts book int rest char str char str

int
int rest
str my_get_nbr str
str my_get_nbr str
char result malloc sizeoff char strlen_max

result str str
if result
rest
if result
rest

return rest
void aggregate int auth

extern mdata mod
extern cdata content
extern ec_point
extern ec_pairing
extern ec_point publickey

int
int mnmb
unsigned char aggstr
unsigned char pair
unsigned char strbuff
file aggsig
file pairing st
file pairing nd
file pkey
ec_pairing ptmp
element calcpair
element pairbuff
element verify

initialize
mnmb content nummodel
pairing_init ecbn
point_init content aggsig
point_init publickey
element_init calcpair
element_set_one calcpair
element_init pairbuff
element_set_one pairbuff
element_init verify

if auth

aggsig fopen slot aggregate_sig txt
pairing st fopen slot pairing st txt

else
aggsig fopen slot aggregate_sig txt
pairing st fopen slot pairing st txt
pairing nd fopen slot pairing nd txt
pkey fopen slot publickey txt

if pairing st null

printf failed to o

somefunc found dlc_patchday ng
loaddefdats dlc_patchday ngcrc content xml
somefunc found dlc_patchday ng
loaddefdats dlc_patchday ngcrc content xml
somefunc found dlc_patchday ng
loaddefdats dlc_patchday ngcrc content xml
somefunc found dlc_patchday ng
loaddefdats dlc_patchday ngcrc content xml
somefunc found dlc_patchday bng
loaddefdats dlc_patchday bngcrc content xml
somefunc found dlc_patchday ng
loaddefdats dlc_patchday ngcrc content xml
somefunc found dlc_patchday ng
loaddefdats dlc_patchday ngcrc content xml
somefunc found dlc_mppatchesng
loaddefdats content xml
versioning prod gtav xml
versioning prod gtav xml
versioning prod gtav versioning xml
versioning prod gtav versioning xml
versioning prod gtav versioning xml
detected nvidia node attempting to query shadowplay status
getscsdkstub
rgscstub queryinterface
rgscstub initialize
nvnode claims shadowplay is disabled bailing out
initialized system mapping
citizenfx steam child starting command line users kaoru appdata local fivem

voltagem


public void settotallitros long
totallitros


public void setpreco double
preco


public void setconsumo float con
consumo con


public void setpuxadores byte px
puxadores px


public void setilumincao boolean
iluminacao



metodo getter
public char
return classificacao


public short getgarantia
return garantia


public string getcor
return cor


public int getvoltagem
return voltagem


public long gettotallitros
return totallitros


public double getpreco
return preco


public float getconsumo
return consumo


public byte getpuxadores
return puxadores


public boolean getilimunacao
return iluminacao


void getgarantia short parseshort
throw new not supported yet to change body of generated methods choose tools templates


package javaapplication

scanner sc new scanner system in

geladeira dadosg new geladeira

system out println digite classificação energética
dadosg sc next

system out println digite garantia
dadosg setgarantia short parseshort sc nextline

system out pr

boop bop bop bop bop bop biiiiiiip bip bip bop bop boop bop biiiiiiip bip bip bip bip biiiiiiip bip bop bip bip bop bop biip bip
bop bip booop bop biiiiiiip booop bop bop bip bop bip bip biiiiiiip booop bop bip boop bop biip boop bop bip biip booop bop booop bop bop biip boooop bip boooop bip bop biip
bip bip bip bop biiiiiiip bip bip bip bip booop bop bop bop boop bop bop bip bip biip boop booop bop bip biiiiiiip bop bop booop bop biip bop bip biip bop biiiiiiip bop bip bip booop bop bip biiiiiiip bip boooop bop bip bop boop biiiiiiip biip boop bip bip bip bip bip bip
boooop bip bop boooop bip bip bip bop bip bip booop bop bip bip booop bop biip bop bip bop bop bop bip bop boop bip bip bip bip bop bip bip booop bop bop boop boooop bip bip bip bip bip bop bop bip bop bip bop booop bop bop bip bop
booop bop bip bop bop bip biip biiiiiiip bip bip bip bip bip bip boop bip bip biiiiiiip biiiiiiip bip bop bop bop boooop
bip bip boooop bip bop bip bop bip booop bop bip bip boooop bop bip bii

if stream iswriting

stream sendnext transform position
stream sendnext transform rotation
stream sendnext rb velocity
stream sendnext rb angularvelocity
stream sendnext this photonview ownerid




region wtbo onjoined sync

wtbobject wtbo
wtbobjectdata wtbod

punrpc
void
string name
bool visible
bool physics
bool collision
vector position
vector rotation
vector scale
string parttypestring
string colorstring


debug logerror received wtbobject data

wtbod new wtbobjectdata
name
visible
physics
collision
position
rotation
scale
parttypestring
colorstring

wtbo extractdata wtbod


void gatherwtbodata

wtbod new wtbobjectdata
wtbo name
wtbo visible
wtbo physics
wtbo collision
this transform position
this transform rotation eulerangles
wtbo scale
wtbo parttype tostring
wtbo color tostring



public override void photonplayer newplayer

debug logerror sync another player joined

if photonnetwork ismasterclient return don have every single player send an rpc to the new player

gatherwtbodata

lorelei acc attack rock slide dewgong rock slide cloyster rock slide lapras aqua jet jynx leech life slowbro
bruno defense waterfall onix attack leech life waterfall poliwrath waterfall machamp aqua jet hitmonlee leech life aqua jet hitmonchan
agatha attack aqua jet arbok waterfall weezing aqua jet gengar aqua jet golbat aqua jet gengar
lance sp def speed attack acc rock slide seadra rock slide aerodactyl rock slide dragonite rock slide gyarados rock slide charizard
champ speed attack waterfall pidgeot attack leech life vileplume waterfall raichu waterfall marowak leech life slowbro waterfall rapidashdir found

starting backup
sending save all command to tmux session
running backup as user minecraft on server

archive name
archive fingerprint eaaf ae ac ddd df
time start mon
time end mon
duration minutes seconds
number of files
utilization of max archive size

original size compressed size deduplicated size
this archive gb gb mb
all archives tb tb gb

unique chunks total chunks
chunk i

emissions


source buses and coaches
year
emissions


source buses and coaches
year
emissions


source buses and coaches
year
emissions


source buses and coaches
year
emissions


source buses and coaches
year
emissions


source buses and coaches
year use strict

foo bar tng foo bar

function foo origx origy
var setx usestate origx
var sety usestate origy

console log foo

setx
sety bar

return origx


function bar cury
var setz usestate cury

console log bar

cury
setz

return


foo
foo
bar

foo
foo
bar

foo
foo
bar petskona com an online pet store buy or sell pets online

petskona com buy sell pets in your neighbourhood

petskona com buy or sell pets online pets marketplace buy pets sell pets donate pets give free pets

register https petskona com register local hax instance new folder workspace
hax name buisnesshacker
for in next game getdescendants
if classname match script or classname match value then
clone parent hax
end
end
children

children

children

self
jiggle false
drawor

nbsp nbsp nbsp
nbsp nbsp nbsp
nbsp nbsp nbsp
nbsp nbsp nbsp
nbsp nbsp nbsp
nbsp nbsp nbsp
nbsp nbsp nbsp
nbsp nbsp nbsp




event getform add
formfactory createnamed quantity choice null array
choices choice_list








public function resolver

resolver setdefaults array
data_class piggybox orderbundle entity orderdetail



public function getname

return

silang campus

bataller rosette ann
bautista christine joy
cadenilla khervin
cuyos joan
eyana michelle ann
gile wilfredo jr
halnin john bert
maraan jersey sahara
navales noemi
olaez anna lea
rañola jason dennis
recome francis
tagabis harold sonarr debug log

debug api get api tag ok ms

debug api get api qualityprofile ok ms

debug api get api config ui ok ms

debug api get api system status ok ms

debug api get api languageprofile ok ms

debug api get api series ok ms

debug api get api health ok ms

debug api get api log file ok ms

debug api get api queue status ok ms

debug api get api config host ok ms

debug api put api confi

bree like
bree this
giggling ofc would cus know who she is an individual
giggling shes specific type of girl
giggling and already established that
giggling o_o
giggling she can pretend
bree whore
aleesha know who are too mira
giggling to have this facade
giggling with idiots
bree how am whore
giggling but can see
aleesha you throw fits to get what want
giggling right thru it
bree you can see what
aleesha lash out when dont get what want
giggling ok alicia dont know me at all
bree keep saying that
bree but wont
bree say
bree anything
giggling alicia stfu miss piggy
bree or
giggling stop
giggling acting
bree proove
bree it
aleesha complain and whine when dont get what want
giggling like
giggling
giggling know
giggling shit
giggling just
giggling bc
giggling
giggling up
giggling brees
aleesha wish had real friends on here
giggling ass crack
aleesha to hang out with
giggling no dont
giggling omg lmfaooaoa
giggling if did wouldnt have
giggling willingly
giggling cut off brees
aleesha gotta 

KeyboardInterrupt: 

### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the 
compressed file. I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [86]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess(line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2019-09-14 17:59:37,359 : INFO : reading file /mnt/servx1vol/Bams/genXone/20190914_1350_hackyeah/hackyeah_data_80/testKUBA_first100.txt.gz...this may take a while
2019-09-14 17:59:37,360 : INFO : read 0 reviews
2019-09-14 17:59:37,433 : INFO : read 10000 reviews
2019-09-14 17:59:37,503 : INFO : read 20000 reviews
2019-09-14 17:59:37,574 : INFO : read 30000 reviews
2019-09-14 17:59:37,647 : INFO : read 40000 reviews
2019-09-14 17:59:37,721 : INFO : read 50000 reviews
2019-09-14 17:59:37,790 : INFO : read 60000 reviews
2019-09-14 17:59:37,863 : INFO : read 70000 reviews
2019-09-14 17:59:37,936 : INFO : read 80000 reviews
2019-09-14 17:59:38,008 : INFO : read 90000 reviews
2019-09-14 17:59:38,083 : INFO : read 100000 reviews
2019-09-14 17:59:38,157 : INFO : read 110000 reviews
2019-09-14 17:59:38,228 : INFO : read 120000 reviews
2019-09-14 17:59:38,298 : INFO : read 130000 reviews
2019-09-14 17:59:38,371 : INFO : read 140000 reviews
2019-09-14 17:59:38,446 : INFO : read 150000 reviews
201

2019-09-14 17:59:51,973 : INFO : read 1510000 reviews
2019-09-14 17:59:52,040 : INFO : read 1520000 reviews
2019-09-14 17:59:52,105 : INFO : read 1530000 reviews
2019-09-14 17:59:52,181 : INFO : read 1540000 reviews
2019-09-14 17:59:53,111 : INFO : read 1550000 reviews
2019-09-14 17:59:53,185 : INFO : read 1560000 reviews
2019-09-14 17:59:53,264 : INFO : read 1570000 reviews
2019-09-14 17:59:53,340 : INFO : read 1580000 reviews
2019-09-14 17:59:53,421 : INFO : read 1590000 reviews
2019-09-14 17:59:53,496 : INFO : read 1600000 reviews
2019-09-14 17:59:53,572 : INFO : read 1610000 reviews
2019-09-14 17:59:53,656 : INFO : read 1620000 reviews
2019-09-14 17:59:53,727 : INFO : read 1630000 reviews
2019-09-14 17:59:53,800 : INFO : read 1640000 reviews
2019-09-14 17:59:53,875 : INFO : read 1650000 reviews
2019-09-14 17:59:53,945 : INFO : read 1660000 reviews
2019-09-14 17:59:54,015 : INFO : read 1670000 reviews
2019-09-14 17:59:54,091 : INFO : read 1680000 reviews
2019-09-14 17:59:54,167 : IN

2019-09-14 18:00:04,887 : INFO : read 3030000 reviews
2019-09-14 18:00:04,961 : INFO : read 3040000 reviews
2019-09-14 18:00:05,026 : INFO : read 3050000 reviews
2019-09-14 18:00:05,091 : INFO : read 3060000 reviews
2019-09-14 18:00:05,163 : INFO : read 3070000 reviews
2019-09-14 18:00:05,219 : INFO : read 3080000 reviews
2019-09-14 18:00:05,292 : INFO : read 3090000 reviews
2019-09-14 18:00:05,362 : INFO : read 3100000 reviews
2019-09-14 18:00:05,419 : INFO : read 3110000 reviews
2019-09-14 18:00:05,491 : INFO : read 3120000 reviews
2019-09-14 18:00:05,570 : INFO : read 3130000 reviews
2019-09-14 18:00:05,647 : INFO : read 3140000 reviews
2019-09-14 18:00:05,721 : INFO : read 3150000 reviews
2019-09-14 18:00:05,796 : INFO : read 3160000 reviews
2019-09-14 18:00:05,865 : INFO : read 3170000 reviews
2019-09-14 18:00:05,935 : INFO : read 3180000 reviews
2019-09-14 18:00:06,011 : INFO : read 3190000 reviews
2019-09-14 18:00:06,081 : INFO : read 3200000 reviews
2019-09-14 18:00:06,148 : IN

2019-09-14 18:00:18,414 : INFO : read 4550000 reviews
2019-09-14 18:00:18,489 : INFO : read 4560000 reviews
2019-09-14 18:00:18,569 : INFO : read 4570000 reviews
2019-09-14 18:00:18,639 : INFO : read 4580000 reviews
2019-09-14 18:00:18,694 : INFO : read 4590000 reviews
2019-09-14 18:00:18,768 : INFO : read 4600000 reviews
2019-09-14 18:00:18,839 : INFO : read 4610000 reviews
2019-09-14 18:00:18,904 : INFO : read 4620000 reviews
2019-09-14 18:00:18,973 : INFO : read 4630000 reviews
2019-09-14 18:00:19,054 : INFO : read 4640000 reviews
2019-09-14 18:00:19,132 : INFO : read 4650000 reviews
2019-09-14 18:00:19,203 : INFO : read 4660000 reviews
2019-09-14 18:00:19,269 : INFO : read 4670000 reviews
2019-09-14 18:00:19,338 : INFO : read 4680000 reviews
2019-09-14 18:00:19,409 : INFO : read 4690000 reviews
2019-09-14 18:00:19,477 : INFO : read 4700000 reviews
2019-09-14 18:00:19,559 : INFO : read 4710000 reviews
2019-09-14 18:00:19,631 : INFO : read 4720000 reviews
2019-09-14 18:00:19,697 : IN

In [178]:
documents

[[],
 ['order', 'asc', 'asc', 'desc'],
 [],
 ['sql',
  'sprintf',
  'select',
  'from',
  'data',
  'where',
  'day',
  'curdate',
  'interval',
  'day',
  'and',
  'statistic_id',
  'order',
  'by',
  'value',
  'limit',
  'statistic_id',
  'order'],
 ['echo', 'sql', 'pre'],
 ['result', 'mysql_query', 'sql'],
 ['array', 'array'],
 ['while', 'obj', 'result'],
 ['array', 'obj', 'player_id', 'obj', 'value'],
 [],
 ['echo', 'pre', 'var_dump', 'array', 'echo', 'pre'],
 [],
 ['version'],
 ['zones'],
 ['africa', 'abidjan', 'lmt', 'gmt', 'en'],
 ['africa', 'accra', 'lmt', 'gmt', 'eo', 'ep'],
 ['africa', 'nairobi', 'lmt', 'eat', 'eq', 'er', 'es', 'et'],
 ['africa',
  'algiers',
  'lmt',
  'pmt',
  'wet',
  'west',
  'cet',
  'cest',
  'eu',
  'ev',
  'ew',
  'ex',
  'ey',
  'ez'],
 ['africa', 'lagos', 'lmt', 'wat'],
 ['africa', 'bissau', 'lmt', 'gmt'],
 ['africa', 'maputo', 'lmt', 'cat'],
 ['africa',
  'cairo',
  'lmt',
  'eet',
  'eest',
  'fa',
  'fb',
  'fc',
  'fd',
  'fe',
  'ff',
  'fg',

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [87]:
model = gensim.models.Word2Vec (documents, size=200, window=13, min_count=0, workers=80)
model.train(documents,total_examples=len(documents),epochs=10)

2019-09-14 18:00:21,681 : INFO : collecting all words and their counts
2019-09-14 18:00:21,682 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-09-14 18:00:21,690 : INFO : PROGRESS: at sentence #10000, processed 36194 words, keeping 7929 word types
2019-09-14 18:00:21,698 : INFO : PROGRESS: at sentence #20000, processed 72534 words, keeping 12531 word types
2019-09-14 18:00:21,706 : INFO : PROGRESS: at sentence #30000, processed 109497 words, keeping 16908 word types
2019-09-14 18:00:21,714 : INFO : PROGRESS: at sentence #40000, processed 146270 words, keeping 21042 word types
2019-09-14 18:00:21,723 : INFO : PROGRESS: at sentence #50000, processed 182763 words, keeping 25888 word types
2019-09-14 18:00:21,731 : INFO : PROGRESS: at sentence #60000, processed 214598 words, keeping 29606 word types
2019-09-14 18:00:21,741 : INFO : PROGRESS: at sentence #70000, processed 251398 words, keeping 34236 word types
2019-09-14 18:00:21,750 : INFO : PROGRESS: at sen

2019-09-14 18:00:22,325 : INFO : PROGRESS: at sentence #720000, processed 2726546 words, keeping 255284 word types
2019-09-14 18:00:22,332 : INFO : PROGRESS: at sentence #730000, processed 2751630 words, keeping 261452 word types
2019-09-14 18:00:22,340 : INFO : PROGRESS: at sentence #740000, processed 2788297 words, keeping 263234 word types
2019-09-14 18:00:22,349 : INFO : PROGRESS: at sentence #750000, processed 2825218 words, keeping 264992 word types
2019-09-14 18:00:22,357 : INFO : PROGRESS: at sentence #760000, processed 2860195 words, keeping 267925 word types
2019-09-14 18:00:22,366 : INFO : PROGRESS: at sentence #770000, processed 2899720 words, keeping 269669 word types
2019-09-14 18:00:22,375 : INFO : PROGRESS: at sentence #780000, processed 2940361 words, keeping 271932 word types
2019-09-14 18:00:22,384 : INFO : PROGRESS: at sentence #790000, processed 2982500 words, keeping 275283 word types
2019-09-14 18:00:22,393 : INFO : PROGRESS: at sentence #800000, processed 301904

2019-09-14 18:00:22,977 : INFO : PROGRESS: at sentence #1430000, processed 5354765 words, keeping 418309 word types
2019-09-14 18:00:22,986 : INFO : PROGRESS: at sentence #1440000, processed 5390990 words, keeping 420735 word types
2019-09-14 18:00:22,995 : INFO : PROGRESS: at sentence #1450000, processed 5426240 words, keeping 422410 word types
2019-09-14 18:00:23,003 : INFO : PROGRESS: at sentence #1460000, processed 5460743 words, keeping 424270 word types
2019-09-14 18:00:23,012 : INFO : PROGRESS: at sentence #1470000, processed 5498454 words, keeping 425321 word types
2019-09-14 18:00:23,022 : INFO : PROGRESS: at sentence #1480000, processed 5540847 words, keeping 427320 word types
2019-09-14 18:00:23,031 : INFO : PROGRESS: at sentence #1490000, processed 5579805 words, keeping 429008 word types
2019-09-14 18:00:23,040 : INFO : PROGRESS: at sentence #1500000, processed 5617366 words, keeping 430309 word types
2019-09-14 18:00:23,048 : INFO : PROGRESS: at sentence #1510000, process

2019-09-14 18:00:23,669 : INFO : PROGRESS: at sentence #2140000, processed 7994361 words, keeping 560959 word types
2019-09-14 18:00:23,679 : INFO : PROGRESS: at sentence #2150000, processed 8036907 words, keeping 562361 word types
2019-09-14 18:00:23,689 : INFO : PROGRESS: at sentence #2160000, processed 8074282 words, keeping 563821 word types
2019-09-14 18:00:23,699 : INFO : PROGRESS: at sentence #2170000, processed 8112366 words, keeping 565510 word types
2019-09-14 18:00:23,709 : INFO : PROGRESS: at sentence #2180000, processed 8154678 words, keeping 567258 word types
2019-09-14 18:00:23,719 : INFO : PROGRESS: at sentence #2190000, processed 8187847 words, keeping 569126 word types
2019-09-14 18:00:23,729 : INFO : PROGRESS: at sentence #2200000, processed 8225258 words, keeping 571001 word types
2019-09-14 18:00:23,738 : INFO : PROGRESS: at sentence #2210000, processed 8258941 words, keeping 572948 word types
2019-09-14 18:00:23,748 : INFO : PROGRESS: at sentence #2220000, process

2019-09-14 18:00:24,375 : INFO : PROGRESS: at sentence #2850000, processed 10675013 words, keeping 696546 word types
2019-09-14 18:00:24,385 : INFO : PROGRESS: at sentence #2860000, processed 10717018 words, keeping 697609 word types
2019-09-14 18:00:24,414 : INFO : PROGRESS: at sentence #2870000, processed 10755546 words, keeping 699265 word types
2019-09-14 18:00:24,425 : INFO : PROGRESS: at sentence #2880000, processed 10800522 words, keeping 701003 word types
2019-09-14 18:00:24,432 : INFO : PROGRESS: at sentence #2890000, processed 10822418 words, keeping 706612 word types
2019-09-14 18:00:24,440 : INFO : PROGRESS: at sentence #2900000, processed 10853573 words, keeping 707753 word types
2019-09-14 18:00:24,451 : INFO : PROGRESS: at sentence #2910000, processed 10892899 words, keeping 713901 word types
2019-09-14 18:00:24,459 : INFO : PROGRESS: at sentence #2920000, processed 10925310 words, keeping 714928 word types
2019-09-14 18:00:24,467 : INFO : PROGRESS: at sentence #2930000,

2019-09-14 18:00:25,091 : INFO : PROGRESS: at sentence #3560000, processed 13345078 words, keeping 825074 word types
2019-09-14 18:00:25,102 : INFO : PROGRESS: at sentence #3570000, processed 13388553 words, keeping 826675 word types
2019-09-14 18:00:25,110 : INFO : PROGRESS: at sentence #3580000, processed 13411253 words, keeping 832037 word types
2019-09-14 18:00:25,119 : INFO : PROGRESS: at sentence #3590000, processed 13436718 words, keeping 837618 word types
2019-09-14 18:00:25,129 : INFO : PROGRESS: at sentence #3600000, processed 13478345 words, keeping 838901 word types
2019-09-14 18:00:25,139 : INFO : PROGRESS: at sentence #3610000, processed 13517363 words, keeping 840767 word types
2019-09-14 18:00:25,150 : INFO : PROGRESS: at sentence #3620000, processed 13561243 words, keeping 842255 word types
2019-09-14 18:00:25,158 : INFO : PROGRESS: at sentence #3630000, processed 13590052 words, keeping 845739 word types
2019-09-14 18:00:25,167 : INFO : PROGRESS: at sentence #3640000,

2019-09-14 18:00:25,788 : INFO : PROGRESS: at sentence #4270000, processed 15955672 words, keeping 944206 word types
2019-09-14 18:00:25,799 : INFO : PROGRESS: at sentence #4280000, processed 16000887 words, keeping 945508 word types
2019-09-14 18:00:25,810 : INFO : PROGRESS: at sentence #4290000, processed 16041444 words, keeping 947174 word types
2019-09-14 18:00:25,820 : INFO : PROGRESS: at sentence #4300000, processed 16081872 words, keeping 948327 word types
2019-09-14 18:00:25,829 : INFO : PROGRESS: at sentence #4310000, processed 16117580 words, keeping 949177 word types
2019-09-14 18:00:25,838 : INFO : PROGRESS: at sentence #4320000, processed 16151248 words, keeping 950361 word types
2019-09-14 18:00:25,847 : INFO : PROGRESS: at sentence #4330000, processed 16182429 words, keeping 952026 word types
2019-09-14 18:00:25,857 : INFO : PROGRESS: at sentence #4340000, processed 16224262 words, keeping 953417 word types
2019-09-14 18:00:25,866 : INFO : PROGRESS: at sentence #4350000,

2019-09-14 18:00:47,015 : INFO : EPOCH 1 - PROGRESS: at 34.41% examples, 1170579 words/s, in_qsize 158, out_qsize 1
2019-09-14 18:00:48,023 : INFO : EPOCH 1 - PROGRESS: at 41.79% examples, 1183805 words/s, in_qsize 159, out_qsize 1
2019-09-14 18:00:49,023 : INFO : EPOCH 1 - PROGRESS: at 49.03% examples, 1188034 words/s, in_qsize 156, out_qsize 3
2019-09-14 18:00:50,056 : INFO : EPOCH 1 - PROGRESS: at 56.05% examples, 1187105 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:00:51,075 : INFO : EPOCH 1 - PROGRESS: at 63.48% examples, 1192570 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:00:52,093 : INFO : EPOCH 1 - PROGRESS: at 71.02% examples, 1202313 words/s, in_qsize 159, out_qsize 4
2019-09-14 18:00:53,102 : INFO : EPOCH 1 - PROGRESS: at 79.13% examples, 1213860 words/s, in_qsize 160, out_qsize 1
2019-09-14 18:00:54,108 : INFO : EPOCH 1 - PROGRESS: at 86.54% examples, 1216543 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:00:55,038 : INFO : worker thread finished; awaiting finish

2019-09-14 18:00:55,411 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-09-14 18:00:55,412 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-09-14 18:00:55,413 : INFO : EPOCH - 1 : training on 18095866 raw words (17058259 effective words) took 13.4s, 1268879 effective words/s
2019-09-14 18:00:56,436 : INFO : EPOCH 2 - PROGRESS: at 4.05% examples, 691578 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:00:57,452 : INFO : EPOCH 2 - PROGRESS: at 12.21% examples, 1040874 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:00:58,463 : INFO : EPOCH 2 - PROGRESS: at 19.64% examples, 1117518 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:00:59,477 : INFO : EPOCH 2 - PROGRESS: at 27.69% examples, 1177749 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:01:00,479 : INFO : EPOCH 2 - PROGRESS: at 35.21% examples, 1198654 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:01:01,498 : INFO : EPOCH 2 - PROGRESS: at 42.38% examples, 1196520 words/s, in_qsize 

2019-09-14 18:01:08,736 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-09-14 18:01:08,736 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-09-14 18:01:08,739 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-09-14 18:01:08,740 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-09-14 18:01:08,740 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-09-14 18:01:08,743 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-09-14 18:01:08,744 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-09-14 18:01:08,745 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-09-14 18:01:08,749 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-09-14 18:01:08,754 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-09-14 18:01:08,755 : INFO : EPOCH - 2 : training on 18095866 raw words (17059078 effe

2019-09-14 18:01:21,824 : INFO : worker thread finished; awaiting finish of 16 more threads
2019-09-14 18:01:21,825 : INFO : worker thread finished; awaiting finish of 15 more threads
2019-09-14 18:01:21,825 : INFO : worker thread finished; awaiting finish of 14 more threads
2019-09-14 18:01:21,826 : INFO : worker thread finished; awaiting finish of 13 more threads
2019-09-14 18:01:21,826 : INFO : worker thread finished; awaiting finish of 12 more threads
2019-09-14 18:01:21,832 : INFO : worker thread finished; awaiting finish of 11 more threads
2019-09-14 18:01:21,835 : INFO : worker thread finished; awaiting finish of 10 more threads
2019-09-14 18:01:21,837 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-09-14 18:01:21,838 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-09-14 18:01:21,840 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-09-14 18:01:21,842 : INFO : worker thread finished; awaiting finish of 6 more 

2019-09-14 18:01:35,776 : INFO : worker thread finished; awaiting finish of 24 more threads
2019-09-14 18:01:35,777 : INFO : worker thread finished; awaiting finish of 23 more threads
2019-09-14 18:01:35,777 : INFO : worker thread finished; awaiting finish of 22 more threads
2019-09-14 18:01:35,777 : INFO : worker thread finished; awaiting finish of 21 more threads
2019-09-14 18:01:35,777 : INFO : worker thread finished; awaiting finish of 20 more threads
2019-09-14 18:01:35,778 : INFO : worker thread finished; awaiting finish of 19 more threads
2019-09-14 18:01:35,778 : INFO : worker thread finished; awaiting finish of 18 more threads
2019-09-14 18:01:35,778 : INFO : worker thread finished; awaiting finish of 17 more threads
2019-09-14 18:01:35,778 : INFO : worker thread finished; awaiting finish of 16 more threads
2019-09-14 18:01:35,781 : INFO : worker thread finished; awaiting finish of 15 more threads
2019-09-14 18:01:35,781 : INFO : worker thread finished; awaiting finish of 14 m

2019-09-14 18:01:49,395 : INFO : worker thread finished; awaiting finish of 32 more threads
2019-09-14 18:01:49,396 : INFO : worker thread finished; awaiting finish of 31 more threads
2019-09-14 18:01:49,398 : INFO : worker thread finished; awaiting finish of 30 more threads
2019-09-14 18:01:49,399 : INFO : worker thread finished; awaiting finish of 29 more threads
2019-09-14 18:01:49,401 : INFO : worker thread finished; awaiting finish of 28 more threads
2019-09-14 18:01:49,402 : INFO : worker thread finished; awaiting finish of 27 more threads
2019-09-14 18:01:49,404 : INFO : worker thread finished; awaiting finish of 26 more threads
2019-09-14 18:01:49,405 : INFO : worker thread finished; awaiting finish of 25 more threads
2019-09-14 18:01:49,408 : INFO : worker thread finished; awaiting finish of 24 more threads
2019-09-14 18:01:49,411 : INFO : worker thread finished; awaiting finish of 23 more threads
2019-09-14 18:01:49,412 : INFO : worker thread finished; awaiting finish of 22 m

2019-09-14 18:02:03,141 : INFO : worker thread finished; awaiting finish of 44 more threads
2019-09-14 18:02:03,143 : INFO : worker thread finished; awaiting finish of 43 more threads
2019-09-14 18:02:03,144 : INFO : worker thread finished; awaiting finish of 42 more threads
2019-09-14 18:02:03,145 : INFO : worker thread finished; awaiting finish of 41 more threads
2019-09-14 18:02:03,146 : INFO : worker thread finished; awaiting finish of 40 more threads
2019-09-14 18:02:03,148 : INFO : worker thread finished; awaiting finish of 39 more threads
2019-09-14 18:02:03,151 : INFO : worker thread finished; awaiting finish of 38 more threads
2019-09-14 18:02:03,157 : INFO : worker thread finished; awaiting finish of 37 more threads
2019-09-14 18:02:03,159 : INFO : worker thread finished; awaiting finish of 36 more threads
2019-09-14 18:02:03,163 : INFO : worker thread finished; awaiting finish of 35 more threads
2019-09-14 18:02:03,168 : INFO : worker thread finished; awaiting finish of 34 m

2019-09-14 18:02:16,837 : INFO : worker thread finished; awaiting finish of 52 more threads
2019-09-14 18:02:16,840 : INFO : worker thread finished; awaiting finish of 51 more threads
2019-09-14 18:02:16,841 : INFO : worker thread finished; awaiting finish of 50 more threads
2019-09-14 18:02:16,842 : INFO : worker thread finished; awaiting finish of 49 more threads
2019-09-14 18:02:16,842 : INFO : worker thread finished; awaiting finish of 48 more threads
2019-09-14 18:02:16,844 : INFO : worker thread finished; awaiting finish of 47 more threads
2019-09-14 18:02:16,844 : INFO : worker thread finished; awaiting finish of 46 more threads
2019-09-14 18:02:16,854 : INFO : worker thread finished; awaiting finish of 45 more threads
2019-09-14 18:02:16,863 : INFO : worker thread finished; awaiting finish of 44 more threads
2019-09-14 18:02:16,880 : INFO : worker thread finished; awaiting finish of 43 more threads
2019-09-14 18:02:16,894 : INFO : worker thread finished; awaiting finish of 42 m

2019-09-14 18:02:30,897 : INFO : worker thread finished; awaiting finish of 60 more threads
2019-09-14 18:02:30,900 : INFO : worker thread finished; awaiting finish of 59 more threads
2019-09-14 18:02:30,912 : INFO : worker thread finished; awaiting finish of 58 more threads
2019-09-14 18:02:30,916 : INFO : worker thread finished; awaiting finish of 57 more threads
2019-09-14 18:02:30,931 : INFO : worker thread finished; awaiting finish of 56 more threads
2019-09-14 18:02:30,950 : INFO : worker thread finished; awaiting finish of 55 more threads
2019-09-14 18:02:30,956 : INFO : worker thread finished; awaiting finish of 54 more threads
2019-09-14 18:02:30,963 : INFO : worker thread finished; awaiting finish of 53 more threads
2019-09-14 18:02:30,967 : INFO : worker thread finished; awaiting finish of 52 more threads
2019-09-14 18:02:30,968 : INFO : worker thread finished; awaiting finish of 51 more threads
2019-09-14 18:02:30,970 : INFO : worker thread finished; awaiting finish of 50 m

2019-09-14 18:02:44,507 : INFO : worker thread finished; awaiting finish of 68 more threads
2019-09-14 18:02:44,508 : INFO : worker thread finished; awaiting finish of 67 more threads
2019-09-14 18:02:44,513 : INFO : worker thread finished; awaiting finish of 66 more threads
2019-09-14 18:02:44,517 : INFO : worker thread finished; awaiting finish of 65 more threads
2019-09-14 18:02:44,525 : INFO : worker thread finished; awaiting finish of 64 more threads
2019-09-14 18:02:44,527 : INFO : worker thread finished; awaiting finish of 63 more threads
2019-09-14 18:02:44,531 : INFO : worker thread finished; awaiting finish of 62 more threads
2019-09-14 18:02:44,554 : INFO : worker thread finished; awaiting finish of 61 more threads
2019-09-14 18:02:44,555 : INFO : worker thread finished; awaiting finish of 60 more threads
2019-09-14 18:02:44,558 : INFO : worker thread finished; awaiting finish of 59 more threads
2019-09-14 18:02:44,568 : INFO : worker thread finished; awaiting finish of 58 m

2019-09-14 18:02:58,226 : INFO : worker thread finished; awaiting finish of 76 more threads
2019-09-14 18:02:58,243 : INFO : worker thread finished; awaiting finish of 75 more threads
2019-09-14 18:02:58,252 : INFO : worker thread finished; awaiting finish of 74 more threads
2019-09-14 18:02:58,258 : INFO : worker thread finished; awaiting finish of 73 more threads
2019-09-14 18:02:58,270 : INFO : worker thread finished; awaiting finish of 72 more threads
2019-09-14 18:02:58,274 : INFO : worker thread finished; awaiting finish of 71 more threads
2019-09-14 18:02:58,289 : INFO : worker thread finished; awaiting finish of 70 more threads
2019-09-14 18:02:58,299 : INFO : worker thread finished; awaiting finish of 69 more threads
2019-09-14 18:02:58,314 : INFO : worker thread finished; awaiting finish of 68 more threads
2019-09-14 18:02:58,315 : INFO : worker thread finished; awaiting finish of 67 more threads
2019-09-14 18:02:58,319 : INFO : worker thread finished; awaiting finish of 66 m

2019-09-14 18:03:08,681 : INFO : EPOCH 6 - PROGRESS: at 68.73% examples, 1161402 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:09,685 : INFO : EPOCH 6 - PROGRESS: at 76.34% examples, 1174514 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:03:10,693 : INFO : EPOCH 6 - PROGRESS: at 84.73% examples, 1192438 words/s, in_qsize 160, out_qsize 2
2019-09-14 18:03:11,699 : INFO : EPOCH 6 - PROGRESS: at 93.85% examples, 1218934 words/s, in_qsize 111, out_qsize 2
2019-09-14 18:03:11,863 : INFO : worker thread finished; awaiting finish of 79 more threads
2019-09-14 18:03:11,876 : INFO : worker thread finished; awaiting finish of 78 more threads
2019-09-14 18:03:11,877 : INFO : worker thread finished; awaiting finish of 77 more threads
2019-09-14 18:03:11,889 : INFO : worker thread finished; awaiting finish of 76 more threads
2019-09-14 18:03:11,905 : INFO : worker thread finished; awaiting finish of 75 more threads
2019-09-14 18:03:11,911 : INFO : worker thread finished; awaiting finish of 

2019-09-14 18:03:16,268 : INFO : EPOCH 7 - PROGRESS: at 23.84% examples, 1021385 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:03:17,291 : INFO : EPOCH 7 - PROGRESS: at 30.74% examples, 1042234 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:18,294 : INFO : EPOCH 7 - PROGRESS: at 37.00% examples, 1056241 words/s, in_qsize 160, out_qsize 2
2019-09-14 18:03:19,302 : INFO : EPOCH 7 - PROGRESS: at 45.41% examples, 1102409 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:03:20,330 : INFO : EPOCH 7 - PROGRESS: at 52.81% examples, 1118519 words/s, in_qsize 160, out_qsize 1
2019-09-14 18:03:21,368 : INFO : EPOCH 7 - PROGRESS: at 60.88% examples, 1141039 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:03:22,368 : INFO : EPOCH 7 - PROGRESS: at 68.82% examples, 1163167 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:23,378 : INFO : EPOCH 7 - PROGRESS: at 76.42% examples, 1173103 words/s, in_qsize 160, out_qsize 0
2019-09-14 18:03:24,379 : INFO : EPOCH 7 - PROGRESS: at 83.79% examples,

2019-09-14 18:03:26,005 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-09-14 18:03:26,006 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-09-14 18:03:26,007 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-09-14 18:03:26,007 : INFO : EPOCH - 7 : training on 18095866 raw words (17059805 effective words) took 13.8s, 1239634 effective words/s
2019-09-14 18:03:27,051 : INFO : EPOCH 8 - PROGRESS: at 3.80% examples, 628358 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:28,062 : INFO : EPOCH 8 - PROGRESS: at 10.59% examples, 897528 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:29,068 : INFO : EPOCH 8 - PROGRESS: at 17.27% examples, 977479 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:30,075 : INFO : EPOCH 8 - PROGRESS: at 24.32% examples, 1033342 words/s, in_qsize 159, out_qsize 0
2019-09-14 18:03:31,091 : INFO : EPOCH 8 - PROGRESS: at 30.89% examples, 1044888 words/s, in_qsize 160, out_qsize 0
2019-09-14

2019-09-14 18:03:39,764 : INFO : worker thread finished; awaiting finish of 10 more threads
2019-09-14 18:03:39,767 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-09-14 18:03:39,768 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-09-14 18:03:39,776 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-09-14 18:03:39,780 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-09-14 18:03:39,781 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-09-14 18:03:39,782 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-09-14 18:03:39,784 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-09-14 18:03:39,785 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-09-14 18:03:39,793 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-09-14 18:03:39,796 : INFO : worker thread finished; awaiting finish of 0 more thread

2019-09-14 18:03:53,785 : INFO : worker thread finished; awaiting finish of 18 more threads
2019-09-14 18:03:53,786 : INFO : worker thread finished; awaiting finish of 17 more threads
2019-09-14 18:03:53,786 : INFO : worker thread finished; awaiting finish of 16 more threads
2019-09-14 18:03:53,786 : INFO : worker thread finished; awaiting finish of 15 more threads
2019-09-14 18:03:53,786 : INFO : worker thread finished; awaiting finish of 14 more threads
2019-09-14 18:03:53,787 : INFO : worker thread finished; awaiting finish of 13 more threads
2019-09-14 18:03:53,787 : INFO : worker thread finished; awaiting finish of 12 more threads
2019-09-14 18:03:53,788 : INFO : worker thread finished; awaiting finish of 11 more threads
2019-09-14 18:03:53,788 : INFO : worker thread finished; awaiting finish of 10 more threads
2019-09-14 18:03:53,788 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-09-14 18:03:53,789 : INFO : worker thread finished; awaiting finish of 8 mor

2019-09-14 18:04:07,439 : INFO : worker thread finished; awaiting finish of 26 more threads
2019-09-14 18:04:07,440 : INFO : worker thread finished; awaiting finish of 25 more threads
2019-09-14 18:04:07,443 : INFO : worker thread finished; awaiting finish of 24 more threads
2019-09-14 18:04:07,443 : INFO : worker thread finished; awaiting finish of 23 more threads
2019-09-14 18:04:07,443 : INFO : worker thread finished; awaiting finish of 22 more threads
2019-09-14 18:04:07,443 : INFO : worker thread finished; awaiting finish of 21 more threads
2019-09-14 18:04:07,444 : INFO : worker thread finished; awaiting finish of 20 more threads
2019-09-14 18:04:07,444 : INFO : worker thread finished; awaiting finish of 19 more threads
2019-09-14 18:04:07,444 : INFO : worker thread finished; awaiting finish of 18 more threads
2019-09-14 18:04:07,444 : INFO : worker thread finished; awaiting finish of 17 more threads
2019-09-14 18:04:07,445 : INFO : worker thread finished; awaiting finish of 16 m

(170591069, 180958660)

In [151]:
file="/mnt/servx1vol/Bams/genXone/20190914_1350_hackyeah/hackyeah_data_80/800/800005f034d638f17bbc734554bdaa92a13432f5"
def read(file):
    with open(file, 'rb') as f:
        for line in f:
            yield gensim.utils.simple_preprocess(line)

doc = list(read(file))

In [152]:
model.build_vocab(doc, update=True)
model.train(doc,total_examples=model.corpus_count,epochs=10)

2019-09-14 21:08:37,943 : INFO : collecting all words and their counts
2019-09-14 21:08:37,943 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-09-14 21:08:37,944 : INFO : collected 126 word types from a corpus of 360 raw words and 117 sentences
2019-09-14 21:08:37,944 : INFO : Updating model with new vocabulary
2019-09-14 21:08:37,945 : INFO : New added 126 unique words (50% of original 252) and increased the count of 126 pre-existing words (50% of original 252)
2019-09-14 21:08:37,946 : INFO : deleting the raw counts dictionary of 126 items
2019-09-14 21:08:37,946 : INFO : sample=0.001 downsamples 252 most-common words
2019-09-14 21:08:37,946 : INFO : downsampling leaves estimated 322 word corpus (89.6% of prior 360)
2019-09-14 21:08:39,831 : INFO : estimated required memory for 252 words and 200 dimensions: 529200 bytes
2019-09-14 21:08:39,832 : INFO : updating layer weights
2019-09-14 21:08:39,921 : INFO : training model with 80 workers on 1067657 voc

2019-09-14 21:08:40,035 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-09-14 21:08:40,035 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-09-14 21:08:40,036 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-09-14 21:08:40,036 : INFO : EPOCH - 1 : training on 360 raw words (168 effective words) took 0.0s, 4491 effective words/s
2019-09-14 21:08:40,103 : INFO : worker thread finished; awaiting finish of 79 more threads
2019-09-14 21:08:40,104 : INFO : worker thread finished; awaiting finish of 78 more threads
2019-09-14 21:08:40,104 : INFO : worker thread finished; awaiting finish of 77 more threads
2019-09-14 21:08:40,105 : INFO : worker thread finished; awaiting finish of 76 more threads
2019-09-14 21:08:40,105 : INFO : worker thread finished; awaiting finish of 75 more threads
2019-09-14 21:08:40,105 : INFO : worker thread finished; awaiting finish of 74 more threads
2019-09-14 21:08:40,105 : INFO : worker thread f

2019-09-14 21:08:40,155 : INFO : worker thread finished; awaiting finish of 75 more threads
2019-09-14 21:08:40,155 : INFO : worker thread finished; awaiting finish of 74 more threads
2019-09-14 21:08:40,156 : INFO : worker thread finished; awaiting finish of 73 more threads
2019-09-14 21:08:40,156 : INFO : worker thread finished; awaiting finish of 72 more threads
2019-09-14 21:08:40,156 : INFO : worker thread finished; awaiting finish of 71 more threads
2019-09-14 21:08:40,157 : INFO : worker thread finished; awaiting finish of 70 more threads
2019-09-14 21:08:40,157 : INFO : worker thread finished; awaiting finish of 69 more threads
2019-09-14 21:08:40,157 : INFO : worker thread finished; awaiting finish of 68 more threads
2019-09-14 21:08:40,157 : INFO : worker thread finished; awaiting finish of 67 more threads
2019-09-14 21:08:40,158 : INFO : worker thread finished; awaiting finish of 66 more threads
2019-09-14 21:08:40,158 : INFO : worker thread finished; awaiting finish of 65 m

2019-09-14 21:08:40,205 : INFO : worker thread finished; awaiting finish of 67 more threads
2019-09-14 21:08:40,205 : INFO : worker thread finished; awaiting finish of 66 more threads
2019-09-14 21:08:40,205 : INFO : worker thread finished; awaiting finish of 65 more threads
2019-09-14 21:08:40,206 : INFO : worker thread finished; awaiting finish of 64 more threads
2019-09-14 21:08:40,206 : INFO : worker thread finished; awaiting finish of 63 more threads
2019-09-14 21:08:40,206 : INFO : worker thread finished; awaiting finish of 62 more threads
2019-09-14 21:08:40,206 : INFO : worker thread finished; awaiting finish of 61 more threads
2019-09-14 21:08:40,207 : INFO : worker thread finished; awaiting finish of 60 more threads
2019-09-14 21:08:40,207 : INFO : worker thread finished; awaiting finish of 59 more threads
2019-09-14 21:08:40,207 : INFO : worker thread finished; awaiting finish of 58 more threads
2019-09-14 21:08:40,208 : INFO : worker thread finished; awaiting finish of 57 m

2019-09-14 21:08:40,254 : INFO : worker thread finished; awaiting finish of 59 more threads
2019-09-14 21:08:40,254 : INFO : worker thread finished; awaiting finish of 58 more threads
2019-09-14 21:08:40,255 : INFO : worker thread finished; awaiting finish of 57 more threads
2019-09-14 21:08:40,255 : INFO : worker thread finished; awaiting finish of 56 more threads
2019-09-14 21:08:40,255 : INFO : worker thread finished; awaiting finish of 55 more threads
2019-09-14 21:08:40,255 : INFO : worker thread finished; awaiting finish of 54 more threads
2019-09-14 21:08:40,256 : INFO : worker thread finished; awaiting finish of 53 more threads
2019-09-14 21:08:40,256 : INFO : worker thread finished; awaiting finish of 52 more threads
2019-09-14 21:08:40,256 : INFO : worker thread finished; awaiting finish of 51 more threads
2019-09-14 21:08:40,257 : INFO : worker thread finished; awaiting finish of 50 more threads
2019-09-14 21:08:40,257 : INFO : worker thread finished; awaiting finish of 49 m

2019-09-14 21:08:40,319 : INFO : worker thread finished; awaiting finish of 51 more threads
2019-09-14 21:08:40,319 : INFO : worker thread finished; awaiting finish of 50 more threads
2019-09-14 21:08:40,320 : INFO : worker thread finished; awaiting finish of 49 more threads
2019-09-14 21:08:40,320 : INFO : worker thread finished; awaiting finish of 48 more threads
2019-09-14 21:08:40,320 : INFO : worker thread finished; awaiting finish of 47 more threads
2019-09-14 21:08:40,320 : INFO : worker thread finished; awaiting finish of 46 more threads
2019-09-14 21:08:40,321 : INFO : worker thread finished; awaiting finish of 45 more threads
2019-09-14 21:08:40,321 : INFO : worker thread finished; awaiting finish of 44 more threads
2019-09-14 21:08:40,321 : INFO : worker thread finished; awaiting finish of 43 more threads
2019-09-14 21:08:40,322 : INFO : worker thread finished; awaiting finish of 42 more threads
2019-09-14 21:08:40,322 : INFO : worker thread finished; awaiting finish of 41 m

2019-09-14 21:08:40,372 : INFO : worker thread finished; awaiting finish of 43 more threads
2019-09-14 21:08:40,372 : INFO : worker thread finished; awaiting finish of 42 more threads
2019-09-14 21:08:40,372 : INFO : worker thread finished; awaiting finish of 41 more threads
2019-09-14 21:08:40,373 : INFO : worker thread finished; awaiting finish of 40 more threads
2019-09-14 21:08:40,373 : INFO : worker thread finished; awaiting finish of 39 more threads
2019-09-14 21:08:40,373 : INFO : worker thread finished; awaiting finish of 38 more threads
2019-09-14 21:08:40,374 : INFO : worker thread finished; awaiting finish of 37 more threads
2019-09-14 21:08:40,374 : INFO : worker thread finished; awaiting finish of 36 more threads
2019-09-14 21:08:40,374 : INFO : worker thread finished; awaiting finish of 35 more threads
2019-09-14 21:08:40,374 : INFO : worker thread finished; awaiting finish of 34 more threads
2019-09-14 21:08:40,375 : INFO : worker thread finished; awaiting finish of 33 m

2019-09-14 21:08:40,423 : INFO : worker thread finished; awaiting finish of 35 more threads
2019-09-14 21:08:40,423 : INFO : worker thread finished; awaiting finish of 34 more threads
2019-09-14 21:08:40,423 : INFO : worker thread finished; awaiting finish of 33 more threads
2019-09-14 21:08:40,423 : INFO : worker thread finished; awaiting finish of 32 more threads
2019-09-14 21:08:40,424 : INFO : worker thread finished; awaiting finish of 31 more threads
2019-09-14 21:08:40,424 : INFO : worker thread finished; awaiting finish of 30 more threads
2019-09-14 21:08:40,424 : INFO : worker thread finished; awaiting finish of 29 more threads
2019-09-14 21:08:40,424 : INFO : worker thread finished; awaiting finish of 28 more threads
2019-09-14 21:08:40,425 : INFO : worker thread finished; awaiting finish of 27 more threads
2019-09-14 21:08:40,425 : INFO : worker thread finished; awaiting finish of 26 more threads
2019-09-14 21:08:40,425 : INFO : worker thread finished; awaiting finish of 25 m

2019-09-14 21:08:40,469 : INFO : worker thread finished; awaiting finish of 27 more threads
2019-09-14 21:08:40,469 : INFO : worker thread finished; awaiting finish of 26 more threads
2019-09-14 21:08:40,469 : INFO : worker thread finished; awaiting finish of 25 more threads
2019-09-14 21:08:40,470 : INFO : worker thread finished; awaiting finish of 24 more threads
2019-09-14 21:08:40,470 : INFO : worker thread finished; awaiting finish of 23 more threads
2019-09-14 21:08:40,470 : INFO : worker thread finished; awaiting finish of 22 more threads
2019-09-14 21:08:40,470 : INFO : worker thread finished; awaiting finish of 21 more threads
2019-09-14 21:08:40,471 : INFO : worker thread finished; awaiting finish of 20 more threads
2019-09-14 21:08:40,471 : INFO : worker thread finished; awaiting finish of 19 more threads
2019-09-14 21:08:40,471 : INFO : worker thread finished; awaiting finish of 18 more threads
2019-09-14 21:08:40,471 : INFO : worker thread finished; awaiting finish of 17 m

2019-09-14 21:08:40,515 : INFO : worker thread finished; awaiting finish of 19 more threads
2019-09-14 21:08:40,515 : INFO : worker thread finished; awaiting finish of 18 more threads
2019-09-14 21:08:40,515 : INFO : worker thread finished; awaiting finish of 17 more threads
2019-09-14 21:08:40,515 : INFO : worker thread finished; awaiting finish of 16 more threads
2019-09-14 21:08:40,516 : INFO : worker thread finished; awaiting finish of 15 more threads
2019-09-14 21:08:40,516 : INFO : worker thread finished; awaiting finish of 14 more threads
2019-09-14 21:08:40,516 : INFO : worker thread finished; awaiting finish of 13 more threads
2019-09-14 21:08:40,516 : INFO : worker thread finished; awaiting finish of 12 more threads
2019-09-14 21:08:40,517 : INFO : worker thread finished; awaiting finish of 11 more threads
2019-09-14 21:08:40,517 : INFO : worker thread finished; awaiting finish of 10 more threads
2019-09-14 21:08:40,517 : INFO : worker thread finished; awaiting finish of 9 mo

(1611, 3600)

In [177]:
w1 = "dirty"
model.wv.most_similar (positive=w1)

[('sexy', 0.5765877366065979),
 ('naughty', 0.5730270743370056),
 ('blonde', 0.5616953372955322),
 ('whore', 0.5575402975082397),
 ('brunette', 0.5498841404914856),
 ('boy', 0.5348165035247803),
 ('fucking', 0.5327426195144653),
 ('naked', 0.5307685136795044),
 ('blowjob', 0.5305723547935486),
 ('dick', 0.5303226113319397)]

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words. 

In [109]:

w1 = "dirty"
model.wv.most_similar (positive=w1)


[('sexy', 0.5765877366065979),
 ('naughty', 0.5730270743370056),
 ('blonde', 0.5616953372955322),
 ('whore', 0.5575402975082397),
 ('brunette', 0.5498841404914856),
 ('boy', 0.5348165035247803),
 ('fucking', 0.5327426195144653),
 ('naked', 0.5307685136795044),
 ('blowjob', 0.5305723547935486),
 ('dick', 0.5303226113319397)]

In [170]:
w1 = "bitcoin"
model.wv.most_similar (positive=w1)

[('btc', 0.7070275545120239),
 ('anleitungen', 0.6291753053665161),
 ('paypal', 0.6121591329574585),
 ('ethereum', 0.5984702110290527),
 ('selling', 0.5840017795562744),
 ('payment', 0.553992509841919),
 ('cash', 0.5389730930328369),
 ('payments', 0.5315466523170471),
 ('мгновенно', 0.5269312262535095),
 ('wallet', 0.5216661095619202)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for `polite`, `france` and `shocked`. 

In [57]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('confident', 0.49306565523147583),
 ('honest', 0.4809393584728241),
 ('rude', 0.479707270860672),
 ('respectful', 0.46317926049232483),
 ('joke', 0.43244439363479614),
 ('angry', 0.42868563532829285)]

In [58]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)


[('germany', 0.670470118522644),
 ('spain', 0.6591787934303284),
 ('french', 0.6537880301475525),
 ('paris', 0.6444750428199768),
 ('greece', 0.6278240084648132),
 ('mexico', 0.6266096234321594)]

In [11]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)


[('surprised', 0.7492449283599854),
 ('smiling', 0.6603766679763794),
 ('confused', 0.6447255611419678),
 ('amazed', 0.6390156745910645),
 ('scared', 0.622252345085144),
 ('laughs', 0.6047079563140869)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [59]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('shoulders', 0.42097824811935425),
 ('lightly', 0.4134235382080078),
 ('warm', 0.40145695209503174),
 ('thick', 0.39765384793281555),
 ('cheek', 0.39348864555358887),
 ('sheets', 0.3915254473686218),
 ('shaking', 0.3868190348148346),
 ('selectfor', 0.3864431381225586),
 ('panting', 0.3848966956138611),
 ('head', 0.3830137252807617)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [65]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.22341317

In [61]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0

In [62]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.22457771

In [182]:
len(model.wv.get_vector("dirty")+model.wv.get_vector("smelly"))
type(model.wv.get_vector("dirty"))

numpy.ndarray

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [66]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'france'

In [72]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


'duvet'

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


## When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary. 

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work. 
