## Testing and benchmarking distributed word2vec
Distributed word2vec is written on TensorFlow. You can install TensorFlow [here](https://www.tensorflow.org/install/) (CPU-only support is enough).

We will train both versions of word2vec with text8 dataset.

In [1]:
import gensim
import os, time

from gensim.models.word2vec import Word2Vec, Text8Corpus
from gensim.models import KeyedVectors

### Benchmarking and testing TfWord2Vec model
####  Start docker with distributed word2vec
Docker containers consist of *parameter server* (ps) and  *workers*. Ps stores and updates model and workers compute expensive operations. 

You need to **interrupt kernel** when all workers exited. Final model will be in *"models/tfw2v_model"* file.

Number of epochs: 3

In [2]:
!docker-compose up

Starting test_worker1_1 ... 
Starting test_worker1_1
Starting test_worker2_1 ... 
Starting test_worker2_1
Starting test_ps1_1 ... 
Starting test_worker3_1 ... 
Starting test_worker3_1
Starting test_ps1_1
[1BAttaching to test_worker3_1, test_worker2_1, test_worker1_1, test_ps1_1
[36mworker3_1  |[0m Found and verified text8.zip
[35mps1_1      |[0m Found and verified text8.zip
[33mworker2_1  |[0m Found and verified text8.zip
[32mworker1_1  |[0m Found and verified text8.zip
[35mps1_1      |[0m Count of vocabulary: 71290. Count of rare words: 182564. Data contains 17005207 words.
[35mps1_1      |[0m 2017-10-01 22:33:06.481366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
[35mps1_1      |[0m 2017-10-01 22:33:06.481435: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker1:2223, 1 -> worker2:2224, 2 -> worker3:2225}
[35mps1_1   

[36mworker3_1  |[0m Epoch: 0. Task: 2. Average loss at step 4500: 13.74557498550415. Data index: 13586804. Time: 45.431280851364136
[32mworker1_1  |[0m Epoch: 0. Task: 0. Average loss at step 8500: 12.887157421112061. Data index: 4250000. Time: 74.43199753761292
[33mworker2_1  |[0m Epoch: 0. Task: 1. Average loss at step 4500: 12.887615405082702. Data index: 7918402. Time: 46.37644934654236
[36mworker3_1  |[0m Epoch: 0. Task: 2. Average loss at step 5000: 11.998390156745911. Data index: 13836804. Time: 50.582481384277344
[32mworker1_1  |[0m Epoch: 0. Task: 0. Average loss at step 9000: 12.189710948944091. Data index: 4500000. Time: 79.58523321151733
[33mworker2_1  |[0m Epoch: 0. Task: 1. Average loss at step 5000: 11.63699865913391. Data index: 8168402. Time: 51.55211806297302
[36mworker3_1  |[0m Epoch: 0. Task: 2. Average loss at step 5500: 11.375207640647888. Data index: 14086804. Time: 55.705252170562744
[32mworker1_1  |[0m Epoch: 0. Task: 0. Average loss at step 950

[33mworker2_1  |[0m Epoch: 1. Task: 1. Average loss at step 3500: 5.196238445281982. Data index: 7418402. Time: 153.74247813224792
[36mworker3_1  |[0m Epoch: 1. Task: 2. Average loss at step 4000: 5.175409344673157. Data index: 13336804. Time: 157.20481514930725
[32mworker1_1  |[0m Epoch: 1. Task: 0. Average loss at step 8000: 5.176196441173554. Data index: 4000000. Time: 186.29354190826416
[33mworker2_1  |[0m Epoch: 1. Task: 1. Average loss at step 4000: 5.244959408760071. Data index: 7668402. Time: 158.92642617225647
[36mworker3_1  |[0m Epoch: 1. Task: 2. Average loss at step 4500: 5.071171466350555. Data index: 13586804. Time: 162.3741512298584
[32mworker1_1  |[0m Epoch: 1. Task: 0. Average loss at step 8500: 5.083449512004853. Data index: 4250000. Time: 191.4587481021881
[33mworker2_1  |[0m Epoch: 1. Task: 1. Average loss at step 4500: 5.2070436115264895. Data index: 7918402. Time: 164.08716320991516
[36mworker3_1  |[0m Epoch: 1. Task: 2. Average loss at step 5000: 

[32mworker1_1  |[0m Epoch: 2. Task: 0. Average loss at step 7000: 4.789488666534424. Data index: 3500000. Time: 293.5945715904236
[33mworker2_1  |[0m Epoch: 2. Task: 1. Average loss at step 3000: 4.788929577827454. Data index: 7168402. Time: 265.84371876716614
[36mworker3_1  |[0m Epoch: 2. Task: 2. Average loss at step 3500: 4.561930554389954. Data index: 13086804. Time: 269.41012740135193
[32mworker1_1  |[0m Epoch: 2. Task: 0. Average loss at step 7500: 4.764244206905365. Data index: 3750000. Time: 298.78954100608826
[33mworker2_1  |[0m Epoch: 2. Task: 1. Average loss at step 3500: 4.663880496501923. Data index: 7418402. Time: 271.0158579349518
[36mworker3_1  |[0m Epoch: 2. Task: 2. Average loss at step 4000: 4.682363587856293. Data index: 13336804. Time: 274.591011762619
[32mworker1_1  |[0m Epoch: 2. Task: 0. Average loss at step 8000: 4.7253683605194094. Data index: 4000000. Time: 303.97411489486694
[33mworker2_1  |[0m Epoch: 2. Task: 1. Average loss at step 4000: 4.

[36mworker3_1  |[0m Nearest to not: juanita, keefe, residual, often, usually, fhm, still, confederate,
[36mworker3_1  |[0m Nearest to many: some, several, these, those, idempotent, burst, powiat, dipyramid,
[36mworker3_1  |[0m Nearest to with: adhesion, between, represents, intermediate, litani, along, neologisms, univac,
[36mworker3_1  |[0m Nearest to be: been, have, make, honesty, levelled, heritable, by, waza,
[36mworker3_1  |[0m Nearest to all: concise, superstar, these, paradiso, inefficient, receiver, both, ditko,
[33mworker2_1  |[0m Epoch: 2. Task: 1. Average loss at step 11000: 4.6352100377082825. Data index: 11168402. Time: 343.67028427124023
[36mtest_worker3_1 exited with code 0
[0m[33mworker2_1  |[0m Nearest to where: when, if, though, although, partition, homeostasis, licensure, positron,
[33mworker2_1  |[0m Nearest to during: in, after, on, upon, entertainers, under, unhappy, gaunt,
[33mworker2_1  |[0m Nearest to in: during, at, on, between, genotype, c

#### Testing TfWord2Vec

In [3]:
wv = KeyedVectors.load("./models/tfw2v_model")
wv.accuracy('questions-words.txt')

[{'correct': [],
  'incorrect': [('ISLAMABAD', 'PAKISTAN', 'MADRID', 'SPAIN'),
   ('ISLAMABAD', 'PAKISTAN', 'ROME', 'ITALY'),
   ('ISLAMABAD', 'PAKISTAN', 'TOKYO', 'JAPAN'),
   ('MADRID', 'SPAIN', 'ROME', 'ITALY'),
   ('MADRID', 'SPAIN', 'TOKYO', 'JAPAN'),
   ('MADRID', 'SPAIN', 'ISLAMABAD', 'PAKISTAN'),
   ('ROME', 'ITALY', 'TOKYO', 'JAPAN'),
   ('ROME', 'ITALY', 'ISLAMABAD', 'PAKISTAN'),
   ('ROME', 'ITALY', 'MADRID', 'SPAIN'),
   ('TOKYO', 'JAPAN', 'ISLAMABAD', 'PAKISTAN'),
   ('TOKYO', 'JAPAN', 'MADRID', 'SPAIN'),
   ('TOKYO', 'JAPAN', 'ROME', 'ITALY')],
  'section': 'capital-common-countries'},
 {'correct': [],
  'incorrect': [('AMMAN', 'JORDAN', 'BUJUMBURA', 'BURUNDI'),
   ('AMMAN', 'JORDAN', 'CARACAS', 'VENEZUELA'),
   ('AMMAN', 'JORDAN', 'DAKAR', 'SENEGAL'),
   ('AMMAN', 'JORDAN', 'DOHA', 'QATAR'),
   ('AMMAN', 'JORDAN', 'DUBLIN', 'IRELAND'),
   ('BUJUMBURA', 'BURUNDI', 'CARACAS', 'VENEZUELA'),
   ('BUJUMBURA', 'BURUNDI', 'DAKAR', 'SENEGAL'),
   ('BUJUMBURA', 'BURUNDI', 'DOHA',

### Benchmarking and testing gensim word2vec

Train gensim model on the same dataset

In [4]:
start = time.time()
corpus = Text8Corpus("text8")
model = Word2Vec(corpus, iter=3, hs=1, negative=0, batch_words=1000)
print("Gensim: ", time.time() - start)

Gensim:  37.43387174606323


Evaluate gensim model

In [5]:
model.wv.accuracy('questions-words.txt')

[{'correct': [('ATHENS', 'GREECE', 'BANGKOK', 'THAILAND'),
   ('ATHENS', 'GREECE', 'BERLIN', 'GERMANY'),
   ('ATHENS', 'GREECE', 'HAVANA', 'CUBA'),
   ('ATHENS', 'GREECE', 'HELSINKI', 'FINLAND'),
   ('ATHENS', 'GREECE', 'KABUL', 'AFGHANISTAN'),
   ('ATHENS', 'GREECE', 'MOSCOW', 'RUSSIA'),
   ('ATHENS', 'GREECE', 'OTTAWA', 'CANADA'),
   ('ATHENS', 'GREECE', 'PARIS', 'FRANCE'),
   ('ATHENS', 'GREECE', 'ROME', 'ITALY'),
   ('ATHENS', 'GREECE', 'STOCKHOLM', 'SWEDEN'),
   ('ATHENS', 'GREECE', 'TOKYO', 'JAPAN'),
   ('BAGHDAD', 'IRAQ', 'BERLIN', 'GERMANY'),
   ('BAGHDAD', 'IRAQ', 'HELSINKI', 'FINLAND'),
   ('BAGHDAD', 'IRAQ', 'KABUL', 'AFGHANISTAN'),
   ('BAGHDAD', 'IRAQ', 'MOSCOW', 'RUSSIA'),
   ('BAGHDAD', 'IRAQ', 'PARIS', 'FRANCE'),
   ('BAGHDAD', 'IRAQ', 'ATHENS', 'GREECE'),
   ('BANGKOK', 'THAILAND', 'BERLIN', 'GERMANY'),
   ('BANGKOK', 'THAILAND', 'CANBERRA', 'AUSTRALIA'),
   ('BANGKOK', 'THAILAND', 'MOSCOW', 'RUSSIA'),
   ('BANGKOK', 'THAILAND', 'PARIS', 'FRANCE'),
   ('BEIJING', 'CHIN