## Testing and benchmarking distributed word2vec
Distributed word2vec is written on TensorFlow. You can install TensorFlow [here](https://www.tensorflow.org/install/) (CPU-only support is enough).

We will train both versions of word2vec with text8 dataset.

In [1]:
import numpy as np
import gensim
import os, time

from gensim.models.word2vec import Word2Vec, Text8Corpus
from gensim.models import KeyedVectors

In [2]:
def eval(results):
    correct = sum([len(i['correct']) for i in results])
    q_all = sum([len(i['correct']) + len(i['incorrect']) for i in results])
    return correct  / q_all

### Benchmarking and testing TfWord2Vec model
####  Start docker with distributed word2vec
Docker containers consist of *parameter server* (ps) and  *workers*. Ps stores and updates model and workers compute expensive operations. 

You need to **interrupt kernel** when all workers exited. Final model will be in *"models/tfw2v_model"* file.

Number of epochs: 3

In [3]:
!docker-compose up

Creating network "test_default" with the default driver
Creating test_worker3_1 ... 
Creating test_worker1_1 ... 
Creating test_ps1_1 ... 
Creating test_worker2_1 ... 
Creating test_worker3_1
Creating test_worker2_1
Creating test_ps1_1
Creating test_worker1_1
[1BAttaching to test_worker2_1, test_worker3_1, test_worker1_1, test_ps1_1
[36mworker2_1  |[0m Found and verified text8.zip
[33mworker3_1  |[0m Found and verified text8.zip
[32mworker1_1  |[0m Found and verified text8.zip
[35mps1_1      |[0m Found and verified text8.zip
[36mworker2_1  |[0m Count of vocabulary: 71290. Count of rare words: 182564. Data contains 17005207 words.
[36mworker2_1  |[0m 2017-10-04 11:58:26.566871: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps1:2222}
[36mworker2_1  |[0m 2017-10-04 11:58:26.566920: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker1:2223, 1

[36mworker2_1  |[0m Epoch: 0. Task: 1. Average loss at step 4000: 13.242281940460206. Data index: 7572948. Time: 37.13486981391907
[33mworker3_1  |[0m Epoch: 0. Task: 2. Average loss at step 4000: 13.212259580612182. Data index: 13145896. Time: 37.16028904914856
[32mworker1_1  |[0m Epoch: 0. Task: 0. Average loss at step 9000: 11.905464023590088. Data index: 4500000. Time: 69.52610516548157
[36mworker2_1  |[0m Epoch: 0. Task: 1. Average loss at step 4500: 12.22135597229004. Data index: 7822948. Time: 41.834819078445435
[33mworker3_1  |[0m Epoch: 0. Task: 2. Average loss at step 4500: 11.865518158912659. Data index: 13395896. Time: 41.77668642997742
[32mworker1_1  |[0m Epoch: 0. Task: 0. Average loss at step 9500: 11.261154238700867. Data index: 4750000. Time: 74.14424729347229
[36mworker2_1  |[0m Epoch: 0. Task: 1. Average loss at step 5000: 10.597343655586243. Data index: 8072948. Time: 46.38004231452942
[33mworker3_1  |[0m Epoch: 0. Task: 2. Average loss at step 5000:

[32mworker1_1  |[0m Epoch: 1. Task: 0. Average loss at step 8000: 4.7942087035179135. Data index: 4000000. Time: 163.90802454948425
[36mworker2_1  |[0m Epoch: 1. Task: 1. Average loss at step 3500: 4.771048245429992. Data index: 7322948. Time: 135.71851205825806
[33mworker3_1  |[0m Epoch: 1. Task: 2. Average loss at step 3500: 4.579245968341827. Data index: 12895896. Time: 135.8993797302246
[32mworker1_1  |[0m Epoch: 1. Task: 0. Average loss at step 8500: 4.732057287693023. Data index: 4250000. Time: 168.5207817554474
[36mworker2_1  |[0m Epoch: 1. Task: 1. Average loss at step 4000: 4.852042723655701. Data index: 7572948. Time: 140.26530838012695
[33mworker3_1  |[0m Epoch: 1. Task: 2. Average loss at step 4000: 4.651783029556275. Data index: 13145896. Time: 140.58714652061462
[32mworker1_1  |[0m Epoch: 1. Task: 0. Average loss at step 9000: 4.723817521095276. Data index: 4500000. Time: 173.12556385993958
[36mworker2_1  |[0m Epoch: 1. Task: 1. Average loss at step 4500: 

[33mworker3_1  |[0m Epoch: 2. Task: 2. Average loss at step 2500: 4.361192439556122. Data index: 12395896. Time: 229.9477822780609
[32mworker1_1  |[0m Epoch: 2. Task: 0. Average loss at step 7500: 4.390030612945557. Data index: 3750000. Time: 262.87275767326355
[36mworker2_1  |[0m Epoch: 2. Task: 1. Average loss at step 3000: 4.371155314922333. Data index: 7072948. Time: 233.8590259552002
[33mworker3_1  |[0m Epoch: 2. Task: 2. Average loss at step 3000: 4.043632297039032. Data index: 12645896. Time: 234.61752796173096
[32mworker1_1  |[0m Epoch: 2. Task: 0. Average loss at step 8000: 4.401861928462982. Data index: 4000000. Time: 267.46152663230896
[36mworker2_1  |[0m Epoch: 2. Task: 1. Average loss at step 3500: 4.318969547271728. Data index: 7322948. Time: 238.44875597953796
[33mworker3_1  |[0m Epoch: 2. Task: 2. Average loss at step 3500: 4.154166238307953. Data index: 12895896. Time: 239.2692632675171
[32mworker1_1  |[0m Epoch: 2. Task: 0. Average loss at step 8500: 4

[36mworker2_1  |[0m Nearest to years: days, months, machines, insertions, decrease, year, bette, hours,
[36mworker2_1  |[0m Nearest to who: never, often, he, which, she, already, still, usually,
[36mworker2_1  |[0m Nearest to one: two, seven, four, five, three, six, eight, nine,
[36mworker2_1  |[0m Nearest to th: nine, six, seven, eight, rd, zs, revitalization, st,
[36mworker2_1  |[0m Nearest to time: period, way, alembic, case, form, evander, position, term,
[33mworker3_1  |[0m Nearest to two: three, four, five, six, seven, one, eight, nine,
[33mworker3_1  |[0m Nearest to between: with, within, in, breeders, against, through, among, amadeus,
[33mworker3_1  |[0m Nearest to from: through, during, in, after, despite, into, including, under,
[33mworker3_1  |[0m Nearest to or: and, than, but, while, though, markedly, although, hessen,
[33mworker3_1  |[0m Nearest to at: during, in, under, while, within, on, trinidad, through,
[33mworker3_1  |[0m Nearest to during: afte

#### Testing TfWord2Vec

In [4]:
tfwv = KeyedVectors.load("./models/tfw2v_model")
eval(tfwv.accuracy('questions-words.txt'))

0.004988958861535945

### Benchmarking and testing gensim word2vec

Train gensim model on the same dataset

In [5]:
start = time.time()
corpus = Text8Corpus("text8")
model = Word2Vec(corpus, iter=3, sg=1, batch_words=1000)
print("Gensim: ", time.time() - start)

Gensim:  83.62504625320435


Evaluate gensim model

In [6]:
eval(model.wv.accuracy('questions-words.txt'))

0.3103195304858168

### Visualization

In [7]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly
import plotly.graph_objs as go

init_notebook_mode(connected=True)

PCA with 2 components for 1000 most frequent words

In [8]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, init='pca', n_iter=5000)
plot_only = 1000

In [9]:
def plot_with_labels(low_dim_embs, labels):
    trace = go.Scatter(x=low_dim_embs[:, 0], y=low_dim_embs[:, 1], text=labels, mode='markers')
    data = [trace]
    fig = go.Figure(data=data)
    iplot(fig, show_link=False)

TfWord2Vec:

In [10]:
low_dim_embs = tsne.fit_transform(tfwv.syn0[:plot_only, :])
labels = [tfwv.index2word[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)

Gensim word2vec:

In [11]:
low_dim_embs = tsne.fit_transform(model.wv.syn0[:plot_only, :])
labels = [model.wv.index2word[i] for i in range(plot_only)]
plot_with_labels(low_dim_embs, labels)