# Using and benchmarking tensorflow wrapper on text8 corpus
[TensorFlow](https://www.tensorflow.org) is Google Brain's second generation machine learning system, 
with a reference implementation released as open source software on November 9, 2015. This tutorial will compare the time taken to train on gensim's word2vec implementation with TensorFlow's.
Before getting started you need to install [gensim](https://github.com/RaRe-Technologies/gensim/) and [setup Tensorflow](https://www.tensorflow.org/get_started/os_setup) to run with GPU support.
This wrapper has all the functionalities of the conventional word2vec of gensim.

# Training the model
We will be training the model using [lee corpus.](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor)

In [1]:
%matplotlib inline 
import gensim
import os, time

from gensim.models.wrappers.tfword2vec import TfWord2Vec
from gensim.models.word2vec import Word2Vec, Text8Corpus

import matplotlib.pyplot as plt


Initialising the model trains it as well.


In [2]:
corpus = '../../gensim/test/test_data/lee.cor'
save_path = '.'
model = TfWord2Vec(corpus, epochs_to_train=1, embedding_size=100, batch_size=100000, save_path=save_path)
model.train()

Data file:  ../../gensim/test/test_data/lee.cor
Vocab size:  96  + UNK
Words per epoch:  3983
Data file:  ../../gensim/test/test_data/lee.cor=    83291
Vocab size:  96  + UNK
Words per epoch:  3983
Epoch  452 Step       12: lr = 0.000 words/sec =    83296

This model behaves similar to the wor2vec model.

In [3]:
print(model)

<gensim.models.wrappers.tfword2vec.TfWord2Vec object at 0x7fbd49569790>


In [4]:
print(model.most_similar('President'))
print(model.similarity('President', 'military'))

[('Government', 0.2656361758708954), ('on', 0.23094132542610168), ('night', 0.20026281476020813), ('at', 0.19160115718841553), ('not', 0.16708612442016602), ('leader', 0.16425536572933197), ('of', 0.16081653535366058), ('he', 0.15998874604701996), ('it', 0.15337687730789185), ('were', 0.14755600690841675)]
-0.0752603531969


## Saving and loading the model

In [5]:
from tempfile import mkstemp

fs, temp_path = mkstemp("tfword2vec_temp")  # creates a temp file
print(temp_path)
model.save(temp_path)  # save the model


/tmp/tmpSKDNETtfword2vec_temp


PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed

In [None]:
model=TfWord2Vec.load_tf_model(temp_path)


In [None]:
model.similarity('President', 'military')

# Gensim benchmark
These benchmarks are conducted on Intel i7 6700K with hyper threading over-clocked to 4.4GHz. 

In [None]:
batch_size = []
time_taken = []
size = 100000000
while size<=10000000:
    start = time.time()
    corpus = Text8Corpus("text8")
    model = Word2Vec(corpus, iter=1, hs=1, negative=0, batch_words=size)
    print "Gensim:\n" + str(time.time()-start)
    size *= 10

# Benchmarking TensorFlow
Using the TensorFlow wrapper for gensim to train the model.

We are considering only one epoch i.e the number of training examples processed per step.
For varying batch sizes i.e the number of training examples each step processes we encounter different results.

In [None]:
batch_size = []
time_taken = []
size = 100#00000
while size<=10000000:
    start = time.time()
    model = TfWord2Vec("text8", epochs_to_train=1, embedding_size=100, batch_size=size, num_neg_samples=0)
    batch_size.append(size)
    time_taken.append(time.time()-start)
    print("\nTensorflow:" + str(time_taken[-1]))
    print("Batch size:" + str(size) + "\n")
    size *= 10

In [None]:
plt.xscale('log', nonposy='clip')
plt.plot(batch_size, time_taken)
plt.title('Tensorflow benchmarks on NVIDIA GRID K520 (4x cluster)')
plt.ylabel('time taken in seconds')
plt.xlabel('batch size')
print(t)
print(b)

Note how the time taken decreases drastically with increasing batch size.