# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context, so we can convert them to similar vectors to encapsulate their meaning. 

In this tutorial, we will learn how to use the Gensim implementation of Word2Vec. The training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Imports and logging

First, we start with our imports and get logging established:

In [3]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


### Dataset 
Next, is our dataset. The secret to Word2Vec is lots and lots of text data. We will use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review.

While gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can pass it a whole review as a sentence (i.e. a much larger size of text), and it does not make much difference

Let's see what this dataset is really about by printing the first line:

In [4]:
data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### Read files into a list
We can now read our dataset into a list so that we can pass it to the Word2Vec model. We are reading directly from the compressed file and lightly preprocessing the data using `gensim.utils.simple_preprocess (line)`. We tokenize (break the sentences down into ordered words), lowercase, etc and return back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [5]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2019-04-02 16:02:54,792 : INFO : reading file reviews_data.txt.gz...this may take a while
2019-04-02 16:02:54,794 : INFO : read 0 reviews
2019-04-02 16:02:56,843 : INFO : read 10000 reviews
2019-04-02 16:02:59,090 : INFO : read 20000 reviews
2019-04-02 16:03:01,593 : INFO : read 30000 reviews
2019-04-02 16:03:03,817 : INFO : read 40000 reviews
2019-04-02 16:03:06,386 : INFO : read 50000 reviews
2019-04-02 16:03:08,800 : INFO : read 60000 reviews
2019-04-02 16:03:10,828 : INFO : read 70000 reviews
2019-04-02 16:03:12,942 : INFO : read 80000 reviews
2019-04-02 16:03:14,949 : INFO : read 90000 reviews
2019-04-02 16:03:16,900 : INFO : read 100000 reviews
2019-04-02 16:03:18,908 : INFO : read 110000 reviews
2019-04-02 16:03:21,005 : INFO : read 120000 reviews
2019-04-02 16:03:23,123 : INFO : read 130000 reviews
2019-04-02 16:03:25,556 : INFO : read 140000 reviews
2019-04-02 16:03:27,718 : INFO : read 150000 reviews
2019-04-02 16:03:29,810 : INFO : read 160000 reviews
2019-04-02 16:03:31,812

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [6]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10, sg=1)
model.train(documents,total_examples=len(documents),epochs=10)

2019-04-02 16:07:15,185 : INFO : collecting all words and their counts
2019-04-02 16:07:15,186 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-04-02 16:07:15,594 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2019-04-02 16:07:16,018 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2019-04-02 16:07:16,590 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2019-04-02 16:07:17,279 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2019-04-02 16:07:17,914 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2019-04-02 16:07:18,463 : INFO : PROGRESS: at sentence #60000, processed 11013726 words, keeping 76786 word types
2019-04-02 16:07:18,935 : INFO : PROGRESS: at sentence #70000, processed 12637528 words, keeping 83199 word types
2019-04-02 16:07:19,346 : INFO : PROG

2019-04-02 16:08:14,090 : INFO : EPOCH 1 - PROGRESS: at 37.71% examples, 320392 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:08:15,107 : INFO : EPOCH 1 - PROGRESS: at 38.81% examples, 319783 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:08:16,118 : INFO : EPOCH 1 - PROGRESS: at 39.89% examples, 319461 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:08:17,131 : INFO : EPOCH 1 - PROGRESS: at 41.03% examples, 319129 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:08:18,180 : INFO : EPOCH 1 - PROGRESS: at 42.30% examples, 319205 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:08:19,190 : INFO : EPOCH 1 - PROGRESS: at 43.36% examples, 318749 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:08:20,210 : INFO : EPOCH 1 - PROGRESS: at 44.52% examples, 318264 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:08:21,215 : INFO : EPOCH 1 - PROGRESS: at 45.72% examples, 318514 words/s, in_qsize 20, out_qsize 0
2019-04-02 16:08:22,230 : INFO : EPOCH 1 - PROGRESS: at 46.81% examples, 318515 words/s,

2019-04-02 16:09:18,447 : INFO : EPOCH 2 - PROGRESS: at 8.41% examples, 315827 words/s, in_qsize 20, out_qsize 0
2019-04-02 16:09:19,484 : INFO : EPOCH 2 - PROGRESS: at 9.45% examples, 319331 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:20,543 : INFO : EPOCH 2 - PROGRESS: at 10.38% examples, 321423 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:21,558 : INFO : EPOCH 2 - PROGRESS: at 11.36% examples, 323696 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:22,569 : INFO : EPOCH 2 - PROGRESS: at 12.21% examples, 325667 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:23,573 : INFO : EPOCH 2 - PROGRESS: at 13.29% examples, 327155 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:24,577 : INFO : EPOCH 2 - PROGRESS: at 14.30% examples, 329252 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:25,593 : INFO : EPOCH 2 - PROGRESS: at 15.38% examples, 330990 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:09:26,627 : INFO : EPOCH 2 - PROGRESS: at 16.27% examples, 329490 words/s, i

2019-04-02 16:10:31,856 : INFO : EPOCH 2 - PROGRESS: at 88.37% examples, 330554 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:10:32,865 : INFO : EPOCH 2 - PROGRESS: at 89.66% examples, 330891 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:33,878 : INFO : EPOCH 2 - PROGRESS: at 90.91% examples, 331118 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:34,892 : INFO : EPOCH 2 - PROGRESS: at 92.19% examples, 331418 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:35,904 : INFO : EPOCH 2 - PROGRESS: at 93.38% examples, 331869 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:36,929 : INFO : EPOCH 2 - PROGRESS: at 94.62% examples, 332036 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:37,936 : INFO : EPOCH 2 - PROGRESS: at 95.85% examples, 332346 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:38,939 : INFO : EPOCH 2 - PROGRESS: at 97.09% examples, 332636 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:10:39,941 : INFO : EPOCH 2 - PROGRESS: at 98.28% examples, 332771 words/s,

2019-04-02 16:11:36,264 : INFO : EPOCH 3 - PROGRESS: at 63.76% examples, 355880 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:11:37,331 : INFO : EPOCH 3 - PROGRESS: at 65.13% examples, 355785 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:38,361 : INFO : EPOCH 3 - PROGRESS: at 66.26% examples, 355768 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:39,371 : INFO : EPOCH 3 - PROGRESS: at 67.52% examples, 355866 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:40,387 : INFO : EPOCH 3 - PROGRESS: at 68.83% examples, 356226 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:41,394 : INFO : EPOCH 3 - PROGRESS: at 69.97% examples, 356356 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:42,407 : INFO : EPOCH 3 - PROGRESS: at 71.13% examples, 356473 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:43,415 : INFO : EPOCH 3 - PROGRESS: at 72.44% examples, 356729 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:11:44,450 : INFO : EPOCH 3 - PROGRESS: at 73.79% examples, 356689 words/s,

2019-04-02 16:12:40,685 : INFO : EPOCH 4 - PROGRESS: at 38.65% examples, 358235 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:41,719 : INFO : EPOCH 4 - PROGRESS: at 39.92% examples, 358161 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:42,728 : INFO : EPOCH 4 - PROGRESS: at 41.30% examples, 358353 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:43,729 : INFO : EPOCH 4 - PROGRESS: at 42.55% examples, 358183 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:44,738 : INFO : EPOCH 4 - PROGRESS: at 43.94% examples, 358546 words/s, in_qsize 20, out_qsize 0
2019-04-02 16:12:45,744 : INFO : EPOCH 4 - PROGRESS: at 45.33% examples, 358899 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:46,748 : INFO : EPOCH 4 - PROGRESS: at 46.60% examples, 359060 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:47,750 : INFO : EPOCH 4 - PROGRESS: at 47.81% examples, 359074 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:12:48,752 : INFO : EPOCH 4 - PROGRESS: at 49.12% examples, 359080 words/s,

2019-04-02 16:13:45,491 : INFO : EPOCH 5 - PROGRESS: at 14.80% examples, 321452 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:46,492 : INFO : EPOCH 5 - PROGRESS: at 15.80% examples, 322133 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:47,493 : INFO : EPOCH 5 - PROGRESS: at 16.71% examples, 322626 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:48,534 : INFO : EPOCH 5 - PROGRESS: at 17.62% examples, 322826 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:49,566 : INFO : EPOCH 5 - PROGRESS: at 18.63% examples, 324526 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:50,571 : INFO : EPOCH 5 - PROGRESS: at 19.52% examples, 326037 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:51,578 : INFO : EPOCH 5 - PROGRESS: at 20.47% examples, 327430 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:52,593 : INFO : EPOCH 5 - PROGRESS: at 21.67% examples, 328537 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:13:53,603 : INFO : EPOCH 5 - PROGRESS: at 22.60% examples, 329738 words/s,

2019-04-02 16:14:57,159 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-04-02 16:14:57,194 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-04-02 16:14:57,208 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-04-02 16:14:57,213 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-04-02 16:14:57,235 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-04-02 16:14:57,241 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-04-02 16:14:57,252 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-04-02 16:14:57,263 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-04-02 16:14:57,264 : INFO : EPOCH - 5 : training on 41519358 raw words (30348086 effective words) took 87.0s, 348725 effective words/s
2019-04-02 16:14:57,266 : INFO : training on a 207596790 raw words (151742912 effective words) took 441.1s, 343998 effective words/s

2019-04-02 16:16:01,378 : INFO : EPOCH 1 - PROGRESS: at 75.58% examples, 360738 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:02,422 : INFO : EPOCH 1 - PROGRESS: at 76.75% examples, 360692 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:03,468 : INFO : EPOCH 1 - PROGRESS: at 77.89% examples, 360525 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:04,483 : INFO : EPOCH 1 - PROGRESS: at 79.14% examples, 360759 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:05,490 : INFO : EPOCH 1 - PROGRESS: at 80.32% examples, 360735 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:06,490 : INFO : EPOCH 1 - PROGRESS: at 81.48% examples, 360814 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:07,510 : INFO : EPOCH 1 - PROGRESS: at 82.78% examples, 360909 words/s, in_qsize 20, out_qsize 0
2019-04-02 16:16:08,526 : INFO : EPOCH 1 - PROGRESS: at 83.97% examples, 360923 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:16:09,552 : INFO : EPOCH 1 - PROGRESS: at 85.11% examples, 360947 words/s,

2019-04-02 16:17:06,124 : INFO : EPOCH 2 - PROGRESS: at 50.74% examples, 353897 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:07,134 : INFO : EPOCH 2 - PROGRESS: at 51.97% examples, 354070 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:08,198 : INFO : EPOCH 2 - PROGRESS: at 53.13% examples, 354012 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:09,223 : INFO : EPOCH 2 - PROGRESS: at 54.35% examples, 353996 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:10,227 : INFO : EPOCH 2 - PROGRESS: at 55.68% examples, 354129 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:11,242 : INFO : EPOCH 2 - PROGRESS: at 56.93% examples, 354174 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:12,267 : INFO : EPOCH 2 - PROGRESS: at 58.17% examples, 354266 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:13,277 : INFO : EPOCH 2 - PROGRESS: at 59.47% examples, 354476 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:17:14,283 : INFO : EPOCH 2 - PROGRESS: at 60.75% examples, 354755 words/s,

2019-04-02 16:18:10,736 : INFO : EPOCH 3 - PROGRESS: at 25.94% examples, 361613 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:11,740 : INFO : EPOCH 3 - PROGRESS: at 27.37% examples, 362390 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:12,758 : INFO : EPOCH 3 - PROGRESS: at 28.71% examples, 362368 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:13,759 : INFO : EPOCH 3 - PROGRESS: at 29.99% examples, 362499 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:14,765 : INFO : EPOCH 3 - PROGRESS: at 31.37% examples, 362412 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:15,812 : INFO : EPOCH 3 - PROGRESS: at 32.66% examples, 361976 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:16,819 : INFO : EPOCH 3 - PROGRESS: at 33.84% examples, 361867 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:17,831 : INFO : EPOCH 3 - PROGRESS: at 35.11% examples, 361962 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:18:18,843 : INFO : EPOCH 3 - PROGRESS: at 36.45% examples, 361995 words/s,

2019-04-02 16:19:15,123 : INFO : EPOCH 4 - PROGRESS: at 5.97% examples, 360119 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:19:16,130 : INFO : EPOCH 4 - PROGRESS: at 7.20% examples, 362060 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:19:17,164 : INFO : EPOCH 4 - PROGRESS: at 8.38% examples, 361868 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:19:18,188 : INFO : EPOCH 4 - PROGRESS: at 9.45% examples, 361440 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:19:19,192 : INFO : EPOCH 4 - PROGRESS: at 10.38% examples, 361086 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:19:20,222 : INFO : EPOCH 4 - PROGRESS: at 11.38% examples, 359819 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:19:21,244 : INFO : EPOCH 4 - PROGRESS: at 12.30% examples, 360336 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:19:22,291 : INFO : EPOCH 4 - PROGRESS: at 13.44% examples, 359549 words/s, in_qsize 20, out_qsize 0
2019-04-02 16:19:23,293 : INFO : EPOCH 4 - PROGRESS: at 14.52% examples, 361101 words/s, in_

2019-04-02 16:20:28,195 : INFO : EPOCH 4 - PROGRESS: at 93.00% examples, 362268 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:20:29,211 : INFO : EPOCH 4 - PROGRESS: at 94.28% examples, 362282 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:20:30,231 : INFO : EPOCH 4 - PROGRESS: at 95.59% examples, 362357 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:20:31,249 : INFO : EPOCH 4 - PROGRESS: at 96.88% examples, 362606 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:20:32,299 : INFO : EPOCH 4 - PROGRESS: at 98.17% examples, 362521 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:20:33,307 : INFO : EPOCH 4 - PROGRESS: at 99.50% examples, 362646 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:20:33,523 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-04-02 16:20:33,544 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-04-02 16:20:33,557 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-04-02 16:20:33,571 : INFO : worker thr

2019-04-02 16:21:32,553 : INFO : EPOCH 5 - PROGRESS: at 69.75% examples, 362711 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:33,556 : INFO : EPOCH 5 - PROGRESS: at 70.89% examples, 362660 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:34,563 : INFO : EPOCH 5 - PROGRESS: at 72.10% examples, 362594 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:35,579 : INFO : EPOCH 5 - PROGRESS: at 73.45% examples, 362572 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:36,582 : INFO : EPOCH 5 - PROGRESS: at 74.66% examples, 362511 words/s, in_qsize 20, out_qsize 0
2019-04-02 16:21:37,614 : INFO : EPOCH 5 - PROGRESS: at 75.82% examples, 362421 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:38,622 : INFO : EPOCH 5 - PROGRESS: at 76.94% examples, 362341 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:39,624 : INFO : EPOCH 5 - PROGRESS: at 78.03% examples, 362273 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:21:40,637 : INFO : EPOCH 5 - PROGRESS: at 79.27% examples, 362280 words/s,

2019-04-02 16:22:36,777 : INFO : EPOCH 6 - PROGRESS: at 45.46% examples, 361713 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:37,784 : INFO : EPOCH 6 - PROGRESS: at 46.66% examples, 361609 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:38,788 : INFO : EPOCH 6 - PROGRESS: at 47.91% examples, 361719 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:39,822 : INFO : EPOCH 6 - PROGRESS: at 49.27% examples, 361723 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:40,889 : INFO : EPOCH 6 - PROGRESS: at 50.55% examples, 361457 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:41,924 : INFO : EPOCH 6 - PROGRESS: at 51.78% examples, 361268 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:42,938 : INFO : EPOCH 6 - PROGRESS: at 52.92% examples, 361289 words/s, in_qsize 20, out_qsize 1
2019-04-02 16:22:43,951 : INFO : EPOCH 6 - PROGRESS: at 54.20% examples, 361639 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:22:44,955 : INFO : EPOCH 6 - PROGRESS: at 55.57% examples, 361920 words/s,

2019-04-02 16:23:41,704 : INFO : EPOCH 7 - PROGRESS: at 21.57% examples, 360389 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:42,705 : INFO : EPOCH 7 - PROGRESS: at 22.53% examples, 360684 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:43,726 : INFO : EPOCH 7 - PROGRESS: at 23.49% examples, 360953 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:44,727 : INFO : EPOCH 7 - PROGRESS: at 24.41% examples, 360852 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:45,785 : INFO : EPOCH 7 - PROGRESS: at 25.72% examples, 360566 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:46,801 : INFO : EPOCH 7 - PROGRESS: at 27.11% examples, 360659 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:47,828 : INFO : EPOCH 7 - PROGRESS: at 28.46% examples, 360576 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:23:48,853 : INFO : EPOCH 7 - PROGRESS: at 29.77% examples, 360469 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:23:49,871 : INFO : EPOCH 7 - PROGRESS: at 31.13% examples, 360800 words/s,

2019-04-02 16:24:45,910 : INFO : EPOCH - 7 : training on 41519358 raw words (30351952 effective words) took 84.5s, 359233 effective words/s
2019-04-02 16:24:46,932 : INFO : EPOCH 8 - PROGRESS: at 0.89% examples, 292236 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:47,996 : INFO : EPOCH 8 - PROGRESS: at 2.08% examples, 316257 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:49,074 : INFO : EPOCH 8 - PROGRESS: at 3.28% examples, 322402 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:50,093 : INFO : EPOCH 8 - PROGRESS: at 4.47% examples, 330139 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:51,116 : INFO : EPOCH 8 - PROGRESS: at 5.62% examples, 333004 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:52,120 : INFO : EPOCH 8 - PROGRESS: at 6.64% examples, 330121 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:53,137 : INFO : EPOCH 8 - PROGRESS: at 7.70% examples, 330426 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:24:54,139 : INFO : EPOCH 8 - PROGRESS: at 8.73% example

2019-04-02 16:25:59,241 : INFO : EPOCH 8 - PROGRESS: at 81.61% examples, 340916 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:00,267 : INFO : EPOCH 8 - PROGRESS: at 82.88% examples, 341150 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:01,275 : INFO : EPOCH 8 - PROGRESS: at 84.10% examples, 341559 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:02,276 : INFO : EPOCH 8 - PROGRESS: at 85.19% examples, 341779 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:03,289 : INFO : EPOCH 8 - PROGRESS: at 86.47% examples, 342047 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:04,359 : INFO : EPOCH 8 - PROGRESS: at 87.88% examples, 342262 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:05,370 : INFO : EPOCH 8 - PROGRESS: at 89.20% examples, 342633 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:26:06,379 : INFO : EPOCH 8 - PROGRESS: at 90.47% examples, 342830 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:26:07,380 : INFO : EPOCH 8 - PROGRESS: at 91.76% examples, 343135 words/s,

2019-04-02 16:27:03,713 : INFO : EPOCH 9 - PROGRESS: at 57.35% examples, 356695 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:04,730 : INFO : EPOCH 9 - PROGRESS: at 58.59% examples, 356941 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:05,764 : INFO : EPOCH 9 - PROGRESS: at 59.90% examples, 357091 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:06,766 : INFO : EPOCH 9 - PROGRESS: at 61.20% examples, 357322 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:07,767 : INFO : EPOCH 9 - PROGRESS: at 62.39% examples, 357270 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:08,774 : INFO : EPOCH 9 - PROGRESS: at 63.75% examples, 357286 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:09,774 : INFO : EPOCH 9 - PROGRESS: at 65.11% examples, 357461 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:10,784 : INFO : EPOCH 9 - PROGRESS: at 66.23% examples, 357424 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:27:11,825 : INFO : EPOCH 9 - PROGRESS: at 67.48% examples, 357306 words/s,

2019-04-02 16:28:08,378 : INFO : EPOCH 10 - PROGRESS: at 30.70% examples, 357603 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:09,404 : INFO : EPOCH 10 - PROGRESS: at 32.07% examples, 357631 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:10,405 : INFO : EPOCH 10 - PROGRESS: at 33.38% examples, 357933 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:11,451 : INFO : EPOCH 10 - PROGRESS: at 34.55% examples, 357332 words/s, in_qsize 18, out_qsize 1
2019-04-02 16:28:12,473 : INFO : EPOCH 10 - PROGRESS: at 35.78% examples, 357219 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:13,474 : INFO : EPOCH 10 - PROGRESS: at 37.05% examples, 357293 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:14,486 : INFO : EPOCH 10 - PROGRESS: at 38.34% examples, 357200 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:15,497 : INFO : EPOCH 10 - PROGRESS: at 39.60% examples, 357208 words/s, in_qsize 19, out_qsize 0
2019-04-02 16:28:16,526 : INFO : EPOCH 10 - PROGRESS: at 40.89% examples, 357005

(303503457, 415193580)

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words using cosine similarity (which is exactly what is sounds like):

In [8]:

w1 = "dirty"
model.wv.most_similar (positive=w1)


[('filthy', 0.8858400583267212),
 ('unclean', 0.8308728337287903),
 ('smelly', 0.8303340077400208),
 ('stained', 0.8210866451263428),
 ('dingy', 0.8048659563064575),
 ('dusty', 0.7949382066726685),
 ('grubby', 0.7872313857078552),
 ('disgusting', 0.7742273211479187),
 ('gross', 0.7593685388565063),
 ('moldy', 0.7495971322059631)]

Let's look at similarity for `polite`, `france` and `shocked`. 

In [9]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('courteous', 0.9150029420852661),
 ('professional', 0.8537306785583496),
 ('attentive', 0.8418415188789368),
 ('friendly', 0.8116524815559387),
 ('helpful', 0.8044909238815308),
 ('efficient', 0.8029367327690125)]

In [10]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)


[('germany', 0.6528472304344177),
 ('england', 0.6416741013526917),
 ('europe', 0.6261036992073059),
 ('paris', 0.620964527130127),
 ('gaulle', 0.5984228849411011),
 ('spain', 0.5930178761482239)]

In [11]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)


[('horrified', 0.7343902587890625),
 ('surprised', 0.7182973027229309),
 ('amazed', 0.7052606344223022),
 ('dismayed', 0.7013192772865295),
 ('appalled', 0.6982929706573486),
 ('astonished', 0.6873292922973633)]

We can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed, sheet, and pillow* only: 

In [12]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('duvet', 0.7795587182044983),
 ('sheets', 0.7652061581611633),
 ('comforter', 0.7430680990219116),
 ('pillows', 0.7363089323043823),
 ('blanket', 0.7252493500709534),
 ('quilt', 0.715475857257843),
 ('feather', 0.7088570594787598),
 ('mattress', 0.7060394287109375),
 ('undersheet', 0.6786792278289795),
 ('matress', 0.6667504906654358)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [13]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="couch")

0.30352983

In [14]:
# similarity between two different but similar contextually words
model.wv.similarity(w1="dirty",w2="smelly")

0.8303339

In [15]:
# similarity between two negatively correlated words
model.wv.similarity(w1="dirty",w2="clean")

0.33860257

In [16]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0

In [17]:
# similarity between two opposite verbs
model.wv.similarity(w1='break', w2='fix')

0.15821667

In [18]:
# similarity between two opposite descriptors
model.wv.similarity(w1='broke', w2='fixed')

0.5597079

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

The reason to use cosine similarity rather than Euclidean distance, for example, is that we don't really care about the overally length of the vectors but just the angles between them as this states how they relate in similarity.

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [19]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'france'

In [20]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


'shower'

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?

### `sg`
Set sg to 1 to use the skip gram model and set it to 0 to use CBOW. By default, sg is set to 0

## When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary. 

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work. 
