# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Imports and logging

First, we start with our imports and get logging established:

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [None]:
import pandas
from scipy import spatial


In [46]:
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)  


2018-10-26 17:40:05,742 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin
2018-10-26 17:40:57,022 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin


### Dataset 
Next, is our dataset. The secret to getting Word2Vec really working for you is to have lots and lots of text data. In this case I am going to use data from the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset. This dataset has full user reviews of cars and hotels. I have specifically concatenated all of the hotel reviews into one big file which is about 97MB compressed and 229MB uncompressed. We will use the compressed file for this tutorial. Each line in this file represents a hotel review. You can download the OpinRank Word2Vec dataset here.

To avoid confusion, while gensim’s word2vec tutorial says that you need to pass it a sequence of sentences as its input, you can always pass it a whole review as a sentence (i.e. a much larger size of text), and it should not make much of a difference. 

Now, let's take a closer look at this data below by printing the first line. You can see that this is a pretty hefty review.

In [2]:
data_file="reviews_data.txt.gz"

with gzip.open ('reviews_data.txt.gz', 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b"Oct 12 2009 \tNice trendy hotel location not too bad.\tI stayed in this hotel for one night. As this is a fairly new place some of the taxi drivers did not know where it was and/or did not want to drive there. Once I have eventually arrived at the hotel, I was very pleasantly surprised with the decor of the lobby/ground floor area. It was very stylish and modern. I found the reception's staff geeting me with 'Aloha' a bit out of place, but I guess they are briefed to say that to keep up the coroporate image.As I have a Starwood Preferred Guest member, I was given a small gift upon-check in. It was only a couple of fridge magnets in a gift box, but nevertheless a nice gesture.My room was nice and roomy, there are tea and coffee facilities in each room and you get two complimentary bottles of water plus some toiletries by 'bliss'.The location is not great. It is at the last metro stop and you then need to take a taxi, but if you are not planning on going to see the historic sites in Be

### Read files into a list
Now that we've had a sneak peak of our dataset, we can read it into a list so that we can pass this on to the Word2Vec model. Notice in the code below, that I am directly reading the 
compressed file. I'm also doing a mild pre-processing of the reviews using `gensim.utils.simple_preprocess (line)`. This does some basic pre-processing such as tokenization, lowercasing, etc and returns back a list of tokens (words). Documentation of this pre-processing method can be found on the official [Gensim documentation site](https://radimrehurek.com/gensim/utils.html). 



In [3]:

def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    logging.info("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logging.info ("read {0} reviews".format (i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess (line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list (read_input (data_file))
logging.info ("Done reading data file")    

2018-10-26 15:51:41,163 : INFO : reading file reviews_data.txt.gz...this may take a while
2018-10-26 15:51:41,166 : INFO : read 0 reviews
2018-10-26 15:51:43,243 : INFO : read 10000 reviews
2018-10-26 15:51:45,116 : INFO : read 20000 reviews
2018-10-26 15:51:47,250 : INFO : read 30000 reviews
2018-10-26 15:51:49,262 : INFO : read 40000 reviews
2018-10-26 15:51:51,373 : INFO : read 50000 reviews
2018-10-26 15:51:53,436 : INFO : read 60000 reviews
2018-10-26 15:51:55,167 : INFO : read 70000 reviews
2018-10-26 15:51:57,148 : INFO : read 80000 reviews
2018-10-26 15:51:58,802 : INFO : read 90000 reviews
2018-10-26 15:52:00,421 : INFO : read 100000 reviews
2018-10-26 15:52:02,049 : INFO : read 110000 reviews
2018-10-26 15:52:03,738 : INFO : read 120000 reviews
2018-10-26 15:52:05,437 : INFO : read 130000 reviews
2018-10-26 15:52:07,216 : INFO : read 140000 reviews
2018-10-26 15:52:08,883 : INFO : read 150000 reviews
2018-10-26 15:52:10,540 : INFO : read 160000 reviews
2018-10-26 15:52:12,897

## Training the Word2Vec model

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the `documents`). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary. And by vocabulary, I mean a set of unique words.

After building the vocabulary, we just need to call `train(...)` to start training the Word2Vec model. Training on the [OpinRank](http://kavita-ganesan.com/entity-ranking-data/) dataset takes about 10 minutes so please be patient while running your code on this dataset.

Behind the scenes we are actually training a simple neural network with a single hidden layer. But, we are actually not going to use the neural network after training. Instead, the goal is to learn the weights of the hidden layer. These weights are essentially the word vectors that we’re trying to learn. 

In [5]:
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=40)
model.train(documents,total_examples=len(documents),epochs=10)

2018-10-26 15:53:01,692 : INFO : collecting all words and their counts
2018-10-26 15:53:01,693 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-10-26 15:53:01,966 : INFO : PROGRESS: at sentence #10000, processed 1655714 words, keeping 25777 word types
2018-10-26 15:53:02,216 : INFO : PROGRESS: at sentence #20000, processed 3317863 words, keeping 35016 word types
2018-10-26 15:53:02,532 : INFO : PROGRESS: at sentence #30000, processed 5264072 words, keeping 47518 word types
2018-10-26 15:53:02,843 : INFO : PROGRESS: at sentence #40000, processed 7081746 words, keeping 56675 word types
2018-10-26 15:53:03,197 : INFO : PROGRESS: at sentence #50000, processed 9089491 words, keeping 63744 word types
2018-10-26 15:53:03,585 : INFO : PROGRESS: at sentence #60000, processed 11013723 words, keeping 76781 word types
2018-10-26 15:53:03,857 : INFO : PROGRESS: at sentence #70000, processed 12637525 words, keeping 83194 word types
2018-10-26 15:53:04,113 : INFO : PROG

2018-10-26 15:53:33,927 : INFO : worker thread finished; awaiting finish of 22 more threads
2018-10-26 15:53:33,928 : INFO : worker thread finished; awaiting finish of 21 more threads
2018-10-26 15:53:33,938 : INFO : worker thread finished; awaiting finish of 20 more threads
2018-10-26 15:53:33,940 : INFO : worker thread finished; awaiting finish of 19 more threads
2018-10-26 15:53:33,941 : INFO : worker thread finished; awaiting finish of 18 more threads
2018-10-26 15:53:33,942 : INFO : worker thread finished; awaiting finish of 17 more threads
2018-10-26 15:53:33,943 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-10-26 15:53:33,944 : INFO : worker thread finished; awaiting finish of 15 more threads
2018-10-26 15:53:33,947 : INFO : worker thread finished; awaiting finish of 14 more threads
2018-10-26 15:53:33,948 : INFO : worker thread finished; awaiting finish of 13 more threads
2018-10-26 15:53:33,949 : INFO : worker thread finished; awaiting finish of 12 m

2018-10-26 15:53:57,515 : INFO : worker thread finished; awaiting finish of 3 more threads
2018-10-26 15:53:57,516 : INFO : worker thread finished; awaiting finish of 2 more threads
2018-10-26 15:53:57,517 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-26 15:53:57,519 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-26 15:53:57,520 : INFO : EPOCH - 2 : training on 41519355 raw words (30348463 effective words) took 23.5s, 1289614 effective words/s
2018-10-26 15:53:58,550 : INFO : EPOCH 3 - PROGRESS: at 3.73% examples, 1141478 words/s, in_qsize 78, out_qsize 1
2018-10-26 15:53:59,554 : INFO : EPOCH 3 - PROGRESS: at 7.78% examples, 1198212 words/s, in_qsize 80, out_qsize 0
2018-10-26 15:54:00,555 : INFO : EPOCH 3 - PROGRESS: at 11.50% examples, 1239091 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:01,560 : INFO : EPOCH 3 - PROGRESS: at 14.89% examples, 1228449 words/s, in_qsize 80, out_qsize 0
2018-10-26 15:54:02,560 : INFO : EPOC

2018-10-26 15:54:33,169 : INFO : EPOCH 4 - PROGRESS: at 49.89% examples, 1287405 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:34,178 : INFO : EPOCH 4 - PROGRESS: at 53.85% examples, 1280461 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:35,179 : INFO : EPOCH 4 - PROGRESS: at 58.28% examples, 1279292 words/s, in_qsize 78, out_qsize 1
2018-10-26 15:54:36,185 : INFO : EPOCH 4 - PROGRESS: at 62.94% examples, 1283157 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:37,186 : INFO : EPOCH 4 - PROGRESS: at 67.40% examples, 1282708 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:38,197 : INFO : EPOCH 4 - PROGRESS: at 71.76% examples, 1285315 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:39,200 : INFO : EPOCH 4 - PROGRESS: at 76.09% examples, 1284915 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:40,205 : INFO : EPOCH 4 - PROGRESS: at 80.39% examples, 1287742 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:54:41,216 : INFO : EPOCH 4 - PROGRESS: at 84.69% examples, 1288494

2018-10-26 15:55:07,965 : INFO : worker thread finished; awaiting finish of 35 more threads
2018-10-26 15:55:07,975 : INFO : worker thread finished; awaiting finish of 34 more threads
2018-10-26 15:55:07,982 : INFO : worker thread finished; awaiting finish of 33 more threads
2018-10-26 15:55:07,990 : INFO : worker thread finished; awaiting finish of 32 more threads
2018-10-26 15:55:07,993 : INFO : worker thread finished; awaiting finish of 31 more threads
2018-10-26 15:55:08,000 : INFO : worker thread finished; awaiting finish of 30 more threads
2018-10-26 15:55:08,003 : INFO : worker thread finished; awaiting finish of 29 more threads
2018-10-26 15:55:08,005 : INFO : worker thread finished; awaiting finish of 28 more threads
2018-10-26 15:55:08,008 : INFO : worker thread finished; awaiting finish of 27 more threads
2018-10-26 15:55:08,009 : INFO : worker thread finished; awaiting finish of 26 more threads
2018-10-26 15:55:08,026 : INFO : worker thread finished; awaiting finish of 25 m

2018-10-26 15:55:31,611 : INFO : worker thread finished; awaiting finish of 20 more threads
2018-10-26 15:55:31,613 : INFO : worker thread finished; awaiting finish of 19 more threads
2018-10-26 15:55:31,614 : INFO : worker thread finished; awaiting finish of 18 more threads
2018-10-26 15:55:31,615 : INFO : worker thread finished; awaiting finish of 17 more threads
2018-10-26 15:55:31,616 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-10-26 15:55:31,617 : INFO : worker thread finished; awaiting finish of 15 more threads
2018-10-26 15:55:31,617 : INFO : worker thread finished; awaiting finish of 14 more threads
2018-10-26 15:55:31,618 : INFO : worker thread finished; awaiting finish of 13 more threads
2018-10-26 15:55:31,619 : INFO : worker thread finished; awaiting finish of 12 more threads
2018-10-26 15:55:31,620 : INFO : worker thread finished; awaiting finish of 11 more threads
2018-10-26 15:55:31,623 : INFO : worker thread finished; awaiting finish of 10 m

2018-10-26 15:55:55,301 : INFO : worker thread finished; awaiting finish of 1 more threads
2018-10-26 15:55:55,302 : INFO : worker thread finished; awaiting finish of 0 more threads
2018-10-26 15:55:55,302 : INFO : EPOCH - 2 : training on 41519355 raw words (30346980 effective words) took 23.6s, 1283805 effective words/s
2018-10-26 15:55:56,325 : INFO : EPOCH 3 - PROGRESS: at 3.61% examples, 1105813 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:55:57,326 : INFO : EPOCH 3 - PROGRESS: at 7.92% examples, 1225118 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:55:58,337 : INFO : EPOCH 3 - PROGRESS: at 11.47% examples, 1238730 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:55:59,339 : INFO : EPOCH 3 - PROGRESS: at 15.45% examples, 1270083 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:56:00,347 : INFO : EPOCH 3 - PROGRESS: at 18.76% examples, 1257746 words/s, in_qsize 80, out_qsize 1
2018-10-26 15:56:01,354 : INFO : EPOCH 3 - PROGRESS: at 22.63% examples, 1276754 words/s, in_qsize 80, o

2018-10-26 15:56:32,865 : INFO : EPOCH 4 - PROGRESS: at 57.74% examples, 1265065 words/s, in_qsize 78, out_qsize 1
2018-10-26 15:56:33,867 : INFO : EPOCH 4 - PROGRESS: at 62.11% examples, 1265566 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:56:34,874 : INFO : EPOCH 4 - PROGRESS: at 66.63% examples, 1266307 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:56:35,878 : INFO : EPOCH 4 - PROGRESS: at 70.76% examples, 1265209 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:56:36,880 : INFO : EPOCH 4 - PROGRESS: at 75.32% examples, 1269291 words/s, in_qsize 80, out_qsize 0
2018-10-26 15:56:37,892 : INFO : EPOCH 4 - PROGRESS: at 79.40% examples, 1269884 words/s, in_qsize 80, out_qsize 1
2018-10-26 15:56:38,896 : INFO : EPOCH 4 - PROGRESS: at 83.73% examples, 1271240 words/s, in_qsize 78, out_qsize 1
2018-10-26 15:56:39,898 : INFO : EPOCH 4 - PROGRESS: at 87.98% examples, 1270522 words/s, in_qsize 79, out_qsize 1
2018-10-26 15:56:40,913 : INFO : EPOCH 4 - PROGRESS: at 92.64% examples, 1271979

2018-10-26 15:57:06,044 : INFO : worker thread finished; awaiting finish of 32 more threads
2018-10-26 15:57:06,054 : INFO : worker thread finished; awaiting finish of 31 more threads
2018-10-26 15:57:06,071 : INFO : worker thread finished; awaiting finish of 30 more threads
2018-10-26 15:57:06,073 : INFO : worker thread finished; awaiting finish of 29 more threads
2018-10-26 15:57:06,077 : INFO : worker thread finished; awaiting finish of 28 more threads
2018-10-26 15:57:06,086 : INFO : worker thread finished; awaiting finish of 27 more threads
2018-10-26 15:57:06,096 : INFO : worker thread finished; awaiting finish of 26 more threads
2018-10-26 15:57:06,099 : INFO : worker thread finished; awaiting finish of 25 more threads
2018-10-26 15:57:06,105 : INFO : worker thread finished; awaiting finish of 24 more threads
2018-10-26 15:57:06,107 : INFO : worker thread finished; awaiting finish of 23 more threads
2018-10-26 15:57:06,108 : INFO : worker thread finished; awaiting finish of 22 m

2018-10-26 15:57:29,977 : INFO : worker thread finished; awaiting finish of 13 more threads
2018-10-26 15:57:29,983 : INFO : worker thread finished; awaiting finish of 12 more threads
2018-10-26 15:57:29,987 : INFO : worker thread finished; awaiting finish of 11 more threads
2018-10-26 15:57:29,988 : INFO : worker thread finished; awaiting finish of 10 more threads
2018-10-26 15:57:29,989 : INFO : worker thread finished; awaiting finish of 9 more threads
2018-10-26 15:57:29,992 : INFO : worker thread finished; awaiting finish of 8 more threads
2018-10-26 15:57:29,993 : INFO : worker thread finished; awaiting finish of 7 more threads
2018-10-26 15:57:29,994 : INFO : worker thread finished; awaiting finish of 6 more threads
2018-10-26 15:57:29,995 : INFO : worker thread finished; awaiting finish of 5 more threads
2018-10-26 15:57:29,996 : INFO : worker thread finished; awaiting finish of 4 more threads
2018-10-26 15:57:29,997 : INFO : worker thread finished; awaiting finish of 3 more thr

2018-10-26 15:57:57,865 : INFO : EPOCH 8 - PROGRESS: at 15.34% examples, 1260397 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:57:58,876 : INFO : EPOCH 8 - PROGRESS: at 19.01% examples, 1274645 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:57:59,894 : INFO : EPOCH 8 - PROGRESS: at 22.74% examples, 1278046 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:58:00,899 : INFO : EPOCH 8 - PROGRESS: at 26.46% examples, 1271575 words/s, in_qsize 80, out_qsize 0
2018-10-26 15:58:01,906 : INFO : EPOCH 8 - PROGRESS: at 31.27% examples, 1275822 words/s, in_qsize 80, out_qsize 1
2018-10-26 15:58:02,907 : INFO : EPOCH 8 - PROGRESS: at 35.47% examples, 1268797 words/s, in_qsize 78, out_qsize 0
2018-10-26 15:58:03,916 : INFO : EPOCH 8 - PROGRESS: at 40.02% examples, 1269667 words/s, in_qsize 80, out_qsize 0
2018-10-26 15:58:04,919 : INFO : EPOCH 8 - PROGRESS: at 44.70% examples, 1269853 words/s, in_qsize 77, out_qsize 3
2018-10-26 15:58:05,921 : INFO : EPOCH 8 - PROGRESS: at 49.20% examples, 1270733

2018-10-26 15:58:37,635 : INFO : EPOCH 9 - PROGRESS: at 84.22% examples, 1281661 words/s, in_qsize 80, out_qsize 0
2018-10-26 15:58:38,637 : INFO : EPOCH 9 - PROGRESS: at 88.56% examples, 1280549 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:58:39,651 : INFO : EPOCH 9 - PROGRESS: at 93.18% examples, 1282506 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:58:40,652 : INFO : EPOCH 9 - PROGRESS: at 97.63% examples, 1283137 words/s, in_qsize 79, out_qsize 0
2018-10-26 15:58:40,915 : INFO : worker thread finished; awaiting finish of 39 more threads
2018-10-26 15:58:40,927 : INFO : worker thread finished; awaiting finish of 38 more threads
2018-10-26 15:58:40,935 : INFO : worker thread finished; awaiting finish of 37 more threads
2018-10-26 15:58:40,943 : INFO : worker thread finished; awaiting finish of 36 more threads
2018-10-26 15:58:40,967 : INFO : worker thread finished; awaiting finish of 35 more threads
2018-10-26 15:58:40,978 : INFO : worker thread finished; awaiting finish of 34 m

2018-10-26 15:59:04,783 : INFO : worker thread finished; awaiting finish of 25 more threads
2018-10-26 15:59:04,786 : INFO : worker thread finished; awaiting finish of 24 more threads
2018-10-26 15:59:04,791 : INFO : worker thread finished; awaiting finish of 23 more threads
2018-10-26 15:59:04,798 : INFO : worker thread finished; awaiting finish of 22 more threads
2018-10-26 15:59:04,806 : INFO : worker thread finished; awaiting finish of 21 more threads
2018-10-26 15:59:04,814 : INFO : worker thread finished; awaiting finish of 20 more threads
2018-10-26 15:59:04,816 : INFO : worker thread finished; awaiting finish of 19 more threads
2018-10-26 15:59:04,818 : INFO : worker thread finished; awaiting finish of 18 more threads
2018-10-26 15:59:04,820 : INFO : worker thread finished; awaiting finish of 17 more threads
2018-10-26 15:59:04,820 : INFO : worker thread finished; awaiting finish of 16 more threads
2018-10-26 15:59:04,822 : INFO : worker thread finished; awaiting finish of 15 m

(303492568, 415193550)

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words. 

In [51]:
def avg_feature_vector(words, model, num_features):
    #function to average all words vectors in a given paragraph
    featureVec = np.zeros((num_features,), dtype="float32")
    nwords = 0

    for word in words:
        nwords = nwords+1
        featureVec = np.add(featureVec, model.wv.get_vector(word))

    if nwords>0:
        featureVec = np.divide(featureVec, nwords)
    return featureVec

In [54]:
sentence_2 = "Hundreds Palestinians flee floods in Gaza as Israel opens dams"
sentence_2_avg_vector = avg_feature_vector(sentence_2.split(), model=model, num_features=300)


  


In [61]:
sentence_1 = "Hundreds Palestinians were evacuated from their homes Sunday morning after Israeli authorities opened number dams near the flooding the Gaza Valley in the wake recent severe winter"
sentence_1_avg_vector = avg_feature_vector(sentence_1.split(), model=model, num_features=300)




  


In [133]:
sentence_1_avg_vector = avg_feature_vector(words, model=model, num_features=300)


  


In [150]:
embded = avg_feature_vector(list(filter(lambda x: x in model.vocab, df.Body[0].split())), model=model, num_features=300)

  


In [155]:
embded.reshape(1,-1)

array([[ 0.04116887,  0.05220095,  0.03116181,  0.03356571, -0.0278487 ,
        -0.03325611, -0.00654184, -0.08596455,  0.08265134,  0.07709363,
        -0.02667697, -0.10175342, -0.01959968,  0.04229806, -0.06995567,
        -0.00867642, -0.000292  ,  0.06517272, -0.02176409, -0.01755957,
        -0.01371862,  0.03844424, -0.02036962, -0.01049426,  0.03681479,
        -0.03064112, -0.06727696,  0.02893789,  0.03638294, -0.03837384,
         0.008601  , -0.03856645, -0.05807032, -0.0066028 , -0.00654647,
        -0.0353211 , -0.00530177, -0.03146559, -0.00054639,  0.05194515,
         0.03757776, -0.00895299,  0.06523016, -0.02891982,  0.00709442,
        -0.05471928, -0.06348508,  0.0402124 , -0.02746599,  0.03954405,
         0.06720875,  0.01042708, -0.02429103, -0.01328754, -0.04901773,
         0.02334564, -0.07133711, -0.00444152,  0.01180557, -0.05457889,
        -0.0662168 ,  0.06294639, -0.03937907, -0.10272601, -0.02113726,
        -0.00306502,  0.00174435,  0.07157054, -0.0

In [180]:
embeddings =pandas.DataFrame((avg_feature_vector(list(filter(lambda x: x in model.vocab, df.Body[0].split())), model=model, num_features=300)).reshape(1,-1))

  


In [181]:
for sents in df.Body:
    filtered_list = list(filter(lambda x: x in model.vocab, sents.split()))
    avg_embed = avg_feature_vector(filtered_list, model=model, num_features=300)
    avg_embed_trans = avg_embed.reshape(1,-1)
    temp_df = pandas.DataFrame(avg_embed_trans)
    embeddings = embeddings.append(temp_df)
    

  


In [202]:
samp1filt = list(filter(lambda x: x in model.vocab, df.Headline[0].split()))
samp1avg = avg_feature_vector(filtered_list, model=model, num_features=300)
samp2filt = list(filter(lambda x: x in model.vocab, df.Body[0].split()))
samp2avg = avg_feature_vector(filtered_list, model=model, num_features=300)

  


In [203]:
1 - spatial.distance.cosine(samp1avg, samp2avg)

1.0

In [204]:
df


Unnamed: 0,Headline,Body,Stance
0,Hundreds Palestinians flee floods in Gazas Isr...,Hundreds Palestinians were evacuated from thei...,agree
1,Spider burrowed through tourists stomach and u...,Fear not arachnophobes the story Bunburys spid...,disagree
2,NasConfirms Earth Will Experience 6 Days Total...,Thousands people have been duped by fake news ...,agree
3,Banksy Arrested & Real Identity Revealed Is Th...,If youâve seen story floating around on your F...,agree
4,Woman detained in Lebanon is not al-Baghdadis ...,An Iraqi official denied that woman detained i...,agree
5,No Robert Plant Didnât Rip Up an $800 Million ...,Led Zeppelin fans will be disappointed to lear...,agree
6,NET Extra: Back-from-the-dead Catholic priest ...,71 years old cleric Father John Micheal Oâneal...,agree
7,Rumor debunked: RoboCop-style robots are not p...,Knightscope co-founder Stacy Stephens said rum...,agree
8,Fisherman lands 19 STONE catfish which could b...,Dino Ferrari hooked the whopper wels catfish w...,agree
9,Student accidentally sets college on fire duri...,He popped the question â€” and burned down his...,agree


In [197]:
embeddings = embeddings.drop(columns=['Class','f'])

In [198]:
embeddings.to_csv("embeddings.csv", sep=',')

In [75]:
reshaped = sentence_1_avg_vector.reshape(1,-1)

In [81]:
np.shape(sentence_2_avg_vector.reshape(1,-1))

(1, 300)

In [82]:
df = pandas.DataFrame(sentence_1_avg_vector.reshape(1,-1))

In [87]:
df2 = pandas.DataFrame(sentence_2_avg_vector.reshape(1,-1))

In [95]:
from numpy import genfromtxt

In [None]:
df = pandas.read_csv("df = pandas.read_csv("Binary.csv", sep=',', error_bad_lines=False)df = pandas.read_csv("Binary.csv", sep=',', error_bad_lines=False)df = pandas.read_csv("Binary.csv", sep=',', error_bad_lines=False)zzzzzzzzzzzzzzzzzzzzzzzzz.csv", sep=',', error_bad_lines=False)vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

In [None]:
filtered_list = list(filter(lambda x: x in model.vocab, df.Headline[0].split()))

In [201]:
1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)

Unnamed: 0,Headline,Body,Stance
0,Hundreds Palestinians flee floods in Gazas Isr...,Hundreds Palestinians were evacuated from thei...,agree
1,Spider burrowed through tourists stomach and u...,Fear not arachnophobes the story Bunburys spid...,disagree
2,NasConfirms Earth Will Experience 6 Days Total...,Thousands people have been duped by fake news ...,agree
3,Banksy Arrested & Real Identity Revealed Is Th...,If youâve seen story floating around on your F...,agree
4,Woman detained in Lebanon is not al-Baghdadis ...,An Iraqi official denied that woman detained i...,agree
5,No Robert Plant Didnât Rip Up an $800 Million ...,Led Zeppelin fans will be disappointed to lear...,agree
6,NET Extra: Back-from-the-dead Catholic priest ...,71 years old cleric Father John Micheal Oâneal...,agree
7,Rumor debunked: RoboCop-style robots are not p...,Knightscope co-founder Stacy Stephens said rum...,agree
8,Fisherman lands 19 STONE catfish which could b...,Dino Ferrari hooked the whopper wels catfish w...,agree
9,Student accidentally sets college on fire duri...,He popped the question â€” and burned down his...,agree


In [200]:
df = pandas.read_csv("relevancy.csv", sep=',', error_bad_lines=False)

In [111]:
df = pandas.read_csv("Binary.csv", sep=',', error_bad_lines=False)

In [88]:
df.append(df2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.063072,0.035162,0.002528,0.002925,-0.018912,-0.097343,-0.042887,-0.038532,0.10054,0.082913,...,-0.096976,-0.060467,-0.137662,-0.026505,0.013073,0.006938,-0.126447,-0.009122,0.093155,-0.033905
0,0.147607,0.079312,-0.071631,0.029242,-0.044415,-0.071875,-0.069897,-0.032141,0.086273,0.10083,...,-0.162549,-0.061475,-0.245752,-0.06781,-0.074834,0.059155,-0.161785,-0.054615,0.081311,0.004245


In [62]:
1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)

0.7869243025779724

In [49]:
w1= "polite"
vect = model.wv.get_vector(w1)

  


In [16]:
import numpy as np

In [17]:
np.shape(vect)

(150,)

In [9]:

w1 = "shocked"
model.wv.most_similar (positive=w1)


[('horrified', 0.8127754926681519),
 ('dismayed', 0.78623366355896),
 ('amazed', 0.7840966582298279),
 ('appalled', 0.7795482277870178),
 ('stunned', 0.7516888976097107),
 ('astonished', 0.7505988478660583),
 ('surprised', 0.720969557762146),
 ('suprised', 0.7207925319671631),
 ('astounded', 0.7204748392105103),
 ('surprized', 0.6929066777229309)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for `polite`, `france` and `shocked`. 

In [50]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('courteous', 0.9174547791481018),
 ('friendly', 0.8309274911880493),
 ('cordial', 0.7990915179252625),
 ('professional', 0.7945970892906189),
 ('attentive', 0.7732747197151184),
 ('gracious', 0.7469891309738159)]

In [53]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)


[('canada', 0.6603403091430664),
 ('germany', 0.6510637998580933),
 ('spain', 0.6431018114089966),
 ('barcelona', 0.61174076795578),
 ('mexico', 0.6070996522903442),
 ('rome', 0.6065913438796997)]

In [54]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)


[('horrified', 0.80775386095047),
 ('amazed', 0.7797470092773438),
 ('astonished', 0.7748459577560425),
 ('dismayed', 0.7680633068084717),
 ('stunned', 0.7603034973144531),
 ('appalled', 0.7466776371002197)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [55]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('duvet', 0.7086508274078369),
 ('blanket', 0.7016597390174866),
 ('mattress', 0.7002605199813843),
 ('quilt', 0.6868821978569031),
 ('matress', 0.6777950525283813),
 ('pillowcase', 0.6413239240646362),
 ('sheets', 0.6382123827934265),
 ('foam', 0.6322235465049744),
 ('pillows', 0.6320573687553406),
 ('comforter', 0.5972476601600647)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [57]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.76181122646029453

In [58]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0000000000000002

In [59]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.25355593501920781

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [63]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

'france'

In [77]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


'shower'

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


## When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary. 

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work. 
