# Getting started with Word2Vec in Gensim and making it work!

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying *show me your friends, and I'll tell who you are*. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words `shocked`,`appalled` and `astonished` are typically used in a similar context. 

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the [original Word2Vec implementation by Google](https://arxiv.org/pdf/1301.3781.pdf) and extended with additional functionality.

### Imports and logging

First, we start with our imports and get logging established:

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging
import pandas
import numpy as np
from scipy import spatial
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [46]:
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)  


2018-10-26 17:40:05,742 : INFO : loading projection weights from ./GoogleNews-vectors-negative300.bin
2018-10-26 17:40:57,022 : INFO : loaded (3000000, 300) matrix from ./GoogleNews-vectors-negative300.bin


In [51]:
def avg_feature_vector(words, model, num_features):
    #function to average all words vectors in a given paragraph
    featureVec = np.zeros((num_features,), dtype="float32")
    nwords = 0

    for word in words:
        nwords = nwords+1
        featureVec = np.add(featureVec, model.wv.get_vector(word))

    if nwords>0:
        featureVec = np.divide(featureVec, nwords)
    return featureVec

In [318]:
df = pandas.read_csv("agreeDisagreeDiscuss.csv", sep=',', error_bad_lines=False, encoding= "ISO-8859-1")

In [319]:
df

Unnamed: 0,Headline,Body,Stance
0,Hundreds of Palestinians flee floods in Gaza a...,Hundreds of Palestinians were evacuated from t...,agree
1,Spider burrowed through tourists stomach and u...,Fear not arachnophobes the story of Bunburys s...,disagree
2,Nasa Confirms Earth Will Experience 6 Days of ...,Thousands of people have been duped by a fake ...,agree
3,Banksy Arrested Real Identity Revealed Is The...,If youve seen a story floating around on your ...,agree
4,Gateway Pundit,A British rapper whose father is awaiting tria...,discuss
5,Woman detained in Lebanon is not alBaghdadis w...,An Iraqi official denied that a woman detained...,agree
6,Soon Marijuana May Lead to Ticket Not Arrest i...,After campaigning on a promise to reform stopa...,discuss
7,Boko Haram Denies Nigeria CeaseFire Claim,ABUJA Nigeria The leader of Nigerias Islamist...,discuss
8,No Robert Plant Didnt Rip Up an $800 Million C...,Led Zeppelin fans will be disappointed to lear...,agree
9,ISIL Beheads American Photojournalist in Iraq,James Foley an American journalist who went mi...,discuss


In [277]:
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
nltk.download('stopwords')

In [248]:
samp1filt = list(filter(lambda x: x in model.vocab, df.Headline[2].split()))
sentence_1_avg_vector = avg_feature_vector(samp1filt, model=model, num_features=300)
samp2filt = list(filter(lambda x: x in model.vocab, df.Body[2].split()))
sentence_2_avg_vector = avg_feature_vector(samp2filt, model=model, num_features=300)

  


In [271]:
df.to_csv("rel_df.csv", sep=',')

In [311]:
rel = []
for i in range(len(df)):
    stopHeadline = [w for w in df.Headline[i].split() if not w in stop_words] 
    samp1filt = list(filter(lambda x: x in model.vocab, stopHeadline))
    sentence_1_avg_vector = avg_feature_vector(samp1filt, model=model, num_features=300)
    
    stopBody = [w for w in df.Body[i].split() if not w in stop_words] 
    samp2filt = list(filter(lambda x: x in model.vocab, stopBody))
    sentence_2_avg_vector = avg_feature_vector(samp2filt, model=model, num_features=300)
    
    rel.append(1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector))

  
  dist = 1.0 - uv / np.sqrt(uu * vv)


In [315]:
df['Relevancy'] = rel

In [317]:
df.to_csv("rel_df.csv", sep=',')

In [54]:
sentence_2 = "Hundreds Palestinians flee floods in Gaza as Israel opens dams"
sentence_2_avg_vector = avg_feature_vector(sentence_2.split(), model=model, num_features=300)


  


In [61]:
sentence_1 = "Hundreds Palestinians were evacuated from their homes Sunday morning after Israeli authorities opened number dams near the flooding the Gaza Valley in the wake recent severe winter"
sentence_1_avg_vector = avg_feature_vector(sentence_1.split(), model=model, num_features=300)




  


In [133]:
sentence_1_avg_vector = avg_feature_vector(words, model=model, num_features=300)


  


In [321]:
embded = avg_feature_vector(list(filter(lambda x: x in model.vocab, df.Body[0].split())), model=model, num_features=300)

  


In [323]:
embeddings =pandas.DataFrame((avg_feature_vector(list(filter(lambda x: x in model.vocab, df.Body[0].split())), model=model, num_features=300)).reshape(1,-1))

  


In [324]:
embeddings

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,0.042348,0.050587,0.010934,0.027728,-0.034364,-0.032291,-0.00525,-0.082431,0.074518,0.072256,...,-0.105168,-0.008516,-0.106848,0.001479,-0.022616,0.038849,-0.058478,-0.03246,0.052244,-0.012583


In [325]:
for sents in df.Body:
    filtered_list = list(filter(lambda x: x in model.vocab, sents.split()))
    avg_embed = avg_feature_vector(filtered_list, model=model, num_features=300)
    avg_embed_trans = avg_embed.reshape(1,-1)
    temp_df = pandas.DataFrame(avg_embed_trans)
    embeddings = embeddings.append(temp_df)
    

  


In [326]:
embeddings.to_csv("ADD_embeddings.csv", sep=',')

In [75]:
reshaped = sentence_1_avg_vector.reshape(1,-1)

In [81]:
np.shape(sentence_2_avg_vector.reshape(1,-1))

(1, 300)

In [82]:
df = pandas.DataFrame(sentence_1_avg_vector.reshape(1,-1))

In [87]:
df2 = pandas.DataFrame(sentence_2_avg_vector.reshape(1,-1))

In [95]:
from numpy import genfromtxt

In [None]:
filtered_list = list(filter(lambda x: x in model.vocab, df.Headline[0].split()))

In [225]:
1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)

0.4982364773750305

In [49]:
w1= "polite"
vect = model.wv.get_vector(w1)

  


In [9]:

w1 = "shocked"
model.wv.most_similar (positive=w1)


[('horrified', 0.8127754926681519),
 ('dismayed', 0.78623366355896),
 ('amazed', 0.7840966582298279),
 ('appalled', 0.7795482277870178),
 ('stunned', 0.7516888976097107),
 ('astonished', 0.7505988478660583),
 ('surprised', 0.720969557762146),
 ('suprised', 0.7207925319671631),
 ('astounded', 0.7204748392105103),
 ('surprized', 0.6929066777229309)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for `polite`, `france` and `shocked`. 

In [50]:
# look up top 6 words similar to 'polite'
w1 = ["polite"]
model.wv.most_similar (positive=w1,topn=6)


[('courteous', 0.9174547791481018),
 ('friendly', 0.8309274911880493),
 ('cordial', 0.7990915179252625),
 ('professional', 0.7945970892906189),
 ('attentive', 0.7732747197151184),
 ('gracious', 0.7469891309738159)]

In [53]:
# look up top 6 words similar to 'france'
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)


[('canada', 0.6603403091430664),
 ('germany', 0.6510637998580933),
 ('spain', 0.6431018114089966),
 ('barcelona', 0.61174076795578),
 ('mexico', 0.6070996522903442),
 ('rome', 0.6065913438796997)]

In [54]:
# look up top 6 words similar to 'shocked'
w1 = ["shocked"]
model.wv.most_similar (positive=w1,topn=6)


[('horrified', 0.80775386095047),
 ('amazed', 0.7797470092773438),
 ('astonished', 0.7748459577560425),
 ('dismayed', 0.7680633068084717),
 ('stunned', 0.7603034973144531),
 ('appalled', 0.7466776371002197)]

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that *relate to bed* only:

In [55]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


[('duvet', 0.7086508274078369),
 ('blanket', 0.7016597390174866),
 ('mattress', 0.7002605199813843),
 ('quilt', 0.6868821978569031),
 ('matress', 0.6777950525283813),
 ('pillowcase', 0.6413239240646362),
 ('sheets', 0.6382123827934265),
 ('foam', 0.6322235465049744),
 ('pillows', 0.6320573687553406),
 ('comforter', 0.5972476601600647)]

### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [57]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

0.76181122646029453

In [58]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

1.0000000000000002

In [59]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

0.25355593501920781

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [63]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

'france'

In [77]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


'shower'

## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


## When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary. 

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work. 
