### Imports and logging

First, we start with our imports and get logging established:

In [1]:
# imports needed and set up logging
import gzip
import gensim 
import logging
import pandas as pd
from nltk.tokenize import word_tokenize

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


In [2]:
train_data = pd.read_csv('train.tsv', sep='\t')
data = train_data.title
data = list(data)

In [3]:
def splitSentence(sentence):
    sentence = sentence.lower()
#     res = gensim.utils.simple_preprocess ('How to draw a stacked dotplot in R?')
    res = sentence.split(' ')
    res = word_tokenize(sentence)
    return res
splitSentence('How to draw a stacked dotplot in R?')

['how', 'to', 'draw', 'a', 'stacked', 'dotplot', 'in', 'r', '?']

In [4]:
documents = []
for row in data:
    documents.append(splitSentence(row))
logging.info ("Done Processing")    
len(documents)

2019-03-15 18:07:36,023 : INFO : Done Processing


100000

In [5]:
model = gensim.models.Word2Vec(documents, size=300,  min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)

2019-03-15 18:07:36,029 : INFO : collecting all words and their counts
2019-03-15 18:07:36,030 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-03-15 18:07:36,051 : INFO : PROGRESS: at sentence #10000, processed 93014 words, keeping 8922 word types
2019-03-15 18:07:36,066 : INFO : PROGRESS: at sentence #20000, processed 185238 words, keeping 13435 word types
2019-03-15 18:07:36,083 : INFO : PROGRESS: at sentence #30000, processed 278067 words, keeping 17059 word types
2019-03-15 18:07:36,105 : INFO : PROGRESS: at sentence #40000, processed 370895 words, keeping 20252 word types
2019-03-15 18:07:36,123 : INFO : PROGRESS: at sentence #50000, processed 464208 words, keeping 23233 word types
2019-03-15 18:07:36,144 : INFO : PROGRESS: at sentence #60000, processed 556960 words, keeping 25993 word types
2019-03-15 18:07:36,165 : INFO : PROGRESS: at sentence #70000, processed 649567 words, keeping 28521 word types
2019-03-15 18:07:36,185 : INFO : PROGRESS: at se

2019-03-15 18:07:39,867 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-15 18:07:39,883 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-15 18:07:39,884 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-15 18:07:39,893 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-15 18:07:39,896 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-15 18:07:39,897 : INFO : EPOCH - 1 : training on 925908 raw words (688705 effective words) took 0.5s, 1257015 effective words/s
2019-03-15 18:07:40,418 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-03-15 18:07:40,422 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-03-15 18:07:40,423 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-15 18:07:40,452 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-15 18:07:40,461 : INFO : worker threa

2019-03-15 18:07:44,758 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-15 18:07:44,759 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-15 18:07:44,759 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-15 18:07:44,768 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-15 18:07:44,781 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-15 18:07:44,782 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-15 18:07:44,783 : INFO : EPOCH - 9 : training on 925908 raw words (688452 effective words) took 0.6s, 1081318 effective words/s
2019-03-15 18:07:45,427 : INFO : worker thread finished; awaiting finish of 9 more threads
2019-03-15 18:07:45,435 : INFO : worker thread finished; awaiting finish of 8 more threads
2019-03-15 18:07:45,436 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-15 18:07:45,449 : INFO : worker threa

(6885538, 9259080)

## Now, let's look at some output 
This first example shows a simple case of looking up words similar to the word `dirty`. All we need to do here is to call the `most_similar` function and provide the word `dirty` as the positive example. This returns the top 10 similar words. 

In [6]:
len(model.wv.vocab)

13265

In [7]:

w1 = "rails"
model.wv.most_similar(positive=w1)

2019-03-15 18:07:51,685 : INFO : precomputing L2-norms of word weight vectors


[('sinatra', 0.6881442070007324),
 ('laravel', 0.679559588432312),
 ('cakephp', 0.657805323600769),
 ('rails3', 0.6141403317451477),
 ('codeigniter', 0.5789684653282166),
 ('django', 0.5771294832229614),
 ('ror', 0.5584676265716553),
 ('rspec', 0.5502299070358276),
 ('devise', 0.5450743436813354),
 ('symfony2', 0.5361107587814331)]

That looks pretty good, right? Let's look at a few more. Let's look at similarity for `polite`, `france` and `shocked`. 

In [8]:
# look up top 6 words similar to 'polite'
w1 = ["python"]
model.wv.most_similar (positive=w1,topn=6)


[('r', 0.6384516954421997),
 ('pandas', 0.5217745900154114),
 ('python3', 0.5198883414268494),
 ('scipy', 0.5000096559524536),
 ('numpy', 0.47042790055274963),
 ('scikit-learn', 0.46748247742652893)]

In [None]:
# look up top 6 words similar to 'france'
w1 = ["numpy"]
model.wv.most_similar (positive=w1,topn=6)


In [None]:
# look up top 6 words similar to 'shocked'
w1 = ["machine"]
model.wv.most_similar (positive=w1,topn=6)


In [None]:
# look up top 6 words similar to 'shocked'
w1 = ["javascript"]
model.wv.most_similar (positive=w1,topn=6)

In [None]:
# get everything related to stuff on the bed
w1 = ["bed",'sheet','pillow']
w2 = ['couch']
model.wv.most_similar (positive=w1,negative=w2,topn=10)


### Similarity between two words in the vocabulary

You can even use the Word2Vec model to return the similarity between two words that are present in the vocabulary. 

In [None]:
# similarity between two different words
model.wv.similarity(w1="dirty",w2="smelly")

In [None]:
# similarity between two identical words
model.wv.similarity(w1="dirty",w2="dirty")

In [None]:
# similarity between two unrelated words
model.wv.similarity(w1="dirty",w2="clean")

Under the hood, the above three snippets computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that `dirty` is highly similar to `smelly` but `dirty` is dissimilar to `clean`. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring [here](https://en.wikipedia.org/wiki/Cosine_similarity).

### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","duvet","shower"])


## Understanding some of the parameters
To train the model earlier, we had to set some parameters. Now, let's try to understand what some of them mean. For reference, this is the command that we used to train the model.

```
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
```

### `size`
The size of the dense vector to represent each token or word. If you have very limited data, then size should be a much smaller value. If you have lots of data, its good to experiment with various sizes. A value of 100-150 has worked well for me. 

### `window`
The maximum distance between the target word and its neighboring word. If your neighbor's position is greater than the maximum window width to the left and the right, then, some neighbors are not considered as being related to the target word. In theory, a smaller window should give you terms that are more related. If you have lots of data, then the window size should not matter too much, as long as its a decent sized window. 

### `min_count`
Minimium frequency count of words. The model would ignore words that do not statisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. Unless your dataset is really tiny, this does not really affect the model.

### `workers`
How many threads to use behind the scenes?


## When should you use Word2Vec?

There are many application scenarios for Word2Vec. Imagine if you need to build a sentiment lexicon. Training a Word2Vec model on large amounts of user reviews helps you achieve that. You have a lexicon for not just sentiment, but for most words in the vocabulary. 

Beyond, raw unstructured text data, you could also use Word2Vec for more structured data. For example, if you had tags for a million stackoverflow questions and answers, you could find tags that are related to a given tag and recommend the related ones for exploration. You can do this by treating each set of co-occuring tags as a "sentence" and train a Word2Vec model on this data. Granted, you still need a large number of examples to make it work. 
