# Working with Word2Vec with Gensim

Setup:
1. pip install gensim
1. cython -V
1. (if no cython): pip install cython

We have been working with a number of techniques and tools that help us navigate the world of NLP. For example, we have Vectorizers:kljm

In [None]:
%matplotlib inline
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


import numpy as np, seaborn as sb
import matplotlib.pyplot as plt  

text = ['That is should come to this!', 'This above all: to thine own self be true.', 'Something is rotten in the state of Denmark.']
vectorizer = CountVectorizer(ngram_range=(1,2))

vectorizer.fit(text)
x = vectorizer.transform(text)
x_back = x.toarray()

pd.DataFrame(x_back, columns=vectorizer.get_feature_names())

Unnamed: 0,above,above all,all,all to,be,be true,come,come to,denmark,in,...,the,the state,thine,thine own,this,this above,to,to thine,to this,true
0,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,1,0,1,0,1,0
1,1,1,1,1,1,1,0,0,0,0,...,0,0,1,1,1,1,1,1,0,1
2,0,0,0,0,0,0,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0


The problem with Bag of Words models is context and semantic meaning does not play a role!! This is problematic, because we're humans and we care about context.

And then came Word2Vec...

Context is hugely important, and [new techniques like t-SNE and Tensorflow](https://www.tensorflow.org/tutorials/word2vec) allow us to visualize these word and phrase relationships.

![](https://www.tensorflow.org/images/linear-relationships.png)

In [2]:
import gensim
documents = ["Will this work?  I'm not sure.  If not go to step #4 (above)"]
texts = [[word for word in document.lower().split()]
         for document in documents]

print(texts)
model = gensim.models.Word2Vec(texts, size=100, window=5, min_count=1, workers=4,sg=1)

[['will', 'this', 'work?', "i'm", 'not', 'sure.', 'if', 'not', 'go', 'to', 'step', '#4', '(above)']]



**4) If you see the following error : "UserWarning: C extension not loaded for Word2Vec"**


[Do the following](https://groups.google.com/forum/#!topic/gensim/isBqIhrw9mk):

1.  pip uninstall gensim
2.  pip uninstall scipy 
3. pip install --no-cache-dir scipy==0.15.1
4. pip install --no-cache-dir gensim==0.12.1

In [3]:
#  A 'Gensim' example: 
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

## Word2Vec Preprocessing:

1. Tokenization   
2. Remove stop words    
3. Convert to lowercase     
4. Others: stemming.. 

In [4]:
# The type of input that Word2Vec is looking for.. 
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

###  2: Word Representation

Learn a continuous representation of words.
Each word (w) is associated with it's own word vector

In [27]:
import gensim
model1 = gensim.models.Word2Vec(texts, size=100, window=5, min_count=1, workers=2,sg=1)

#print model.vocab
#print(model.vocab['minors'])

In [26]:
#  and Voila !    We have our word vector 
model1['computer']

array([ 0.00358155,  0.00264495, -0.00401788,  0.00149795, -0.00194208,
        0.00045494, -0.00017517,  0.0013563 , -0.00399634,  0.00104367,
       -0.00279939,  0.00215701, -0.0022176 , -0.00280289, -0.00152524,
        0.00427915,  0.00151263, -0.00368231, -0.00362804,  0.0014829 ,
        0.00033702,  0.0014282 , -0.00318798, -0.00481104,  0.00064264,
        0.0041865 , -0.00427888,  0.00062682, -0.00080342, -0.00499865,
       -0.00364611,  0.00132216, -0.00449747, -0.00371323, -0.00272111,
        0.00494667, -0.00280392,  0.00320229,  0.00346256, -0.00225677,
       -0.001115  , -0.00231972,  0.00036926, -0.00090166, -0.00046707,
       -0.00385928, -0.00196428, -0.00238861, -0.00350803, -0.00146319,
        0.00112949,  0.00418342,  0.00376145,  0.00302149,  0.00321145,
        0.00430367,  0.00351456,  0.00297585,  0.00080843, -0.00160295,
       -0.00343848,  0.00124714,  0.00095836,  0.00302949, -0.00010518,
       -0.0009199 , -0.00101466, -0.00157668,  0.00212804,  0.00

In [8]:
model1.similarity('trees', 'machine')

-0.069379084445707534

In [9]:
model1.similarity('human', 'machine')

-0.039476059360845459

In [10]:
model1.similarity('computer', 'system')

0.0032223047000629855

In [11]:
model1.similarity('computer', 'machine')

0.13791865731732256

In [20]:
model1.doesnt_match("computer human trees".split())

'computer'

In [21]:
model1.doesnt_match("computer human machine".split())

'human'

In [22]:
model1.doesnt_match("computer machine system".split())

'computer'

###  What do we have?   Word Embeddings 

**A word embedding W : words → ℝn **

The output above is the result of 'word' projections in a latent space
of N dimensions, (N ~ size of NN layer we chose).     
Our float values above represent the coordinates for the word 'computer' in our 100-dimensional space!

Our high dimensional vectors stand in place for words.    
Note, that these dimensions are encoding 'latent' properties for 'computer' (such that 'queen' will be geometrically closer to 'king' than it would to be to (let's say) 'computer'. 


Word Embeddings are useful because:

1.  We can measure the semantic similarity between two words
2.  We can use these word vectors as features for various NLP supervised learning tasks (such as classifcation, sentiment analysis). 

We will see how we get here.. 


![check this](http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png)

### 3:  Skip-Gram Methods &  Continuous Bag of Words (CBOW) Methods : 

#### Skip-Gram: 

**example sentence:**  

**  "We are on the cusp of deep learning for the masses"

For Context Window = 2:

*We could get the following training examples: (Where target word is in bold) *


**We** are on 

We **are** on the

We are **on** the cusp


#### What's happening underneath the hood? :
for this example : we have input of skip-gram is a single word (Wi) **'learning'**, we will determine the probability of seeing the words (Wo) : 'of','deep', 'for','the'

Step 1) Transform our vobabulary into a 'bag of indices'

Step 2) [One-hot encode](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science) (input vectors) 

Step 3) Randomly initialize the Weight Vectors

Step 4) Get dot product: (Input vector.InputWeightMatrix) ~ (this is just the weight vector for 'learning')

Step 5) Get dot product:  ('learning' weight vector).(Output Weight Matrix) 

Step 6) Calculate Softmax probabilities
What is the probability of 'seeing' the word 'deep' given that we've seen the word 'learning'?  -- >  Using SGD together with softmax regression, we will maximize the probability for 'deep' 

P(Wo|Wi) = (exp(Wi.Wo)/ sum(exp(Wi.Woj)   (sum~ sum of all Woj for all j in Vocabulary)

Step 7) As always, we update our Weight matrix to reduce our errors
Wi=Wi-a*ej*Wo

Step 7) Repeat..

 
<img src='img/skip_gram.png'/>


### CBOW: 


CBOW: very similiar model with the inputs & outputs reversed.  The input layer consists of our word window 

<img src='img/CBOW.png'/>



In [8]:
import os
import nltk
from nltk.corpus import stopwords

# we may need to download gutenberg
#nltk.download()

In [9]:
stop = stopwords.words('english')
stop+=['?','!','.',',',':',';']

#creating our iterator
# An Illustration.. 

import os

class MySentences(object):
     def __init__(self, dirname):
            self.dirname = dirname
 
     def __iter__(self):
        # iterate through all file names in our directory
         for fname in os.listdir(self.dirname):
                for line in open(os.path.join(self.dirname, fname), encoding='utf-8', errors='ignore'):
                    word=line.lower().split()
                    if word not in stop:
                        yield word

In [28]:
sentences = MySentences('/Users/jb/nltk_data/corpora/gutenberg/') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences,min_count=1,workers=-1)

# looks, like workers = -1, doesnt work for us!
#model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=1, workers=-1)



In [51]:
model.most_similar('king' ,topn=10)

[('queen', 0.5949533581733704),
 ('captain', 0.5392752289772034),
 ("''tis", 0.5350890159606934),
 ('sluggard,"\'', 0.5228051543235779),
 ('ahab', 0.5076287984848022),
 ('huzza!"', 0.503913402557373),
 ("fun?'", 0.5004797577857971),
 ('refutations', 0.49884623289108276),
 ("what?'", 0.49297577142715454),
 ("'prentice", 0.487434983253479)]

In [45]:
# Similarity

model.similarity('woman','man')

0.49153613754258013

In [46]:
# Compute cosine_similarity

model.n_similarity(['woman', 'girl'], ['man', 'boy'])

0.59014822458792149

In [47]:
model.doesnt_match("breakfast soldier cowboy warrior".split())

'breakfast'

In [48]:
model.doesnt_match("breakfast good lunch dinner".split())

'good'

### Some things to keep in Mind when using Word2Vec:

Word2vec requires a lot of data (100M+ words) to train refined models, or use [pretrained vectors](http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/)!

We don't track where in a window our co-occurant words exist, so if 'New' always appears before 'York' there's a 100% change of 'New' given 'York', but if the window is 3, that doesn't mean any of the slots have a 100% chance of 'New', and it doesn't know of the phrase 'New York'.

Doc2Vec method extends the word2vec algorithm to larger blocks of texts (paragraphs, documents, articles):
- https://radimrehurek.com/gensim/models/doc2vec.html
- http://learningaboutdata.blogspot.com/2014/06/plotting-word-embedding-using-tsne-with.html
- https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
- https://radimrehurek.com/gensim/models/word2vec.html