#WORD2VEC

## Working with Word2Vec with Gensim


We have been working with a number of techniques and tools that help us navigate the world of NLP.       
**For example, we have our Vectorizer:**


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

text = ['That is should come to this!', 'This above all: to thine own self be true.', 'Something is rotten in the state of Denmark.']
vectorizer = CountVectorizer(ngram_range=(1,2))

vectorizer.fit(text)
x = vectorizer.transform(text)
x_back = x.toarray()

pd.DataFrame(x_back, columns=vectorizer.get_feature_names())

Unnamed: 0,above,above all,all,all to,be,be true,come,come to,denmark,in,...,the,the state,thine,thine own,this,this above,to,to thine,to this,true
0,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,1,0,1,0,1,0
1,1,1,1,1,1,1,0,0,0,0,...,0,0,1,1,1,1,1,1,0,1
2,0,0,0,0,0,0,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0


** Two Adverse Traits abou the Bag of Words model:**      
1)  Word Context & semantic meaning does not play a role   
2)  Our data size increases with vocabulary size. 

** And then came Word2Vec** 

We will see that with Word2Vec context does play a role and it can decipher the relationship between words including: 

linguistic relationships:  (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) =~ (‘queen’))

## First things first :


**1) Install Gensim: **

pip install gensim

**  2)  Make sure cython is installed ? **

cython -V

(if no cython):

pip install cython


** 3) test (run the following) **

from gensim import Word2Vec
text = [['testin','testing','testing']]
model = Word2Vec(text,workers=4)

**4) If you see the following error : "UserWarning: C extension not loaded for Word2Vec"**


Do the following:

1.  pip uninstall gensim
2.  pip uninstall scipy 

3. pip install --no-cache-dir scipy==0.15.1
4. pip install --no-cache-dir gensim==0.12.1


**Refer to the following:** https://groups.google.com/forum/#!topic/gensim/isBqIhrw9mk


In [38]:
#  A 'Gensim' example: 

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

##  Word2Vec (In a Word..)

###  Idea 1: Preprocessing

1) Tokenization   
2) Remove stop words    
3) Convert to lowercase     
4) Others: stemming.. 

In [3]:
# The type of input that Word2Vec is looking for.. 
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

print texts

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]


###  Idea 2: Word Representation

Learn a continuous representation of words.
Each word (w) is associated with it's own word vector C(w)

In [4]:
import gensim
model = gensim.models.Word2Vec(texts, size=100, window=5, min_count=1, workers=4,sg=1)

In [5]:
# 
print model['computer']

[  3.60290846e-03   2.64952192e-03  -4.01750021e-03   1.51745591e-03
  -1.94312329e-03   4.38236195e-04  -1.41889264e-04   1.34513830e-03
  -4.01126919e-03   1.00765983e-03  -2.78462586e-03   2.17301771e-03
  -2.22336175e-03  -2.83751870e-03  -1.49124057e-03   4.26495587e-03
   1.51252816e-03  -3.69674829e-03  -3.60761560e-03   1.50641694e-03
   2.96684972e-04   1.45111303e-03  -3.22056771e-03  -4.79736738e-03
   6.92113303e-04   4.17999458e-03  -4.27227514e-03   6.63728337e-04
  -8.40774272e-04  -5.01733180e-03  -3.63163697e-03   1.31287775e-03
  -4.52222209e-03  -3.73024936e-03  -2.75684590e-03   4.95271757e-03
  -2.78775464e-03   3.16175050e-03   3.49858007e-03  -2.28901766e-03
  -1.11150870e-03  -2.29695300e-03   3.72917828e-04  -8.70059594e-04
  -4.48034843e-04  -3.86656355e-03  -1.96747994e-03  -2.37318990e-03
  -3.50443041e-03  -1.47374463e-03   1.15248945e-03   4.21493221e-03
   3.77143174e-03   3.08511173e-03   3.19681736e-03   4.27480042e-03
   3.56297242e-03   2.96207075e-03

###  What do we have?   Word Embeddings 

**A word embedding W : words → ℝn **

The output above is the result of 'word' projections in a latent spaces
of N dimensions, (N ~ size of NN layers we chose).     
Our float values above represent the coordinates for the word 'computer' in our 100-dimensional space!

Our high dimensional vectors stand in place for words.    
Note, that these dimensions are encoding 'latent' properties for 'computer' (such that 'queen' will be geometrically closer to 'king' than it would to be to (let's say) 'computer'. 


<img src='img/vector_queen.png'/>

<img src='img/vector_queen2.png'/>


Word Embeddings are useful because:

1.  We can measure the semantic similarity between two words
2.  We can use these word vectors as features for various NLP supervised learning tasks (such as classifcation, sentiment analysis). 

We will see how we get here.. 


### IDEA(S) # 3  Skip-Gram Methods &  CBOW Methods : 

#### Skip-Gram: 

**example sentence:**  "We are on the cusp of deep learning for the masses"

a) Input of skip-gram is a single word (Wi) 'cusp' and the output are the words (Wo) in Wi's context window (defined by our word window C)

Context window: "We are on the cusp of deep"  (Using Context Window = 5 & Skip-Gram=1)

We will utilize 2 for loops: (1) iterating through our inputs (our 1st Wi will be : 'We')   
(2) Our second for loop will iterate through our Wo's : {'are','the','cusp','of','deep'}

b) One-hot encode both input & output vectors

c) Our W matrix is just a massive matrix containing all of our word embeddings (row by row basis)

d) In the process of learning weight matrices W & W', we initiate the matrices randomly.  

e) We then sequenctially feed training examples into our model & observe the error (some function of the difference    between the expected output and the actual). 

f) We then compute the gradient of this error with respect to the elements of both matrices and correct them in the direction of the gradient (aha!  stochastic gradient descent).    As we learned with SGD, the goal is to take a small step (as controled by the learning rate) in order to minimize distance bewtween the vectors (Thereby, increase the probablity of P(wo|wi) 

g) We define our loss function.  (Our objective is to maximize the conditional probability of the output given our input context): 

h) By repeating this process over an entire training set, we will acquire vectors for words that habitually co_occur tend to be nudged closer together (and by gradually lowering the learning rate) this process converges towards some final state for the vectors.  
 
<img src='img/skip_gram.png'/>


### CBOW: 


CBOW: very similiar model with the inputs & outputs reversed.  The input layer consists of our word window (Size C)

<img src='img/CBOW.png'/>



In [29]:
# An Illustration.. 

import os

class MySentences(object):
     def __init__(self, dirname):
        self.dirname = dirname
 
     def __iter__(self):
         for fname in os.listdir(self.dirname):
                for line in open(os.path.join(self.dirname, fname)):
                    yield line.split()

sentences = MySentences('/Users/julialintern/nltk_data/corpora/gutenberg') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences,min_count=3,workers=5)

In [7]:
# We can test our model
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[('prophet', 0.5949121713638306),
 ('beforetime', 0.5525686740875244),
 ("king's", 0.5368978977203369),
 ('impotent', 0.5293000936508179),
 ('prophet.', 0.5250883102416992)]

In [10]:
import os
import nltk
from nltk.corpus import stopwords

stop = stopwords.words('english')


class MySentences2(object):
     def __init__(self, dirname):
            self.dirname = dirname
 
     def __iter__(self):
         for fname in os.listdir(self.dirname):
                for line in open(os.path.join(self.dirname, fname)):
                    word=line.lower().split()
                    if word not in stop:
                        yield word
                    
sentences = MySentences2('/Users/julialintern/nltk_data/corpora/gutenberg') 
model = gensim.models.Word2Vec(sentences,min_count=3,workers=5)

In [11]:
# We can test our model
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=5)

[('queen', 0.5984632968902588),
 ('paul', 0.5767810344696045),
 ('prince', 0.5267163515090942),
 ('corinthians', 0.5077587962150574),
 ('emperor', 0.5019620656967163)]

In [12]:
# Similarity

model.similarity('woman','man')

0.57056338874927059

In [15]:
# Compute cosine_similarity

model.n_similarity(['woman', 'girl'], ['man', 'boy'])

0.68047994582732718

In [26]:
print model['whale']

[-0.19909103  0.11658057 -0.16124374 -0.0753553   0.280846   -0.33586395
 -0.09959811  0.07986737  0.09244389 -0.00216066  0.15571938 -0.15189013
  0.01881346  0.40199128 -0.0604613   0.22984141  0.1188383  -0.00243269
  0.02065047 -0.34811813 -0.08189397 -0.19340491  0.0159527  -0.01274251
 -0.09855368  0.07735041  0.05886059 -0.23424435  0.12708429  0.04924433
 -0.08262601 -0.16553693  0.15812439 -0.20336491 -0.16151191 -0.15788379
 -0.00684571  0.00556067 -0.24170582  0.35226527 -0.47907963 -0.10631083
 -0.3883279   0.07438131 -0.18755727 -0.17352945 -0.21855398  0.13600644
 -0.12352561 -0.22306238  0.03754399  0.15342307  0.45225978 -0.02662029
  0.01002679 -0.11355209 -0.15838839 -0.33109254 -0.02738108 -0.17900626
 -0.13094853 -0.063792   -0.08352834 -0.03761846  0.09131543  0.04872246
  0.11330067  0.03604474  0.16260488 -0.01536125  0.13360719  0.0912374
  0.14135809 -0.00672222  0.17967939 -0.12232968 -0.18073294 -0.07064348
  0.19077745  0.10241222  0.00484772  0.06955955  0.

In [28]:
print model['fish']

[-0.09021866 -0.01703615 -0.02865197  0.02772826 -0.00929441  0.03133548
  0.05567687 -0.0292524   0.11020087  0.07419246  0.01350545 -0.07690042
  0.11439198  0.00588876 -0.13341768 -0.02478784  0.00963654  0.00378606
 -0.03490182  0.00524183  0.08824384 -0.04764377 -0.0797544  -0.03424416
  0.07177494 -0.02164323 -0.07376256 -0.11372524  0.15830228  0.07450563
 -0.04994248 -0.05376245  0.00284377 -0.08964711  0.04206511 -0.1294381
  0.1155242   0.03627551  0.0479129   0.1475217   0.00926567 -0.06451008
  0.03168656  0.04195825  0.00065164 -0.05297829 -0.07661257 -0.01360856
 -0.04855314 -0.10175702 -0.03983173  0.04128741  0.1897212   0.02466719
 -0.01337607  0.04385811 -0.13051809 -0.04868311 -0.04781685  0.05410472
  0.00764993 -0.0131494  -0.10523718 -0.05013266  0.10316719  0.08746231
  0.00569277  0.05293912 -0.06064452 -0.01950141  0.02286505  0.08702739
  0.0897103   0.11196417  0.09544533  0.11684236  0.00404374 -0.08111674
 -0.05138755 -0.00552055  0.00084776  0.08524694 -0.

In [37]:
# Looks like we should refine! 

model.doesnt_match("breakfast cereal dinner lunch".split())

'lunch'

Other very cool methods!: 

https://radimrehurek.com/gensim/models/word2vec.html

### But if you really want to refine your model, you'll need more data:


https://code.google.com/p/word2vec/

Download:  'freebase-vectors-skipgram1000-en.bin.gz'

###   How you can use Word2Vec in your models: 

1) Sum the word vectors in a sentence (like in CBOW) and predict the label. 
2) Sentiment analysis! 

###   Some things to keep in Mind when using Word2Vec:

1) Word2vec requires a lot of data to train.

As we've illustrated, you can download pretrained vectors. However, if you would need to train your own data 
you will need a lot of it!  (Think Hundreds of Millions of Words!) 


OTHER REFERENCES:

- https://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
- http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
