### Word vector visualization 

Jay Urbain, PhD

Word vector visualization with [Gensim](https://github.com/RaRe-Technologies/gensim)

Credits:  
https://www.machinelearningplus.com/nlp/gensim-tutorial/  
https://radimrehurek.com/gensim/downloader.html   
[Stanford Class CS224b](https://web.stanford.edu/class/cs224n/)

In [0]:
import numpy as np

# Matplotlib for plotting
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# sklearn for PCA dimensionality reduction
from sklearn.decomposition import PCA

# Gensim for word vectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

Gensim is an NLP library that is especially handy for working with word vectors. Gensim isn't really a deep learning package. It's a package for  word and text similarity modeling, which started with LDA-style [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)  topic models and grew into SVD [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) and neural word representation library. But its efficient, scalable, and widely used.

You can try *50d, *100d, *200d, or *300d vectors. Research efforts have shown that performance does not improve with vectors larger than 300d.

#### Download

We can download and evaluate fasttext, word2vec, and glove models using the `gensim.downloader api`. These are large files, so you will have to be a little patient.

In [0]:
import gensim.downloader as api

#print( api.info() )  # return dict with info about available models/datasets
print( api.info("text8") )  # return dict with info about "text8" dataset

In [0]:
import gensim.downloader as api

model = api.load("glove-twitter-25")  # load glove vectors
model.most_similar("cat")  # show words that similar to word 'cat'

In [0]:
import gensim.downloader as api

# Download the models
# fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
# word2vec_model300 = api.load('word2vec-google-news-300')
glove_model300 = api.load('glove-wiki-gigaword-300')

# Get word embeddings
glove_model300.most_similar('support')
# [('supporting', 0.6251285076141357),
#  ...
#  ('backing', 0.6007589101791382),
#  ('supports', 0.5269277691841125),
#  ('assistance', 0.520713746547699),
#  ('supportive', 0.5110025405883789)]

#### Evaluation

To run the following code, set `model` to the model you would like to evaluate.

In [0]:
model = glove_model300

In [0]:
model.most_similar('obama')

In [0]:
model.most_similar('banana')

In [0]:
model.most_similar(negative='banana')

In [0]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))


  **Visualizing word vectors - blog.acolyer.org**  

<img src="https://adriancolyer.files.wordpress.com/2016/04/word2vec-king-queen-vectors.png?w=400"/>


**word2vec King - Queen Composition - blog.acolyer.org**

<img src="https://adriancolyer.files.wordpress.com/2016/04/word2vec-king-queen-composition.png" width="400px"/>

**The Illustrated Word2Vec - Jay Alamar** 

<img src="http://jalammar.github.io/images/word2vec/king-analogy-viz.png" width="400px"/>


In [0]:
# x1 is to x2 as y1 is to ?
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [0]:
analogy('japan', 'japanese', 'australia')

In [0]:
analogy('australia', 'beer', 'france')

In [0]:
analogy('obama', 'clinton', 'reagan')

In [0]:
analogy('tall', 'tallest', 'long')

In [0]:
analogy('good', 'fantastic', 'bad')

In [0]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

In [0]:
%matplotlib inline
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [0]:
display_pca_scatterplot(model, 
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])

In [0]:
display_pca_scatterplot(model, sample=300)

#### To do: explore some concepts on your own.

I've started looking at medical concepts.

In [0]:
model.most_similar('cardiac')

In [0]:
model.most_similar('diabetes')

In [0]:
model.most_similar('opioid', topn=20)

In [0]:
model.most_similar('alzheimers', topn=20)

In [0]:
analogy('endocrine', 'diabetes', 'neural')

#### Summary

We've explored the concepts of learned word representations. In so doing, we identified semantic relationshiops between word vectors.

A significant disadvantage of word2vec, Glove, and fasttest is that they are `context free` word representations, i.e., the only represent each word with a single vector and do not take context into account.

A more advanced tutorial can be found here:  
    https://github.com/sismetanin/word2vec-tsne