### Word vector visualization 

Jay Urbain, PhD

Word vector visualization with [Gensim](https://github.com/RaRe-Technologies/gensim)

Credits:  
https://www.machinelearningplus.com/nlp/gensim-tutorial/  
https://radimrehurek.com/gensim/downloader.html   
[Stanford Class CS224b](https://web.stanford.edu/class/cs224n/)

In [None]:
import numpy as np

# Matplotlib for plotting
%matplotlib notebook
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# sklearn for PCA dimensionality reduction
from sklearn.decomposition import PCA

# Gensim for word vectors
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

Gensim is an NLP library that is especially handy for working with word vectors. Gensim isn't really a deep learning package. It's a package for  word and text similarity modeling, which started with LDA-style [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)  topic models and grew into SVD [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) and neural word representation library. But its efficient, scalable, and widely used.

You can try *50d, *100d, *200d, or *300d vectors. Research efforts have shown that performance does not improve with vectors larger than 300d.

#### Download

We can download and evaluate fasttext, word2vec, and glove models using the `gensim.downloader api`. These are large files, so you will have to be a little patient.

In [4]:
import gensim.downloader as api

#print( api.info() )  # return dict with info about available models/datasets
print( api.info("text8") )  # return dict with info about "text8" dataset

{'num_records': 1701, 'record_format': 'list of str (tokens)', 'file_size': 33182058, 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py', 'license': 'not found', 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.', 'checksum': '68799af40b6bda07dfa47a32612e5364', 'file_name': 'text8.gz', 'read_more': ['http://mattmahoney.net/dc/textdata.html'], 'parts': 1}


In [5]:
import gensim.downloader as api

model = api.load("glove-twitter-25")  # load glove vectors
model.most_similar("cat")  # show words that similar to word 'cat'



[('dog', 0.9590819478034973),
 ('monkey', 0.9203578233718872),
 ('bear', 0.9143137335777283),
 ('pet', 0.9108031392097473),
 ('girl', 0.8880630135536194),
 ('horse', 0.8872727155685425),
 ('kitty', 0.8870542049407959),
 ('puppy', 0.886769711971283),
 ('hot', 0.8865255117416382),
 ('lady', 0.8845518827438354)]

In [7]:
import gensim.downloader as api

# Download the models
# fasttext_model300 = api.load('fasttext-wiki-news-subwords-300')
# word2vec_model300 = api.load('word2vec-google-news-300')
glove_model300 = api.load('glove-wiki-gigaword-300')

# Get word embeddings
glove_model300.most_similar('support')
# [('supporting', 0.6251285076141357),
#  ...
#  ('backing', 0.6007589101791382),
#  ('supports', 0.5269277691841125),
#  ('assistance', 0.520713746547699),
#  ('supportive', 0.5110025405883789)]

[('supported', 0.740031898021698),
 ('supporting', 0.6803102493286133),
 ('backing', 0.6659233570098877),
 ('supports', 0.6377385258674622),
 ('provide', 0.6045100092887878),
 ('assistance', 0.587337076663971),
 ('efforts', 0.5793647766113281),
 ('providing', 0.561307430267334),
 ('strong', 0.5610021352767944),
 ('help', 0.5547006130218506)]

#### Evaluation

To run the following code, set `model` to the model you would like to evaluate.

In [8]:
model = glove_model300

In [9]:
model.most_similar('obama')

[('barack', 0.9254721403121948),
 ('mccain', 0.7590768337249756),
 ('bush', 0.7570987939834595),
 ('clinton', 0.7085603475570679),
 ('hillary', 0.6497915983200073),
 ('kerry', 0.6144052743911743),
 ('rodham', 0.6138635873794556),
 ('biden', 0.5940852165222168),
 ('gore', 0.5885976552963257),
 ('democrats', 0.5608304738998413)]

In [10]:
model.most_similar('banana')

[('bananas', 0.6691170930862427),
 ('mango', 0.580410361289978),
 ('pineapple', 0.5492371916770935),
 ('coconut', 0.5462779402732849),
 ('papaya', 0.541056752204895),
 ('fruit', 0.5218108296394348),
 ('growers', 0.4877638816833496),
 ('nut', 0.4839959144592285),
 ('peanut', 0.48062020540237427),
 ('potato', 0.4806118607521057)]

In [11]:
model.most_similar(negative='banana')

[('keyrates', 0.6847262382507324),
 ('rw97', 0.6595869064331055),
 ('+9.00', 0.6340475678443909),
 ('ryryryryryry', 0.6322759985923767),
 ('zety', 0.5784541368484497),
 ('.0342', 0.5776804089546204),
 ('k586-1', 0.5598777532577515),
 ('cw96', 0.5540916323661804),
 ('mongkolporn', 0.5488854050636292),
 ('purva.patel@chron.com', 0.5483731627464294)]

In [12]:
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

queen: 0.6713


**Visualizing word vectors - blog.acolyer.org**
<img src="https://adriancolyer.files.wordpress.com/2016/04/word2vec-king-queen-vectors.png?w=400"/>


**word2vec King - Queen Composition - blog.acolyer.org**

<img src="https://adriancolyer.files.wordpress.com/2016/04/word2vec-king-queen-composition.png" width="400px"/>

**The Illustrated Word2Vec - Jay Alamar** 

<img src="http://jalammar.github.io/images/word2vec/king-analogy-viz.png" width="400px"/>

In [13]:
def analogy(x1, x2, y1):
    result = model.most_similar(positive=[y1, x2], negative=[x1])
    return result[0][0]

In [14]:
analogy('japan', 'japanese', 'australia')

'australian'

In [15]:
analogy('australia', 'beer', 'france')

'champagne'

In [16]:
analogy('obama', 'clinton', 'reagan')

'ronald'

In [17]:
analogy('tall', 'tallest', 'long')

'longest'

In [18]:
analogy('good', 'fantastic', 'bad')

'horrible'

In [19]:
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

cereal


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


In [20]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [21]:
display_pca_scatterplot(model, 
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])

<IPython.core.display.Javascript object>

In [22]:
display_pca_scatterplot(model, sample=300)

<IPython.core.display.Javascript object>

#### To do: explore some concepts on your own.

I've started looking at medical concepts.

In [23]:
model.most_similar('cardiac')

[('cardiovascular', 0.6490300893783569),
 ('arrhythmia', 0.6411479711532593),
 ('coronary', 0.6186240911483765),
 ('arrhythmias', 0.599291980266571),
 ('heart', 0.580522894859314),
 ('pulmonary', 0.5803776383399963),
 ('vascular', 0.55409836769104),
 ('catheterization', 0.5496689081192017),
 ('renal', 0.5444202423095703),
 ('neurological', 0.5365750789642334)]

In [24]:
model.most_similar('diabetes')

[('hypertension', 0.7783902883529663),
 ('obesity', 0.7220836877822876),
 ('asthma', 0.6963690519332886),
 ('alzheimer', 0.6956477165222168),
 ('arthritis', 0.6762572526931763),
 ('diabetics', 0.6576086282730103),
 ('osteoporosis', 0.6514430642127991),
 ('cardiovascular', 0.6253440380096436),
 ('disease', 0.6239296197891235),
 ('epilepsy', 0.6215373873710632)]

In [25]:
model.most_similar('opioid', topn=20)

[('opioids', 0.6966589689254761),
 ('analgesic', 0.656631350517273),
 ('opiate', 0.6186232566833496),
 ('agonists', 0.5982040166854858),
 ('agonist', 0.5847464799880981),
 ('benzodiazepine', 0.5649693012237549),
 ('cannabinoid', 0.5598238706588745),
 ('analgesics', 0.5569523572921753),
 ('endogenous', 0.5363828539848328),
 ('receptors', 0.532952606678009),
 ('benzodiazepines', 0.5248212814331055),
 ('dopamine', 0.5204257965087891),
 ('morphine', 0.5166813731193542),
 ('serotonin', 0.5131203532218933),
 ('receptor', 0.5015822052955627),
 ('painkillers', 0.5012892484664917),
 ('anti-inflammatory', 0.4960584044456482),
 ('μ-opioid', 0.4935702681541443),
 ('neurotransmitter', 0.4874703884124756),
 ('nmda', 0.4839155673980713)]

In [26]:
model.most_similar('alzheimers', topn=20)

[('parkinsons', 0.5796595811843872),
 ('neurone', 0.44790858030319214),
 ('alzheimer', 0.44505855441093445),
 ('dementia', 0.4296298623085022),
 ('tay-sachs', 0.42227107286453247),
 ('coeliac', 0.41936880350112915),
 ('parkinson', 0.393898069858551),
 ('myasthenia', 0.3933151662349701),
 ('foot-and-mouth', 0.3902786374092102),
 ('spondylitis', 0.3854370713233948),
 ('sciatica', 0.3840393126010895),
 ('atherosclerosis', 0.37351036071777344),
 ('sandhoff', 0.3727266788482666),
 ('occlusive', 0.37110474705696106),
 ('amyotrophic', 0.36899054050445557),
 ('senility', 0.367917537689209),
 ('graft-versus-host', 0.3652462363243103),
 ('chorea', 0.3651826083660126),
 ('post-modernism', 0.36478275060653687),
 ('diarrhoeal', 0.36266517639160156)]

In [27]:
analogy('endocrine', 'diabetes', 'neural')

'obesity'

#### Summary

We've explored the concepts of learned word representations. In so doing, we identified semantic relationshiops between word vectors.

A significant disadvantage of word2vec, Glove, and fasttest is that they are `context free` word representations, i.e., the only represent each word with a single vector and do not take context into account.

A more advanced tutorial can be found here:  
    https://github.com/sismetanin/word2vec-tsne