# Using pre-trained embeddings and NLP corpora

Gensim has some really nice functionality, in that it allows you to use pre-trained GloVe and Word2Vec embeddings with its libraries. In addition there are also some re-usable corpora that you can download and immediately use to train a Word2Vec embedding. The code snippets below show you how. The source of the embeddings can be found here: https://github.com/RaRe-Technologies/gensim-data.

I'll have to warn you that I'm not impressed with the quality of the pre-trained word embeddings. Either the dataset is noisy or its just too general. To be explained more later.

In [30]:
import warnings
warnings.filterwarnings('ignore')

## Imports

In [31]:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

## Pre-trained: Twitter GloVe Embeddings
This first step downloads the pre-trained embeddings and loads it for re-use. Note that these are GloVe embeddings built using Tweets as the name suggests. These vectors are based on 2B tweets, 27B tokens, 1.2M vocab, uncased. The original source can be found here: https://nlp.stanford.edu/projects/glove/. The `25` in the model name refers to the dimensionality of the vectors. 

In [32]:
# download the model and return as object ready for use
model = api.load("glove-twitter-25")

Once you have loaded the pre-trained model, just use it as you would with any gensim word2vec model. Here are a few similarity examples:

In [33]:
model.wv.most_similar("pelosi",topn=10)

[('clegg', 0.9653651118278503),
 ('miliband', 0.9515050053596497),
 ('bachmann', 0.9484400749206543),
 ('mcconnell', 0.9416399002075195),
 ('carney', 0.934025764465332),
 ('coulter', 0.9311323761940002),
 ('boehner', 0.9286302328109741),
 ('santorum', 0.9269059300422668),
 ('farage', 0.919365406036377),
 ('mourdock', 0.9186689853668213)]

In [34]:
model.wv.most_similar("policies",topn=10)

[('policy', 0.9484813213348389),
 ('reforms', 0.9403933882713318),
 ('laws', 0.94012051820755),
 ('government', 0.923071026802063),
 ('regulations', 0.916893482208252),
 ('economy', 0.9110006093978882),
 ('immigration', 0.9105910062789917),
 ('legislation', 0.908964991569519),
 ('govt', 0.9054746627807617),
 ('regulation', 0.9050779342651367)]

Which of these words don't fit?

In [35]:
#what doesn't fit?
model.wv.doesnt_match(["trump","bernie","obama","pelosi","orange"])

'orange'

Word vectors for `trump` and `obama`

In [36]:
# show weight vector for trump and obama
model["trump"],model['obama']

(array([-0.56174 ,  0.69419 ,  0.16733 ,  0.055867, -0.26266 , -0.6303  ,
        -0.28311 , -0.88244 ,  0.57317 , -0.82376 ,  0.46728 ,  0.48607 ,
        -2.1942  , -0.41972 ,  0.31795 , -0.70063 ,  0.060693,  0.45279 ,
         0.6564  ,  0.20738 ,  0.84496 , -0.087537, -0.38856 , -0.97028 ,
        -0.40427 ], dtype=float32),
 array([ 0.77126 ,  0.81259 , -0.5901  , -0.015908, -0.082797, -1.2261  ,
         0.098286,  0.087488,  0.012586, -0.35884 ,  0.80733 ,  0.12569 ,
        -4.0522  ,  0.14856 ,  0.6988  , -0.78948 , -0.77125 ,  0.49512 ,
         0.16366 , -0.9713  ,  0.95064 ,  0.19921 , -0.27903 , -1.6844  ,
        -0.79424 ], dtype=float32))

### Rank phrases by similarity

The goal here is given a query phrase, rank all other phrases by semantic similarity and compare that with surface level similarity using jaccard similarity index

In [87]:
import pandas as pd
from sklearn.metrics import jaccard_similarity_score

phrases=["barrack obama","barrack h. obama","barrack hussein obama","michelle obama","donald trump","melania trump"]
query="barack hussain obama"

results_glove=[]
results_jaccard=[]

def compute_jaccard(t1,t2):
    
    intersect = [value for value in t1 if value in t2] 
    
    union=[]
    union.extend(t1)
    union.extend(t2)
    union=list(set(union))
    
    
    jaccard=(len(intersect))/(len(union)+0.01)
    return jaccard
    

for p in phrases:
    tokens_1=[t for t in p.split() if t in model.wv.vocab]
    tokens_2=[t for t in query.split() if t in model.wv.vocab]
    
    #compute jaccard similarity
    jaccard=compute_jaccard(tokens_1,tokens_2)
    results_jaccard.append([p,jaccard])
    
    #compute cosine similarity using word embedings 
    cosine=0
    if (len(tokens_1) > 0 and len(tokens_2)>0):
        cosine=model.wv.n_similarity(tokens_1,tokens_2)
        results_glove.append([p,cosine])

print("Phrases most similar to '{0}' using glove word embeddings".format(query))
pd.DataFrame(results_glove,columns=["phrase","score"]).sort_values(by=["score"],ascending=False)

Phrases most similar to 'barack hussain obama' using glove word embeddings


Unnamed: 0,phrase,score
0,barrack obama,0.956801
1,barrack h. obama,0.944671
2,barrack hussein obama,0.937
3,michelle obama,0.905201
4,donald trump,0.729601
5,melania trump,0.614963


In [88]:
print("Phrases most similar to '{0}' using jaccard similarity".format(query))
pd.DataFrame(results_jaccard,columns=["phrase","score"]).sort_values(by=["score"],ascending=False)

Phrases most similar to 'barack hussain obama' using jaccard similarity


Unnamed: 0,phrase,score
0,barrack obama,0.249377
3,michelle obama,0.249377
1,barrack h. obama,0.199601
2,barrack hussein obama,0.199601
4,donald trump,0.0
5,melania trump,0.0


### Pre-trainend: GloVe Wikipedia + Gigaword 
The example below uses pre-trained GloVe vectors based on Wikipedia 2014 and Gigaword. The original source of these embeddings can be found here: https://nlp.stanford.edu/projects/glove/

In [5]:
#again, download and load the model
model_gigaword = api.load("glove-wiki-gigaword-100")

In [6]:
# find similarity
model_gigaword.wv.most_similar(positive=['dirty','grimy'],topn=10)


  


[('filthy', 0.7690386176109314),
 ('smelly', 0.7392696738243103),
 ('shabby', 0.7025482654571533),
 ('dingy', 0.7022336721420288),
 ('grubby', 0.6754513382911682),
 ('grungy', 0.6414024233818054),
 ('dank', 0.6263698935508728),
 ('sweaty', 0.622745156288147),
 ('dreary', 0.6216242909431458),
 ('gritty', 0.6215749382972717)]

In [7]:
model_gigaword.wv.most_similar(positive=["summer","winter"],topn=10)

  """Entry point for launching an IPython kernel.


[('spring', 0.8519278764724731),
 ('autumn', 0.7865706086158752),
 ('olympics', 0.6915045380592346),
 ('weekend', 0.6908971667289734),
 ('days', 0.6872981786727905),
 ('during', 0.6861999034881592),
 ('season', 0.6849778294563293),
 ('year', 0.6827663779258728),
 ('rainy', 0.6744828820228577),
 ('day', 0.671191930770874)]

## Load a dataset and train a model
Instead of loading pre-trained embeddings, you can also load a corpus and train it on demand. This list of datasets that you can download can be found here: https://github.com/RaRe-Technologies/gensim-data#datasets

In [8]:
from gensim.models.word2vec import Word2Vec

# this loads the text8 dataset
corpus = api.load('text8')

# train a Word2Vec model
model_text8 = Word2Vec(corpus,iter=10,size=150, window=10, min_count=2, workers=10)  # train a model from the corpus


In [9]:
# similarity 
model_text8.wv.most_similar("shocked")

[('surprised', 0.7146514058113098),
 ('outraged', 0.7117233276367188),
 ('disappointed', 0.6712729930877686),
 ('angered', 0.6455301642417908),
 ('offended', 0.6371268630027771),
 ('overwhelmed', 0.6347959637641907),
 ('confronted', 0.6278891563415527),
 ('betrayed', 0.6236147284507751),
 ('disgusted', 0.6220308542251587),
 ('alarmed', 0.6148042678833008)]

In [10]:
# similarity between two different words
model_text8.wv.similarity(w1="dirty",w2="smelly")

0.44690064

In [11]:
# Which one is the odd one out in this list?
model_text8.wv.doesnt_match(["cat","dog","france"])

'france'