# 01 - Exploring Word Embeddings
In this notebook, we'll look at trained word embeddings. We'll plot the embeddings so we can attempt to visually compare embeddings. We'll then look at analogies and word similarities. We'll use the Gensim library which makes it easy to work with embeddings.

In [29]:
import gensim
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

#### Download a table of pre-trained embeddings

In [3]:
# use GloVe model
model = api.load("glove-wiki-gigaword-300")




How many words does this table have?

In [30]:
model.vectors.shape

(400000, 300)

Which means:
* 400,000 words (vocab_size)
* Each has an embedding composed of 300 numbers (embedding_size)

In [31]:
model['cat']

array([-0.29353  ,  0.33247  , -0.047372 , -0.12247  ,  0.071956 ,
       -0.23408  , -0.06238  , -0.0037192, -0.39462  , -0.69411  ,
        0.36731  , -0.12141  , -0.044485 , -0.15268  ,  0.34864  ,
        0.22926  ,  0.54361  ,  0.25215  ,  0.097972 , -0.087305 ,
        0.87058  , -0.12211  , -0.079825 ,  0.28712  , -0.68563  ,
       -0.27265  ,  0.22056  , -0.75752  ,  0.56293  ,  0.091377 ,
       -0.71004  , -0.3142   , -0.56826  , -0.26684  , -0.60102  ,
        0.26959  , -0.17992  ,  0.10701  , -0.57858  ,  0.38161  ,
       -0.67127  ,  0.10927  ,  0.079426 ,  0.022372 , -0.081147 ,
        0.011182 ,  0.67089  , -0.19094  , -0.33676  , -0.48471  ,
       -0.35406  , -0.15209  ,  0.44503  ,  0.46385  ,  0.38409  ,
        0.045081 , -0.59079  ,  0.21763  ,  0.38576  , -0.44567  ,
        0.009332 ,  0.442    ,  0.097062 ,  0.38005  , -0.11881  ,
       -0.42718  , -0.31005  , -0.025058 ,  0.12689  , -0.13468  ,
        0.11976  ,  0.76253  ,  0.2524   , -0.26934  ,  0.0686

# Test

In [32]:
# Access embeddings for words
embedding1 = model["cat"]
embedding2 = model["dog"]

You can check how similar two words are based on their cosine similarity.

In [33]:
similarity = cosine_similarity([embedding1], [embedding2])
print(f"Cosine Similarity between 'cat' and 'dog': {similarity[0][0]}")


Cosine Similarity between 'cat' and 'dog': 0.6816747188568115


Measure the distance between the embeddings of two words.

In [34]:
distance = np.linalg.norm(embedding1 - embedding2)
print(f"Euclidean Distance between 'cat' and 'dog': {distance}")


Euclidean Distance between 'cat' and 'dog': 5.195904731750488


Check the dot product to understand the directional similarity of two embeddings.

In [35]:
dot_product = np.dot(embedding1, embedding2)
print(f"Dot Product between 'cat' and 'dog': {dot_product}")


Dot Product between 'cat' and 'dog': 28.778003692626953


Top 100 Similar words

In [42]:
similar_words = model.most_similar("pakistan", topn=100)

# Display the similar words in a stacked list format
for word, similarity in similar_words:
    print(f"{word}: {similarity}")



pakistani: 0.7468125820159912
india: 0.7285578846931458
bangladesh: 0.6560508608818054
islamabad: 0.6534844040870667
afghanistan: 0.6449339985847473
kashmir: 0.6071956157684326
musharraf: 0.6063677668571472
lanka: 0.5710559487342834
karachi: 0.564299464225769
lahore: 0.5641221404075623
afghan: 0.5493603944778442
iran: 0.5435517430305481
pervez: 0.5388423800468445
delhi: 0.5302056074142456
taliban: 0.5259105563163757
punjab: 0.5258557200431824
sri: 0.5210655927658081
militants: 0.5195590853691101
bhutto: 0.517187774181366
khan: 0.5122395753860474
indian: 0.5028319358825684
peshawar: 0.49511292576789856
pakistanis: 0.4813428819179535
arabia: 0.4808826744556427
nepal: 0.479999840259552
cricket: 0.47665661573410034
syria: 0.4667215049266815
islamic: 0.4537121653556824
yemen: 0.4491788148880005
saudi: 0.4431416094303131
uzbekistan: 0.44242429733276367
australia: 0.43970999121665955
zealand: 0.4385705292224884
militant: 0.4382505416870117
malaysia: 0.43476229906082153
subcontinent: 0.4337340

## Analogies
### cat - horse + love  = ?

In [37]:
model.most_similar(positive=["cat", "love"], negative=["horse"])

[('loves', 0.48272591829299927),
 ('lover', 0.4793064594268799),
 ('lovers', 0.4516974985599518),
 ('loved', 0.42938748002052307),
 ('friends', 0.42440059781074524),
 ('sings', 0.42226406931877136),
 ('mom', 0.41548898816108704),
 ('romantic', 0.41227686405181885),
 ('affection', 0.40540915727615356),
 ('cats', 0.40343040227890015)]

In [38]:
model.most_similar(positive=["man", "love"], negative=["woman"])

[('passion', 0.534675657749176),
 ('you', 0.5236014127731323),
 ('always', 0.5187980532646179),
 ('me', 0.5141775012016296),
 ('loves', 0.5118120908737183),
 ('good', 0.5114336609840393),
 ('?', 0.499277800321579),
 ('know', 0.4981096088886261),
 ('really', 0.49251341819763184),
 ('fun', 0.4884888827800751)]

In [40]:
model.most_similar(positive=["woman", "love"], negative=["man"])

[('her', 0.5882567763328552),
 ('mother', 0.5708790421485901),
 ('she', 0.567848265171051),
 ('romance', 0.5440021753311157),
 ('herself', 0.5413269996643066),
 ('beautiful', 0.5150748491287231),
 ('girl', 0.5132489800453186),
 ('lover', 0.5117532014846802),
 ('loves', 0.503203809261322),
 ('romantic', 0.5030348896980286)]