<a href="https://colab.research.google.com/github/kretchmar/CS339_2023/blob/main/GloveExploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings
Matt Kretchmar <p>
March 2023 <p>

An exploration with the GLOVE model (similar to Word2Vec)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import gensim.downloader as api
from gensim.models import KeyedVectors


## Download the Model
Here we access the model and load it into memory.

In [2]:
model = api.load('glove-wiki-gigaword-100')
print(model)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f8e844fee20>


### Word to Vector
Here we can directly access the vector representation of any individual word

In [3]:
print(model['king'])

[-0.32307  -0.87616   0.21977   0.25268   0.22976   0.7388   -0.37954
 -0.35307  -0.84369  -1.1113   -0.30266   0.33178  -0.25113   0.30448
 -0.077491 -0.89815   0.092496 -1.1407   -0.58324   0.66869  -0.23122
 -0.95855   0.28262  -0.078848  0.75315   0.26584   0.3422   -0.33949
  0.95608   0.065641  0.45747   0.39835   0.57965   0.39267  -0.21851
  0.58795  -0.55999   0.63368  -0.043983 -0.68731  -0.37841   0.38026
  0.61641  -0.88269  -0.12346  -0.37928  -0.38318   0.23868   0.6685
 -0.43321  -0.11065   0.081723  1.1569    0.78958  -0.21223  -2.3211
 -0.67806   0.44561   0.65707   0.1045    0.46217   0.19912   0.25802
  0.057194  0.53443  -0.43133  -0.34311   0.59789  -0.58417   0.068995
  0.23944  -0.85181   0.30379  -0.34177  -0.25746  -0.031101 -0.16285
  0.45169  -0.91627   0.64521   0.73281  -0.22752   0.30226   0.044801
 -0.83741   0.55006  -0.52506  -1.7357    0.4751   -0.70487   0.056939
 -0.7132    0.089623  0.41394  -1.3363   -0.61915  -0.33089  -0.52881
  0.16483  -0.98878

### Synonyms
Here we can access "similar" words which are located nearby in vector space to a given word.  

In [6]:
model.most_similar('king')

[('prince', 0.7682329416275024),
 ('queen', 0.7507690191268921),
 ('son', 0.7020887136459351),
 ('brother', 0.6985775232315063),
 ('monarch', 0.6977890729904175),
 ('throne', 0.6919990181922913),
 ('kingdom', 0.6811410188674927),
 ('father', 0.6802029013633728),
 ('emperor', 0.6712857484817505),
 ('ii', 0.6676074266433716)]

In [7]:
model.most_similar('banana')

[('coconut', 0.7097253799438477),
 ('mango', 0.7054824233055115),
 ('bananas', 0.6887733936309814),
 ('potato', 0.6629636287689209),
 ('pineapple', 0.6534532904624939),
 ('fruit', 0.6519855260848999),
 ('peanut', 0.6420576572418213),
 ('pecan', 0.6349173188209534),
 ('cashew', 0.6294420957565308),
 ('papaya', 0.6246591210365295)]

### Cosine Similarity
We can see from the code below that the model is using cosine similarity as a metric for finding similar words.   There must be some efficient search method (rather than linear search).  

In [15]:
from numpy.linalg import norm
def cosine_sim ( A, B ):
    return np.dot(A,B) / ( norm(A)*norm(B) )

In [16]:
cosine_sim(model['banana'],model['coconut'])

0.70972526

### Analogy
We can use the most similar feature to do basic vector arithmetic

In [8]:
model.most_similar(positive=['woman','king'],negative=['man'])

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380928039551),
 ('throne', 0.6755735874176025),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534753799438),
 ('prince', 0.6517034769058228),
 ('elizabeth', 0.6464517712593079),
 ('mother', 0.6311717629432678),
 ('emperor', 0.6106470823287964),
 ('wife', 0.6098655462265015)]

In [9]:
def analogy(x1,x2,y1):
  '''
  x1 is to x2 as y1 is to y2
  Returns the best answer for y2
  '''
  res = model.most_similar(positive=[y1,x2],negative=[x1])
  return res[0][0]
  

In [10]:
analogy('man','king','woman')

'queen'

In [18]:
analogy('dallas','usa','tokyo')

'japan'