# Metadata

```
Course:   DS 5001 
Module:   09 Lab
Topic:    Using GloVe
Author:   R.C. Alvarado

Purpose:  We use some pretrained word vectors from the developers of GloVe.

```

# GloVe

Provide by the <a href="https://nlp.stanford.edu/">Stanford NLP Group</a>.

GloVe is an **unsupervised learning algorithm for obtaining vector representations for words**. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

GloVe provides both **code to train models** and a set of **pre-trained models**.

More info here:
* <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>
* <a href="https://github.com/stanfordnlp/GloVe">GitHub</a>

**Glove50**
* Released in 2014
* Based on Wikipedia 
* 50 features
* Trained with Global Vectors (GloVe) method
* Encodes 1,193,515 word vectors
* All tokens outside the vocabulary encoded as the zero-vector
* Case is ignored
* Used by SpaCy
* Intended to be used with Cosine similarity (or Euclidean distance)

# Set Up

In [1]:
data_home = "../data"
glove_file = f"{data_home}/misc/glove50.csv"

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import manhattan_distances, cosine_distances

# Import GloVe data

In [3]:
glove = pd.read_csv(glove_file).set_index('term_str')

In [4]:
glove

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.418000,0.249680,-0.41242,0.121700,0.34527,-0.044457,-0.496880,-0.178620,-0.000660,-0.656600,...,-0.298710,-0.157490,-0.347580,-0.045637,-0.442510,0.187850,0.002785,-0.184110,-0.115140,-0.785810
",",0.013441,0.236820,-0.16899,0.409510,0.63812,0.477090,-0.428520,-0.556410,-0.364000,-0.239380,...,-0.080262,0.630030,0.321110,-0.467650,0.227860,0.360340,-0.378180,-0.566570,0.044691,0.303920
.,0.151640,0.301770,-0.16763,0.176840,0.31719,0.339730,-0.434780,-0.310860,-0.449990,-0.294860,...,-0.000064,0.068987,0.087939,-0.102850,-0.139310,0.223140,-0.080803,-0.356520,0.016413,0.102160
of,0.708530,0.570880,-0.47160,0.180480,0.54449,0.726030,0.181570,-0.523930,0.103810,-0.175660,...,-0.347270,0.284830,0.075693,-0.062178,-0.389880,0.229020,-0.216170,-0.225620,-0.093918,-0.803750
to,0.680470,-0.039263,0.30186,-0.177920,0.42962,0.032246,-0.413760,0.132280,-0.298470,-0.085253,...,-0.094375,0.018324,0.210480,-0.030880,-0.197220,0.082279,-0.094340,-0.073297,-0.064699,-0.260440
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
chanty,0.232040,0.025672,-0.70699,-0.045465,0.13989,-0.628070,0.726250,0.341080,0.446140,0.163290,...,-0.095526,-0.296050,0.385670,0.136840,0.593310,-0.694860,0.124100,-0.180690,-0.258300,-0.039673
kronik,-0.609210,-0.672180,0.23521,-0.111950,-0.46094,-0.007462,0.255780,0.856320,0.055977,-0.237920,...,0.672050,-0.598220,-0.202590,0.392430,0.028873,0.030003,-0.106170,-0.114110,-0.249010,-0.120260
rolonda,-0.511810,0.058706,1.09130,-0.551630,-0.10249,-0.126500,0.995030,0.079711,-0.162460,0.564880,...,0.024747,0.200920,-1.085100,-0.136260,0.350520,-0.858910,0.067858,-0.250030,-1.125000,1.586300
zsombor,-0.758980,-0.474260,0.47370,0.772500,-0.78064,0.232330,0.046114,0.840140,0.243710,0.022978,...,0.454390,-0.842540,0.106500,-0.059397,0.090449,0.305810,-0.614240,0.789540,-0.014116,0.644800


# Remove non-words

There are a lot of useless tokens in the vocabulary. These may be good for generating the features, but we don't need them in our queries.

In [5]:
glove = glove[glove.index.str.match(r'^[a-z]+$')]

In [6]:
glove

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,41,42,43,44,45,46,47,48,49,50
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,0.418000,0.249680,-0.41242,0.121700,0.345270,-0.044457,-0.496880,-0.178620,-0.000660,-0.656600,...,-0.298710,-0.157490,-0.347580,-0.045637,-0.442510,0.187850,0.002785,-0.184110,-0.115140,-0.785810
of,0.708530,0.570880,-0.47160,0.180480,0.544490,0.726030,0.181570,-0.523930,0.103810,-0.175660,...,-0.347270,0.284830,0.075693,-0.062178,-0.389880,0.229020,-0.216170,-0.225620,-0.093918,-0.803750
to,0.680470,-0.039263,0.30186,-0.177920,0.429620,0.032246,-0.413760,0.132280,-0.298470,-0.085253,...,-0.094375,0.018324,0.210480,-0.030880,-0.197220,0.082279,-0.094340,-0.073297,-0.064699,-0.260440
and,0.268180,0.143460,-0.27877,0.016257,0.113840,0.699230,-0.513320,-0.473680,-0.330750,-0.138340,...,-0.069043,0.368850,0.251680,-0.245170,0.253810,0.136700,-0.311780,-0.632100,-0.250280,-0.380970
in,0.330420,0.249950,-0.60874,0.109230,0.036372,0.151000,-0.550830,-0.074239,-0.092307,-0.328210,...,-0.486090,-0.008027,0.031184,-0.365760,-0.426990,0.421640,-0.116660,-0.507030,-0.027273,-0.532850
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
chanty,0.232040,0.025672,-0.70699,-0.045465,0.139890,-0.628070,0.726250,0.341080,0.446140,0.163290,...,-0.095526,-0.296050,0.385670,0.136840,0.593310,-0.694860,0.124100,-0.180690,-0.258300,-0.039673
kronik,-0.609210,-0.672180,0.23521,-0.111950,-0.460940,-0.007462,0.255780,0.856320,0.055977,-0.237920,...,0.672050,-0.598220,-0.202590,0.392430,0.028873,0.030003,-0.106170,-0.114110,-0.249010,-0.120260
rolonda,-0.511810,0.058706,1.09130,-0.551630,-0.102490,-0.126500,0.995030,0.079711,-0.162460,0.564880,...,0.024747,0.200920,-1.085100,-0.136260,0.350520,-0.858910,0.067858,-0.250030,-1.125000,1.586300
zsombor,-0.758980,-0.474260,0.47370,0.772500,-0.780640,0.232330,0.046114,0.840140,0.243710,0.022978,...,0.454390,-0.842540,0.106500,-0.059397,0.090449,0.305810,-0.614240,0.789540,-0.014116,0.644800


Still a lot of words!

## Define some semantic functions

In [8]:
def get_word_vector(term_str):
    """Get a numpy array from the glove matrix and shape for input into cosine function"""
    wv = glove.loc[term_str].values.reshape(-1, 1).T
    return wv

def get_dists(term_str, n=10):
    """Get the top n words for a given word based on cosine similarity"""
    wv = get_word_vector(term_str)
    
    dists = cosine_distances(glove.values, wv)
    
    return pd.DataFrame(dists, index=glove.index, columns=['score'])\
        .sort_values('score').head(n)

def get_nearest_vector(wv):
    """Get the nearest word vector to a given word vector"""
    dists = cosine_distances(glove.values, wv)
    return pd.DataFrame(dists, index=glove.index, columns=['score'])\
        .sort_values('score').iloc[1]

def get_analogy(a, b, c):
    """Infer missing analogical term"""
    try:
        A = get_word_vector(a)
        B = get_word_vector(b)
        C = get_word_vector(c)
        D = np.add(np.subtract(B, A), C)
        X = get_nearest_vector(D)
        return X.name
    except ValueError as e:
        print(e)
        return None

### Test similarity function

In [9]:
get_dists('queen')

Unnamed: 0_level_0,score
term_str,Unnamed: 1_level_1
queen,0.0
princess,0.148483
lady,0.194939
elizabeth,0.212696
king,0.216096
prince,0.217814
coronation,0.230722
consort,0.23739
royal,0.255714
crown,0.261735


In [10]:
get_dists('king')

Unnamed: 0_level_0,score
term_str,Unnamed: 1_level_1
king,0.0
prince,0.176382
queen,0.216096
ii,0.225377
emperor,0.226375
son,0.233281
uncle,0.237285
kingdom,0.245784
throne,0.246009
brother,0.250759


## Test analogy functions

In [11]:
get_analogy('life','death','male')

'female'

In [12]:
get_analogy('dog','male','cat')

'female'

In [12]:
get_analogy('male','doctor','female')

'nurse'

In [13]:
get_analogy('queen','female','king')

'male'

In [14]:
get_analogy('female','princess','male')

'duchess'

In [15]:
get_analogy('right','left','life')

'survived'

In [16]:
get_analogy('left','right','death')

'punishment'

In [17]:
get_analogy('right','left','male')

'male'

In [18]:
get_analogy('male','female','right')

'put'

In [19]:
get_analogy('right','left','male')

'male'

In [20]:
get_analogy('left','right','black')

'white'

In [21]:
get_analogy('left','right','white')

'black'

In [22]:
get_analogy('moon','sun','male')

'female'

In [23]:
get_analogy('moon','sun','female')

'female'

In [24]:
get_analogy('sun','moon','male')

'male'

In [25]:
get_analogy('day','sun','night')

'sky'