# GloVe Example

The following notebook is using [this](https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db) tutorial as a start into the world of GloVe.

In [2]:
# Import packages
import numpy as np
from scipy import spatial
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

I am using a pretrained vector provided by the GloVe team and is available on their [website](https://nlp.stanford.edu/projects/glove/). These text files are nicely formatted as they contain single a single word followed by *N* numbers. The *N* numbers that follow this word represent a point in *N* dimensional space. Numbers which are close to each other are close synonyms. The higher the dimension, the better the accuracy but more computation is needed. I will be using an *N* of 50 for this project to mimic the tutorial that this is from.

In [5]:
# First I will create an embeddings dictionary for word embedding. 
# Word embedding involves mapping words or phrases to vectors or numbers
embeddings_dict = {}
# Open the file with 50 dimensions
with open('glove/glove.6B.50d.txt', 'r') as f:
    # For every line, which is every word in this file
    for line in f:
        # Split the line up using default separator which is whitespace
        values = line.split()
        # The first item in the line is the word so assign that
        word = values[0]
        # Every other value in the line is a vector position
        vector = np.asarray(values[1:], "float32")
        # Now map the word to the vectors that we have got
        embeddings_dict[word] = vector

The above code takes all the words we have an stores them with their respective vector positions using the embeddings_dict dictionary. We can now use this dictionary to do some cool things with it.

Below is a function that will take in a word and find other words that are similar to it using the vector positioning I talked about earlier. What this function does is take the word that was inputted and uses the dictionary keys, which are the words we embedded above, and find the distance between all these words and the word inputted. Instead of returning an alphabetically sorted list, we use the distance we calculated as the sorting key so that we get the most similar words.

In [52]:
def find_closest_embeddings(embedding):
    return sorted(embeddings_dict.keys(), key=lambda word: 
                  spatial.distance.euclidean(embeddings_dict[word], embedding))[1:6]

In [53]:
find_closest_embeddings(embeddings_dict['shoe'])

['shoes', 'handbag', 'underwear', 'sneakers', 'leather']

So as an example, I used 'shoe' as the word to test and outputted the top 5 most similar words in our dictionary. As you can see the first word, which is supposedly the most similar is 'shoes'. Makes sense! The other 4 words are all items of clothing which shows that these pretrained GloVe vectors do not just go on the words that are the most similar but also the meaning that is most similar. An interesting thing to note is that 'sneakers', a type of shoe, is 4th in the list behind handbag and underwear.

What we can do is adapt the above function to show us the euclidean distances between the top 5 words and the word we want. We can use these distances to see how 'far' apart the words are from 'shoe'.

In [75]:
def find_distance_of_embeddings(embedding):
    words = find_closest_embeddings(embedding)
    distances = list(map(lambda word: spatial.distance.euclidean(embeddings_dict[word], embedding), words))
    return dict(zip(words, distances))

In [76]:
find_distance_of_embeddings(embeddings_dict['shoe'])

{'shoes': 2.918541431427002,
 'handbag': 3.0618388652801514,
 'underwear': 3.096071481704712,
 'sneakers': 3.323274850845337,
 'leather': 3.34987211227417}

We can see that 'shoes' is clearly the most similar which makes sense. We can also see the distance between 'shoe' and 'handbag' and 'underwear' are both very similar but sneakers is bit further away from these.