# GloVe Example

The following notebook is using [this](https://medium.com/analytics-vidhya/basics-of-using-pre-trained-glove-vectors-in-python-d38905f356db) tutorial as a start into the world of GloVe.

In [78]:
# Import packages
import numpy as np
import pandas
from scipy import spatial
from sklearn.manifold import TSNE
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

I am using a pretrained vector provided by the GloVe team and is available on their [website](https://nlp.stanford.edu/projects/glove/). These text files are nicely formatted as they contain single a single word followed by *N* numbers. The *N* numbers that follow this word represent a point in *N* dimensional space. Numbers which are close to each other are close synonyms. The higher the dimension, the better the accuracy but more computation is needed. I will be using an *N* of 50 for this project to mimic the tutorial that this is from.

In [2]:
# First I will create an embeddings dictionary for word embedding. 
# Word embedding involves mapping words or phrases to vectors or numbers
embeddings_dict = {}
# Open the file with 50 dimensions
with open('glove/glove.6B.50d.txt', 'r') as f:
    # For every line, which is every word in this file
    for line in f:
        # Split the line up using default separator which is whitespace
        values = line.split()
        # The first item in the line is the word so assign that
        word = values[0]
        # Every other value in the line is a vector position
        vector = np.asarray(values[1:], "float32")
        # Now map the word to the vectors that we have got
        embeddings_dict[word] = vector

The above code takes all the words we have an stores them with their respective vector positions using the embeddings_dict dictionary. We can now use this dictionary to do some cool things with it.

Below is a function that will take in a word and find other words that are similar to it using the vector positioning I talked about earlier. What this function does is take the word that was inputted and uses the dictionary keys, which are the words we embedded above, and find the distance between all these words and the word inputted. Instead of returning an alphabetically sorted list, we use the distance we calculated as the sorting key so that we get the most similar words.

In [3]:
def find_closest_embeddings(embedding):
    return sorted(embeddings_dict.keys(), key=lambda word: 
                  spatial.distance.euclidean(embeddings_dict[word], embedding))[1:6]

In [4]:
find_closest_embeddings(embeddings_dict['shoe'])

['shoes', 'handbag', 'underwear', 'sneakers', 'leather']

So as an example, I used 'shoe' as the word to test and outputted the top 5 most similar words in our dictionary. As you can see the first word, which is supposedly the most similar is 'shoes'. Makes sense! The other 4 words are all items of clothing which shows that these pretrained GloVe vectors do not just go on the words that are the most similar but also the meaning that is most similar. An interesting thing to note is that 'sneakers', a type of shoe, is 4th in the list behind handbag and underwear.

What we can do is adapt the above function to show us the euclidean distances between the top 5 words and the word we want. We can use these distances to see how 'far' apart the words are from 'shoe'.

In [5]:
def find_distance_of_embeddings(embedding):
    words = find_closest_embeddings(embedding)
    distances = list(map(lambda word: spatial.distance.euclidean(embeddings_dict[word], embedding), words))
    return dict(zip(words, distances))

In [6]:
find_distance_of_embeddings(embeddings_dict['shoe'])

{'shoes': 2.918541431427002,
 'handbag': 3.0618388652801514,
 'underwear': 3.096071481704712,
 'sneakers': 3.323274850845337,
 'leather': 3.34987211227417}

We can see the distances between the other words that are closest to the word 'shoe'. An important thing to note is that GloVe uses a co-occurence matrix which calculates how often other words co-occur together. In this example, shoe and shoes commonly occur together in bodies of text.

We can see this using a number of different words below.

In [14]:
find_distance_of_embeddings(embeddings_dict['ice'])

{'hot': 3.627650260925293,
 'plate': 3.771793842315674,
 'dust': 3.7886362075805664,
 'snow': 3.79959774017334,
 'melting': 3.9464774131774902}

In [10]:
find_distance_of_embeddings(embeddings_dict['horse'])

{'horses': 2.7341508865356445,
 'dog': 3.245368480682373,
 'bull': 3.3311214447021484,
 'riding': 3.397752523422241,
 'cat': 3.7490692138671875}

In [12]:
find_distance_of_embeddings(embeddings_dict['tesco'])

{'asda': 2.3777077198028564,
 'sainsbury': 2.838521957397461,
 'woolworths': 3.094175338745117,
 'morrisons': 3.1852293014526367,
 'waitrose': 3.4131877422332764}

In [15]:
find_distance_of_embeddings(embeddings_dict['virus'])

{'viruses': 3.052818775177002,
 'flu': 3.061753034591675,
 'h5n1': 3.106165647506714,
 'influenza': 3.1839656829833984,
 'infected': 3.265507221221924}

Important thing to note with these *Global Vectors* is that we can combine them like mathematical equations. For example I could do something like puppy - dog + cat and we can see what comes closest to this equation.

In [48]:
find_distance_of_embeddings(embeddings_dict['puppy'] - embeddings_dict['dog'] + embeddings_dict['cat'])

{'kitten': 3.0314016342163086,
 'puppies': 3.083557367324829,
 'cat': 3.1496877670288086,
 'pug': 3.2214677333831787,
 'frisky': 3.229954957962036}

As you can see, the closest matching result is a kitten which makes sense from the equation I created. Logically this makes sense.

We can also visualise our word embeddings using t-SNE or t-distributed stochastic neighbour embedding. We installed this package at the start so we can use that now. It essentially allows us to reduce the 50 dimensions we have down to 2 dimensions so that we can plot them on a graph. 

In [56]:
tsne = TSNE(n_components=2, random_state=0)

Now we need to convert our words and vector positions into lists so that TSNE can fit these to a 2 dimensional axis

In [57]:
words = list(embeddings_dict.keys())
vectors = [embeddings_dict[word] for word in words]

Now we fit our TSNE to our lists. I will be using just first 100 to make it easier to view on the plot. The larger the slice of data you take, the longer it will take to process.

In [81]:
Y = tsne.fit_transform(vectors[:100])
labels = words[:100]

In [82]:
plot_df = pandas.DataFrame({'labels': labels, 'x': Y[:, 0], 'y': Y[:, 1]})

In [84]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=plot_df.x, y=plot_df.y, mode='markers', text=plot_df.labels))
fig.show()