In [10]:
# IMPORT STATEMENTS
import numpy as np
import sys
from nltk.corpus import stopwords
import string
from scipy.spatial.distance import cdist

**Glo**bal **Ve**ctors or Glove for short is an unsupervised ML algorithm used for converting words into vectors which encode semantic relationships. [Here](https://nlp.stanford.edu/pubs/glove.pdf) is the paper from the Stanford NLP team that introduced Glove to the NLP world. The nice folks at Stanford NLP team have pre-trained the Glove model on large text corpora and have [shared them](https://nlp.stanford.edu/projects/glove/) for public use. We can use a pre-trained glove model for our NLP tasks - like article recommendation or just play around with it! This notebook should help you get familiar with the glove file and how to load it in Python for further use.

The glove file is just a simple text file with each row containing an English word and then a set of numbers that determine the orientation of that word's vector representation in a high dimensional space. As each word is represented by vectors, we can perform vector operations like computing the Euclidian distance. Infact, this will be our approach for finding words that are closest to a word. Now these closest words need not be synonymns as we shall see. Infact, glove model embeds some interesting relationships that is surprisngly powerful.

We load the glove file into a dictionary with the word as a key and its vector dimensions as a numpy array assigned to it. We do remove stop words as well and we load the words line-by-line to make sure this operations works on even machines with low resources. The entire operation of loading the 'glove.6B.300d.txt' glove file took about 22s on my Macbook Pro 16 inch and I suppose more able machines can load this faster.

In [3]:
stop_words = list(stopwords.words('english'))
punctuation = list(string.punctuation)
stop_words.extend(punctuation)
glove = {}
with open('glove.6B.300d.txt',encoding='utf-8',mode='r') as f:
    for line in f.readlines():
        line = line.split()
        if line[0] not in stop_words:
            glove[line[0]] = np.array(line[1:],dtype=np.float32)
        

Next we create a function that takes in a word and a glove file to return the n words present in the glove dictionary that are closest. 

In [26]:
def distance(v1,v2):
    ''' 
    Takes in two vectors and returns the euclidian distance between them
    '''
    return np.linalg.norm(v1-v2)


def top_n_closest_words(word,glove,n):
    ''' 
    Given a word and a glove dictionary object, returns the n closest words
    '''

    # We will append (word1, distance from word) where word1 is present as keys in glove dictionary
    result = []

    for key in glove.keys():
        dist = distance(glove[key],glove[word])
        result.append((key,dist))

    #sort based on keys:
    result = sorted(result,key=lambda x: x[1])

    # The first closest word is the word itself, so we start our index from 1
    return result[1:n+1]


To illustrate the power of glove vectors, lets look at the 10 closest words to 'ally'. Do you notice the second entry 'staunch'? This is because both of them appear a lot (like '...is a staunch ally...') together and therefore are close to each other. We see synonyms as well as antonymns.

In [38]:
top_n_closest_words('ally',glove,10)

[('allies', 4.8981414),
 ('staunch', 5.6140485),
 ('stalwart', 5.660538),
 ('supporter', 5.893143),
 ('backer', 5.9614644),
 ('longtime', 6.0186133),
 ('adversary', 6.028124),
 ('adversaries', 6.061235),
 ('foe', 6.071723),
 ('considers', 6.216141)]

We can even use glove to infer relationships using simple arithmetic like, 'uncle'-'man'+'woman' = 'aunt'!

In [39]:
glove['test_string'] = glove['uncle'] - glove['man'] + glove['woman']

In [40]:
test = 'test_string'

In [41]:
top_n_closest_words(test,glove,5)

[('aunt', 4.668237),
 ('niece', 4.720575),
 ('uncle', 4.7539396),
 ('grandmother', 4.8348556),
 ('mother', 5.1565957)]