# Fun With Word Embeddings

GLoVe embeddings can be accessed using on this webpage: https://nlp.stanford.edu/projects/glove/. Here, you can download publicly licensed pre-trained GLoVe word embeddings. There are four zip files of word embeddings available for you to download:

    1. Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)
    2. Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)
    3. Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
    4. Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)

We will be using the **Common Crawl 42B tokens** download in this tutorial. However, you can practice using any of the zip files on the GLoVe website.

## Download Data File

To begin visualizing GLoVe word embeddings, you must first download the Common Crawl 42B tokens zip file here: https://nlp.stanford.edu/projects/glove/. The download may take a few minutes as the file is 1.75 GB large. 

Once the file has been downloaded, you will upload the file into your notebook using Pandas. 

In [None]:
!pip install pandas
!pip install numpy

In [None]:
import pandas as pd 
import numpy as np 

In [None]:
#import words from txt file and save as a dictionary

word_embs = {}

#change the file name to reflect your txt file's path
with open("glove.42B.300d.txt", 'r') as f:
    for line in f:
        values = line.split()
        words = values[0]
        word_embs[words]=np.asarray(values[1:], "float32")


In [None]:
word_embs

## Similarity Using Euclidean Distance

Euclidean distance is a measure of the length of a line connecting two points in Euclidean space. For GLoVe embeddings, the similarity of two words is measured by their Euclidean distance. Words that are closer together in Euclidean space are considered more similar and words that are farther apart are considered dissimilar. 

The **scipy** package in Python gives you the ability to measure Euclidean distance between multiple points in a 1D array. 

In [None]:
!pip install scipy

In [None]:
from scipy.spatial import distance



def get_distances(X, Y):
    dist = distance.euclidean(X, Y)
    return dist

In [None]:
print("The Euclidean distance between man and queen is ", get_distances(word_embs["man"], word_embs["queen"]))

In [None]:
print("The Euclidean distance between woman and king is ",get_distances(word_embs["woman"], word_embs["king"]))

Words that have a higher euclidean distance score are words that are considered to be more dissimilar. One thing we notice is that man and queen and woman and king are dissimilar words. However, if you are to run the above code to run man and king or woman and queen, the distance between the two words will be smaller indicating that man and king is more similar than woman and king. 

While this is a good first step in your analysis, however, it is important to note that king and queen have lower distance scores when compared together and punctuation points have a closer distance to the words king and queen than the words man and woman. This is because in the dataset, punctuation marks occur most frequently with these words. To get a more reliable distance score, we would need to perform some more analysis. 

Let's first start by looking at a list of the words most similar to queen and king by creating a **most_similar** function that can return to us the most similar words by their distance scores.

In [None]:
def most_similar(embs):
    return sorted(word_embs.keys(), key=lambda word: distance.euclidean(word_embs[words], embs))

In [None]:
#returns the top 5 most similar words by their distance scores. 
most_similar(word_embs["queen"])[:5]

In [None]:
#returns the top 5 most similar words by their distance scores. 
most_similar(word_embs["king"])[:5]

As you can see, punctuation marks and articles are the top 5 most similar words to the word queen and king! This is a very common occurrence when using word embeddings as articles and punctuation marks are used more commonly than words!  

In [None]:
print(most_similar(word_embs['king'] - word_embs['man'] + word_embs['woman'])[:5])

### Sentence Preprocessing
In Natural Language Processing, you can use a technique called "removing stop words" to take away commonly used words in the English language that does not contribute to our understanding of a word's context. This includes things like articles. 

We will also do things like removing punctuation marks to make sure our analysis is more concise. The **NLTK** package has a way for us to remove stopwords and the **regex** package has a way for us to remove punctuation. We will dowload both packages after we turn our dictionary word embeddings into a list for preprocessing. 

In [None]:
!pip install nltk
!pip install regex

In [None]:
# turn dictionary into a 2D list for us to perform analysis
import string
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

In [None]:
stop_words = stopwords.words('english')
punct = '''!"#$%&'()*+, -./:;<=>?@[\]^_`{|}~'''
stop_words.extend(punct)

stop_words

In [None]:
def clean_dict(dict):
    for words in stop_words:  
        dict.pop(words, None)
    return dict

In [None]:
word_embs_clean = clean_dict(word_embs)

In [None]:
word_embs_clean

### Let's Think

Now that we have removed stop words from our word embeddings dictionary, how does the dictionary look?

Do you notice more words or punctuation marks that may be worth removing to improve the cleanliness of the dictionary even further?

Run the **most_similar** and **get_distances** functions on the new dictionary. Are the distance scores the same or different? Why may that be? Are the top 5 most similar words the same or different? Why is this?