# Exploring Word Embeddings

In this exercise, we'll load a set of pre-trained word embeddings created with `gensim` and use them to explore similarity.

Let's start by importing the `gensim` module and loading our pre-trained embeddings.

In [10]:
import gensim
import nltk
from gensim.models import Word2Vec
from nltk.corpus import stopwords
import string

sentences = nltk.corpus.brown.sents()

final_sents =[]

for sent in sentences:  
    sent = [w.lower() for w in sent if w.lower() not in stopwords.words()]  
    sent = [w.lower() for w in sent if w.lower() not in string.punctuation] 
    final_sents.append(sent) 

model = Word2Vec(final_sents, min_count=1, workers=1)
w2v_embeddings = model 
      

#w2v_embeddings = gensim.models.Word2Vec.load('wsj-embeddings.w2v')

KeyboardInterrupt: 

In [None]:
model = Word2Vec(final_sents, min_count=1, workers=1)
w2v_embeddings = model 

That's it! We can do a few operations on the embeddings, such as getting the number of words in the vocabulary.

In [60]:
print(len(w2v_embeddings.wv.vocab))

44377


## Using Embeddings for Similarity

More importantly, however, particularly for the purposes of the homework, we want to use our word embeddings to calculate similarities between words.

Using the `gensim.models.Word2Vec` class, we can simply call:

    w2v_embeddings.wv.simiarity(word_1, word_2)
    
To get the cosine similarity score between the vectors ([documentation here](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.similarity)).

In [61]:
def print_similarity(word_1, word_2):
    
    def check_word(word):
        contained = word in w2v_embeddings.wv
        if not contained:
            print('"{}" not seen in embeddings.'.format(word))
        return contained
        
    if check_word(word_1) and check_word(word_2):
        sim = w2v_embeddings.wv.similarity(word_1, word_2)
        print('{:<18}  {:<15} = {}'.format(word_1, word_2, sim))

print_similarity('man', 'woman')
print_similarity('person', 'child')
print_similarity('tax', 'money')
print_similarity('tax', 'tariff')
print_similarity('tax', 'savings')
print_similarity('tax', 'dogs')
print_similarity('pope', 'catholic')
print_similarity('pope', 'senator')
print_similarity('pope', 'leader')
print_similarity('pope', 'tax')

man                 woman           = 0.9413719773292542
person              child           = 0.8452761173248291
tax                 money           = 0.6546809673309326
tax                 tariff          = 0.6185605525970459
tax                 savings         = 0.47217410802841187
tax                 dogs            = 0.43527430295944214
pope                catholic        = 0.8322218656539917
pope                senator         = 0.8930568695068359
pope                leader          = 0.8474855422973633
pope                tax             = 0.1644563525915146


## Correlating with Human Judgments

Now, I've designed a short in-class poll for us to go through a number of word pairs and get judgments from the class. We can use these results as a convenience sample of human judgments.

In [62]:
human_judgments = '''man,woman,5.0
person,child,5.0
tax,tariff,5.0
tax,money,4.0
tax,savings,3.0
tax,dogs,1.0
pope,catholic,4.0
pope,senator,3.0
pope,leader,3.0
pope,tax,1.0
'''

import csv
judgments = [(row[0], row[1], float(row[2])) for row in csv.reader(human_judgments.split('\n')) if row]

With our human judgments obtained, let's use the human scores as array $a$ and the embeddings scores as array $b$, and use [`scipy.stats.spearmanr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html) to calculate the Spearman rank-order correlation coefficient.

In [67]:
human_scores = [entry[2] for entry in judgments]
embedding_scores = [w2v_embeddings.wv.similarity(w1, w2) for w1, w2, score in judgments]

from scipy.stats.stats import spearmanr

print(spearmanr(human_scores, embedding_scores).correlation)

0.5315095895586142


In [12]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([[2, 0, -1]], [[0,-1, 0]])


array([[0.]])