# Activity 10: Exploring Word Embeddings

---
In this activity, we'll load word embeddings and take a look at a few of their properties. This will prepare us to create a predictive model based on word embeddings in a later computational exercise. Goals of the activity are as follows:

- Learn how to load word embeddings
- Calculate distances between words
- Find the word(s) that are most similar to a given word

## Loading the Embeddings

We'll begin by loading our word vectors/embeddings. We're going to use 300-dimensional embeddings from the [Stanford NLP group (i.e. GloVe)](https://nlp.stanford.edu/projects/glove/). However, we won't be loading all of the >1 million words available directly from the Stanford site. Instead, we'll be working from a much smaller file that contains only those words that are common in PubMed abstracts (from the PubMed 200k RCT dataset). No need to worry about how we're loading the file.

In [1]:
import numpy as np
import pandas as pd
import requests
import shutil

response = requests.get(
    'https://github.com/mengelhard/mmci_applied_ds/raw/master/data/glove/ce3_glove.npy',
    stream=True)

with open('glove.npy', 'wb') as fin:
    shutil.copyfileobj(response.raw, fin)

glove_dict = np.load('glove.npy', allow_pickle=True).item()

`glove_dict` is a Python *dictionary*, which maps unique *keys* to *values*. In our case, the keys are words, and the values are the associated word vectors. If we want to know if a given word is one of the keys in our dictionary, we can write `'word' in glove_dict.keys()`:

In [4]:
'word' in glove_dict.keys()

True

Then, if we want to look up the vector for `'word'`, we can write `glove_dict['word']`, which will give us a 300-dimensional vector:

In [7]:
glove_dict['word'].shape

(300,)

## Part 1: Look up the embeddings for a few medical terms

How rich is the vocabulary we get from our word vectors? In this portion of the activity, you should:
- check whether a few medical terms are found in `glove_dict`
- retrieve the word vectors associated with these terms

Note that our dictionary only contains words found in **both** our PubMed and in the original GloVe dictionary, which contains words from Wikipedia. You'll find some medical terms but not others, and you won't find any legal terms, for instance, because they're not found in PubMed.

In [8]:
### YOUR CODE HERE ###



## Part 2: Calculate the similarity between words

Using our word vectors, we can evaluate the similarity between pairs of words in our dictionary by taking the inner product (also called the *dot product* or *cosine similarity*), which gives us the cosine of the angle between the two word vectors (since these vectors have been unit-normalized).

- When the angle $\theta$ is close to zero, $\cos(\theta)$ will be close to 1.
- As $\theta$ gets larger, $\cos(\theta)$ gets smaller (and, in some cases, negative).

Supposing 'word1' and 'word2' are both in `glove_dict`, we can calculate their dot product as `np.sum(glove_dict['word1'] * glove_dict['word2'])`.

In this part of the exercise, you should calculate the word similarity between:
1. a few pairs of words you'd expect to be closely related
2. a few pairs of words you'd expect to be unrelated

In [10]:
### YOUR CODE HERE ###



## Part 3: Words most similar to a given word

Last but not least, we can find the words in our dictionary that are *most similar* to a given word by iterating over the dictionary. We can do this by:
1. Iterating over the keys (e.g. `for key in glove_dict.keys()`)
2. Iterating over key, value pairs with `.items()` (e.g. `for key, value in glove_dict.items()`)

Either way, we want to:
- calculate the similarity between the word of interest and each other word in the dictionary
- sort the words by their similarity to the word of interest

`pandas` gives us an easy way to do the latter: we can create a series, then sort it by the similarity values. The block below provides a possible implementation along with an example. Please either (a) write your own implementation, or (b) go over this implementation carefully to make sure you understand how it works.

In [35]:
def similarity_to(wordvec):
    return pd.Series({k: np.sum(wordvec * v) for k, v in glove_dict.items()}).sort_values(ascending=False)

similarity_to(glove_dict['lisinopril']).head(20)

lisinopril      1.000000
ramipril        0.546203
captopril       0.541375
atenolol        0.506404
infliximab      0.503944
hydralazine     0.501791
donepezil       0.483091
verapamil       0.482338
budesonide      0.481172
enalapril       0.479861
fluvoxamine     0.476956
clonidine       0.474255
propranolol     0.473654
exenatide       0.472517
ppis            0.464988
salmeterol      0.464125
pantoprazole    0.463856
etanercept      0.463551
atomoxetine     0.461104
temozolomide    0.457288
dtype: float64

Finally, in this part of the exercise, you should:
1. find words that are most similar to a few medical terms of interest and comment on whether / how much you agree with the ranking
2. see which words are *least* similar to those words and comment on their significance or lack thereof

In [45]:
### YOUR CODE HERE ###

