# Introduction

In this lab we will illustrate some of the ideas of word embeddings.

In [1]:
%tensorflow_version 1.x

import os
from google.colab import drive
import numpy as np

TensorFlow 1.x selected.


## Set up -- getting the data and word embeddings
As in the cats vs dogs lab, I have shared the necessary files with you in a Google drive folder.  To get the data into colab, do these steps:

1. Sign into drive.google.com
2. Click on "Shared with me" on the left side of the screen
3. Right click on the stat344ne_imdb folder and select "Add shortcut to Drive", and choose "My Drive" for the shortcut location.
4. Run the code cell below and click on the link that is displayed.  It will pop up a new browser tab where you have to authorize Colab to access your google drive.  Then, copy the sequence of numbers and letters that is displayed and paste it in the space that shows up in the code cell below.


In [2]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
os.mkdir("/content/stat344ne_imdb/")

FileExistsError: ignored

In [0]:
!unzip -uq "/content/drive/My Drive/stat344ne_imdb/glove.6B.50d.txt.zip" -d "/content/stat344ne_imdb/glove/"

### Load word embeddings

We are working here with the GloVe (**Glo**bal **Ve**ctors for word representation) word embeddings.  The word embeddings are stored in a large text file.  Each line represents one word, with corresponding coefficients giving its embedding (that is, its representation).  In the particular version we are working with, the embedding dimension is 50.  I chose this dimension to use because it is the smallest embedding dimension available (so it has the smallest file size).  Other options for the dimension are 100, 200, and 300.

In [2]:
glove_dir = "/content/stat344ne_imdb/glove"

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.50d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Here are the first 20 words in the embeddings dictionary:

In [3]:
list(embeddings_index)[:20]

['the',
 ',',
 '.',
 'of',
 'to',
 'and',
 'in',
 'a',
 '"',
 "'s",
 'for',
 '-',
 'that',
 'on',
 'is',
 'was',
 'said',
 'with',
 'he',
 'as']

Here is the embedding of the word 'cat'.  It is a vector of 50 real numbers since we are working with embedding dimension 50.

In [4]:
embeddings_index['cat']

array([ 0.45281 , -0.50108 , -0.53714 , -0.015697,  0.22191 ,  0.54602 ,
       -0.67301 , -0.6891  ,  0.63493 , -0.19726 ,  0.33685 ,  0.7735  ,
        0.90094 ,  0.38488 ,  0.38367 ,  0.2657  , -0.08057 ,  0.61089 ,
       -1.2894  , -0.22313 , -0.61578 ,  0.21697 ,  0.35614 ,  0.44499 ,
        0.60885 , -1.1633  , -1.1579  ,  0.36118 ,  0.10466 , -0.78325 ,
        1.4352  ,  0.18629 , -0.26112 ,  0.83275 , -0.23123 ,  0.32481 ,
        0.14485 , -0.44552 ,  0.33497 , -0.95946 , -0.097479,  0.48138 ,
       -0.43352 ,  0.69455 ,  0.91043 , -0.28173 ,  0.41637 , -1.2609  ,
        0.71278 ,  0.23782 ], dtype=float32)

### Cosine Similarity

In our videos today, we saw that the similarity of two vectors can be measured by the cosine of the angle between the vectors, which is closely related to the inner product of the vectors.  We also saw that with a one-hot encoding, the similarity for two different words will be 0.

In the videos, I gave the relation $v \cdot w = ||v|| \, ||w|| \, cos(\theta)$, where $\theta$ is the angle between the vectors $v$ and $w$.  We can rearrange this to obtain the **cosine similarity score**:

$$cos(\theta) = \frac{v \cdot w}{||v|| \, ||w||}.$$

#### 1. Implement a function to calculate the similarity of two vectors $v$ and $w$ using the cosine similarity score.

You may use the functions `np.dot`, `np.sqrt`, and/or `np.linalg.norm`.

In [0]:
def cos_similarity(v, w):
  '''
  Calculate cosine similarity of vectors v and w

  Arguments:
   - v: column vector of shape (d, 1)
   - w: column vector of shape (d, 1)
  
  Return:
   - cosine similarity of v and w
  '''
  # add your calculation here.  You can add more lines if it's helpful
  result = np.dot(v.T, w) / (np.linalg.norm(v) * np.linalg.norm(w))
  return(result)

#### 2. Find the cosine similarity of the word embeddings for "cat" and "dog".
Note that you will need to `reshape` the embeddings to have shape (50, 1) for your function to work correctly.

In [6]:
cos_similarity(
    embeddings_index['cat'].reshape((50,1)),
    embeddings_index['dog'].reshape((50,1))
)

array([[0.92180043]], dtype=float32)

#### 3. Find the cosine similarity of the word embeddings for "cat" and "antarctica".

In [7]:
cos_similarity(
    embeddings_index['cat'].reshape((50,1)),
    embeddings_index['antarctica'].reshape((50,1))
)

array([[0.19274694]], dtype=float32)

#### 4. An interesting, amazing, and actually sortof sinister, aspect of word embeddings is that arithmetic differences of word embeddings tend to respect analogies.  For example, the difference between the embedding of 'paris' and 'france' is similar to the difference between the embedding of 'rome' and 'italy'.  The interpretation of this is that if $e_{paris}$ is the embedding vector of 'paris' and $e_{france}$ is the embedding vector of 'france', then the difference $e_{paris} - e_{france}$ represents a meaningful direction in the vector space, encapsulating the relationship between a city and a country it is in.  We will explore this more in a future lab.  For now, calculate the cosine similarity between the difference $e_{paris} - e_{france}$ and the difference $e_{rome} - e_{italy}$.

In [8]:
cos_similarity(
    embeddings_index['paris'].reshape((50,1)) - embeddings_index['france'].reshape((50,1)),
    embeddings_index['rome'].reshape((50,1)) - embeddings_index['italy'].reshape((50,1))
)

array([[0.67514807]], dtype=float32)