# Word Embeddings - Exploration

In this part of the assignment, we'll explore a few properties of word embeddings. We'll use pre-trained GloVe ([Pennington et al. 2013](https://nlp.stanford.edu/pubs/glove.pdf)) embeddings, and evaluate on the analogy task described in ([Mikolov et al. 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)).

You may also want to take a look at the [Embedding Co-occurrence and Visualization notebook](Embeddings_SVD_Viz.ipynb); this notebook will build somewhat on that material.

In [None]:
# Install a few python packages using pip
from common import utils
utils.require_package("wget")      # for fetching dataset

In [None]:
# Standard python helper libraries.
import os, sys, re, json, time
import itertools, collections
from importlib import reload
from IPython.display import display

# NumPy and SciPy for matrix ops
import numpy as np
import scipy.sparse

# NLTK for NLP utils
import nltk

# Helper libraries
from common import utils, vocabulary, tf_embed_viz

# Fits like a GloVe

Word embeddings take a long time to train - since the goal is to provide a good representation for as many words as possible, generating good embeddings often requires making several passes over a very large corpus. 

Fortunately, it's possible to learn fairly general embeddings from large corpora that are useful for many downstream tasks. We'll use the GloVe vectors available at https://nlp.stanford.edu/projects/glove/ - specifically, a set trained with a vocabulary of 400,000 on a corpus of 6B tokens from Wikipedia and Gigaword.

The vectors are distributed as a (very) large text file, with one word per line followed by its vector:
```
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459
```

We've implemented a helper class, `Hands` in `glove_helper.py`, that will parse these files in a memory efficient manner and provide a wrapper object over a NumPy array containing the actual vectors. 

Run the cell below; the first time, it will download an ~800 MB file to the `data/` directory. **_Please do not check this in to git!_** Also be sure that you have at lease 2 GB of free memory on your workstation; Python is not very efficient, so these embeddings can be significantly larger when loaded into memory.

In [None]:
import glove_helper; reload(glove_helper)

hands = glove_helper.Hands(ndim=100)  # 50, 100, 200, 300 dim are available

`hands` has a few properties and methods that might be useful:
- `hands.vocab` is a `vocabulary.Vocabulary` object that manages the set of available words
- `hands.W` is a matrix of shape $|V| \times d$ containing the actual vectors, one per row. Row indices are as given by `hands.vocab.word_to_id[word]`.
- `hands.get_vector(word)` returns the vector for a word (passed as a string).

Note that we let $|V| = $`hands.W.shape[0]`, which in addition to the actual words includes three special tokens: `<s>` (begin sentence), `</s>` (end sentence), and `<unk>` (unknown word).

# Part (a): Nearest Neighbors

### Cosine Similarity

To measure the similarity of two words, we'll use the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between their representation vectors:

$$ D^{cos}_{ij} = \frac{v_i^T v_j}{||v_i||\ ||v_j||}$$

*Note that this is called cosine similarity because $D^{cos}_{ij} = \cos(\theta_{ij})$, where $\theta_{ij}$ is the angle between the two vectors.*

## Part (a) Questions

1. In `vector_math.py`, implement the `find_nn_cos(...)` function. Read the docstring _carefully_ - it describes what you should return. *Hint: use NumPy functions instead of a `for` loop.*
<p>
2. Use the `show_nns(...)` function below to find the nearest neighbors for the words `"bank"`, `"plane"`, and `"flies"`. Are the neighbors dominated by one sense of these words or another? Is there evidence that the vector encodes meaning of the other senses as well?
<p>
3. Like `word2vec`, GloVe constructs representations by summarizing word-word coocurrence statistics. Use `show_nns(...)` to find the neighbors of `"green"` and `"celadon"`, and `"orange"` and `"ochre"`. Explain what you find in terms of the distributional hypothesis and the grounding problem. Do the vectors for `"ochre"` and `"celadon"` appear to encode a notion of color? What do they represent, instead?

_(Recall that the Distributional Hypothesis is the idea that "you shall know a word by the company it keeps" (Firth, 1957) - that meaning is derived from the context in which a word is used. Grounding refers to the meaning of language in terms of external concepts, such as real-world entities or physical characteristics.)_

In [None]:
import vector_math; reload(vector_math)

def show_nns(hands, word, k=10):
    """Helper function to print neighbors of a given word."""
    word = word.lower()
    print("Nearest neighbors for '{:s}'".format(word))
    v = hands.get_vector(word)
    for i, sim in zip(*vector_math.find_nn_cos(v, hands.W, k)):
        target_word = hands.vocab.id_to_word[i]
        print("{:.03f} : '{:s}'".format(sim, target_word))
    print("")
    
show_nns(hands, "the")

In [5]:
#### YOUR CODE HERE ####
# Code for Part (a).2
show_nns(hands, "flies")   #--SOLUTION--
show_nns(hands, "plane")   #--SOLUTION--
show_nns(hands, "bank")    #--SOLUTION--


#### END(YOUR CODE) ####

Nearest neighbors for 'flies'
1.000 : 'flies'
0.741 : 'fly'
0.644 : 'flying'
0.634 : 'insects'
0.632 : 'flew'
0.618 : 'butterflies'
0.614 : 'moths'
0.609 : 'moth'
0.581 : 'planes'
0.576 : 'plane'

Nearest neighbors for 'plane'
1.000 : 'plane'
0.865 : 'airplane'
0.822 : 'flight'
0.820 : 'jet'
0.808 : 'crashed'
0.793 : 'crash'
0.775 : 'planes'
0.773 : 'helicopter'
0.758 : 'airliner'
0.753 : 'flying'

Nearest neighbors for 'bank'
1.000 : 'bank'
0.806 : 'banks'
0.753 : 'banking'
0.704 : 'credit'
0.694 : 'investment'
0.678 : 'financial'
0.669 : 'securities'
0.665 : 'lending'
0.648 : 'funds'
0.648 : 'ubs'



In [6]:
#### YOUR CODE HERE ####
# Code for Part (a).3
show_nns(hands, "orange")  #--SOLUTION--
show_nns(hands, "ochre")   #--SOLUTION--
show_nns(hands, "green")   #--SOLUTION--
show_nns(hands, "celadon") #--SOLUTION--


#### END(YOUR CODE) ####

Nearest neighbors for 'orange'
1.000 : 'orange'
0.736 : 'yellow'
0.714 : 'red'
0.712 : 'blue'
0.711 : 'green'
0.678 : 'pink'
0.677 : 'purple'
0.671 : 'black'
0.665 : 'colored'
0.625 : 'lemon'

Nearest neighbors for 'ochre'
1.000 : 'ochre'
0.687 : 'pigment'
0.677 : 'reddish'
0.674 : 'ocher'
0.662 : 'coloured'
0.658 : 'greenish'
0.648 : 'magenta'
0.634 : 'pigments'
0.632 : 'yellowish'
0.629 : 'mottled'

Nearest neighbors for 'green'
1.000 : 'green'
0.820 : 'red'
0.787 : 'blue'
0.781 : 'brown'
0.771 : 'yellow'
0.762 : 'white'
0.749 : 'gray'
0.733 : 'black'
0.729 : 'pink'
0.728 : 'purple'

Nearest neighbors for 'celadon'
1.000 : 'celadon'
0.620 : 'faience'
0.602 : 'porcelains'
0.594 : 'majolica'
0.591 : 'ocher'
0.585 : 'blue-and-white'
0.575 : 'glazes'
0.563 : 'unglazed'
0.558 : 'porcelain'
0.549 : 'steatite'



# Part (b): Linear Analogies

In this part, you'll implement the word analogy task described in Section 4 of ([Mikolov et al. 2013](https://arxiv.org/pdf/1301.3781.pdf)), and discussed in section 4.8 and 4.11 of the async.

1. In `vector_math.py`, implement the `analogy(...)` function. (*Hint: this should be a very short function, given what you've already written above.*)
<p>
2. Evaluate a few analogies using the `show_analogy(...)` function below. In particular, find at least one analogy that tests each of the following relationships, and that the model gets right:<ul>
<li> Singular / plural
<li> Superlatives
<li> Verb tense
<li> Country / capital
</ul>
(See Table 1 of ([Mikolov et al. 2013](https://arxiv.org/pdf/1301.3781.pdf)) for a few ideas)
<p>
3. Evaluate the following analogies:
<ul>
<li> `"lizard" is to "reptile" as "dog" is to ____`
<li> `"finger" is to  "hand"   as "toe" is to ____`
</ul>
What types of relations do these test? (*Hint: think back to WordNet, and things that end in -nymy.*) Does our approach of linear analogies work well here? What assumption is violated by these sorts of relationships? (*Hint: what if we reversed the order, and tested "reptile" is to "lizard", and so on?*)

In [7]:
import vector_math; reload(vector_math)

def show_analogy(hands, a, b, c, k=5):
    """Compute and print a vector analogy."""
    a, b, c = a.lower(), b.lower(), c.lower()
    va = hands.get_vector(a)
    vb = hands.get_vector(b)
    vc = hands.get_vector(c)
    print("'{a:s}' is to '{b:s}' as '{c:s}' is to ___".format(**locals()))
    for i, sim in zip(*vector_math.analogy(va, vb, vc, hands.W, k)):
        target_word = hands.vocab.id_to_word[i]
        print("{:.03f} : '{:s}'".format(sim, target_word))
    print("")
    
show_analogy(hands, "king", "queen", "man")

'king' is to 'queen' as 'man' is to ___
0.804 : 'woman'
0.779 : 'man'
0.735 : 'girl'
0.682 : 'she'
0.659 : 'her'



In [8]:
#### YOUR CODE HERE ####
# Code for Part (b).2
show_analogy(hands, "mouse", "mice", "dog")        #--SOLUTION--
show_analogy(hands, "fast", "fastest", "slow")     #--SOLUTION--
show_analogy(hands, "russia", "moscow", "greece")  #--SOLUTION--
show_analogy(hands, "work", "works", "speak")      #--SOLUTION--


#### END(YOUR CODE) ####

'mouse' is to 'mice' as 'dog' is to ___
0.808 : 'dogs'
0.729 : 'rats'
0.715 : 'dog'
0.704 : 'cats'
0.689 : 'mice'

'fast' is to 'fastest' as 'slow' is to ___
0.855 : 'fastest'
0.744 : 'slowest'
0.640 : 'fourth'
0.640 : 'slower'
0.629 : 'slow'

'russia' is to 'moscow' as 'greece' is to ___
0.797 : 'athens'
0.753 : 'greece'
0.701 : 'thessaloniki'
0.698 : 'istanbul'
0.668 : 'moscow'

'work' is to 'works' as 'speak' is to ___
0.851 : 'speak'
0.749 : 'speaks'
0.719 : 'spoken'
0.627 : 'spoke'
0.621 : 'speaking'



In [9]:
#### YOUR CODE HERE ####
# Code for Part (b).3
show_analogy(hands, "lizard", "reptile", "dog")  #--SOLUTION--
show_analogy(hands, "finger", "hand", "toe")     #--SOLUTION--


#### END(YOUR CODE) ####

'lizard' is to 'reptile' as 'dog' is to ___
0.768 : 'dog'
0.682 : 'dogs'
0.662 : 'pet'
0.635 : 'puppy'
0.629 : 'animal'

'finger' is to 'hand' as 'toe' is to ___
0.776 : 'toe'
0.625 : 'hand'
0.581 : 'shoes'
0.567 : 'back'
0.566 : 'hands'

