# Word Embeddings

In this exercise, you will see how to train and use word embeddings using the ``gensim `` package


## **Step 1:**
Run the following code block to download pre-trained word embeddings. These word embeddings were trained on google news data, and the vector size is 300. The download might take around 15 minutes. 

In [None]:
import numpy as np
import gensim.downloader as gensim_api

#Download vectors. This might take ~15 minutes or longer
word_vectors = gensim_api.load("word2vec-google-news-300")

In [None]:
#TODO
#Pearson correlation w. human scores from file
! wget http://alfonseca.org/pubs/ws353simrel.tar.gz
! tar -xf ws353simrel.tar.gz
! head wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt



## Step 2:

Using the word embeddings stored in ``word_vectors``, find the most similar words to a given word

### **Question 1:**

* (a) What are the 10 closest words to "france" ?
* (b) What are the 10 closest words to "great" ?
* (c) What are the 10 closest words to "film" ?


In [None]:
#Your code here

## Step 3:

Answer Analogy Questions.

### **Question 2:**

Using the word embeddings stored in ``word_vectors``, complete the following sentences:

* (a) Man is to uncle is like woman is to ..... 
* (b) work is to working is like talk is to .... 
* (c) France is to paris is like Germany is to .....

In [None]:
#your code here

## Step 4:

Now we will plot the word embeddings to see how they are related. The following block implements a function for computing PCA components, which will project the 300-dimensional vectors to 2 dimensions so we can plot them.The code is taken from: https://necromuralist.github.io/Neurotic-Networking/posts/nlp/pca-dimensionality-reduction-and-word-vectors/

In [None]:
import numpy
def compute_pca(X: numpy.ndarray, n_components: int=2) -> numpy.ndarray:
    """Calculate the principal components for X

    Args:
       X: of dimension (m,n) where each row corresponds to a word vector
       n_components: Number of components you want to keep.

    Return:
       X_reduced: data transformed in 2 dims/columns + regenerated original data
    """
    # you need to set axis to 0 or it will calculate the mean of the entire matrix instead of one per row
    X_demeaned = X - X.mean(axis=0)

    # calculate the covariance matrix
    # the default numpy.cov assumes the rows are variables, not columns so set rowvar to False
    covariance_matrix = numpy.cov(X_demeaned, rowvar=False)

    # calculate eigenvectors & eigenvalues of the covariance matrix
    eigen_vals, eigen_vecs = numpy.linalg.eigh(covariance_matrix)

    # sort eigenvalue in increasing order (get the indices from the sort)
    idx_sorted = numpy.argsort(eigen_vals)

    # reverse the order so that it's from highest to lowest.
    idx_sorted_decreasing = list(reversed(idx_sorted))

    # sort the eigen values by idx_sorted_decreasing
    eigen_vals_sorted = eigen_vals[idx_sorted_decreasing]

    # sort eigenvectors using the idx_sorted_decreasing indices
    # We're only sorting the columns so remember to get all the rows in the slice
    eigen_vecs_sorted = eigen_vecs[:, idx_sorted_decreasing]

    # select the first n eigenvectors (n is desired dimension
    # of rescaled data array, or dims_rescaled_data)
    # once again, make sure to get all the rows and only slice the columns
    eigen_vecs_subset = eigen_vecs_sorted[:, :n_components]

    # transform the data by multiplying the transpose of the eigenvectors 
    # with the transpose of the de-meaned data
    # Then take the transpose of that product.
    X_reduced = numpy.dot(eigen_vecs_subset.T, X_demeaned.T).T
    return X_reduced

import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn.decomposition import PCA
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    else:
            words = [word for  word in words if word in model.vocab] 
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

## Plot 

Using the above method ``display_pca_scatterplot``, you can now display a plot showing a subset of words of your choice. An example is shown below. Select other words and plot them to see how they are related in the vector space. 

In [None]:
display_pca_scatterplot(word_vectors, 
                        ['paris', 'berlin', 'france','germany' ])