# Exercises Week 10
Week 10 exercises is about training word embedding using the Noise Contrastive Estimation (NCE) Skip Gram-model and CBOW model.

In my_util.py we have included standard code for loading data, training a neural network model with pytorch, (as you have seen in earlier exercises) and some nearest neighbor similarity search.

Your task is simply to implement the neural network architectures required for training the embedding model.

Before we get to that let us start with the basic Skip-Gram Model and how we train it.
In the classix Skip Gram model we learn embeddings by teaching a simple neural net: to learn how to predict the neighbouring words in a sentence.


The code for the Skip Gram architecture is shown  in the cell below.
Notice, the model has two matrices:
- the embedding matrix named embedding of size $v \times d$, where $v$ is the number of different words and d is the dimension of the embedding. The i'th row of the matrix is the embedding vector for the i'th word in the list of all considered words.
- the softmax output matrix named linear of size $d \times v$ that maps an embedding vector of size d to $v$ numbers, one entry for each word and these define the model estimates of which words are likely to be near the input word. To make the values into "probabilities" the $v$ numbers are passed through the softmax function.

The forward pass simply looks up the current embedding vector for each input point and then multiplies that vector with the weights defined in linear (just like a standard linar model) and returns the result. The loss used is Negative Log Likelihood on softmax transformation of the output, known as Cross Entropy.

Note that the training data to train such a skip gram model can be extracted from any test and that both the input and the text is an interger in $\{1,\dots, v\}$ both encoding the index of a word.


In [1]:
%matplotlib inline
import torch
import torch.utils.data as torch_data
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import json

class SkipGram(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        """ Construct and save the parameters needed for the model 
        
            The embedding parameters must be stored in self.embedding
        """

        super(SkipGram, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        """ Evaluation of the neural net. The forward pass 
        
            Args:
                inputs: torch.tensor n x 1 (each data point is a word index)
                
            Returns:
                out: torch.tensor shape n x v
        """
        embeds = self.embedding(inputs).squeeze() # batch_size, e_dim 
        #print(embeds.shape)
        out = self.linear(embeds)
        return out
    


## Exercise 1: Noise Contrastive Estimation for Skip-Gram
To get around the problem of updating the embedding for every word in each iteration of gradient descent which makt take very long, we instead train a skip gram embedding model with Noise Contrastive Estimation (NCE),  which changes the problem into a binary classification problem, where the neural net is tasked to learn to determine whether two words comes from the same sentence context or not.

In NCE each input data point is two words and we predict whether they are from the same sentence context, where in the classic skip gram we try to predict word two from word 1. These two words are represented as indexes i.e. word 7 and word 15 which is simply represented as the vector $[7, 15]$ and the label is an integer in {0, 1} indicating whether the two words are taken from the same real sentence context or it is a noise sample created by us.

The neural net architecture works  as follows on a fixed input point (note that input to neural net is batches)
- Step 1: Compute the embedding vectors for the two inputs 
- Step 2: Computer the inner product of the two embeddings
- Step 3: Return this value 

Implement the architecture in the cell below in SkipGramNCE. 
Remeber to name the embedding layer (embedding i.e. self.embedding)

In [2]:
class SkipGramNCE(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        """ Construct and sace the parameters needed for the model 
        
            The embedding parameters must be stored in self.embedding
        """
    
        super(SkipGramNCE, self).__init__()
        ### YOUR CODE HERE
        ### END CODE

    def forward(self, inputs):
        """ Evaluation of the neural net. The forward pass 
        
            Args:
                inputs: torch.tensor n x 2 (each data point is a pair of word indexes)
                
            Returns:
                out: torch.tensor shape n,
        """
        out = None
        ### YOUR CODE HERE
        ### END CODE
        return out

# trivial test to see if it computes something
smodel = SkipGramNCE(5, 10)
test_data = torch.tensor([[2, 3], [3, 1]])
res = smodel.forward(test_data)
print(res)

## Exercise 1 continued: Testing on Large Data
To test your skip gram NCE implementation run the cell below that also shows the nearest words for a few examples that you can experiment with if you like. You can also change the embedding dimension if you like.

In the following three cells we test your implementation
- cell 1: load the data (slow so separate cell) - you can load a small data set for testing if need be (-1 means all data)
- cell 2: training (slow so separate cell) particularly if use all the data and many and set embedding to high as 150 or 200
- cell 3: printing some nearest neighbours in embedding space, results should improve with data size and training time

In [3]:
import my_util 
data_path = './skip_grams.json'
### YOUR CODE HERE - set data path correct here
### END CODE
dataset, dataloader, words_to_idx = my_util.load_skipgram_data(data_path, data_size=-1)


In [4]:
print('Index of horse:', words_to_idx['horse'])
tmp = dataset[0]
print(tmp)
print(words_to_idx.inverse[1516])
print('first data point: {0} -> {1}: label {2}'.format(words_to_idx.inverse[tmp[0][0].item()], 
                                                      words_to_idx.inverse[tmp[0][1].item()], tmp[1]))
# Settings
embedding_dim = 100
vocab_size = len(words_to_idx)
skip_net = SkipGramNCE(vocab_size, embedding_dim)

hist = my_util.fit_model(skip_net, dataloader, epochs=3)

In [5]:
print(vocab_size, embedding_dim)
skip_embedding = skip_net.embedding.weight.data
knn = my_util.KNN(skip_embedding, words_to_idx)
test_words = ["three", "cat", "city", "player", "king", "queen"]
knn.print_nearest(test_words)


## Exercise 2: Noise Contrastive Estimation for CBOW
This exercise is very similar to exercise 1, instead now we use NCE for the CBOW model instead of the Skip-Gram model. The difference to skip gram is that now instead of using one word to predict the surrounding words, we make an algorithm that tries to predict the middle word given the surrounding words. Again we use NCE and get a binary classifation problem. 

The input is now a k+1 dimensional vector  where $k$ is the size of the context used (number of words) around the word we need to predict. The first $k$ values are the indexes of the $k$ words in the context and the last entry is is the index of the target word. The label is again a value in {0, 1} indicating whether the data comes from a real sentence or a fake generated one. 

The neural net architecure is as follows on a given input $x = [x_0,\dots, x_k]$ (note that the input comes in batches - here we describe one point)
- Step 1: Compute the embedding vectors for all the inputs indices
- Step 2: Compute the mean of the embedding vectors for the first k words (all except the last)
- Step 3: Computer the inner product between this mean embedding vector and the embedding for the last word
- Step 4: Return this value

Implement the architecture in the cell below in CBOW_NCE.
Remeber to name the embedding layer (embedding i.e. self.embedding)

In [6]:
class CBOW_NCE(nn.Module):

    def __init__(self, vocab_size, embedding_dim):
        """ Construct and sace the parameters needed for the model 
        
            The embedding parameters must be stored in self.embedding
        """
    
        super(CBOW_NCE, self).__init__()
        ### YOUR CODE HERE
        ### END CODE

    def forward(self, inputs):
        """ Evaluation of the neural net. The forward pass 
        
            Args:
                inputs: torch.tensor n x 2 (each data point is a pair of word indexes)
                
            Returns:
                out: torch.tensor shape n,
        """
        out = None
        ### YOUR CODE HERE
        ### END CODE
        return out

# trivial test to see if it computes something
smodel = CBOW_NCE(20, 10)
test_data = torch.tensor([[1, 2, 3, 4, 5], [3, 3, 3, 3  , 10]])
res = smodel.forward(test_data)
print(res)

## Exercise 2 continued: Testing on Large Data
To test your CBOW NCE implementation run the cell below that also shows the the nearest words for a few examples that you can experiment with if you like. You can also change the embedding dimension if you like.

In the following three cells we test your implementation
- cell 1: load the data (slow so separate cell) - you can load a small data set for testing if need be (-1 means all data)
- cell 2: training (slow so separate cell) particularly if use all the data and many and set embedding to high as 150 or 200
- cell 3: printing some nearest neighbours in embedding space, results should improve with data size and training time

In [7]:
import my_util 
data_path_cbow = './cbow.json'
### YOUR CODE HERE - set data path correct here
### END CODE
dataset_cbow, dataloader_cbow, words_to_idx_cbow = my_util.load_cbow_data(data_path_cbow, data_size=-1)



In [8]:
print('Index of horse:', words_to_idx_cbow['horse'])
tmp = dataset_cbow[0]
print(tmp)
print(words_to_idx_cbow.inverse[1516])
print('first data point: {0} -> {1}: label {2}'.format([words_to_idx_cbow.inverse[tmp[0][i].item()] for i in range(4)],
                                                      words_to_idx_cbow.inverse[tmp[0][4].item()], tmp[1]))
# Settings
embedding_dim_cbow = 100
vocab_size_cbow = len(words_to_idx_cbow)
cbow_net = CBOW_NCE(vocab_size_cbow, embedding_dim_cbow)

hist = my_util.fit_model(cbow_net, dataloader_cbow, epochs=20)

In [9]:
print(vocab_size_cbow, embedding_dim_cbow)
cbow_embedding = cbow_net.embedding.weight.data
knn = my_util.KNN(cbow_embedding, words_to_idx_cbow)
test_words = ["three", "cat", "city", "player", "king", "queen"]
knn.print_nearest(test_words)

