In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from utils import read_embeddings

# 0. Introduction

## 0.1 Readings

The readings for next week will be:
- [Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) - A well-written introduction to recurrent neural networks by Andrej Karpathy (one of Fei-fei Li's former students and current head of Tesla Autopilot).
- [Chris Olah: Understanding LSTMs](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) - An introduction to the computations that go on inside an LSTM, written by Chris Olah (Google Brain, and creator of interactive research journal [distill.pub](http://distill.pub)).

A good resource for today's task (besides the assigned Jurafsky readings from last week) is the following tutorial: [Word2Vec Tutorial: The SkipGram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/). Part 2 provides a really good overview of negative sampling!

## 0.2 Task

This week your tasks will be:
1. using off the shelf embeddings to 
2. training a neural network on the Skipgram task to create word embeddings. If you're accurate enough, you can

# 1. Off-the-Shelf Embeddings

The first task is to use off-the-shelf embeddings to reattempt the IMDB classification. The function below reads the embeddings from a text file and returns the vectors as a `np.array`, as well as `word_to_ix`, which is a `dict` that maps from word to the index of the row in the array and `ix_to_word`, which is a list that maps from the index in the array to a word.

In [2]:
vectors, word_to_ix, ix_to_word = read_embeddings('../data/glove.6B.50d.txt', length=50)

In [3]:
print(word_to_ix['purple'])
print(vectors[word_to_ix['purple'], :])
print(ix_to_word[7644])

7644
[ -5.12440000e-01   1.37550000e+00  -1.02030000e+00  -1.61290000e-01
   4.63910000e-01   5.20210000e-01  -1.25540000e-01  -9.24370000e-01
  -2.89470000e-01  -1.43390000e-01   3.35830000e-01   2.51460000e-01
   1.02190000e+00  -1.28130000e-01  -3.98560000e-01  -7.64740000e-02
  -6.97520000e-01   2.09050000e-01  -9.28610000e-01  -9.80310000e-01
  -1.01630000e+00  -5.03380000e-01   1.10990000e+00  -1.04600000e+00
  -8.72510000e-01  -4.71210000e-01  -8.33200000e-01   1.74180000e+00
   4.39090000e-01  -1.20890000e+00   1.46100000e+00   3.15650000e-01
  -3.03300000e-01   3.27280000e-02  -1.77280000e-01   4.93680000e-01
  -2.78910000e-03  -3.54150000e-01  -2.78760000e-01  -6.22390000e-01
   9.43790000e-02   1.72130000e-03  -6.85390000e-01  -8.15770000e-01
   1.00790000e+00  -2.43800000e-01  -2.20430000e-03  -1.40590000e+00
  -4.30230000e-01  -5.63840000e-01]
purple


In order to featurize all the documents, you'll be averaging the word vectors of all the words in the documents. There are two ways to tackle this problem:

1. an iterative solution, using a `for` loop over documents and then a `for` loop over words, to sum, count, and divide
2. a matrix-based solution (which is much faster)

To arrive at the matrix solution, we'll be trying to compute a document matrix $D$, for which every row is the feature vector of the document. So, if we have 25,000 documents and 300-dimensional word embeddings, $D$ is a $25,000$x$300$ element matrix. We have an array existing word vectors $W$, which in this case we'll say is $|V|$x$300$. What we're missing is a $25,000$x$|V|$ matrix that contains all the words in each document. What's an efficient way to get that?

It turns out we already know one! Scikit-learn's `CountVectorizer` returns exactly that! Note that **you'll need to pass in the existing vocabulary so everything works smoothly**, but that's a parameter you can initialize the `CountVectorizer` with!

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# Your code goes here.

Once you have constructed the document matrix, you can classify each of the rows using softmax regression. You're welcome to use either your previous implementation or the existing scikit-learn implementation.

In [5]:
# Up to you if you use the following or not:
#
# from sklearn.linear_model import LogisticRegression

# Your code goes here.

# 2. Skipgrams

The second task is to train a SkipGram model. Take inspiration from the CBoW classifier trained in the slides, but remember the implementational differences in skipgram.

For one, you generally think about it as having 2 embedding matrices, a word-embedding and a context-embedding. Secondly, you're predicting the context from the words, not the other way around. That part won't change your model code, but it will change how you preprocess the data.

In [6]:
X = '''
it was the best of times it was the worst of times 
it was the age of wisdom it was the age of foolishness
it was the epoch of belief it was the epoch of credulity
'''.split()

# You'll need to write code that takes in the dataset and creates 
# the training examples from it. Remember, the training examples
# are (w, c) pairs for every word c in the context window of word w.

# for instance, on the training set above, the pairs for w = best, and a 
# will be:
#
# (best, was)
# (best, the)
# (best, of)
# (best, times)
def make_examples(X, context_size=2):
    # Your code goes here.
    pass

In [7]:
class Skipgram(nn.Module):
    """ Your code goes here. """
    pass

Once you've created the model, you'll need to train it. I've provided you with a batch function.

In [8]:
def batch_data(X, y, batch_size=8):
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    count = 0
    
    while count < X.shape[0]:
        yield X[count:(count+batch_size), :], y[count:(count+batch_size)]
        count += batch_size

In [9]:
NUM_EPOCHS = 1000

# Your code goes here.
# Remember to instantiate a model, loss function, and optimizer

for i in range(NUM_EPOCHS):
    # Your code goes here.
    # You should use the batch_data function, but remember to wrap your batches
    # in a Variable()
    pass

# Extra Credit: Negative Sampling

Finally, for those of you who want to take on the challenge, I encourage you to attempt to implement negative sampling! Your two options are to write a general purpose negative sampling loss module, or to wrap the whole thing up into the implementation of Skipgram.

The first step will be using a larger dataset to train the vectors on. We'll use the IMDB data.

In [10]:
from utils import read_imdb_data

In [11]:
X, y = read_imdb_data('../data/aclImdb/test')
X = ' '.join(X).split()

We can limit the size of the vocabulary to speed up training as well.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=1000)
vectorizer.fit(X)

X = [w for w in X if w in vectorizer.vocabulary_.keys()]

Now go through the same process you went through above to generate the training pairs.

In [14]:
class NegativeSamplingSkipgram(nn.Module):
    """ Your code goes here. """
    pass

In [15]:
NUM_EPOCHS = 1000

# Your code goes here.
# Remember to instantiate a model, loss function, and optimizer

for i in range(NUM_EPOCHS):
    # Your code goes here.
    # You should use the batch_data function, but remember to wrap your batches
    # in a Variable()
    pass