# CS 224D Assignment #2
# Part [2]: Recurrent Neural Networks

This notebook will provide starter code, testing snippets, and additional guidance for implementing the Recurrent Neural Network Language Model (RNNLM) described in Part 2 of the handout.

Please complete parts (a), (b), and (c) of Part 2 before beginning this notebook.

In [32]:
import sys, os
from numpy import *
from matplotlib.pyplot import *
%matplotlib inline
matplotlib.rcParams['savefig.dpi'] = 100

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## (e): Implement a Recurrent Neural Network Language Model

Follow the instructions on the handout to implement your model in `rnnlm.py`, then use the code below to test.

In [33]:
from rnnlm import RNNLM
# Gradient check on toy data, for speed
random.seed(10)
wv_dummy = random.randn(10,50)
model = RNNLM(L0 = wv_dummy, U0 = wv_dummy,
              alpha=0.005, rseed=10, bptt=20)
model.grad_check(array([1,2,3]), array([2,3,4]))

NOTE: temporarily setting self.bptt = len(y) = 3 to compute true gradient.
grad_check: dJ/dH error norm = 6.67e-10 [ok]
    H dims: [50, 50] = 2500 elem
grad_check: dJ/dU error norm = 1.014e-09 [ok]
    U dims: [10, 50] = 500 elem
grad_check: dJ/dL[3] error norm = 4.208e-10 [ok]
    L[3] dims: [50] = 50 elem
grad_check: dJ/dL[2] error norm = 4.199e-10 [ok]
    L[2] dims: [50] = 50 elem
grad_check: dJ/dL[1] error norm = 4.192e-10 [ok]
    L[1] dims: [50] = 50 elem
Reset self.bptt = 20


## Prepare Vocabulary and Load PTB Data

We've pre-prepared a list of the vocabulary in the Penn Treebank, along with their absolute counts and unigram frequencies. The document loader code below will "canonicalize" words and replace any unknowns with a `"UUUNKKK"` token, then convert the data to lists of indices.

In [34]:
from data_utils import utils as du
import pandas as pd

# Load the vocabulary
vocab = pd.read_table("data/lm/vocab.ptb.txt", header=None, sep="\s+",
                     index_col=0, names=['count', 'freq'], )

# Choose how many top words to keep
vocabsize = 2000
num_to_word = dict(enumerate(vocab.index[:vocabsize]))
word_to_num = du.invert_dict(num_to_word)
##
# Below needed for 'adj_loss': DO NOT CHANGE
fraction_lost = float(sum([vocab['count'][word] for word in vocab.index
                           if (not word in word_to_num) 
                               and (not word == "UUUNKKK")]))
fraction_lost /= sum([vocab['count'][word] for word in vocab.index
                      if (not word == "UUUNKKK")])
print "Retained %d words from %d (%.02f%% of all tokens)" % (vocabsize, len(vocab),
                                                             100*(1-fraction_lost))

Retained 2000 words from 38444 (84.00% of all tokens)


Load the datasets, using the vocabulary in `word_to_num`. Our starter code handles this for you, and also generates lists of lists X and Y, corresponding to input words and target words*. 

*(Of course, the target words are just the input words, shifted by one position, but it can be cleaner and less error-prone to keep them separate.)*

In [35]:
# Load the training set
docs = du.load_dataset('data/lm/ptb-train.txt')
S_train = du.docs_to_indices(docs, word_to_num)
X_train, Y_train = du.seqs_to_lmXY(S_train)

# Load the dev set (for tuning hyperparameters)
docs = du.load_dataset('data/lm/ptb-dev.txt')
S_dev = du.docs_to_indices(docs, word_to_num)
X_dev, Y_dev = du.seqs_to_lmXY(S_dev)

# Load the test set (final evaluation only)
docs = du.load_dataset('data/lm/ptb-test.txt')
S_test = du.docs_to_indices(docs, word_to_num)
X_test, Y_test = du.seqs_to_lmXY(S_test)

# Display some sample data
print " ".join(d[0] for d in docs[7])
print S_test[7]

Big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock , traders say .
[   4  147  169  250 1879    7 1224   64    7    1    3    7  456    1    3
 1024  255   24  378  147    3    6   67    0  255  138    2    5]


## (f): Train and evaluate your model

When you're able to pass the gradient check, let's run our model on some real language!

You should randomly initialize the word vectors as Gaussian noise, i.e. $L_{ij} \sim \mathit{N}(0,0.1)$ and $U_{ij} \sim \mathit{N}(0,0.1)$; the function `random.randn` may be helpful here.

As in Part 1, you should tune hyperparameters to get a good model.

In [36]:
hdim = 100 # dimension of hidden layer = dimension of word vectors
random.seed(10)
L0 = zeros((vocabsize, hdim)) # replace with random init, 
                              # or do in RNNLM.__init__()
# test parameters; you probably want to change these
model = RNNLM(L0, U0 = L0, alpha=0.05, rseed=10, bptt=3)

# Gradient check is going to take a *long* time here
# since it's quadratic-time in the number of parameters.
# run at your own risk... (but do check this!)
model.grad_check(array([1,2,3]), array([2,3,4]))

NOTE: temporarily setting self.bptt = len(y) = 3 to compute true gradient.
grad_check: dJ/dH error norm = 2.279e-09 [ok]
    H dims: [100, 100] = 10000 elem
grad_check: dJ/dU error norm = 5.83e-09 [ok]
    U dims: [2000, 100] = 200000 elem
grad_check: dJ/dL[3] error norm = 5.518e-10 [ok]
    L[3] dims: [100] = 100 elem
grad_check: dJ/dL[2] error norm = 5.131e-10 [ok]
    L[2] dims: [100] = 100 elem
grad_check: dJ/dL[1] error norm = 7.321e-10 [ok]
    L[1] dims: [100] = 100 elem
Reset self.bptt = 3


In [37]:
#### YOUR CODE HERE ####
# did not attempt any tuning here...just did straight sgd
model = RNNLM(L0, U0 = L0, alpha=0.05, rseed=10, bptt=3)

##
# Pare down to a smaller dataset, for speed
# (optional - recommended to not do this for your final model)
ntrain = len(Y_train)
X = X_train[:ntrain]
Y = Y_train[:ntrain]

print 'training set size: %d' % ntrain
model.train_sgd(X=X, y=Y, printevery=1000, costevery=10000)

#### END YOUR CODE ####

training set size: 56522
Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 9.00999
  Seen 1000 in 382.69 s
  Seen 2000 in 403.13 s
  Seen 3000 in 424.72 s
  Seen 4000 in 446.17 s
  Seen 5000 in 466.95 s
  Seen 6000 in 487.96 s
  Seen 7000 in 509.72 s
  Seen 8000 in 530.50 s
  Seen 9000 in 551.26 s
  Seen 10000 in 572.53 s
  [10000]: mean loss 4.63868
  Seen 11000 in 961.23 s
  Seen 12000 in 983.33 s
  Seen 13000 in 1004.64 s
  Seen 14000 in 1025.86 s
  Seen 15000 in 1046.55 s
  Seen 16000 in 1066.84 s
  Seen 17000 in 1087.87 s
  Seen 18000 in 1108.80 s
  Seen 19000 in 1128.74 s
  Seen 20000 in 1148.47 s
  [20000]: mean loss 4.38977
  Seen 21000 in 1530.79 s
  Seen 22000 in 1551.61 s
  Seen 23000 in 1572.37 s
  Seen 24000 in 1593.40 s
  Seen 25000 in 1614.09 s
  Seen 26000 in 1634.29 s
  Seen 27000 in 1654.94 s
  Seen 28000 in 1674.44 s
  Seen 29000 in 1694.70 s
  Seen 30000 in 1714.92 s
  [30000]: mean loss 4.2913
  Seen 31000 in 2094.58 s
  Seen 32000 in 2115.44 s
  Seen 33000 in 2135.

[(0, 9.0099876549519617),
 (10000, 4.6386767385681313),
 (20000, 4.3897730782142856),
 (30000, 4.2912980919415933),
 (40000, 4.1988311706922934),
 (50000, 4.1332312755493925),
 (56522, 4.0957669077505159)]

In [38]:
## Evaluate cross-entropy loss on the dev set,
## then convert to perplexity for your writeup
dev_loss = model.compute_mean_loss(X_dev, Y_dev)
print dev_loss

4.14701969121


The performance of the model is skewed somewhat by the large number of `UUUNKKK` tokens; if these are 1/6 of the dataset, then that's a sizeable fraction that we're just waving our hands at. Naively, our model gets credit for these that's not really deserved; the formula below roughly removes this contribution from the average loss. Don't worry about how it's derived, but do report both scores - it helps us compare across models with different vocabulary sizes.

In [39]:
## DO NOT CHANGE THIS CELL ##
# Report your numbers, after computing dev_loss above.
def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q))

Unadjusted: 63.245
Adjusted for missing vocab: 98.270


### Save Model Parameters

In [40]:
##
# Save to .npy files; should only be a few MB total
assert(min(model.sparams.L.shape) <= 100) # don't be too big
assert(max(model.sparams.L.shape) <= 5000) # don't be too big
save("rnnlm.L.npy", model.sparams.L)
save("rnnlm.U.npy", model.params.U)
save("rnnlm.H.npy", model.params.H)

## (g): Generating Data

Once you've trained your model to satisfaction, let's use it to generate some sentences!

Implement the `generate_sequence` function in `rnnlm.py`, and call it below.

In [41]:
def seq_to_words(seq):
    return [num_to_word[s] for s in seq]
    
seq, J = model.generate_sequence(word_to_num["<s>"], 
                                 word_to_num["</s>"], 
                                 maxlen=100)
print J
# print seq
print " ".join(seq_to_words(seq))

selected 1 of 2000 with p=0.109404
selected 254 of 2000 with p=0.000922
selected 925 of 2000 with p=0.000204
selected 346 of 2000 with p=0.002342
selected 863 of 2000 with p=0.000171
selected 14 of 2000 with p=0.212176
selected 403 of 2000 with p=0.000339
selected 9 of 2000 with p=0.310122
selected 431 of 2000 with p=0.011205
selected 1111 of 2000 with p=0.000049
selected 2 of 2000 with p=0.222543
selected 5 of 2000 with p=0.999749
selected 78 of 2000 with p=0.001395
selected 0 of 2000 with p=0.129957
selected 74 of 2000 with p=0.000218
selected 0 of 2000 with p=0.455104
selected 13 of 2000 with p=0.009339
selected 226 of 2000 with p=0.002116
selected 0 of 2000 with p=0.040343
selected 521 of 2000 with p=0.000231
selected 255 of 2000 with p=0.000615
selected 11 of 2000 with p=0.017121
selected 3 of 2000 with p=0.281324
selected 6 of 2000 with p=0.017217
selected 1108 of 2000 with p=0.000125
selected 11 of 2000 with p=0.015899
selected 817 of 2000 with p=0.000459
selected 645 of 2000 wi

**BONUS:** Use the unigram distribution given in the `vocab` table to fill in any `UUUNKKK` tokens in your generated sequences with words that we omitted from the vocabulary. You'll want to use `list(vocab.index)` to get a list of words, and `vocab.freq` to get a list of corresponding frequencies.

In [None]:
# Replace UUUNKKK with a random unigram,
# drawn from vocab that we skipped
from nn.math import MultinomialSampler, multinomial_sample
def fill_unknowns(words):
    #### YOUR CODE HERE ####
    ret = words # do nothing; replace this
    

    #### END YOUR CODE ####
    return ret
    
print " ".join(fill_unknowns(seq_to_words(seq)))