# CS 224D Assignment #2
# Part [2]: Recurrent Neural Networks

This notebook will provide starter code, testing snippets, and additional guidance for implementing the Recurrent Neural Network Language Model (RNNLM) described in Part 2 of the handout.

Please complete parts (a), (b), and (c) of Part 2 before beginning this notebook.

In [2]:
import sys, os
from numpy import *
from matplotlib.pyplot import *
%matplotlib inline
matplotlib.rcParams['savefig.dpi'] = 100

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## (e): Implement a Recurrent Neural Network Language Model

Follow the instructions on the handout to implement your model in `rnnlm.py`, then use the code below to test.

In [20]:
from rnnlm import RNNLM
# Gradient check on toy data, for speed
random.seed(10)
wv_dummy = random.randn(10,50)
model = RNNLM(L0 = wv_dummy, U0 = wv_dummy,
              alpha=0.005, rseed=10, bptt=1)
model.grad_check(array([1,2,3]), array([2,3,4]))

NOTE: temporarily setting self.bptt = len(y) = 3 to compute true gradient.
grad_check: dJ/dH error norm = 6.621e-08 [ok]
    H dims: [50, 50] = 2500 elem
grad_check: dJ/dU error norm = 9.449e-10 [ok]
    U dims: [10, 50] = 500 elem
grad_check: dJ/dL[1] error norm = 1.566e-08 [ok]
    L[1] dims: [50] = 50 elem
grad_check: dJ/dL[2] error norm = 1.598e-08 [ok]
    L[2] dims: [50] = 50 elem
grad_check: dJ/dL[3] error norm = 1.249e-08 [ok]
    L[3] dims: [50] = 50 elem
Reset self.bptt = 1


In [15]:
from rnnlm import ExtraCreditRNNLM
random.seed(10)
wv_dummy = random.randn(10, 50)
u0 = random.randn(12, 50)
c_words = array([[0, 1, 2, -1, -1, -1, -1], [3, 4, 5, 6, 7, 8, 9]])
lcluster = array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])
args = {"regular": 1e-3,"isME": True, "ngram": 3, "hash_size": 100, "bptt": 3, "alpha": 0.005, "rseed": 10, "isCompression":False, "compression_size": 30, "class_size": 2, "U0": u0, "Lcluster": lcluster, "cfreq": array([3, 7]), "cwords": c_words}

model = ExtraCreditRNNLM(L0=wv_dummy, **args)
model.grad_check(array([1,2,3]), array([2,3,4]))

NOTE: temporarily setting self.bptt = len(y) = 3 to compute true gradient.
grad_check: dJ/dcluster_direct error norm = 1.703e-11 [ok]
    cluster_direct dims: [10, 2] = 20 elem
grad_check: dJ/dword_direct error norm = 3.328e-10 [ok]
    word_direct dims: [10, 10] = 100 elem
grad_check: dJ/dU error norm = 8.248e-10 [ok]
    U dims: [12, 50] = 600 elem
grad_check: dJ/dH error norm = 3.202e-09 [ok]
    H dims: [50, 50] = 2500 elem
grad_check: dJ/dL[1] error norm = 1.535e-09 [ok]
    L[1] dims: [50] = 50 elem
grad_check: dJ/dL[2] error norm = 9.996e-10 [ok]
    L[2] dims: [50] = 50 elem
grad_check: dJ/dL[3] error norm = 1.125e-09 [ok]
    L[3] dims: [50] = 50 elem
Reset self.bptt = 3


## Prepare Vocabulary and Load PTB Data

We've pre-prepared a list of the vocabulary in the Penn Treebank, along with their absolute counts and unigram frequencies. The document loader code below will "canonicalize" words and replace any unknowns with a `"UUUNKKK"` token, then convert the data to lists of indices.

In [27]:
from data_utils import utils as du
import pandas as pd

# Load the vocabulary
vocab = pd.read_table("data/lm/vocab.ptb.txt", header=None, sep="\s+",
                     index_col=0, names=['count', 'freq'], )

# Choose how many top words to keep            
vocabsize = 2000
num_to_word = dict(enumerate(vocab.index[:vocabsize]))
word_to_num = du.invert_dict(num_to_word)
##
# Below needed for 'adj_loss': DO NOT CHANGE
fraction_lost = float(sum([vocab['count'][word] for word in vocab.index
                           if (not word in word_to_num) 
                               and (not word == "UUUNKKK")]))
fraction_lost /= sum([vocab['count'][word] for word in vocab.index
                      if (not word == "UUUNKKK")])
print "Retained %d words from %d (%.02f%% of all tokens)" % (vocabsize, len(vocab),
                                                             100*(1-fraction_lost))
def getVocabClass(csize = 10, v=vocab, vs=vocabsize):
    freq_percent = 0.0
    start = 0
    lcluster = []
    freq_c = zeros(csize)
    freq_sum = vocab['count'][:vs].sum()
    for i in xrange(vocabsize):
        freq_percent += float(vocab['count'][i]) / (freq_sum + 0.0)
        if freq_percent > 1.0:
            freq_percent = 1.0
        if freq_percent > (start + 1.0) / (csize + 0.0):
            lcluster.append(start)
            freq_c[start] += 1
            if start < csize-1:
                start += 1
        else:
            lcluster.append(start)
            freq_c[start] += 1
    
    freq_words = zeros((csize, int(amax(freq_c)+1)))
    start = 0
    for i in xrange(csize):
        #print freq_c[i]
        for j in xrange(int(freq_c[i])):
            freq_words[i,j] = int(start)
            start += 1 
        #print freq_words[i, :int(freq_c[i])]
    print len(lcluster)
    return array(lcluster).astype(int), freq_c.astype(int), freq_words.astype(int)

lcluster, freq_c, freq_words = getVocabClass()



Retained 2000 words from 38444 (84.00% of all tokens)
2000


Load the datasets, using the vocabulary in `word_to_num`. Our starter code handles this for you, and also generates lists of lists X and Y, corresponding to input words and target words*. 

*(Of course, the target words are just the input words, shifted by one position, but it can be cleaner and less error-prone to keep them separate.)*

In [28]:
# Load the training set
docs = du.load_dataset('data/lm/ptb-train.txt')
S_train = du.docs_to_indices(docs, word_to_num)
X_train, Y_train = du.seqs_to_lmXY(S_train)

# Load the dev set (for tuning hyperparameters)
docs = du.load_dataset('data/lm/ptb-dev.txt')
S_dev = du.docs_to_indices(docs, word_to_num)
X_dev, Y_dev = du.seqs_to_lmXY(S_dev)

# Load the test set (final evaluation only)
docs = du.load_dataset('data/lm/ptb-test.txt')
S_test = du.docs_to_indices(docs, word_to_num)
X_test, Y_test = du.seqs_to_lmXY(S_test)

# Display some sample data
print " ".join(d[0] for d in docs[7])
print S_test[7]

Big investment banks refused to step up to the plate to support the beleaguered floor traders by buying big blocks of stock , traders say .
[   4  147  169  250 1879    7 1224   64    7    1    3    7  456    1    3
 1024  255   24  378  147    3    6   67    0  255  138    2    5]


## (f): Train and evaluate your model

When you're able to pass the gradient check, let's run our model on some real language!

You should randomly initialize the word vectors as Gaussian noise, i.e. $L_{ij} \sim \mathit{N}(0,0.1)$ and $U_{ij} \sim \mathit{N}(0,0.1)$; the function `random.randn` may be helpful here.

As in Part 1, you should tune hyperparameters to get a good model.

In [26]:
hdim = 70 # dimension of hidden layer = dimension of word vectors
cdim = 30
csize = 10
random.seed(10)
L = zeros((vocabsize, hdim)) # replace with random init, 
                              # or do in RNNLM.__init__()

# test parameters; you probably want to change these
sigma = 0.1
mu = 0
L = random.randn(vocabsize, hdim)*sigma + mu
U0 = random.randn(vocabsize+csize, hdim)*sigma + mu 

args = {"regular":1e-4,"isME": True, "bptt": 5, "alpha": 0.1, "rseed": 10, "isCompression":False, "compression_size": cdim, "class_size": csize, "U0": U0, "Lcluster": lcluster, "cfreq": freq_c, "cwords": freq_words}
#print model.cdim
model = ExtraCreditRNNLM(L0=L, **args)
#print model.hsize

# Gradient check is going to take a *long* time here
# since it's quadratic-time in the number of parameters.
# run at your own risk... (but do check this!)
model.grad_check(array([1,2,3]), array([2,3,4]))

NOTE: temporarily setting self.bptt = len(y) = 3 to compute true gradient.
grad_check: dJ/dcluster_direct error norm = 6.5e-10 [ok]
    cluster_direct dims: [200, 10] = 2000 elem
grad_check: dJ/dword_direct error norm = 3.257e-10 [ok]
    word_direct dims: [200, 200] = 40000 elem
grad_check: dJ/dU error norm = 1.359e-09 [ok]
    U dims: [210, 70] = 14700 elem
grad_check: dJ/dH error norm = 5.141e-10 [ok]
    H dims: [70, 70] = 4900 elem
grad_check: dJ/dL[1] error norm = 1.6e-10 [ok]
    L[1] dims: [70] = 70 elem
grad_check: dJ/dL[2] error norm = 1.743e-10 [ok]
    L[2] dims: [70] = 70 elem
grad_check: dJ/dL[3] error norm = 1.996e-10

200


 [ok]
    L[3] dims: [70] = 70 elem
Reset self.bptt = 5


In [29]:
#### YOUR CODE HERE ####

##
# Pare down to a smaller dataset, for speed
# (optional - recommended to not do this for your final model)
ntrain = len(Y_train)
X = X_train[:ntrain]
Y = Y_train[:ntrain]

sequence = range(ntrain)
def minibatches(k = 5, seq=sequence):
    num_batches = len(seq) / k
    for i in xrange(num_batches):
        yield random.choice(seq, k)

def decreaselr(start=0.1, number=10000):
    for i in xrange(number):
        yield start * (1 - float(i)/float(number))
#ten percent of data for tuning
ntrain_small = len(Y_train) / 10
X_small = X_train[:ntrain_small]
Y_small = Y_train[:ntrain_small]

print ntrain_small / 5
#model_output = model.train_sgd(X_small, Y_small, idxiter=minibatches(5, range(ntrain_small)), printevery= 300, costevery=1000)
#L0 = random.randn(vocabsize, 100)*0.1 + 0
#model = RNNLM(L0, U0 = L0, alpha=0.05, rseed=10, bptt=1)

#model_output = model.train_sgd(X_small, Y_small, idxiter=minibatches(5, range(ntrain_small)), printevery= 200, costevery=500)
#model_test = model.train_sgd(X_small, Y_small, idxiter=(5, range(ntrain_small)), printevery=200, costevery=500)


#### END YOUR CODE ####

1130


In [28]:
model_output = model.train_sgd(X, Y, idxiter=minibatches(5, range(ntrain)), printevery=500, costevery=1000)
train_loss = model.compute_mean_loss(X, Y)
dev_loss = model.compute_mean_loss(X_dev, Y_dev)

Begin SGD...
  Seen 0 in 0.01 s
  [0]: mean loss 5.43369
  Seen 500 in 251.49 s
  Seen 1000 in 294.29 s
  [1000]: mean loss 50.9193
  Seen 1500 in 538.14 s
  Seen 2000 in 580.21 s
SGD Interrupted: saw 2000 examples in 772.82 seconds.


In [22]:
bptts = range(1, 6)
optim_bptt = 5
loss_best = 10
first = True
print "begining tuning the proper bptt"
for i in bptts:
    model = RNNLM(L0, U0 = L0, alpha=0.1, rseed=10, bptt=i)
    model_output = model.train_sgd(X_small, Y_small, idxiter=minibatches(5, range(ntrain_small)), printevery=200, costevery=500)
    
    if first:
        loss_best = model_output
        first = False
    else:
        if loss_best > model_output:
            loss_best = model_output
            optim_bptt = i
print "Endding tuning the proper bptt"
    
    

begining tuning the proper bptt
Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 7.6893
  Seen 200 in 83.23 s
  Seen 400 in 123.33 s
  [500]: mean loss 5.0037
  Seen 600 in 206.51 s
  Seen 800 in 247.07 s
  Seen 1000 in 287.79 s
  [1000]: mean loss 4.7962
  [1130]: mean loss 4.73605
SGD complete: 1130 examples in 403.91 seconds.
Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 7.70895
  Seen 200 in 88.01 s
  Seen 400 in 131.37 s
  [500]: mean loss 6.42164
  Seen 600 in 217.84 s
  Seen 800 in 259.28 s
  Seen 1000 in 302.29 s
  [1000]: mean loss 5.14403
  [1130]: mean loss 4.91755
SGD complete: 1130 examples in 417.05 seconds.
Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 7.66984
  Seen 200 in 86.46 s
  Seen 400 in 129.71 s
  [500]: mean loss 5.00425
  Seen 600 in 213.09 s
  Seen 800 in 254.63 s
  Seen 1000 in 296.11 s
  [1000]: mean loss 4.82172
  [1130]: mean loss 4.82933
SGD complete: 1130 examples in 408.33 seconds.
Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 7.6686
  Seen 20

In [17]:
model_output = model.train_sgd(X, Y, idxiter=minibatches(5, range(ntrain)), printevery=1000, costevery=5000)
train_loss = model.compute_mean_loss(X, Y)
dev_loss = model.compute_mean_loss(X_dev, Y_dev)

Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 5.49968
  Seen 1000 in 323.15 s
  Seen 2000 in 472.09 s
  Seen 3000 in 618.96 s
  Seen 4000 in 765.11 s
  Seen 5000 in 909.97 s
  [5000]: mean loss 5.43254


KeyboardInterrupt: 

In [13]:
def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q, mode="basic"))

Unadjusted: 71.584
Adjusted for missing vocab: 113.881


In [33]:
hdim = 80 # dimension of hidden layer = dimension of word vectors
csize = 10
random.seed(10)
L = zeros((vocabsize, hdim)) # replace with random init, 
                              # or do in RNNLM.__init__()

# test parameters; you probably want to change these
sigma = 0.1
mu = 0
L = random.randn(vocabsize, hdim)*sigma + mu
U0 = random.randn(vocabsize+csize, hdim)*sigma + mu 

args = {"isME": True,"bptt": 5, "alpha": 0.1, "rseed": 10, "isCompression":False, "class_size": csize, "U0": U0, "Lcluster": lcluster, "cfreq": freq_c, "cwords": freq_words}

model = ExtraCreditRNNLM(L0=L, **args)

model_output = model.train_sgd(X, Y, idxiter=minibatches(5, range(ntrain)), printevery=1000, costevery=5000)
train_loss = model.compute_mean_ppl(X, Y)
dev_loss = model.compute_mean_ppl(X_dev, Y_dev)

def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q, mode="basic"))

Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 5.59847
  Seen 1000 in 295.52 s
  Seen 2000 in 391.79 s
  Seen 3000 in 488.52 s
  Seen 4000 in 585.40 s
  Seen 5000 in 681.73 s
  [5000]: mean loss 4.29344
  Seen 6000 in 976.79 s
  Seen 7000 in 1073.25 s
  Seen 8000 in 1169.72 s
  Seen 9000 in 1266.09 s
  Seen 10000 in 1362.60 s
  [10000]: mean loss 4.12286
  Seen 11000 in 1660.99 s
  [11304]: mean loss 4.07628
SGD complete: 11304 examples in 1897.84 seconds.
Unadjusted: 61.800
Adjusted for missing vocab: 95.603


In [34]:
hdim = 80 # dimension of hidden layer = dimension of word vectors
cdim = 40
csize = 10
random.seed(10)
L = zeros((vocabsize, hdim)) # replace with random init, 
                              # or do in RNNLM.__init__()

# test parameters; you probably want to change these
sigma = 0.1
mu = 0
L = random.randn(vocabsize, hdim)*sigma + mu
U0 = random.randn(vocabsize+csize, cdim)*sigma + mu 

args = {"bptt": 5, "alpha": 0.1, "rseed": 30, "isCompression":True, "compression_size": cdim, "class_size": csize, "U0": U0, "Lcluster": lcluster, "cfreq": freq_c, "cwords": freq_words} 

#print "Finishing args setting"
model = ExtraCreditRNNLM(L0=L, **args)
#print "Starting to train model"
model_output = model.train_sgd(X, Y, idxiter=minibatches(5, range(ntrain)), printevery=1000, costevery=5000)
train_loss = model.compute_mean_ppl(X, Y)
dev_loss = model.compute_mean_ppl(X_dev, Y_dev)

def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q, mode="basic"))

Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 5.50063
  Seen 1000 in 233.34 s
  Seen 2000 in 297.53 s
  Seen 3000 in 360.81 s
  Seen 4000 in 426.96 s
  Seen 5000 in 491.42 s
  [5000]: mean loss 5.43527
  Seen 6000 in 722.58 s
  Seen 7000 in 787.95 s
  Seen 8000 in 851.20 s
  Seen 9000 in 915.36 s
  Seen 10000 in 979.49 s
  [10000]: mean loss 5.29684
  Seen 11000 in 1207.73 s
  [11304]: mean loss 5.46928
SGD complete: 11304 examples in 1391.36 seconds.
Unadjusted: 236.207
Adjusted for missing vocab: 471.697


In [9]:
## Evaluate cross-entropy loss on the dev set,
## then convert to perplexity for your writeup
optim_dim = 100
optim_bptt = 5
L0 = random.randn(vocabsize, optim_dim) * sigma + mu
model = RNNLM(L0, U0 = L0, alpha=0.1, rseed=10, bptt=optim_bptt)
model_output = model.train_sgd(X, Y, idxiter=minibatches(5, range(ntrain)), printevery=1000, costevery=5000)
dev_loss = model.compute_mean_loss(X_dev, Y_dev)

Begin SGD...
  Seen 0 in 0.00 s
  [0]: mean loss 7.5281
  Seen 1000 in 659.80 s
  Seen 2000 in 889.35 s
  Seen 3000 in 1118.35 s
  Seen 4000 in 1343.70 s
  Seen 5000 in 1570.45 s
  [5000]: mean loss 4.42817
  Seen 6000 in 2227.53 s
  Seen 7000 in 2454.10 s
  Seen 8000 in 2684.65 s
  Seen 9000 in 2911.24 s
  Seen 10000 in 3141.05 s
  [10000]: mean loss 4.27154
  Seen 11000 in 3800.62 s
  [11304]: mean loss 4.2159
SGD complete: 11304 examples in 4296.65 seconds.


The performance of the model is skewed somewhat by the large number of `UUUNKKK` tokens; if these are 1/6 of the dataset, then that's a sizeable fraction that we're just waving our hands at. Naively, our model gets credit for these that's not really deserved; the formula below roughly removes this contribution from the average loss. Don't worry about how it's derived, but do report both scores - it helps us compare across models with different vocabulary sizes.

In [22]:
## DO NOT CHANGE THIS CELL ##
# Report your numbers, after computing dev_loss above.
#without compression layer
def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q, mode="basic"))

Unadjusted: 68.019
Adjusted for missing vocab: 107.163


In [27]:
## DO NOT CHANGE THIS CELL ##
# Report your numbers, after computing dev_loss above.
#with compression layer, csize=10, vocab = 2000, cdim=50
def adjust_loss(loss, funk, q, mode='basic'):
    if mode == 'basic':
        # remove freebies only: score if had no UUUNKKK
        return (loss + funk*log(funk))/(1 - funk)
    else:
        # remove freebies, replace with best prediction on remaining
        return loss + funk*log(funk) - funk*log(q)
# q = best unigram frequency from omitted vocab
# this is the best expected loss out of that set
q = vocab.freq[vocabsize] / sum(vocab.freq[vocabsize:])
print "Unadjusted: %.03f" % exp(dev_loss)
print "Adjusted for missing vocab: %.03f" % exp(adjust_loss(dev_loss, fraction_lost, q, mode="basic"))

Unadjusted: 58.317
Adjusted for missing vocab: 89.224


### Save Model Parameters

In [11]:
##
# Save to .npy files; should only be a few MB total
assert(min(model.sparams.L.shape) <= 100) # don't be too big
assert(max(model.sparams.L.shape) <= 5000) # don't be too big
save("rnnlm.L2.npy", model.sparams.L)
save("rnnlm.U2.npy", model.params.U)
save("rnnlm.H2.npy", model.params.H)
save("rnnlm.C2.npy", model.params.C)

## (g): Generating Data

Once you've trained your model to satisfaction, let's use it to generate some sentences!

Implement the `generate_sequence` function in `rnnlm.py`, and call it below.

In [15]:
def seq_to_words(seq):
    return [num_to_word[s] for s in seq]
    
seq, J = model.generate_sequence(word_to_num["<s>"], 
                                 word_to_num["</s>"], 
                                 maxlen=100)
print J
# print seq
print " ".join(seq_to_words(seq))

117.591262186
<s> the mortgage regulatory UUUNKKK no institute in the japan ) acquisition would be buying UUUNKKK -- america on UUUNKKK 's UUUNKKK , '' he added . </s>


**BONUS:** Use the unigram distribution given in the `vocab` table to fill in any `UUUNKKK` tokens in your generated sequences with words that we omitted from the vocabulary. You'll want to use `list(vocab.index)` to get a list of words, and `vocab.freq` to get a list of corresponding frequencies.

In [21]:
# Replace UUUNKKK with a random unigram,
# drawn from vocab that we skipped
from nn.math import MultinomialSampler, multinomial_sample
def fill_unknowns(words):
    #### YOUR CODE HERE ####
    sum_ = sum(vocab.freq)
    unigram = vocab.freq / float(sum_)    
    ret = words # do nothing; replace this
    for i in xrange(len(words)):
        if ret[i] == "UUUNKKK":
            rep = multinomial_sample(unigram)
            ret[i] = list(vocab.index)[rep]
    

    #### END YOUR CODE ####
    return ret
    
print " ".join(fill_unknowns(seq_to_words(seq)))

<s> the mortgage regulatory reform no institute in the japan ) acquisition would be buying DGDGDGDG -- america on damage 's funding , '' he added . </s>
