# Assignment 4: Recurrent Neural Network Language Model

This is the "working notebook", with skeleton code to load and train your model, as well as run unit tests. See [rnnlm-instructions.ipynb](rnnlm-instructions.ipynb) for the main writeup.

Run the cell below to import packages.

In [1]:
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division

import json, os, re, shutil, sys, time
from importlib import reload
import collections, itertools
import unittest
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import tensorflow as tf
assert(tf.__version__.startswith("1."))

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz

# Your code
import rnnlm; reload(rnnlm)
import rnnlm_test; reload(rnnlm_test)

import json
import pandas as pd
from glob import glob
import csv

## (a) RNNLM Inputs and Parameters  

### Questions for Part (a)
You should use big-O notation when appropriate (i.e. computing $\exp(\mathbf{v})$ for a vector $\mathbf{v}$ of length $n$ is $O(n)$ operations).  Assume for problems a(1-5) that:   
> -- Cell is one layer,  
> -- the embedding feature length and hidden-layer feature lengths are both H, and   
> -- ignore at the moment batch and max_time dimensions.  

1. Let $\text{CellFunc}$ be a simple RNN cell (see async Section 5.8). Write the cell equation in terms of nonlinearities and matrix multiplication. How many parameters (matrix or vector elements) are there for this cell, in terms of `V` and `H`?
<p>
2. How many parameters are in the embedding layer? In the output layer? (By parameters, we mean total number of matrix elements across all train-able tensors. A $m \times n$ matrix has $mn$ elements.)
<p>
3. How many calculations (floating point operations) are required to compute $\hat{P}(w^{(i+1)})$ for a given *single* target word $w^{(i+1)}$, assuming $w^{(i)}$ given and $h^{(i-1)}$ already computed? How about for *all* target words?
<p>
4. How does your answer to 3. change if we approximate $\hat{P}(w^{(i+1)})$ with a sampled softmax with $k$ samples? How about if we use a hierarchical softmax? (*Recall that hierarchical softmax makes a series of left/right decisions using a binary classifier $P_s(\text{right}) = \sigma(u_s \cdot o^{(i)} + b_s)$ at each split $s$ in the tree.*)
<p>
5. If you have an LSTM with $H = 200$ and use sampled softmax with $k = 100$, what part of the network takes up the most computation time during training? (*Choose "embedding layer", "recurrent layer", or "output layer"*.)

Note: for $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times l}$, computing the matrix product $AB$ takes $O(mnl)$ time.

### Answers for Part (a)

####  [Use the Google forms to answer a1) - a5)!]( https://docs.google.com/forms/d/e/1FAIpQLSc7kpuOzErVE_H0vMfsDvFcBHz9dkwXGSzqHUGa70QsTIC1ow/viewform)

## (b) Implementing the RNNLM

### Aside: Shapes Review

Before we start, let's review our understanding of the shapes involved in this assignment and how they manifest themselves in the TF API.

As in the [instructions](rnnlm-instructions.ipynb) notebook, $w$ is a matrix of wordids with shape batch_size x max_time.  Passing this through the embedding layer, we retrieve the word embedding for each, resulting in $x$ having shape batch_size x max_time x embedding_dim.  I find it useful to draw this out on a piece of paper.  When you do, you should end up with a rectangular prism with batch_size height, max_time width and embedding_dim depth.  Many tensors in this assignment share this shape (e.g. $o$, the output from the LSTM, which represents the hidden layer going into the softmax to make a prediction at every time step in every batch).

Since batch size and sentence length are only resolved when we run the graph, we construct the placeholder using "None" in the dimensions we don't know.  The .shape property renders these as ?s.  This should be familiar to you from batch size handling in earlier assignments, only now there are two dimensions of variable length.

See the next cell for a concrete example (though in practice, we'd use a TensorFlow variable that we can train for the embeddings rather than a static array).  Notice how the shape of x_val matches the shape described earlier in this cell.

In [2]:
tf.reset_default_graph()

wordid_ph = tf.placeholder(tf.int32, shape=[None, None])
embedding_matrix = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
x = tf.nn.embedding_lookup(embedding_matrix, wordid_ph)

print('wordid placeholder shape:', wordid_ph.shape)
print('x shape:', x.shape)

sess = tf.Session()
# Two sentences, each with four words.
wordids = [[1, 2, 1, 2], [0, 0, 0, 0]]
x_val = sess.run(x, feed_dict={wordid_ph: wordids})
print('Embeddings shape:', x_val.shape)  # 2 sentences, 4 words, 
print('Embeddings value:\n', x_val)

wordid placeholder shape: (?, ?)
x shape: (?, ?, 3)
Embeddings shape: (2, 4, 3)
Embeddings value:
 [[[2 2 2]
  [3 3 3]
  [2 2 2]
  [3 3 3]]

 [[1 1 1]
  [1 1 1]
  [1 1 1]
  [1 1 1]]]


### Implmenting the RNNLM

In order to better manage the model parameters, we'll implement our RNNLM in the `RNNLM` class in `rnnlm.py`. We've given you a skeleton of starter code for this, but the bulk of the implementation is left to you.

In [3]:
reload(rnnlm)

TF_GRAPHDIR = "/tmp/artificial_hotel_reviews/a4_graph"

# Clear old log directory.
shutil.rmtree(TF_GRAPHDIR, ignore_errors=True)

lm = rnnlm.RNNLM(V=10000, H=200, num_layers=2)
lm.BuildCoreGraph()
lm.BuildTrainGraph()
lm.BuildSamplerGraph()

summary_writer = tf.summary.FileWriter(TF_GRAPHDIR, lm.graph)

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



The code above will load your implementation, construct the graph, and write a logdir for TensorBoard. You can bring up TensorBoard with:
```
cd assignment/a4
tensorboard --logdir /tmp/w266/a4_graph --port 6006
```
As usual, check http://localhost:6006/ and visit the "Graphs" tab to inspect your implementation. Remember, judicious use of `tf.name_scope()` and/or `tf.variable_scope()` will greatly improve the visualization, and make code easier to debug.

In [4]:
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()

with tf.Session(graph=lm.graph) as session:
    session.run(initializer)
    variables_names = [v.name for v in tf.trainable_variables()]
    values = session.run(variables_names)
    for k, v in zip(variables_names, values):
        print("Variable: ", k)
        print("Shape: ", v.shape)
        print(v)

Variable:  W_in:0
Shape:  (10000, 200)
[[-0.78748155 -0.05848813 -0.64172769 ...,  0.84141541 -0.94063973
   0.6318059 ]
 [ 0.73271012  0.9191277   0.16166019 ..., -0.77998519  0.349545
  -0.45933056]
 [ 0.56432152 -0.03329587  0.20995927 ..., -0.3828721   0.12824035
  -0.13679957]
 ..., 
 [-0.12901855 -0.29448342 -0.08786106 ..., -0.73254704 -0.37692833
  -0.20111871]
 [-0.62848496 -0.50970197  0.9452734  ...,  0.99917698 -0.81992722
  -0.37649083]
 [-0.09145284  0.00680947  0.67164588 ..., -0.92668557  0.46153927
   0.74277234]]
Variable:  rnn/multi_rnn_cell/cell_0/basic_lstm_cell/kernel:0
Shape:  (400, 800)
[[ 0.02885439 -0.00585808 -0.04069098 ...,  0.00699186 -0.05404399
   0.02202466]
 [ 0.06509807 -0.02223223  0.02077565 ..., -0.00037639  0.04783005
  -0.00804034]
 [ 0.02742642  0.05705827 -0.03898749 ...,  0.00316557  0.06507974
  -0.01836298]
 ..., 
 [-0.00754741 -0.04928695  0.06990113 ..., -0.06414328 -0.06926721
  -0.05963309]
 [-0.06505661 -0.0187963   0.05553678 ..., -0.0

We've provided a few unit tests below to verify some *very* basic properties of your model.

In [2]:
reload(rnnlm); reload(rnnlm_test)
utils.run_tests(rnnlm_test, ["TestRNNLMCore", "TestRNNLMTrain", "TestRNNLMSampler"])

test_shapes_embed (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_output (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_recurrent (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_train (rnnlm_test.TestRNNLMTrain) ... 

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



ok
test_shapes_sample (rnnlm_test.TestRNNLMSampler) ... ok

----------------------------------------------------------------------
Ran 5 tests in 3.315s

OK


Note that the error messages are intentionally somewhat spare, and that passing tests are no guarantee of model correctness! Your best chance of success is through careful coding and understanding of how the model works.

## (c) Training your RNNLM (5 points)

We'll give you data loader functions in **`utils.py`**. They work similarly to the loaders in the Week 5 notebook.

Particularly, `utils.rnnlm_batch_generator` will return an iterator that yields minibatches in the correct format. Batches will be of size `[batch_size, max_time]`, and consecutive batches will line up along rows so that the final state $h^{\text{final}}$ of one batch can be used as the initial state $h^{\text{init}}$ for the next.

For example, using a toy corpus:  
*(Ignore the ugly formatter code.)*

In [6]:
toy_corpus = "<s> Mary had a little lamb . <s> The lamb was white as snow . <s>"
toy_corpus = np.array(toy_corpus.split())

html = "<h3>Input words w:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.rnnlm_batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    cols = ["w_%d" % d for d in range(w.shape[1])]
    html += "<td>{:s}</td>".format(utils.render_matrix(w, cols=cols, dtype=object))
html += "</tr></table>"
display(HTML(html))

html = "<h3>Target words y:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.rnnlm_batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    cols = ["y_%d" % d for d in range(y.shape[1])]
    html += "<td>{:s}</td>".format(utils.render_matrix(y, cols=cols, dtype=object))
html += "</tr></table>"
display(HTML(html))

Unnamed: 0_level_0,w_0,w_1,w_2,w_3
Unnamed: 0_level_1,w_0,w_1,w_2,Unnamed: 4_level_1
0,<s>,Mary,had,a
1,<s>,The,lamb,was
0,little,lamb,.,
1,white,as,snow,
Batch 0,Batch 1,,,
w_0  w_1  w_2  w_3  0  <s>  Mary  had  a  1  <s>  The  lamb  was,w_0  w_1  w_2  0  little  lamb  .  1  white  as  snow,,,

Unnamed: 0,w_0,w_1,w_2,w_3
0,<s>,Mary,had,a
1,<s>,The,lamb,was

Unnamed: 0,w_0,w_1,w_2
0,little,lamb,.
1,white,as,snow


Unnamed: 0_level_0,y_0,y_1,y_2,y_3
Unnamed: 0_level_1,y_0,y_1,y_2,Unnamed: 4_level_1
0,Mary,had,a,little
1,The,lamb,was,white
0,lamb,.,<s>,
1,as,snow,.,
Batch 0,Batch 1,,,
y_0  y_1  y_2  y_3  0  Mary  had  a  little  1  The  lamb  was  white,y_0  y_1  y_2  0  lamb  .  <s>  1  as  snow  .,,,

Unnamed: 0,y_0,y_1,y_2,y_3
0,Mary,had,a,little
1,The,lamb,was,white

Unnamed: 0,y_0,y_1,y_2
0,lamb,.,<s>
1,as,snow,.


Note that the data we feed to our model will be word indices, but the shape will be the same.

### 1. Implement the `run_epoch` function
We've given you some starter code for logging progress; fill this in with actual call(s) to `session.run` with the appropriate arguments to run a training step. 

Be sure to handle the initial state properly at the beginning of an epoch, and remember to carry over the final state from each batch and use it as the initial state for the next.

**Note:** we provide a `train=True` flag to enable train mode. If `train=False`, this function can also be used for scoring the dataset - see `score_dataset()` below.

In [3]:
def run_epoch(lm, session, batch_iterator,
              train=False, verbose=False,
              tick_s=10, learning_rate=None):
    assert(learning_rate is not None)
    start_time = time.time()
    tick_time = start_time  # for showing status
    total_cost = 0.0  # total cost, summed over all words
    total_batches = 0
    total_words = 0

    if train:
        train_op = lm.train_step_
        use_dropout = True
        loss = lm.train_loss_
    else:
        train_op = tf.no_op()
        use_dropout = False  # no dropout at test time
        loss = lm.loss_  # true loss, if train_loss is an approximation

    for i, (w, y) in enumerate(batch_iterator):
        cost = 0.0
        # At first batch in epoch, get a clean intitial state.
        if i == 0:
            h = session.run(lm.initial_h_, {lm.input_w_: w})

        #### YOUR CODE HERE ####
        feed_dict = {lm.input_w_:w,
                     lm.target_y_:y,
                     lm.learning_rate_: learning_rate,
                     lm.use_dropout_: use_dropout,
                     lm.initial_h_:h}
        cost, h, _ = session.run([loss, lm.final_h_, train_op],feed_dict=feed_dict)

        #### END(YOUR CODE) ####
        total_cost += cost
        total_batches = i + 1
        total_words += w.size  # w.size = batch_size * max_time

        ##
        # Print average loss-so-far for epoch
        # If using train_loss_, this may be an underestimate.
        if verbose and (time.time() - tick_time >= tick_s):
            avg_cost = total_cost / total_batches
            avg_wps = total_words / (time.time() - start_time)
            print("[batch {:d}]: seen {:d} words at {:.1f} wps, loss = {:.3f}".format(
                i, total_words, avg_wps, avg_cost))
            tick_time = time.time()  # reset time ticker

    return total_cost / total_batches

In [4]:
def score_dataset(lm, session, ids, name="Data"):
    # For scoring, we can use larger batches to speed things up.
    bi = utils.rnnlm_batch_generator(ids, batch_size=100, max_time=100)
    cost = run_epoch(lm, session, bi, 
                     learning_rate=0.0, train=False, 
                     verbose=False, tick_s=3600)
    print("{:s}: avg. loss: {:.03f}  (perplexity: {:.02f})".format(name, cost, np.exp(cost)))
    return cost

You can use the cell below to verify your implementation of `run_epoch`, and to test your RNN on a (very simple) toy dataset:

In [5]:
reload(rnnlm); reload(rnnlm_test)
th = rnnlm_test.RunEpochTester("test_toy_model")
th.setUp(); th.injectCode(run_epoch, score_dataset)
unittest.TextTestRunner(verbosity=2).run(th)

test_toy_model (rnnlm_test.RunEpochTester) ... 

[batch 0]: seen 50 words at 43.1 wps, loss = 2.028
[batch 98]: seen 4950 words at 2284.2 wps, loss = 1.114
[batch 210]: seen 10550 words at 3329.2 wps, loss = 0.959
[batch 346]: seen 17350 words at 4155.3 wps, loss = 1.024
[batch 481]: seen 24100 words at 4656.4 wps, loss = 0.983
[batch 615]: seen 30800 words at 4987.1 wps, loss = 0.943
[batch 751]: seen 37600 words at 5237.5 wps, loss = 0.954
[batch 887]: seen 44400 words at 5425.6 wps, loss = 0.964
[batch 1024]: seen 51250 words at 5579.2 wps, loss = 0.954
[batch 1160]: seen 58050 words at 5697.4 wps, loss = 0.943
[batch 1296]: seen 64850 words at 5794.1 wps, loss = 0.930
[batch 1431]: seen 71600 words at 5872.2 wps, loss = 0.921
[batch 1567]: seen 78400 words at 5940.1 wps, loss = 0.917
[batch 1702]: seen 85150 words at 5995.7 wps, loss = 0.905
[batch 1839]: seen 92000 words at 6050.8 wps, loss = 0.909
[batch 1976]: seen 98850 words at 6098.0 wps, loss = 0.914
[batch 2112]: seen 105650 words at 6138.3 wps, loss = 0.906
[batch 2245]:

ok

----------------------------------------------------------------------
Ran 1 test in 33.626s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

Note that as above, this is a *very* simple test case that does not guarantee model correctness.

### 2. Run Training

We'll give you the outline of the training procedure, but you'll need to fill in a call to your `run_epoch` function. 

At the end of training, we use a `tf.train.Saver` to save a copy of the model to `/tmp/w266/a4_model/rnnlm_trained`. You'll want to load this from disk to work on later parts of the assignment; see **part (d)** for an example of how this is done.

#### Tuning Hyperparameters
With a sampled softmax loss, the default hyperparameters should train 5 epochs in around 15 minutes on a single-core GCE instance, and reach a training set perplexity between 120-140.

However, it's possible to do significantly better. Try experimenting with multiple RNN layers (`num_layers` > 1) or a larger hidden state - though you may also need to adjust the learning rate and number of epochs for a larger model.

You can also experiment with a larger vocabulary. This will look worse for perplexity, but will be a better model overall as it won't treat so many words as `<unk>`.

#### Notes on Speed

To speed things up, you may want to re-start your GCE instance with more CPUs. Using a 16-core machine will train *very* quickly if using a sampled softmax lost, almost as fast as a GPU. (Because of the sequential nature of the model, GPUs aren't actually much faster than CPUs for training and running RNNs.) The training code will print the words-per-second processed; with the default settings on a single core, you can expect around 8000 WPS, or up to more than 25000 WPS on a fast multi-core machine.

You might also want to modify the code below to only run score_dataset at the very end, after all epochs are completed. This will speed things up significantly, since `score_dataset` uses the full softmax loss - and so often can take longer than a whole training epoch!

#### Submitting your model
You should submit your trained model along with the assignment. Do:
```
cp /tmp/w266/a4_model/rnnlm_trained* .
git add rnnlm_trained*
git commit -m "Adding trained model."
```
Unless you train a very large model, these files should be < 50 MB and no problem for git to handle. If you do also train a large model, please only submit the smaller one.

In [10]:
# Load the dataset
V = 10000
vocab, train_ids, test_ids = utils.load_corpus("brown", split=0.8, V=V, shuffle=42)

[nltk_data] Downloading package brown to /home/kalvin_kao/nltk_data...
[nltk_data]   Package brown is already up-to-date!
Vocabulary: 10,000 types
Loaded 57,340 sentences (1.16119e+06 tokens)
Training set: 45,872 sentences (924,077 tokens)
Test set: 11,468 sentences (237,115 tokens)


In [6]:
review_path = '/home/kalvin_kao/yelp_challenge_dataset/review.csv'
business_path = '/home/kalvin_kao/yelp_challenge_dataset/business.csv'

In [7]:
review_df = pd.read_csv(review_path)

In [8]:
business_df = pd.read_csv(business_path)

In [9]:
five_star_review_df = review_df[review_df['stars']==5]
five_star_review_series = five_star_review_df['text']

In [10]:
#build a list of list of characters from the 5-star reviews
def preprocess_review_series(review_series):
    review_list = []
    for new_review in five_star_review_series:
        clipped_review = new_review[2:-1]
        char_list = list(clipped_review.lower())
        semifinal_review = []
        last_char = ''
        for ascii_char in char_list:
            if ascii_char == '\\' or last_char == '\\':
                pass
            else:
                semifinal_review.append(ascii_char)
            last_char = ascii_char
        final_review = ['<SOR>'] + semifinal_review + ['<EOR>']
        #print(final_review)
        review_list.append(final_review)
    return review_list

In [11]:
#preprocessed reviews
#review_list = preprocess_review_series(five_star_review_series)
review_list = preprocess_review_series(five_star_review_series)

In [12]:
#combined_review_list = [item for sublist in review_list for item in sublist]
training_review_list = [item for sublist in review_list[:250000] for item in sublist]
#unique_characters = list(set(combined_review_list))
unique_characters = list(set(training_review_list))
#len(unique_characters)

In [15]:
#vocabulary
char_dict = {w:i for i, w in enumerate(unique_characters)}
#print(char_dict)

{'h': 0, '.': 1, '4': 2, '*': 3, '_': 4, '`': 5, 'p': 6, 'j': 7, 'n': 8, 'w': 9, ';': 10, 'z': 11, '#': 12, 'f': 13, 'r': 14, 'b': 15, 'q': 16, '^': 17, '<EOR>': 18, 'x': 19, 'y': 20, '@': 21, '5': 22, '$': 23, "'": 24, '{': 25, 't': 26, 'k': 27, '"': 28, '7': 29, '8': 30, 'c': 31, '|': 32, 'a': 33, 'e': 34, 'o': 35, '}': 36, '(': 37, 'l': 38, ' ': 39, '?': 40, 'g': 41, '-': 42, '%': 43, '[': 44, '2': 45, '+': 46, '9': 47, '~': 48, 'd': 49, '6': 50, ',': 51, ')': 52, ':': 53, ']': 54, 's': 55, '0': 56, 'u': 57, '/': 58, '=': 59, '1': 60, '<SOR>': 61, '3': 62, '&': 63, 'm': 64, '!': 65, 'v': 66, 'i': 67}


In [17]:
#convert to flat (1D) np.array(int) of ids
#add memory to VM and remove 1000 slice
#combined_review_ids = [char_dict.get(token) for token in combined_review_list[:1000]]
training_review_ids = [char_dict.get(token) for token in training_review_list]

In [19]:
train_ids = np.array(training_review_ids)

In [24]:
# Training parameters
max_time = 25
batch_size = 100
learning_rate = 0.01
#num_epochs = 10
num_epochs = 1

# Model parameters
#model_params = dict(V=vocab.size, 
                    #H=200, 
                    #softmax_ns=200,
                    #num_layers=2)
model_params = dict(V=len(unique_characters), 
                    H=200, 
                    softmax_ns=len(unique_characters),
                    num_layers=2)

TF_SAVEDIR = "/tmp/artificial_hotel_reviews/a4_model"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [None]:
# Will print status every this many seconds
print_interval = 5

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)
    
    #check trainable variables
    #variables_names = [v.name for v in tf.trainable_variables()]
    #values = session.run(variables_names)
    #for k, v in zip(variables_names, values):
        #print("Variable: ", k)
        #print("Shape: ", v.shape)
        #print(v)

    for epoch in range(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.rnnlm_batch_generator(train_ids, batch_size, max_time)
        print("[epoch {:d}] Starting epoch {:d}".format(epoch, epoch))
        #### YOUR CODE HERE ####
        # Run a training epoch.
        run_epoch(lm, session, batch_iterator=bi, train=True, verbose=True, tick_s=10, learning_rate=learning_rate)

        
        #### END(YOUR CODE) ####
        print("[epoch {:d}] Completed in {:s}".format(epoch, utils.pretty_timedelta(since=t0_epoch)))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        print("[epoch {:d}]".format(epoch), end=" ")
        score_dataset(lm, session, train_ids, name="Train set")
        print("[epoch {:d}]".format(epoch), end=" ")
        score_dataset(lm, session, test_ids, name="Test set")
        print("")
    
    # Save final model
    saver.save(session, trained_filename)
    
    #variables_names = [v.name for v in tf.trainable_variables()]
    #values = session.run(variables_names)
    #for k, v in zip(variables_names, values):
        #print("Variable: ", k)
        #print("Shape: ", v.shape)
        #print(v)

[epoch 1] Starting epoch 1
[batch 42]: seen 107500 words at 10684.1 wps, loss = 2.917
[batch 86]: seen 217500 words at 10778.3 wps, loss = 2.631
[batch 130]: seen 327500 words at 10821.0 wps, loss = 2.491
[batch 174]: seen 437500 words at 10860.9 wps, loss = 2.403
[batch 218]: seen 547500 words at 10868.3 wps, loss = 2.339
[batch 262]: seen 657500 words at 10867.9 wps, loss = 2.288
[batch 306]: seen 767500 words at 10862.0 wps, loss = 2.244
[batch 350]: seen 877500 words at 10850.8 wps, loss = 2.211
[batch 394]: seen 987500 words at 10845.7 wps, loss = 2.182
[batch 438]: seen 1097500 words at 10848.4 wps, loss = 2.156
[batch 481]: seen 1205000 words at 10838.4 wps, loss = 2.135
[batch 525]: seen 1315000 words at 10834.5 wps, loss = 2.116
[batch 569]: seen 1425000 words at 10828.3 wps, loss = 2.098
[batch 612]: seen 1532500 words at 10815.8 wps, loss = 2.081
[batch 656]: seen 1642500 words at 10816.8 wps, loss = 2.066
[batch 699]: seen 1750000 words at 10810.6 wps, loss = 2.053
[batch 7

[batch 5834]: seen 14587500 words at 10854.8 wps, loss = 1.761
[batch 5879]: seen 14700000 words at 10855.9 wps, loss = 1.761
[batch 5923]: seen 14810000 words at 10856.5 wps, loss = 1.760
[batch 5967]: seen 14920000 words at 10856.9 wps, loss = 1.760
[batch 6012]: seen 15032500 words at 10858.1 wps, loss = 1.759
[batch 6056]: seen 15142500 words at 10858.7 wps, loss = 1.759
[batch 6100]: seen 15252500 words at 10859.5 wps, loss = 1.758
[batch 6144]: seen 15362500 words at 10859.6 wps, loss = 1.758
[batch 6188]: seen 15472500 words at 10860.1 wps, loss = 1.757
[batch 6232]: seen 15582500 words at 10860.7 wps, loss = 1.757
[batch 6276]: seen 15692500 words at 10861.0 wps, loss = 1.757
[batch 6320]: seen 15802500 words at 10861.6 wps, loss = 1.756
[batch 6364]: seen 15912500 words at 10862.3 wps, loss = 1.756
[batch 6409]: seen 16025000 words at 10864.4 wps, loss = 1.755
[batch 6454]: seen 16137500 words at 10865.8 wps, loss = 1.755
[batch 6498]: seen 16247500 words at 10866.7 wps, loss 

## (d) Sampling Sentences (5 points)

If you didn't already in **part (b)**, implement the `BuildSamplerGraph()` method in `rnnlm.py` See the function docstring for more information.

#### Implement the `sample_step()` method below (5 points)
This should access the Tensors you create in `BuildSamplerGraph()`. Given an input batch and initial states, it should return a vector of shape `[batch_size,1]` containing sampled indices for the next word of each batch sequence.

Run the method using the provided code to generate 10 sentences.

In [None]:
def sample_step(lm, session, input_w, initial_h):
    """Run a single RNN step and return sampled predictions.
  
    Args:
      lm : rnnlm.RNNLM
      session: tf.Session
      input_w : [batch_size] vector of indices
      initial_h : [batch_size, hidden_dims] initial state
    
    Returns:
      final_h : final hidden state, compatible with initial_h
      samples : [batch_size, 1] vector of indices
    """
    # Reshape input to column vector
    input_w = np.array(input_w, dtype=np.int32).reshape([-1,1])
  
    #### YOUR CODE HERE ####
    # Run sample ops
    #lm.BuildSamplerGraph()
    feed_dict = {lm.input_w_:input_w, lm.initial_h_:initial_h}
    final_h, samples = session.run([lm.final_h_, lm.pred_samples_], feed_dict=feed_dict)


    #### END(YOUR CODE) ####
    # Note indexing here: 
    #   [batch_size, max_time, 1] -> [batch_size, 1]
    return final_h, samples[:,-1,:]

In [None]:
# Same as above, but as a batch
max_steps = 20
num_samples = 10
random_seed = 42

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildSamplerGraph()

with lm.graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(random_seed)
    
    # Load the trained model
    saver.restore(session, trained_filename)

    # Make initial state for a batch with batch_size = num_samples
    #w = np.repeat([[vocab.START_ID]], num_samples, axis=0)
    w = np.repeat([['<SOR>']], num_samples, axis=0)
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    # We'll take one step for each sequence on each iteration 
    for i in range(max_steps):
        h, y = sample_step(lm, session, w[:,-1:], h)
        w = np.hstack((w,y))

    # Print generated sentences
    for row in w:
        for i, word_id in enumerate(row):
            print(vocab.id_to_word[word_id], end=" ")
            #if (i != 0) and (word_id == vocab.START_ID):
            if (i != 0) and (word_id == "<EOR>"):
                break
        print("")

## (e) Linguistic Properties (5 points)

Now that we've trained our RNNLM, let's test a few properties of the model to see how well it learns linguistic phenomena. We'll do this with a scoring task: given two or more test sentences, our model should score the more plausible (or more correct) sentence with a higher log-probability.

We'll define a scoring function to help us:

In [15]:
def score_seq(lm, session, seq, vocab):
    """Score a sequence of words. Returns total log-probability."""
    padded_ids = vocab.words_to_ids(utils.canonicalize_words(["<s>"] + seq + ["</s>"], 
                                                             wordset=vocab.word_to_id))
    w = np.reshape(padded_ids[:-1], [1,-1])
    y = np.reshape(padded_ids[1:],  [1,-1])
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    feed_dict = {lm.input_w_:w,
                 lm.target_y_:y,
                 lm.initial_h_:h,
                 lm.dropout_keep_prob_: 1.0}
    # Return log(P(seq)) = -1*loss
    return -1*session.run(lm.loss_, feed_dict)

def load_and_score(inputs, sort=False):
    """Load the trained model and score the given words."""
    lm = rnnlm.RNNLM(**model_params)
    lm.BuildCoreGraph()
    
    with lm.graph.as_default():
        saver = tf.train.Saver()

    with tf.Session(graph=lm.graph) as session:  
        # Load the trained model
        saver.restore(session, trained_filename)

        if isinstance(inputs[0], str) or isinstance(inputs[0], bytes):
            inputs = [inputs]

        # Actually run scoring
        results = []
        for words in inputs:
            score = score_seq(lm, session, words, vocab)
            results.append((score, words))

        # Sort if requested
        if sort: results = sorted(results, reverse=True)

        # Print results
        for score, words in results:
            print("\"{:s}\" : {:.02f}".format(" ".join(words), score))

Now we can test as:

In [16]:
sents = ["once upon a time",
         "the quick brown fox jumps over the lazy dog"]
load_and_score([s.split() for s in sents])

INFO:tensorflow:Restoring parameters from /tmp/artificial_hotel_reviews/a4_model/rnnlm_trained
"once upon a time" : -8.74
"the quick brown fox jumps over the lazy dog" : -7.83


In [17]:
#sents = ["is sentence nonsense this", "i drive a car"]
#load_and_score([s.split() for s in sents])

### 1. Number agreement

Compare **"the boy and the girl [are/is]"**. Which is more plausible according to your model?

If your model doesn't order them correctly (*this is OK*), why do you think that might be? (answer in cell below)

In [18]:
#### YOUR CODE HERE ####
sents = ["the boy and the girl are", "the boy and the girl is"]
load_and_score([s.split() for s in sents])

#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from /tmp/artificial_hotel_reviews/a4_model/rnnlm_trained
"the boy and the girl are" : -6.03
"the boy and the girl is" : -5.90


#### Answer to part 1. question(s)

*According to the model, "the boy and the girl are" is more plausible, but the log probabilities of the two phrases are very close.  This is likely because "is" and "are" are used so similarly.*

### 2. Type/semantic agreement

Compare:
- **"peanuts are my favorite kind of [nut/vegetable]"**
- **"when I'm hungry I really prefer to [eat/drink]"**

Of each pair, which is more plausible according to your model?

How would you expect a 3-gram language model to perform at this example? How about a 5-gram model? (answer in cell below)

In [19]:
#### YOUR CODE HERE ####
sents = ["peanuts are my favorite kind of nut",
         "peanuts are my favorite kind of vegetable"]
load_and_score([s.split() for s in sents])

sents = ["when I'm hungry I really prefer to eat",
         "when I'm hungry I really prefer to drink"]
load_and_score([s.split() for s in sents])

#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from /tmp/artificial_hotel_reviews/a4_model/rnnlm_trained
"peanuts are my favorite kind of nut" : -7.46
"peanuts are my favorite kind of vegetable" : -7.29
INFO:tensorflow:Restoring parameters from /tmp/artificial_hotel_reviews/a4_model/rnnlm_trained
"when I'm hungry I really prefer to eat" : -7.49
"when I'm hungry I really prefer to drink" : -7.52


#### Answer to part 2. question(s)

*Of each pair, the sentences "peanuts are my favorite kind of vegetable" and "when I'm hungry I really prefer to eat" are more plausible according to the model.*

*For the differing word in each pair of sentences, a 3-gram language model would use as context "favorite kind of" and "really prefer to", and it thus would not have enough context to correctly score the likely final word.*

*For the differing word in each pair of sentences, a 5-gram language model would use as context "are my favorite kind of" and "hungry I really prefer to".  In the first pair of sentences, there is still not enough context to correctly score the next expected word.  However, in the second pair of sentences, a 5-gram language model would capture "hungry" in the context, and this may help it to score "eat" higher than "drink" as the next expected word, similarly to the RNN model.*

### 3. Adjective ordering (just for fun)

Let's repeat the exercise from Week 2:

![Adjective Order](adjective_order.jpg)
*source: https://twitter.com/MattAndersonBBC/status/772002757222002688?lang=en*

We'll consider a toy example (literally), and consider all possible adjective permutations.

Note that this is somewhat sensitive to training, and even a good language model might not get it all correct. Why might the NN fail, if the trigram model from Week 2 was able to solve it?

In [20]:
prefix = "I have lots of".split()
noun = "toys"
adjectives = ["square", "green", "plastic"]
inputs = []
for adjs in itertools.permutations(adjectives):
    words = prefix + list(adjs) + [noun]
    inputs.append(words)
    
load_and_score(inputs, sort=True)

INFO:tensorflow:Restoring parameters from /tmp/artificial_hotel_reviews/a4_model/rnnlm_trained
"I have lots of plastic green square toys" : -8.51
"I have lots of green square plastic toys" : -8.56
"I have lots of green plastic square toys" : -8.57
"I have lots of plastic square green toys" : -8.64
"I have lots of square green plastic toys" : -8.69
"I have lots of square plastic green toys" : -8.71
