# Assignment 4: Recurrent Neural Network Language Model

This is the "working notebook", with skeleton code to load and train your model, as well as run unit tests. See [rnnlm-instructions.ipynb](rnnlm-instructions.ipynb) for the main writeup.

Run the cell below to import packages.

In [1]:
import json, os, re, shutil, sys, time
import collections, itertools
import unittest
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import tensorflow as tf
assert(tf.__version__.startswith("1."))

# utils.pretty_print_matrix uses Pandas. Configure float format here.
import pandas as pd
pd.set_option('float_format', lambda f: "{0:.04f}".format(f))

# Helper libraries
from shared_lib import utils, vocabulary, tf_embed_viz

# Your code
import rnnlm
import rnnlm_test
reload(rnnlm)
reload(rnnlm_test)

<module 'rnnlm_test' from 'rnnlm_test.pyc'>

## (a) RNNLM Inputs and Parameters

### Answers for Part (a)
You can use LaTeX to typeset math, e.g. `$ f(x) = x^2 $` will render as $ f(x) = x^2 $.

1. *$tanh(MW+B)$ The number of parameters in the recurrent cell W is $2H x H$ or $2H^2$ + plus the bias terms of size $H$.*
2. *The embedding layer is of size $VH$.*
3. *To convert from a word from its index representation to its vector representation, we can index the row in the embeddings matrix since the index representation is one-hot.  Inside the recurrent cell, the number of operations is $O(H^2)$. The softmax to calculate $P(w_{t+1})$ for a single target word is $O(HV)$, so the overall operation is $O(H(H+V))$ since to normalize the percentages, it has to compute the logit for all of the words.  So for calculating the probability across all target words in the vocabulary, the number of computations is the same since we have to compute all of the logits for the single target.*
4. *With sampled softmax, computing the probability for each word is $O(K)$.  So the softmax for k samples is $O(k*K)$.  So the total computation, including the recurrent term, is $O(H^2 + H*K)$.  With hierarchical softmax, computing the probability for each word is $O(log_2(V))$.  So the softmax for k samples is $O(k*log_2(V))$.  So the total computation, including the recurrent term, is $O(H^2 + k*log(V))$.*
5. *The recurrent layer.*

## (b) Implementing the RNNLM

In order to better manage the model parameters, we'll implement our RNNLM in the `RNNLM` class in `rnnlm.py`. We've given you a skeleton of starter code for this, but the bulk of the implementation is left to you.

In [2]:
reload(rnnlm)

TF_GRAPHDIR = "tf_graph"

# Clear old log directory.
shutil.rmtree(TF_GRAPHDIR, ignore_errors=True)

lm = rnnlm.RNNLM(V=10000, H=200, num_layers=2)
lm.BuildCoreGraph()
lm.BuildTrainGraph()
lm.BuildSamplerGraph()

summary_writer = tf.summary.FileWriter(TF_GRAPHDIR, lm.graph)

The code above will load your implementation, construct the graph, and write a logdir for TensorBoard. You can bring up TensorBoard with:
```
cd assignment/a4
tensorboard --logdir tf_graph --port 6006
```
As usual, check http://localhost:6006/ and visit the "Graphs" tab to inspect your implementation. Remember, judicious use of `tf.name_scope()` and/or `tf.variable_scope()` will greatly improve the visualization, and make code easier to debug.

We've provided a few unit tests below to verify some *very* basic properties of your model.

In [3]:
reload(rnnlm)
reload(rnnlm_test)

testnames = ["TestRNNLMCore", "TestRNNLMTrain", "TestRNNLMSampler"]

unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromNames(
        testnames, rnnlm_test))

test_shapes_embed (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_output (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_recurrent (rnnlm_test.TestRNNLMCore) ... ok
test_shapes_train (rnnlm_test.TestRNNLMTrain) ... ok
test_shapes_sample (rnnlm_test.TestRNNLMSampler) ... ok

----------------------------------------------------------------------
Ran 5 tests in 1.374s

OK


<unittest.runner.TextTestResult run=5 errors=0 failures=0>

Note that the error messages are intentionally somewhat spare, and that passing tests are no guarantee of model correctness! Your best chance of success is through careful coding and understanding of how the model works.

## (c) Training your RNNLM (5 points)

We'll give you data loader functions in **`utils.py`**. They work similarly to the loaders in the Week 5 notebook.

Particularly, `utils.batch_generator` will return an iterator that yields minibatches in the correct format. Batches will be of size `[batch_size, max_time]`, and consecutive batches will line up along rows so that the final state $h^{\text{final}}$ of one batch can be used as the initial state $h^{\text{init}}$ for the next.

For example, using a toy corpus:  
*(Ignore the ugly formatter code.)*

In [4]:
toy_corpus = "<s> Mary had a little lamb . <s> The lamb was white as snow . <s>"
toy_corpus = np.array(toy_corpus.split())

html = "<h3>Input words w:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    html += "<td>" + utils.render_matrix(w, cols=["w_%d" % d for d in range(w.shape[1])], dtype=object) + "</td>"
html += "</tr></table>"
display(HTML(html))

html = "<h3>Target words y:</h3>"
html += "<table><tr><th>Batch 0</th><th>Batch 1</th></tr><tr>"
bi = utils.batch_generator(toy_corpus, batch_size=2, max_time=4)
for i, (w,y) in enumerate(bi):
    html += "<td>" + utils.render_matrix(y, cols=["y_%d" % d for d in range(y.shape[1])], dtype=object) + "</td>"
html += "</tr></table>"
display(HTML(html))

Unnamed: 0_level_0,w_0,w_1,w_2,w_3
Unnamed: 0_level_1,w_0,w_1,w_2,Unnamed: 4_level_1
0,<s>,Mary,had,a
1,<s>,The,lamb,was
0,little,lamb,.,
1,white,as,snow,
Batch 0,Batch 1,,,
"w_0  w_1  w_2  w_3  0  <s>  Mary  had  a  1  <s>  The  lamb  was  var df = $('table.dataframe'); var cells = df.children('tbody').children('tr')  .children('td'); cells.css(""width"", ""30px"").css(""height"", ""30px"");","w_0  w_1  w_2  0  little  lamb  .  1  white  as  snow  var df = $('table.dataframe'); var cells = df.children('tbody').children('tr')  .children('td'); cells.css(""width"", ""30px"").css(""height"", ""30px"");",,,

Unnamed: 0,w_0,w_1,w_2,w_3
0,<s>,Mary,had,a
1,<s>,The,lamb,was

Unnamed: 0,w_0,w_1,w_2
0,little,lamb,.
1,white,as,snow


Unnamed: 0_level_0,y_0,y_1,y_2,y_3
Unnamed: 0_level_1,y_0,y_1,y_2,Unnamed: 4_level_1
0,Mary,had,a,little
1,The,lamb,was,white
0,lamb,.,<s>,
1,as,snow,.,
Batch 0,Batch 1,,,
"y_0  y_1  y_2  y_3  0  Mary  had  a  little  1  The  lamb  was  white  var df = $('table.dataframe'); var cells = df.children('tbody').children('tr')  .children('td'); cells.css(""width"", ""30px"").css(""height"", ""30px"");","y_0  y_1  y_2  0  lamb  .  <s>  1  as  snow  .  var df = $('table.dataframe'); var cells = df.children('tbody').children('tr')  .children('td'); cells.css(""width"", ""30px"").css(""height"", ""30px"");",,,

Unnamed: 0,y_0,y_1,y_2,y_3
0,Mary,had,a,little
1,The,lamb,was,white

Unnamed: 0,y_0,y_1,y_2
0,lamb,.,<s>
1,as,snow,.


Note that the data we feed to our model will be word indices, but the shape will be the same.

### 1. Implement the `run_epoch` function
We've given you some starter code for logging progress; fill this in with actual call(s) to `session.run` with the appropriate arguments to run a training step. 

Be sure to handle the initial state properly at the beginning of an epoch, and remember to carry over the final state from each batch and use it as the initial state for the next.

**Note:** we provide a `train=True` flag to enable train mode. If `train=False`, this function can also be used for scoring the dataset - see `score_dataset()` below.

In [5]:
def run_epoch(lm, session, batch_iterator,
              train=False, verbose=False,
              tick_s=10, learning_rate=0.1):
    start_time = time.time()
    tick_time = start_time  # for showing status
    total_cost = 0.0  # total cost, summed over all words
    total_batches = 0
    total_words = 0

    if train:
        train_op = lm.train_step_
        use_dropout = True
        loss = lm.train_loss_
    else:
        train_op = tf.no_op()
        use_dropout = False  # no dropout at test time
        loss = lm.loss_  # true loss, if train_loss is an approximation

    for i, (w, y) in enumerate(batch_iterator):
        cost = 0.0
        # At first batch in epoch, get a clean intitial state.
        if i == 0:
            h = session.run(lm.initial_h_, {lm.input_w_: w})

        #### YOUR CODE HERE ####
        feed_dict = { lm.input_w_ : w, lm.target_y_ : y, 
                     lm.learning_rate_ : learning_rate, 
                     lm.use_dropout_ : use_dropout }
        cost, _ = session.run([loss, train_op], feed_dict=feed_dict)
        lm.initial_h_ = lm.final_h_

        #### END(YOUR CODE) ####
        total_cost += cost
        total_batches = i + 1
        total_words += w.size  # w.size = batch_size * max_time

        ##
        # Print average loss-so-far for epoch
        # If using train_loss_, this may be an underestimate.
        if verbose and (time.time() - tick_time >= tick_s):
            avg_cost = total_cost / total_batches
            avg_wps = total_words / (time.time() - start_time)
            print "[batch %d]: seen %d words at %d wps, loss = %.3f" % (
                i, total_words, avg_wps, avg_cost)
            tick_time = time.time()  # reset time ticker

    return total_cost / total_batches

In [6]:
def score_dataset(lm, session, ids, name="Data"):
    # For scoring, we can use larger batches to speed things up.
    bi = utils.batch_generator(ids, batch_size=100, max_time=100)
    cost = run_epoch(lm, session, bi, 
                     learning_rate=1.0, train=False, 
                     verbose=False, tick_s=3600)
    print "%s: avg. loss: %.03f  (perplexity: %.02f)" % (name, cost, np.exp(cost))

You can use the cell below to verify your implementation of `run_epoch`, and to test your RNN on a (very simple) toy dataset:

In [7]:
reload(rnnlm); reload(rnnlm_test)
th = rnnlm_test.RunEpochTester("test_toy_model")
th.setUp(); th.injectCode(run_epoch, score_dataset)
unittest.TextTestRunner(verbosity=2).run(th)

test_toy_model (rnnlm_test.RunEpochTester) ... 

[batch 155]: seen 7800 words at 7792 wps, loss = 0.922
[batch 347]: seen 17400 words at 8681 wps, loss = 0.537
[batch 550]: seen 27550 words at 9163 wps, loss = 0.390
[batch 756]: seen 37850 words at 9437 wps, loss = 0.313
Train set: avg. loss: 0.001  (perplexity: 1.00)
Test set: avg. loss: 0.002  (perplexity: 1.00)


ok

----------------------------------------------------------------------
Ran 1 test in 5.281s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

Note that as above, this is a *very* simple test case that does not guarantee model correctness.

### 2. Run Training

We'll give you the outline of the training procedure, but you'll need to fill in a call to your `run_epoch` function. 

At the end of training, we use a `tf.train.Saver` to save a copy of the model to `./tf_saved/rnnlm_trained`. You'll want to load this from disk to work on later parts of the assignment; see **part (d)** for an example of how this is done.

#### Tuning Hyperparameters
With a sampled softmax loss, the default hyperparameters should train 5 epochs in around 15 minutes on a single-core GCE instance, and reach a training set perplexity below 200.

However, it's possible to do significantly better. Try experimenting with multiple RNN layers (`num_layers` > 1) or a larger hidden state - though you may also need to adjust the learning rate and number of epochs for a larger model.

You can also experiment with a larger vocabulary. This will look worse for perplexity, but will be a better model overall as it won't treat so many words as `<unk>`.

#### Notes on Speed

To speed things up, you may want to re-start your GCE instance with more CPUs. Using a 16-core machine will train *very* quickly if using a sampled softmax lost, almost as fast as a GPU. (Because of the sequential nature of the model, GPUs aren't actually much faster than CPUs for training and running RNNs.) The training code will print the words-per-second processed; with the default settings on a single core, you can expect around 8000 WPS, or up to more than 25000 WPS on a fast multi-core machine.

You might also want to modify the code below to only run score_dataset at the very end, after all epochs are completed. This will speed things up significantly, since `score_dataset` uses the full softmax loss - and so often can take longer than a whole training epoch!

#### Submitting your model
You should submit your trained model along with the assignment. Do:
```
git add tf_saved/rnnlm_trained tf_saved/rnnlm_trained.meta
git commit -m "Adding trained model."
```
Unless you train a very large model, these files should be < 50 MB and no problem for git to handle. If you do also train a large model, please only submit the smaller one.

In [8]:
# Load the dataset
V = 10000
vocab, train_ids, test_ids = utils.load_corpus("brown", split=0.8, V=V, shuffle=42)

Loaded 57340 sentences (1.16119e+06 tokens)
Training set: 45872 sentences (924077 tokens)
Test set: 11468 sentences (237115 tokens)


In [13]:
# Training parameters
max_time = 20
batch_size = 50
learning_rate = 0.2
num_epochs = 20

# Model parameters
model_params = dict(V=vocab.size, 
                    H=200, 
                    softmax_ns=200,
                    num_layers=3)

TF_SAVEDIR = "tf_saved"
checkpoint_filename = os.path.join(TF_SAVEDIR, "rnnlm")
trained_filename = os.path.join(TF_SAVEDIR, "rnnlm_trained")

In [14]:
# Will print status every this many seconds
print_interval = 5

# Clear old log directory
shutil.rmtree("tf_summaries", ignore_errors=True)

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildTrainGraph()

# Explicitly add global initializer and variable saver to LM graph
with lm.graph.as_default():
    initializer = tf.global_variables_initializer()
    saver = tf.train.Saver()
    
# Clear old log directory
shutil.rmtree(TF_SAVEDIR, ignore_errors=True)
if not os.path.isdir(TF_SAVEDIR):
    os.makedirs(TF_SAVEDIR)

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(42)

    session.run(initializer)

    for epoch in xrange(1,num_epochs+1):
        t0_epoch = time.time()
        bi = utils.batch_generator(train_ids, batch_size, max_time)
        print "[epoch %d] Starting epoch %d" % (epoch, epoch)
        #### YOUR CODE HERE ####
        # Run a training epoch.
        run_epoch(lm, session, bi,
              train=True, verbose=True,
              tick_s=30, learning_rate=learning_rate)
        
        #### END(YOUR CODE) ####
        print "[epoch %d] Completed in %s" % (epoch, utils.pretty_timedelta(since=t0_epoch))
    
        # Save a checkpoint
        saver.save(session, checkpoint_filename, global_step=epoch)
    
        ##
        # score_dataset will run a forward pass over the entire dataset
        # and report perplexity scores. This can be slow (around 1/2 to 
        # 1/4 as long as a full epoch), so you may want to comment it out
        # to speed up training on a slow machine. Be sure to run it at the 
        # end to evaluate your score.
        print ("[epoch %d]" % epoch),
        #score_dataset(lm, session, train_ids, name="Train set")
        print ("[epoch %d]" % epoch),
        score_dataset(lm, session, test_ids, name="Test set")
        print ""
    
    # Save final model
    saver.save(session, trained_filename)

[epoch 1] Starting epoch 1
[batch 276]: seen 277000 words at 9230 wps, loss = 5.702
[batch 549]: seen 550000 words at 9159 wps, loss = 5.327
[batch 814]: seen 815000 words at 9048 wps, loss = 5.135
[epoch 1] Completed in 0:01:51
[epoch 1] [epoch 1] Test set: avg. loss: 5.872  (perplexity: 354.83)

[epoch 2] Starting epoch 2
[batch 276]: seen 277000 words at 9214 wps, loss = 4.576
[batch 562]: seen 563000 words at 9357 wps, loss = 4.551
[batch 842]: seen 843000 words at 9339 wps, loss = 4.528
[epoch 2] Completed in 0:01:48
[epoch 2] [epoch 2] Test set: avg. loss: 5.711  (perplexity: 302.06)

[epoch 3] Starting epoch 3
[batch 287]: seen 288000 words at 9575 wps, loss = 4.431
[batch 567]: seen 568000 words at 9439 wps, loss = 4.422
[batch 844]: seen 845000 words at 9369 wps, loss = 4.407
[epoch 3] Completed in 0:01:48
[epoch 3] [epoch 3] Test set: avg. loss: 5.628  (perplexity: 278.12)

[epoch 4] Starting epoch 4
[batch 283]: seen 284000 words at 9439 wps, loss = 4.347
[batch 569]: seen 5

## (d) Sampling Sentences (5 points)

If you didn't already in **part (b)**, implement the `BuildSamplerGraph()` method in `rnnlm.py` See the function docstring for more information.

#### Implement the `sample_step()` method below (5 points)
This should access the Tensors you create in `BuildSamplerGraph()`. Given an input batch and initial states, it should return a vector of shape `[batch_size,1]` containing sampled indices for the next word of each batch sequence.

Run the method using the provided code to generate 10 sentences.

In [27]:
def sample_step(lm, session, input_w, initial_h):
    """Run a single RNN step and return sampled predictions.
  
    Args:
      lm : rnnlm.RNNLM
      session: tf.Session
      input_w : [batch_size] vector of indices
      initial_h : [batch_size, hidden_dims] initial state
    
    Returns:
      final_h : final hidden state, compatible with initial_h
      samples : [batch_size, 1] vector of indices
    """
    # Reshape input to column vector
    input_w = np.array(input_w, dtype=np.int32).reshape([-1,1])
  
    #### YOUR CODE HERE ####
    # Run sample ops
    feed_dict = { lm.initial_h_ : initial_h, lm.input_w_ : input_w }
    samples, final_h = session.run([lm.pred_samples_, lm.final_h_], feed_dict)

    #### END(YOUR CODE) ####
    # Note indexing here: 
    #   [batch_size, max_time, 1] -> [batch_size, 1]
    return final_h, samples[:,-1,:]

In [28]:
# Same as above, but as a batch
max_steps = 20
num_samples = 10
random_seed = 42

lm = rnnlm.RNNLM(**model_params)
lm.BuildCoreGraph()
lm.BuildSamplerGraph()

with lm.graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=lm.graph) as session:
    # Seed RNG for repeatability
    tf.set_random_seed(random_seed)
    
    # Load the trained model
    saver.restore(session, trained_filename)

    # Make initial state for a batch with batch_size = num_samples
    w = np.repeat([[vocab.START_ID]], num_samples, axis=0)
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    # We'll take one step for each sequence on each iteration 
    for i in xrange(max_steps):
        h, y = sample_step(lm, session, w[:,-1:], h)
        w = np.hstack((w,y))

    # Print generated sentences
    for row in w:
        for i, word_id in enumerate(row):
            print vocab.id_to_word[word_id],
            if (i != 0) and (word_id == vocab.START_ID):
                break
        print ""

INFO:tensorflow:Restoring parameters from tf_saved/rnnlm_trained
<s> the doctor few to frequency impossible with ideal -- which questions or antibodies in new birth or <unk> , but 
<s> the man of his possible to gown the only of the shaken <unk> within <unk> fashionable to their unity , 
<s> they <unk> males in less planted , her business , <unk> <unk> , but the shirt time of the appendix 
<s> we had be good after the of nitrogen . </s> <s> 
<s> in an violently , my absurd <unk> speak and he am through the presence <unk> teachers . </s> <s> 
<s> the lincoln of eugene shall undoubtedly contract invariably the glasses <unk> , and the american of the <unk> of new 
<s> <unk> adoption , and for her undue of the whole of its entire hansen that most <unk> , or to 
<s> it has out the whether of his own total and people by tend by the appreciation , they will painter 
<s> a `` pan belt '' , he would gone such one of <unk> in present the state became street , 
<s> phil sure longer and her said k

## (e) Linguistic Properties (5 points)

Now that we've trained our RNNLM, let's test a few properties of the model to see how well it learns linguistic phenomena. We'll do this with a scoring task: given two or more test sentences, our model should score the more plausible (or more correct) sentence with a higher log-probability.

We'll define a scoring function to help us:

In [15]:
def score_seq(lm, session, seq, vocab):
    """Score a sequence of words. Returns total log-probability."""
    padded_ids = vocab.words_to_ids(utils.canonicalize_words(["<s>"] + seq + ["</s>"], 
                                                             wordset=vocab.word_to_id))
    w = np.reshape(padded_ids[:-1], [1,-1])
    y = np.reshape(padded_ids[1:],  [1,-1])
    h = session.run(lm.initial_h_, {lm.input_w_: w})
    feed_dict = {lm.input_w_:w,
                 lm.target_y_:y,
                 lm.initial_h_:h,
                 lm.dropout_keep_prob_: 1.0}
    # Return log(P(seq)) = -1*loss
    return -1*session.run(lm.loss_, feed_dict)

def load_and_score(inputs, sort=False):
    """Load the trained model and score the given words."""
    lm = rnnlm.RNNLM(**model_params)
    lm.BuildCoreGraph()
    
    with lm.graph.as_default():
        saver = tf.train.Saver()

    with tf.Session(graph=lm.graph) as session:  
        # Load the trained model
        saver.restore(session, trained_filename)

        if isinstance(inputs[0], str) or isinstance(inputs[0], unicode):
            inputs = [inputs]

        # Actually run scoring
        results = []
        for words in inputs:
            score = score_seq(lm, session, words, vocab)
            results.append((score, words))

        # Sort if requested
        if sort: results = sorted(results, reverse=True)

        # Print results
        for score, words in results:
            print "\"%s\" : %.02f" % (" ".join(words), score)

Now we can test as:

In [16]:
sents = ["once upon a time",
         "the quick brown fox jumps over the lazy dog"]
load_and_score([s.split() for s in sents])

INFO:tensorflow:Restoring parameters from tf_saved/rnnlm_trained
"once upon a time" : -5.91
"the quick brown fox jumps over the lazy dog" : -6.93


### 1. Number agreement

Compare **"the boy and the girl [are/is]"**. Which is more plausible according to your model?

If your model doesn't order them correctly (*this is OK*), why do you think that might be? (answer in cell below)

In [20]:
#### YOUR CODE HERE ####
sents = ["the boy and the girl are",
         "the boy and the girl is",
         "the boy and the girls are",
         "the boy and the girls is",
         "the girl and the boys are",
         "the girl and the boys is",
         "the boys are",
          "the boys is", ]
load_and_score([s.split() for s in sents])
#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from tf_saved/rnnlm_trained
"the boy and the girl are" : -5.06
"the boy and the girl is" : -4.87
"the boy and the girls are" : -5.23
"the boy and the girls is" : -5.03
"the girl and the boys are" : -5.28
"the girl and the boys is" : -5.09
"the boys are" : -5.77
"the boys is" : -5.37


#### Answer to part 1. question(s)

*For the given examples, there are cases where "and the girl is" is valid.  For example, if the "and" is connecting what could be two independent sentences that the author choose to join into one with "and" as the connector.  More perplexing is that the model prefers "is" over "are" when the preceding word is plural, even without the "and".  Perhaps, even with a hidden state size of 200, the model wasn't able to capture number agreement, or there weren't enough examples of cases with "is" and "are".*

### 2. Type/semantic agreement

Compare:
- **"peanuts are my favorite kind of [nut/vegetable]"**
- **"when I'm hungry I really prefer to [eat/drink]"**

Of each pair, which is more plausible according to your model?

How would you expect a 3-gram language model to perform at this example? How about a 5-gram model? (answer in cell below)

In [18]:
#### YOUR CODE HERE ####
sents = ["peanuts are my favorite kind of nut",
         "peanuts are my favorite kind of vegetable",
         "when I'm hungry I really prefer to eat",
         "when I'm hungry I really prefer to drink",]
load_and_score([s.split() for s in sents])


#### END(YOUR CODE) ####

INFO:tensorflow:Restoring parameters from tf_saved/rnnlm_trained
"peanuts are my favorite kind of nut" : -6.43
"peanuts are my favorite kind of vegetable" : -6.51
"when I'm hungry I really prefer to eat" : -7.21
"when I'm hungry I really prefer to drink" : -7.32


#### Answer to part 2. question(s)

*For the peanut example, neither a 3-gram or 5-gram model have a large enough window to be able to select the correct response for the last word.  Both models would be just as likely to choose "book" or "movie" as the selection.  For the hunger eample, the 5-gram model would have just large enough of a window to see both "hungry" and "eat/drink" to distinguish between the two.*

### 3. Adjective ordering (just for fun)

Let's repeat the exercise from Week 2:

![Adjective Order](adjective_order.jpg)
*source: https://twitter.com/MattAndersonBBC/status/772002757222002688?lang=en*

We'll consider a toy example (literally), and consider all possible adjective permutations.

Note that this is somewhat sensitive to training, and even a good language model might not get it all correct. Why might the NN fail, if the trigram model from Week 2 was able to solve it?

In [19]:
prefix = "I have lots of".split()
noun = "toys"
adjectives = ["square", "green", "plastic"]
inputs = []
for adjs in itertools.permutations(adjectives):
    words = prefix + list(adjs) + [noun]
    inputs.append(words)
    
load_and_score(inputs, sort=True)

INFO:tensorflow:Restoring parameters from tf_saved/rnnlm_trained
"I have lots of plastic square green toys" : -7.73
"I have lots of plastic green square toys" : -7.74
"I have lots of green plastic square toys" : -7.76
"I have lots of green square plastic toys" : -7.77
"I have lots of square plastic green toys" : -7.86
"I have lots of square green plastic toys" : -7.87


#### Answer to part 3. question(s)

*In this case, the model gets it completely wrong.  An earlier version of the NN model did get it correct.  This shows how sensitive this case is.  The hidden state might not be large enough to capture the nuances of this case.  The n-gram models would be able perform better if it had seen examples in the corpus of "square green" and "green plastic".  Then it would be able to choose the correct order.  Though sparsity is an issue.  It would take an enormous corpus to capture all the combinations of all of the different possible adjectives.*