Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [418]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from pprint import pprint as pprint
from six.moves import range
from six.moves.urllib.request import urlretrieve

In [419]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [420]:
def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()
  
text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [421]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [422]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [423]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [424]:
vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    '''Takes in a character and returns an integer id number.'''
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0
    
def id2char(dictid):
    '''Takes the integer id number of a character and returns the character matching the string.'''
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

Unexpected character: ï
1 26 0 0
a z  


In [425]:
def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [426]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))
  
  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

  # Predictions.
  train_prediction = tf.nn.softmax(logits)
  
  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [11]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.295486 learning rate: 10.000000
Minibatch perplexity: 26.99
z ikephskna omtmytewgrctoionopijm v  ler mscyrdtacngqvtejhgul bgtmzjlbgbnevltg  
w ame ljqazfekrz b ahyepppuazk gn ebn ldlarfol sidgx dyassblerjsiqtderishvf japx
gslorh qofjoa vnftfqjnrl b lrlujaiae nnq  ieecle tuiinqurfum witxa lrl fuix unzz
bt ynxs eq qvtizlpezefanelnevhahrrlprwnlcmdndybvgprtzrnmqmntz nhg  filu kp a lul
vxcxjirksdgnv epw i  qjkx i kr zahlggicuuaio tperm yhio gk ninkegeonpaqlatge er 
Validation set perplexity: 20.28
Average loss at step 100: 2.598224 learning rate: 10.000000
Minibatch perplexity: 11.10
Validation set perplexity: 10.49
Average loss at step 200: 2.265844 learning rate: 10.000000
Minibatch perplexity: 8.73
Validation set perplexity: 8.61
Average loss at step 300: 2.103437 learning rate: 10.000000
Minibatch perplexity: 7.56
Validation set perplexity: 7.86
Average loss at step 400: 2.004244 learning rate: 10.000000
Minibatch perplexity: 7.45
Validation set per

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

### Problem 1: Scratch

The cells in this section are just copy and pasted versions of the cells above that I annotated as a way of figuring out how things worked. My solution is below, in the next section.

In [165]:
# Annotated stats funcs

def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    
    '''
    Predictions and labels here are matching (n, vocabulary_size) matrices
    where each row is a character vector. Intuitively you can see the product
    below will be larger when labels and predictions are similar.
    '''
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized
    probabilities.
    """
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

In [166]:
# Annotated batch generation

batch_size = 64
num_unrollings = 10

class BatchGenerator(object):
    
    '''
    A batch in this context is (batch_size, vacab_size) tensor. Each row of the
    batch represents a where the character at each index is drawn from a different
    segment of the text. When you generate another batch, the character in each row
    shifts to right one element. It's important to see that the rows don't represent
    sequential characters.
    
    For example, if the text stared out 'apples are nice', then character at the 0th
    index of the first batch would be 'a'. In the second batch, the character at the 
    0th index of the second batch would be would be 'p', then 'p' again in thrid batch,
    and so on.
    
    Also note that the name of the class, 'BatchGenerator', is misleading. It actually
    returns an 'unrolling', which is a list of list of num_unrollings + 1 batches. 
    The first batch in the unrolling is last batch in the previous unrolling.
    '''
    
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        # The first cursor is a list of the first indicies of the each of the
        # segments.
        self._cursor = [offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
    
    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data.
        """
        # He creates a matrix of zeroes with one row per encoded character
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        # Here he walks through and insert a one into char_id'th position in
        # the one-hot encoding vector for each character in the matrix.
        for b in range(self._batch_size):
            char_id = char2id(self._text[self._cursor[b]])
            batch[b, char_id] = 1.0
            # Here he adds one to the previous cursor locations. Assuming
            # the mod divide is so the cursor repeats in the case we go past
            # the edge of the text.
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
    
    def next(self):
        """This returns a list of batches, aka an 'unrolling'."""
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches

def characters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

# print(batches2string(train_batches.next()))
# print(batches2string(train_batches.next()))
# print(batches2string(valid_batches.next()))
# print(batches2string(valid_batches.next()))

In [None]:
# Annotated model

num_nodes = 128

graph = tf.Graph()
with graph.as_default():
    
    # Parameters
    
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Memory cell: input, state and bias.                                                         
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
        
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""        
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        # Update is the heart of the cell. It's the same calulation you would do
        # even if you weren't using LSTM. The difference is that here is that, instead
        # outputing the result, you add it to a persistent state that is shared by all
        # unrollings.
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        # The state is the shared memory that runs through all unrollings. The 
        # input and forget gates allow this cell to delete old data and store
        # parts of the data generated in the update step.
        state = (forget_gate * state) + (input_gate * tf.tanh(update))
        # The output gate determines what parts of the state will be used as output 
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        output = output_gate * tf.tanh(state)
        return output, state

    # Input data
    
    '''
    train_data: A list of num_unrollings + 1 batch matrices used to
    construct train_inputs and train_labels.
    
    train_inputs: A list of num_unrolling batch matrices, each of which is
    fed into a cell as input.
    
    train_labels: train_labels[n] contains the desired output for a cell
    given train_data as input. 
    '''
    
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]    # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    
    '''
    outputs: The direct outputs of the LSTM cells, one for each each cell
    cell in the unrolling. Each of these is a (batch_size, num_nodes) tensor.
    
    tf.concat(0, outputs) takes the a list of (batch_size, num_nodes) tensors
    and combines them into a single (batch_size * num_unrollings, num_nodes)
    tensor. The 0 says "take all the rows and join them together in to a single
    tall matrix.". If it had been 1, it would have mean "take all the columns
    and join them into a single wide matrix".
    '''
    
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    
    '''    
    He needs to save 'output' and 'state' so that he can use them in the next
    of unrollings. He's created TF variables to hold them, but you can't assign
    TF variables with an equal sign. You have to use the TF operation,
    'Variable.assign'.
    
    Where it gets weird is that the result of 'saved_output.assign(output)' is
    actually a tensor, and you have to evaluate that tensor for the assignment
    to take place.
    
    'control_dependencies' evaluates the tensors passed into it before those
    declared in its context. So in effect, this block tells TensorFlow to do 
    the assignments before it does the final project from node space back to 
    character space. 
    
    This is a little misleading because none of those operations change the
    value of 'outputs', and none of tensors declared below depend on the value
    of the tensors returned by the assignments. This is just a convient place 
    to do it because you know that before TF can eval the tensors below 'output'
    and 'state' must hold the correct values.
    '''
    
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        
        '''
        At this point, the output of all the cells is in still of num_nodes
        dimensions. He uses a final NN layer to project it back down into
        character space.
        
        I've de-nested some of the variables as compared with the original, but
        it's still a bit hard to follow because instead of projecting the output
        of each cell individually, he does them all at once with a single matrix
        multiply.
        
        So, 'tf.concat(0, outputs)' takes the a list of (batch_size, num_nodes) tensors
        and combines the list into a single (batch_size * num_unrollings, num_nodes)
        tensor. The 0 says "take all the rows and join them together in to a single
        tall matrix." (if it had been 1, it would have mean "take all the columns
        and join them into a single wide matrix"). 
        
        The same thing happens to the labels.
        '''
        
        combined_outputs = tf.concat(0, outputs)
        combined_labels = tf.concat(0, train_labels)
        logits = tf.matmul(combined_outputs, w) + b
        tk_error = tf.nn.softmax_cross_entropy_with_logits(logits, combined_labels)
        loss = tf.reduce_mean(tk_error)
    
    # Optimizer.
    
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions.
    
    '''
    Here he sets up another graph to do the test predictions. These are done one
    character at a time by repeatedly evaluating the sample_prediction tensor.
    
    There are no unrollings, but the state and output are maintained between calls
    so it's still taking the past into consideration. I think you only unroll when
    you optimize so that the weight adjustments in a given step can be optimize for
    their effects over time.
    '''
    
    train_prediction = tf.nn.softmax(logits)
    
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_output, sample_state = lstm_cell(
        sample_input, saved_sample_output, saved_sample_state)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [None]:
# Annotate run

num_steps = 1
summary_frequency = 10

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):    
        # Set up fetches
        batches = train_batches.next()
        print(len(batches))
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            # train_data is a list of place holder tensors, so he associates
            # each of those place holders with a batch in the unrolling.
            feed_dict[train_data[i]] = batches[i]
        # Run the session    
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], 
            feed_dict=feed_dict)
        mean_loss += l #????
        
        # Log status
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                # Remember her that the validation unrollings are shaped differently. They
                # only contain two characters, the current character and the previous one.
                # And they only have one batch. Specifically, it's a normal Python list which
                # contains two character vectors.
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))

import os
os.system('say "Training complete."')

### Problem 1: Solution

In [427]:
def lstm_cell(x, o, state, x_weights, x_bias, mem_weights, mem_bias, keep_prob=1):
    """
    Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates.

    Args:
        x (Tensor): A a Tesnor with shape (batch_size, input_size).
        o (Tensor): A a Tensor with shape (batch_size, num_nodes).
        state (Tensor): A Tensor with shape (batch_size, num_nodes).
        x_weights (Tensor): A Tensor with shape (input_size, num_nodes * 4)
        x_bias (Tensor): A Tensor with shape (1, num_nodes * 4)
        mem_weights (Tensor): A Tensor with shape (num_nodes, num_nodes * 4)
        mem_bias (Tensor): A Tensor with shape (1, num_nodes * 4)
    
    Returns:
        tuple: A tuple containing Tensors representing a cell's output and state.
    
    """

    x = tf.matmul(x, x_weights) + x_bias
    mem = tf.matmul(o, mem_weights) + mem_bias

    x_input = x[:, :num_nodes]
    x_forget = x[:, num_nodes:num_nodes * 2]
    x_update = x[:, num_nodes * 2:num_nodes * 3]
    x_output = x[:, num_nodes * 3:num_nodes * 4]

    mem_input = mem[:, :num_nodes]
    mem_forget = mem[:, num_nodes:num_nodes * 2]
    mem_update = mem[:, num_nodes * 2:num_nodes * 3]
    mem_output = mem[:, num_nodes * 3:num_nodes * 4]

    input_gate = tf.sigmoid(x_input + mem_input)
    forget_gate = tf.sigmoid(x_forget + mem_forget)
    update = tf.tanh(x_update + mem_update)
    state = tf.tanh((forget_gate * state) + (input_gate * update))
    output_gate = tf.sigmoid(x_output + mem_output)
    output = output_gate * state
    return output, state

In [428]:
# Optimized model
num_nodes = 128

graph = tf.Graph()
with graph.as_default():
    
    # Parameters
    
    x_weights = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes * 4], -0.1, 0.1))
    x_bias = tf.Variable(tf.zeros([1, num_nodes * 4]))
    
    mem_weights = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    mem_bias = tf.Variable(tf.zeros([1, num_nodes * 4]))
        
    # Variables saving state across unrollings.
    
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases.
    
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Input data
    
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]    # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state, x_weights, x_bias, mem_weights, mem_bias)
        outputs.append(output)

    # State saving across unrollings
    
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        
        combined_outputs = tf.concat(0, outputs)
        combined_labels = tf.concat(0, train_labels)
        logits = tf.matmul(combined_outputs, w) + b
        tk_error = tf.nn.softmax_cross_entropy_with_logits(logits, combined_labels)
        loss = tf.reduce_mean(tk_error)
    
    # Optimizer
    
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions
    
    train_prediction = tf.nn.softmax(logits)
    
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_output, sample_state = lstm_cell(sample_input, 
                                        saved_sample_output, 
                                        saved_sample_state, 
                                        x_weights, 
                                        x_bias, 
                                        mem_weights, 
                                        mem_bias)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [430]:
# Run model

num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):    
        # Set up fetches
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        
        # Run the session    
        output_requested = [optimizer, 
                            loss, 
                            train_prediction, 
                            learning_rate, 
                            combined_outputs, 
                            combined_labels]
        output = session.run(output_requested, feed_dict=feed_dict)
        output_loss = output[1]
        predictions = output[2]
        lr = output[3]
        co = output[4]
        cl = output[5]
        
        mean_loss += output_loss
        
        # Log status
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                # Remember her that the validation unrollings are shaped differently. They
                # only contain two characters, the current character and the previous one.
                # And they only have one batch. Specifically, it's a normal Python list which
                # contains two character vectors.
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))

import os
os.system('say "Training complete."')

Initialized
Average loss at step 0: 3.299650 learning rate: 10.000000
Minibatch perplexity: 27.10
ptndpsxijfp pu orkiefyia mq dssrstyf boin  dcztnelkciinhq kss  aqatw ssinqucucxe
omwotqsskoa irniltbkppbsznww dsfkik zwtnq ijua s tfkb ncjjgorte ismgyf jaa   rea
m hbt wod ysweqharkaayx pcbmiw hvpwofyasg   sw  eeqt hu cm iribt  ccbid ujij yon
nybd caliwsaeaqwbllveao grws l vefsa eso  lto qmzeeg a  iwfjudpdo tamtpufyavvcy 
zbhh   su o  qqevdaradnscrkek erevoivdx c ls tuyhkf r izwdbievyn ba ul dtejiyjdo
Validation set perplexity: 20.27
Average loss at step 100: 2.658969 learning rate: 10.000000
Minibatch perplexity: 11.34
Validation set perplexity: 10.79
Average loss at step 200: 2.335114 learning rate: 10.000000
Minibatch perplexity: 8.88
Validation set perplexity: 9.62
Average loss at step 300: 2.159666 learning rate: 10.000000
Minibatch perplexity: 8.49
Validation set perplexity: 9.11
Average loss at step 400: 2.054146 learning rate: 10.000000
Minibatch perplexity: 6.91
Validation set per

0

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

### Part A: Generating the embeddings

In [431]:
# Split the whole text up into an array of bigrams

bigrams = [text[i * 2:i * 2 + 2] for i in range(len(text) // 2)]

In [102]:
# Further divide those into training and validation sets

validation_size = 1000
valid_bigrams = bigrams[:validation_size]
train_size = len(bigrams) - valid_size
train_bigrams = bigrams[validation_size:]

In [103]:
# Creat a dictionary to map bigrams to an id number

dictionary = dict()
possible_chars = string.ascii_lowercase + ' '
for c1 in possible_chars:
    for c2 in possible_chars:
        bigram = c1 + c2
        if bigram not in dictionary:
            dictionary[bigram] = len(dictionary)
reverse_dictionary = {v: k for k, v in dictionary.iteritems()}
bigram_vocab_size = len(dictionary)

In [104]:
# Bigram utils

def bigram_to_id(bigram):
    return dictionary[bigram]

def id_to_bigram(id):
    return reverse_dictionary[id]

In [105]:
# Skip-gram embedding generator

class EmbeddingBatchGenerator(object):
    
    def __init__(self, batch_size, skip_window):
        assert batch_size % (skip_window * 2) == 0
        self.batch_size = batch_size
        self.skip_window = skip_window
        self.text_index = skip_window
        self.num_skips = skip_window * 2
    
    def next(self):
        skips = []
        labels = []

        for word in range(self.batch_size / self.num_skips):
            for i in range(self.skip_window):
                skips.append(bigrams[self.text_index])
                labels.append(bigrams[self.text_index + i - 1])
                skips.append(bigrams[self.text_index])
                labels.append(bigrams[self.text_index - i + 1])
            self.text_index += 1
        
        output_skips = np.ndarray(shape=(self.batch_size), dtype=np.int32)
        output_labels = np.ndarray(shape=(self.batch_size, 1), dtype=np.int32)
        
        for i in range(self.batch_size):
            output_skips[i] = bigram_to_id(skips[i])
            output_labels[i, 0] = bigram_to_id(labels[i])
        
        return output_skips, output_labels

In [106]:
# Embedding training model

graph = tf.Graph()
batch_size = 128
embedding_size = 16
skip_window = 1 
num_skips = skip_window * 2
vocabulary_size = len(dictionary)
num_sampled = 64

with graph.as_default(), tf.device('/cpu:0'):
    
    # Input placeholders
    train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    
    # Variables
    embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, 
                                                       embedding_size], 
                                                       stddev=1.0 / math.sqrt(embedding_size)))
    softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
    
    # Loss
    embed = tf.nn.embedding_lookup(embeddings, train_dataset)
    loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, 
                                                     softmax_biases, 
                                                     embed, 
                                                     train_labels, 
                                                     num_sampled, 
                                                     vocabulary_size))
    
    # Optimizer
    optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
    
    # Final embeddings
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm 

In [107]:
# Embedding training session

gen = EmbeddingBatchGenerator(batch_size, skip_window)
num_steps = 100001
log_freq = 2000

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    average_loss = 0
    for step in range(num_steps):
        batch_data, batch_labels = gen.next()
        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += l
        # Logging
        if step % log_freq == 0:
            if step > 0:
                average_loss = average_loss / log_freq
            print('Average loss at step {}: {}'.format(step, average_loss))
            average_loss = 0
    
    final_embeddings = normalized_embeddings.eval()

Initialized
Average loss at step 0: 3.82160639763
Average loss at step 2000: 2.10316807246
Average loss at step 4000: 1.84800698766
Average loss at step 6000: 1.83102092531
Average loss at step 8000: 1.81702862722
Average loss at step 10000: 1.84194074395
Average loss at step 12000: 1.812926651
Average loss at step 14000: 1.76442282325
Average loss at step 16000: 1.79538573244
Average loss at step 18000: 1.82905118611
Average loss at step 20000: 1.78603089705
Average loss at step 22000: 1.74286911315
Average loss at step 24000: 1.81091421971
Average loss at step 26000: 1.71641250476
Average loss at step 28000: 1.79508699706
Average loss at step 30000: 1.72991965634
Average loss at step 32000: 1.79332515079
Average loss at step 34000: 1.77370179448
Average loss at step 36000: 1.76807139125
Average loss at step 38000: 1.7781546174
Average loss at step 40000: 1.82313803336
Average loss at step 42000: 1.78247744182
Average loss at step 44000: 1.72845778793
Average loss at step 46000: 1.806

In [117]:
# Some various embeding to/from one-hot utils

def embed_to_id(embed):
    cosine_sims = np.dot(final_embeddings, embed)
    return np.argmax(cosine_sims)

def id_to_embed(id):
    return final_embeddings[id]

def bigram_to_embed(bigram):
    return id_to_embed(bigram_to_id(bigram))

def embed_to_bigram(embed):
    return id_to_bigram(embed_to_id(embed))

def one_hot_to_id(probs):
    return np.argmax(probs)

def one_hot_to_bigram(probs):
    return id_to_bigram(one_hot_to_id(probs))

def one_hot_to_embed(probs):
    return id_to_embed(one_hot_to_id(probs))

### Part B: Putting them into an LSTM

In [124]:
# A bigram aware batch generator

batch_size = 64
num_unrollings = 10

class LSTMBatchGenerator(object):
    """
    Generates one-hot batches for the bigram LSTM.
    
    This is basically the same the earlier version, but works
    on the bigram and embedding dictionaries declared above. 
    """
    
    def __init__(self, bigrams, batch_size, num_unrollings):
        self._text_size = len(bigrams)
        self._batch_size = batch_size
        self._bigrams = bigrams
        self._num_unrollings = num_unrollings
        self.reset()
    
    def reset(self):
        segment_size = self._text_size // self._batch_size
        self._cursor = [offset * segment_size for offset in range(batch_size)]
        self._last_batch = self._next_batch()
    
    def _next_batch(self):
        batch = np.zeros(shape=(self._batch_size, bigram_vocab_size), dtype=np.float)
        for b in range(self._batch_size):
            bigram_id = bigram_to_id(self._bigrams[self._cursor[b]])
            batch[b, bigram_id] = 1
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
    
    def next(self):
        """This returns a list of batches, aka an 'unrolling'."""
        unrolling = [self._last_batch]
        for step in range(self._num_unrollings):
            unrolling.append(self._next_batch())
        self._last_batch = unrolling[-1]
        return unrolling

In [115]:
# Batch generator related untils

def batch_to_characters(batch):
    return [one_hot_to_bigram(bigram) for bigram in batch]

def unrolling_to_strings(unrolling):
    strings = []
    chars = [batch_to_characters(b) for b in unrolling]
    batch_size = len(unrolling[0])
    for b in range(batch_size):
        batch_bigrams = [chars[u][b] for u in range(len(unrolling))]
        strings.append(''.join(batch_bigrams))
    return strings

def random_one_hot():
    b = np.random.uniform(0.0, 1.0, size=[1, bigram_vocab_size])
    return b/np.sum(b, 1)[:,None]

def random_embedding(): 
    return id_to_embed(one_hot_to_id(random_one_hot()))

def one_hot_unroll_to_embed(unrolling):
    embed_unrolling = []
    for batch in unrolling:
        embed_unrolling.append([one_hot_to_embed(row) for row in batch])
    return np.array(embed_unrolling)

def embed_unroll_to_one_hot(unrolling):
    one_hot_unrolling = []
    batch_size = len(unrolling[0])
    for batch in unrolling:
        oh = np.zeros(shape=(batch_size, bigram_vocab_size), dtype=np.float)
        for row_num in range(len(batch)):
            embed = batch[row_num]
            oh[row_num, embed_to_id(embed)] = 1.0
        one_hot_unrolling.append(oh)
    return one_hot_unrolling 

In [125]:
# Create the batche generators for the model

embed_lstm_train_batches = LSTMBatchGenerator(train_bigrams, batch_size, num_unrollings)
embed_lstm_valid_batches = LSTMBatchGenerator(valid_bigrams, 1, 1)

In [150]:
# The bigram LSTM model

# This is mostly the same as the old model, but it takes embeddings
# instead of one-hots. It still returns one-hots by projecting the 
# output of the cells from `num_nodes` to `bigram_vocab` size in 
# the final layer.

num_nodes = 128

graph = tf.Graph()
with graph.as_default():
    
    # Cells weights and biases
    x_weights = tf.Variable(tf.truncated_normal([embedding_size, num_nodes * 4], -0.1, 0.1))
    x_bias = tf.Variable(tf.zeros([1, num_nodes * 4]))
    mem_weights = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    mem_bias = tf.Variable(tf.zeros([1, num_nodes * 4]))
        
    # Output saved across unrollings
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases
    w = tf.Variable(tf.truncated_normal([num_nodes, bigram_vocab_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([bigram_vocab_size]))

    # Input data
    train_inputs = list()
    for _ in range(num_unrollings):
        train_inputs.append(tf.placeholder(tf.float32, shape=[batch_size, embedding_size]))
    train_labels = list()
    for _ in range(num_unrollings):
        train_labels.append(tf.placeholder(tf.float32, shape=[batch_size, bigram_vocab_size]))   

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state, x_weights, x_bias, mem_weights, mem_bias)
        outputs.append(output)

    # State saving across unrollings
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        combined_outputs = tf.concat(0, outputs)
        combined_labels = tf.concat(0, train_labels)
        logits = tf.matmul(combined_outputs, w) + b
        tk_error = tf.nn.softmax_cross_entropy_with_logits(logits, combined_labels)
        loss = tf.reduce_mean(tk_error)
    
    # Optimizer
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions
    train_prediction = tf.nn.softmax(logits)
    
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, embedding_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_output, sample_state = lstm_cell(sample_input, 
                                            saved_sample_output, 
                                            saved_sample_state, 
                                            x_weights, 
                                            x_bias, 
                                            mem_weights, 
                                            mem_bias)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [151]:
# Run the model

num_steps = 7001
summary_frequency = 100

embed_lstm_train_batches.reset()
embed_lstm_valid_batches.reset()

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):    
        # Set up fetches dictionary
        train_unrolling = embed_lstm_train_batches.next()
        train_input_unrolling = one_hot_unroll_to_embed(train_unrolling[:num_unrollings])
        train_label_unrolling = train_unrolling[1:]
        feed_dict = dict()
        for i in range(num_unrollings):
            feed_dict[train_inputs[i]] = train_input_unrolling[i]
            feed_dict[train_labels[i]] = train_label_unrolling[i]
            
        # Run the session    
        output_requested = [optimizer, 
                            loss, 
                            train_prediction, 
                            learning_rate]
        output = session.run(output_requested, feed_dict=feed_dict)
        output_loss = output[1]
        predictions = output[2]
        lr = output[3]
        
        mean_loss += output_loss
        
        # Log status
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            # np.concatenate(train_label_unrolling) gives labels the shape
            # (batch_size * num_unrollings, vacab_size), just predictions.
            labels = np.concatenate(train_label_unrolling)
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    rand = random_one_hot()
                    feed = [one_hot_to_embed(rand)]
                    sentence = batch_to_characters(rand)[0]
                    reset_sample_state.run()
                    for _ in range(79 // 2):
                        feed = sample_prediction.eval({sample_input: feed})
                        sentence += batch_to_characters(feed)[0]
                        feed = [one_hot_to_embed(feed[0])]
                    print(sentence)
                print('=' * 80)
            
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                # Remember her that the validation unrollings are shaped differently. They
                # only contain two batches, one for the current character and the previous one.
                # And each of those batches contains only a single row.
                b = valid_batches.next()
                valid_unrolling = embed_lstm_valid_batches.next()
                valid_input_unrolling = one_hot_unroll_to_embed(valid_unrolling[:1])
                valid_label_unrolling = valid_unrolling[1:]
                valid_predictions = sample_prediction.eval(
                    {sample_input: valid_input_unrolling[0]})
                valid_logprob = valid_logprob + logprob(
                    valid_predictions, valid_label_unrolling[0])
            print('Validation set perplexity: %.2f' % 
                  float(np.exp(valid_logprob / valid_size)))

import os
os.system('say "Training complete."')

Initialized
Average loss at step 0: 6.590360 learning rate: 10.000000
Minibatch perplexity: 728.04
oce e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 
bke e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 
mie e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 
sbe e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 
roe e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e 
Validation set perplexity: 670.12
Average loss at step 100: 5.282624 learning rate: 10.000000
Minibatch perplexity: 123.89
Validation set perplexity: 134.20
Average loss at step 200: 4.507654 learning rate: 10.000000
Minibatch perplexity: 72.68
Validation set perplexity: 86.78
Average loss at step 300: 4.172103 learning rate: 10.000000
Minibatch perplexity: 57.35
Validation set perplexity: 71.77
Average loss at step 400: 3.963395 learning rate: 10.000000
Minibatch perplexity: 56.90
Validatio

0

### Part C: Introducing Dropout

If I'm interpretting the paper right, the idea is to add dropout each cell's input and ouput of the cell, but not to its recurrent connection. This is how I've implemented it, but it still caused an increase in validation perplexity. 

Either I'm missing something or the network just isn't large enough for dropout to be effective.

In [398]:
def lstm_cell(x, o, state, x_weights, x_bias, mem_weights, mem_bias, keep_prob=1):
    """
    Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit the various connections between the
    previous state and the gates.

    Args:
        x (Tensor): A a Tesnor with shape (batch_size, input_size).
        o (Tensor): A a Tensor with shape (batch_size, num_nodes).
        state (Tensor): A Tensor with shape (batch_size, num_nodes).
        x_weights (Tensor): A Tensor with shape (input_size, num_nodes * 4)
        x_bias (Tensor): A Tensor with shape (1, num_nodes * 4)
        mem_weights (Tensor): A Tensor with shape (num_nodes, num_nodes * 4)
        mem_bias (Tensor): A Tensor with shape (1, num_nodes * 4)
    
    Returns:
        tuple: A tuple containing Tensors representing a cell's output and state.
    
    """

    x = tf.matmul(tf.nn.dropout(x, keep_prob), x_weights) + x_bias
    mem = tf.matmul(o, mem_weights) + mem_bias

    x_input = x[:, :num_nodes]
    x_forget = x[:, num_nodes:num_nodes * 2]
    x_update = x[:, num_nodes * 2:num_nodes * 3]
    x_output = x[:, num_nodes * 3:num_nodes * 4]

    mem_input = mem[:, :num_nodes]
    mem_forget = mem[:, num_nodes:num_nodes * 2]
    mem_update = mem[:, num_nodes * 2:num_nodes * 3]
    mem_output = mem[:, num_nodes * 3:num_nodes * 4]

    input_gate = tf.sigmoid(x_input + mem_input)
    forget_gate = tf.sigmoid(x_forget + mem_forget)
    update = tf.tanh(x_update + mem_update)
    state = tf.tanh((forget_gate * state) + (input_gate * update))
    output_gate = tf.sigmoid(x_output + mem_output)
    output = output_gate * state
    return tf.nn.dropout(output, keep_prob), state

In [152]:
# The bigram LSTM model with dropout

num_nodes = 128
keep_prop = 0.5

graph = tf.Graph()
with graph.as_default():
    
    # Cells weights and biases
    x_weights = tf.Variable(tf.truncated_normal([embedding_size, num_nodes * 4], -0.1, 0.1))
    x_bias = tf.Variable(tf.zeros([1, num_nodes * 4]))
    mem_weights = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1))
    mem_bias = tf.Variable(tf.zeros([1, num_nodes * 4]))
        
    # Output saved across unrollings
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    
    # Classifier weights and biases
    w = tf.Variable(tf.truncated_normal([num_nodes, bigram_vocab_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([bigram_vocab_size]))

    # Input data
    train_inputs = list()
    for _ in range(num_unrollings):
        train_inputs.append(tf.placeholder(tf.float32, shape=[batch_size, embedding_size]))
    train_labels = list()
    for _ in range(num_unrollings):
        train_labels.append(tf.placeholder(tf.float32, shape=[batch_size, bigram_vocab_size]))   

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state, x_weights, x_bias, mem_weights, mem_bias, keep_prob=keep_prop)
        outputs.append(output)

    # State saving across unrollings
    with tf.control_dependencies([saved_output.assign(output), saved_state.assign(state)]):
        # Classifier.
        combined_outputs = tf.concat(0, outputs)
        combined_labels = tf.concat(0, train_labels)
        logits = tf.matmul(combined_outputs, w) + b
        tk_error = tf.nn.softmax_cross_entropy_with_logits(logits, combined_labels)
        loss = tf.reduce_mean(tk_error)
    
    # Optimizer
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)

    # Predictions
    train_prediction = tf.nn.softmax(logits)
    
    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, embedding_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    
    sample_output, sample_state = lstm_cell(sample_input, 
                                            saved_sample_output, 
                                            saved_sample_state, 
                                            x_weights, 
                                            x_bias, 
                                            mem_weights, 
                                            mem_bias)
    
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [153]:
# Run the model

num_steps = 7001
summary_frequency = 100

embed_lstm_train_batches.reset()
embed_lstm_valid_batches.reset()

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    for step in range(num_steps):    
        # Set up fetches dictionary
        train_unrolling = embed_lstm_train_batches.next()
        train_input_unrolling = one_hot_unroll_to_embed(train_unrolling[:num_unrollings])
        train_label_unrolling = train_unrolling[1:]
        feed_dict = dict()
        for i in range(num_unrollings):
            feed_dict[train_inputs[i]] = train_input_unrolling[i]
            feed_dict[train_labels[i]] = train_label_unrolling[i]
            
        # Run the session    
        output_requested = [optimizer, 
                            loss, 
                            train_prediction, 
                            learning_rate]
        output = session.run(output_requested, feed_dict=feed_dict)
        output_loss = output[1]
        predictions = output[2]
        lr = output[3]
        
        mean_loss += output_loss
        
        # Log status
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            # np.concatenate(train_label_unrolling) gives labels the shape
            # (batch_size * num_unrollings, vacab_size), just predictions.
            labels = np.concatenate(train_label_unrolling)
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    rand = random_one_hot()
                    feed = [one_hot_to_embed(rand)]
                    sentence = batch_to_characters(rand)[0]
                    reset_sample_state.run()
                    for _ in range(79 // 2):
                        feed = sample_prediction.eval({sample_input: feed})
                        sentence += batch_to_characters(feed)[0]
                        feed = [one_hot_to_embed(feed[0])]
                    print(sentence)
                print('=' * 80)
            
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                # Remember her that the validation unrollings are shaped differently. They
                # only contain two batches, one for the current character and the previous one.
                # And each of those batches contains only a single row.
                b = valid_batches.next()
                valid_unrolling = embed_lstm_valid_batches.next()
                valid_input_unrolling = one_hot_unroll_to_embed(valid_unrolling[:1])
                valid_label_unrolling = valid_unrolling[1:]
                valid_predictions = sample_prediction.eval(
                    {sample_input: valid_input_unrolling[0]})
                valid_logprob = valid_logprob + logprob(
                    valid_predictions, valid_label_unrolling[0])
            print('Validation set perplexity: %.2f' % 
                  float(np.exp(valid_logprob / valid_size)))

import os
os.system('say "Training complete."')

Initialized
Average loss at step 0: 6.598445 learning rate: 10.000000
Minibatch perplexity: 733.95
hkd e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d 
tfd e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d 
u e e e e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d 
 ue e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d 
ckd e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d e d 
Validation set perplexity: 670.66
Average loss at step 100: 5.392356 learning rate: 10.000000
Minibatch perplexity: 162.52
Validation set perplexity: 157.61
Average loss at step 200: 4.995656 learning rate: 10.000000
Minibatch perplexity: 143.22
Validation set perplexity: 119.12
Average loss at step 300: 4.857814 learning rate: 10.000000
Minibatch perplexity: 123.86
Validation set perplexity: 103.74
Average loss at step 400: 4.744691 learning rate: 10.000000
Minibatch perplexity: 118.66
Vali

0

---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

### Problem 3: Solution

The model below uses the encoder/decoder scheme from the paper provided with a few simplifications:

1. The encoder and decorder are each only one layer deep.
2. I did't use 'go' or 'eos' characters. I experimented with them, but they seemed to make the network harder to train.
3. I didn't vary the sequence length during training. I was worried this might mean the network wouldn't be able to reverse shorter sequences without padding characters, but that doesn't appear to be the case (see last cell).

The resulting network does still miss an occasional character, but does seem to have learned to encode an ordered list of `batch_size` characters into a single vector and then decode them in reverse.

Also note: It is prone to getting stuck in a local error minima, so I had to run it a few times before I got weights that worked consistently.

In [399]:
# mir for mirror problem
mir_vocabulary_size = len(string.ascii_lowercase) + 1
batch_size = 64
num_unrollings = 10
mir_valid_size = batch_size * num_unrollings * 10
mir_valid_text = text[:mir_valid_size]
mir_train_text = text[mir_valid_size:]
mir_train_size = len(mir_train_text)

In [400]:
def mir_char_to_id(char):
    if char == eos:
        return mir_vocabulary_size - 1
    else:
        return char2id(char)

def mir_id_to_char(id):
    if id >= mir_vocabulary_size:
        raise Exception()
    if id == mir_vocabulary_size - 1:
        return eos
    else:
        return id2char(id)

def mir_char_to_one_hot(char):
    one_hot = np.zeros(shape=mir_vocabulary_size, dtype=np.float)
    one_hot[mir_char_to_id(char)] = 1.0
    return one_hot

In [401]:
def mir_batch_to_strings(batch):
    return [mir_id_to_char(np.argmax(char_one_hot)) for char_one_hot in batch]

def mir_unrolling_to_strings(unrolling):
    strings = []
    chars = [mir_batch_to_strings(b) for b in unrolling]
    batch_size = len(unrolling[0])
    for b in range(batch_size):
        batch_chars = [chars[u][b] for u in range(len(unrolling))]
        strings.append(''.join(batch_chars))
    return strings

In [403]:
class MirBatchGenerator(object):
    
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        self.reset()
        
    def reset(self):
        segment = self._text_size // batch_size
        self._cursor = [offset * segment for offset in range(batch_size)]
    
    def _eos_batch(self):
        batch = np.zeros(shape=(self._batch_size, mir_vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, mir_char_to_id(eos)] = 1.0
        return batch
    
    def _next_batch(self):
        batch = np.zeros(shape=(self._batch_size, mir_vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            char_id = mir_char_to_id(self._text[self._cursor[b]])
            batch[b, char_id] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch
    
    def next(self):
        """This returns a list of batches, aka an 'unrolling'."""
        batches = []
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        return batches

In [408]:
mir_train_batches = MirBatchGenerator(mir_train_text, batch_size, num_unrollings)
mir_valid_batches = MirBatchGenerator(mir_valid_text, batch_size, num_unrollings)

In [409]:
num_nodes = 256

graph = tf.Graph()
with graph.as_default():
    
    # Stage 1 (encoder) cells weights and biases
    s1_x_weights = tf.Variable(tf.truncated_normal([mir_vocabulary_size, num_nodes * 4], -0.1, 0.1), 
                               name='s1_x_weights')
    s1_x_bias = tf.Variable(tf.zeros([1, num_nodes * 4]),
                            name='s1_x_bias')
    s1_mem_weights = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1), 
                                 name='s1_mem_weights')
    s1_mem_bias = tf.Variable(tf.zeros([1, num_nodes * 4]), name='s1_mem_bias')
    
    # Stage 2 (decoder) cells weights and biases
    s2_x_weights = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1), name='s2_x_weights')
    s2_x_bias = tf.Variable(tf.zeros([1, num_nodes * 4]), name='s2_x_bias')
    s2_mem_weights = tf.Variable(tf.truncated_normal([num_nodes, num_nodes * 4], -0.1, 0.1), name='s2_mem_weights')
    s2_mem_bias = tf.Variable(tf.zeros([1, num_nodes * 4]), name='s2_mem_bias')
    
    # Final classifier weights and biases
    weights = tf.Variable(tf.truncated_normal([num_nodes, mir_vocabulary_size], -0.1, 0.1), name='weights')
    biases = tf.Variable(tf.zeros([mir_vocabulary_size]), name='biases')
    
    # Output saved across unrollings
    s1_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    s1_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    s2_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    s2_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
        
    # Input data
    train_inputs = list()
    train_labels = list()
    for _ in range(num_unrollings):
        train_inputs.append(tf.placeholder(tf.float32, shape=[batch_size, mir_vocabulary_size]))
        train_labels.append(tf.placeholder(tf.float32, shape=[batch_size, mir_vocabulary_size]))
        
    s1_outputs = list()
    output = s1_output
    state = s1_state
    for train_input in train_inputs:
        output, state = lstm_cell(train_input, 
                                  output, 
                                  state, 
                                  s1_x_weights, 
                                  s1_x_bias, 
                                  s1_mem_weights, 
                                  s1_mem_bias)
        s1_outputs.append(output)
        
    s2_outputs = list()
    output = s2_output
    state = s2_state
    for i in range(len(train_inputs)):
        if i == 0:
            s2_input = s1_outputs[-1]
        else:
            s2_input = output
        output, state = lstm_cell(s2_input, 
                                  output, 
                                  state, 
                                  s2_x_weights, 
                                  s2_x_bias, 
                                  s2_mem_weights, 
                                  s2_mem_bias)
        s2_outputs.append(output)

    # Classifier
    combined_outputs = tf.concat(0, s2_outputs) # Shape is (batch_size * num_unrollings, num_nodes).
    combined_labels = tf.concat(0, train_labels) # Shape is (batch_size * num_unrollings, vocabulary_size).
    logits = tf.matmul(combined_outputs, weights) + biases
    train_prediction = tf.nn.softmax(logits) # Used for logging.
    tk_error = tf.nn.softmax_cross_entropy_with_logits(logits, combined_labels)
    loss = tf.reduce_mean(tk_error)
    
    # Optimizer
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)
    

        
    # Sampling
    s1_sample_input = tf.placeholder(tf.float32, shape=[1, mir_vocabulary_size], name='s1_sample_input')
    s1_saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]), name='s1_saved_sample_state')
    s1_saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]), name='s1_saved_sample_output')
    
    
    s1_sample_output, s1_sample_state = lstm_cell(s1_sample_input, 
                                            s1_saved_sample_output, 
                                            s1_saved_sample_state, 
                                            s1_x_weights, 
                                            s1_x_bias, 
                                            s1_mem_weights, 
                                            s1_mem_bias)
    
    with tf.control_dependencies([s1_sample_output, s1_sample_state]):
        s1_store_state = tf.group(s1_saved_sample_output.assign(s1_sample_output), 
                                  s1_saved_sample_state.assign(s1_sample_state))
    
    s1_reset_state = tf.group(s1_saved_sample_output.assign(tf.zeros([1, num_nodes])),
                              s1_saved_sample_state.assign(tf.zeros([1, num_nodes])))    
    
    s2_sample_input = tf.Variable(tf.zeros([1, num_nodes]))
    s2_saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    s2_saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    
    setup_s2 = s2_sample_input.assign(s1_saved_sample_output)
    
    s2_sample_output, s2_sample_state = lstm_cell(s2_sample_input, 
                                            s2_saved_sample_output, 
                                            s2_saved_sample_state, 
                                            s2_x_weights, 
                                            s2_x_bias, 
                                            s2_mem_weights, 
                                            s2_mem_bias)
    
    with tf.control_dependencies([s2_sample_output, s2_sample_state]):
        s2_store_state = tf.group(s2_saved_sample_output.assign(s2_sample_output), 
                                  s2_saved_sample_state.assign(s2_sample_state), 
                                  s2_sample_input.assign(s2_sample_output))
        with tf.control_dependencies([s2_store_state]):
            sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(s2_sample_output, weights, biases))
    
    s2_reset_state = tf.group(s2_saved_sample_output.assign(tf.zeros([1, num_nodes])),
                              s2_saved_sample_state.assign(tf.zeros([1, num_nodes])))
        

In [414]:
num_steps = 30001
summary_frequency = 200

mir_train_batches.reset()
mir_valid_batches.reset()

with tf.Session(graph=graph) as session:
    tf.initialize_all_variables().run()
    print('Initialized')
    mean_loss = 0
    
    train_unrolling = list()
    for step in range(num_steps):    
        # Set up fetches dictionary
        train_input_unrolling = mir_train_batches.next()
        train_label_unrolling = train_input_unrolling[:]
        train_label_unrolling.reverse()
        
        feed_dict = dict()
        for i in range(num_unrollings):
            feed_dict[train_inputs[i]] = train_input_unrolling[i]
            feed_dict[train_labels[i]] = train_label_unrolling[i]
            
        # Run the session    
        output_requested = [optimizer, 
                            loss, 
                            train_prediction, 
                            learning_rate]
        output = session.run(output_requested, feed_dict=feed_dict)
        output_loss = output[1]
        predictions = output[2]
        lr = output[3]
        
        mean_loss += output_loss
        
        # Log status
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            
            # Log perplexity
            labels = np.concatenate(train_label_unrolling)
            print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))
            
            # Log the prediction for the first row of the unrolling
            reshaped_unrolling = np.array(train_input_unrolling)[:, :1, :]
            print('Sample Input:', mir_unrolling_to_strings(reshaped_unrolling)[0])
            reshaped_pred = predictions.reshape((num_unrollings, batch_size, mir_vocabulary_size))
            reshaped_pred = reshaped_pred[:, :1, :]
            print('Sample Pred :', mir_unrolling_to_strings(reshaped_pred)[0])
        
        
    # Measure validation set perplexity.
    valid_logprob = 0
    num_steps = mir_valid_size // (num_unrollings * batch_size)
    for _ in range(num_steps):        
        valid_unrolling = list()
        
        # Set up fetches dictionary
        valid_input_unrolling = mir_valid_batches.next()
        valid_label_unrolling = valid_input_unrolling[:]
        valid_label_unrolling.reverse()

        feed_dict = dict()
        for i in range(num_unrollings):
            feed_dict[train_inputs[i]] = valid_input_unrolling[i]
            feed_dict[train_labels[i]] = valid_label_unrolling[i]

        # Run the session    
        output_requested = [train_prediction]
        output = session.run(output_requested, feed_dict=feed_dict)
        valid_predictions = output[0]        
        valid_labels = np.concatenate(valid_label_unrolling)
        valid_logprob = valid_logprob + logprob(valid_predictions, valid_labels)
        
    print('Validation set perplexity: %.2f' % float(np.exp(valid_logprob / num_steps)))

    mir_saver = tf.train.Saver([s1_x_weights, 
                                s1_x_bias, 
                                s1_mem_weights, 
                                s1_mem_bias, 
                                s2_x_weights, 
                                s2_x_bias, 
                                s2_mem_weights, 
                                s2_mem_bias, 
                                weights,
                                biases])
    
    mir_saver.save(session, os.path.expanduser('~/Desktop/Mirror Model/mir_data'))
    

Initialized
Average loss at step 0: 3.300151 learning rate: 10.000000
Minibatch perplexity: 27.12
Sample Input:  community
Sample Pred : h dhbhbh  
Average loss at step 200: 3.039025 learning rate: 10.000000
Minibatch perplexity: 17.10
Sample Input: to active 
Sample Pred :           
Average loss at step 400: 2.856922 learning rate: 10.000000
Minibatch perplexity: 18.16
Sample Input: aire one e
Sample Pred : e  eee eee
Average loss at step 600: 2.864120 learning rate: 10.000000
Minibatch perplexity: 18.10
Sample Input: he same ti
Sample Pred :           
Average loss at step 800: 2.831155 learning rate: 10.000000
Minibatch perplexity: 16.90
Sample Input: yndicalist
Sample Pred :  eee eee e
Average loss at step 1000: 2.778417 learning rate: 10.000000
Minibatch perplexity: 15.11
Sample Input: ral uphold
Sample Pred : oi   e    
Average loss at step 1200: 2.662394 learning rate: 10.000000
Minibatch perplexity: 13.41
Sample Input: though opp
Sample Pred : aa        
Average loss at step 1

Because I didn't vary the length of the training sequences, and I didn't use an 'eos' character to mark the end of input, I was a little concerned that the final network might not be able to handle sequences of arbitrary lengths. For example, I could imagine a situation where the encoder learned to wait until it saw the 10th character before formating its output in a way that would be usable by the decoder.

That doesn't appear to have happened (see next cell). It looks as though the encoder learned to continuously encode an ordered sequence of 10 letters into a finite vector of size `num_nodes`, which could be understood by the decoder at any point in the process.

In [417]:
with tf.Session(graph=graph) as session:    
    tf.initialize_all_variables().run()
    
    mir_saver = tf.train.Saver([s1_x_weights, 
                                s1_x_bias, 
                                s1_mem_weights, 
                                s1_mem_bias, 
                                s2_x_weights, 
                                s2_x_bias, 
                                s2_mem_weights, 
                                s2_mem_bias, 
                                weights,
                                biases])
    mir_saver.restore(session, os.path.expanduser('~/Desktop/Mirror Model/mir_data'))
    
    def mirror(test_string):
        print('Input:', test_string)
        s1_reset_state.run()
        s2_reset_state.run()
        for char in test_string:
            outputs = session.run([s1_store_state], {s1_sample_input: [mir_char_to_one_hot(char)]})
        setup_s2.eval()
        sentence = list()
        for char in test_string:
            sentence.append(sample_prediction.eval())
        print('Pred:', mir_unrolling_to_strings(sentence)[0])
        
    test_string = 'ons anarch'
    test_string = 'ons'
    
    mirror('abcdf')
    mirror('cow')

Input: abcdf
Pred: fdcba
Input: cow
Pred: woc
