Deep Learning
=============

Assignment 6
------------

After training a skip-gram model in `5_word2vec.ipynb`, the goal of this notebook is to train a LSTM character model over [Text8](http://mattmahoney.net/dc/textdata) data.

In [1]:
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

print('Tensorflow Version: ', tf.__version__)

Tensorflow Version:  1.3.0


In [2]:
url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


In [3]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        name = f.namelist()[0]
        data = tf.compat.as_str(f.read(name))
    return data

text = read_data(filename)
print('Data size %d' % len(text))

Data size 100000000


Create a small validation set.

In [4]:
valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl


Utility functions to map characters to vocabulary IDs and back.

In [5]:
vocabulary_size = len(string.ascii_lowercase) + 1  # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
    if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
    elif char == ' ':
        return 0
    else:
        print('Unexpected character: %s' % char)
        return 0

def id2char(dictid):
    if dictid > 0:
        return chr(dictid + first_letter - 1)
    else:
        return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))


Unexpected character: ï
1 26 0 0
a z  


Function to generate a training batch for the LSTM model.

In [41]:
batch_size=64
num_unrollings=10

class BatchGenerator(object):
    def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()

    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
            batch[b, char2id(self._text[self._cursor[b]])] = 1.0
            self._cursor[b] = (self._cursor[b] + 1) % self._text_size
        return batch

    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
        for step in range(self._num_unrollings):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches

def characters(probabilities):
    """Turn a 1-hot encoding or a probability distribution over the possible
    characters back into its (most likely) character representation."""
    return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
    """Convert a sequence of batches back into their (most likely) string
    representation."""
#     print(batches[0].shape[0])
    s = [''] * batches[0].shape[0]
    for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
    return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nat

In [42]:
def logprob(predictions, labels):
    """Log-probability of the true labels in a predicted batch."""
    predictions[predictions < 1e-10] = 1e-10
    return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
    """Sample one element from a distribution assumed to be an array of normalized
    probabilities.
    """
    r = random.uniform(0, 1)
    s = 0
    for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
            return i
    return len(distribution) - 1

def sample(prediction):
    """Turn a (column) prediction into 1-hot encoded samples."""
    p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
    p[0, sample_distribution(prediction[0])] = 1.0
    return p

def random_distribution():
    """Generate a random column of probabilities."""
    b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
    return b/np.sum(b, 1)[:,None]

Simple LSTM Model.

In [43]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
        saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [44]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval(
                            {sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                print('=' * 80)
                print(sentence)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.292858 learning rate: 10.000000
Minibatch perplexity: 26.92
n biu ifcpoqpyxf pemtujpmssewnreiehcul m defkmqkohlwnbvcwaot e ridsa nzgwyjr s i
Validation set perplexity: 20.08
Average loss at step 100: 2.603985 learning rate: 10.000000
Minibatch perplexity: 11.10
Validation set perplexity: 10.56
Average loss at step 200: 2.249387 learning rate: 10.000000
Minibatch perplexity: 8.43
Validation set perplexity: 8.63
Average loss at step 300: 2.095807 learning rate: 10.000000
Minibatch perplexity: 7.38
Validation set perplexity: 7.95
Average loss at step 400: 1.994223 learning rate: 10.000000
Minibatch perplexity: 7.40
Validation set perplexity: 7.71
Average loss at step 500: 1.927092 learning rate: 10.000000
Minibatch perplexity: 6.33
Validation set perplexity: 6.92
Average loss at step 600: 1.900494 learning rate: 10.000000
Minibatch perplexity: 6.26
Validation set perplexity: 6.71
Average loss at step 700: 1.849040 learning rate: 10.000000
Minibatch pe

Validation set perplexity: 4.25
Average loss at step 5700: 1.562760 learning rate: 1.000000
Minibatch perplexity: 4.46
Validation set perplexity: 4.26
Average loss at step 5800: 1.579059 learning rate: 1.000000
Minibatch perplexity: 4.81
Validation set perplexity: 4.26
Average loss at step 5900: 1.568666 learning rate: 1.000000
Minibatch perplexity: 5.22
Validation set perplexity: 4.26
Average loss at step 6000: 1.542613 learning rate: 1.000000
Minibatch perplexity: 4.98
her as them the jecam a succeed curtexially iesorcoge masia valia gran there its
Validation set perplexity: 4.24
Average loss at step 6100: 1.559675 learning rate: 1.000000
Minibatch perplexity: 5.11
Validation set perplexity: 4.24
Average loss at step 6200: 1.530015 learning rate: 1.000000
Minibatch perplexity: 4.92
Validation set perplexity: 4.24
Average loss at step 6300: 1.540296 learning rate: 1.000000
Minibatch perplexity: 5.00
Validation set perplexity: 4.20
Average loss at step 6400: 1.532069 learning rate: 1.0

---
Problem 1
---------

You might have noticed that the definition of the LSTM cell involves 4 matrix multiplications with the input, and 4 matrix multiplications with the output. Simplify the expression by using a single matrix multiply for each, and variables that are 4 times larger.

---

In [45]:
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Parameters:
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Concatenate parameters
    sx = tf.concat([ix, fx, cx, ox], 1)
    sm = tf.concat([im, fm, cm, om], 1)
    sb = tf.concat([ib, fb, cb, ob], 1)
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""    
        smatmul = tf.matmul(i, sx) + tf.matmul(o, sm) + sb
        smatmul_input, smatmul_forget, update, smatmul_output = \
          tf.split(axis=1, num_or_size_splits=4, value=smatmul)
        input_gate = tf.sigmoid(smatmul_input)
        forget_gate = tf.sigmoid(smatmul_forget)
        output_gate = tf.sigmoid(smatmul_output)
        #input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        #forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        #update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        #output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state    

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
        saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [47]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval(
                            {sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.297015 learning rate: 10.000000
Minibatch perplexity: 27.03
qemo   hizmu  jinytvmb oarj     nzc  op  ddgs  tcoesqna i oquoho sncngrsecshe ha
jrmtp vssycnf fw redhrn zwuhtn y vevo ite kqebdal  akv y to wbmexfohsnbveprffim 
f noi hsqegagoiw  q  njqbtyzwrge txizssf zpenma cyd ruwcrr  ecnscrhtakcqenmokhsz
exy yknejaxtrxbt  vx e zwq xh ematmiazmunrqrzwzo sicoaoy t  cxtsii hsnuadhquf  l
x kpcwf adt  erwhkv sam  hbqnftdkwklgwynsdn ijevrgrxdhnicmf  akenel ysfswer dhdi
Validation set perplexity: 19.95
Average loss at step 100: 2.579453 learning rate: 10.000000
Minibatch perplexity: 12.14
Validation set perplexity: 10.61
Average loss at step 200: 2.238851 learning rate: 10.000000
Minibatch perplexity: 8.47
Validation set perplexity: 8.76
Average loss at step 300: 2.077987 learning rate: 10.000000
Minibatch perplexity: 6.91
Validation set perplexity: 7.87
Average loss at step 400: 1.995008 learning rate: 10.000000
Minibatch perplexity: 6.97
Validation set per

Validation set perplexity: 5.09
Average loss at step 4500: 1.628428 learning rate: 10.000000
Minibatch perplexity: 5.03
Validation set perplexity: 5.02
Average loss at step 4600: 1.627153 learning rate: 10.000000
Minibatch perplexity: 5.07
Validation set perplexity: 4.92
Average loss at step 4700: 1.598402 learning rate: 10.000000
Minibatch perplexity: 5.47
Validation set perplexity: 4.99
Average loss at step 4800: 1.584996 learning rate: 10.000000
Minibatch perplexity: 5.32
Validation set perplexity: 5.02
Average loss at step 4900: 1.594357 learning rate: 10.000000
Minibatch perplexity: 5.19
Validation set perplexity: 4.98
Average loss at step 5000: 1.618720 learning rate: 1.000000
Minibatch perplexity: 5.20
ja chread the three six determed his uce vicroses of carros at also luggther new
zaw sition with s bottle of whate eight if one nighe accona df at windorgined si
ts projetted the gubors offirum ausur in actor jussibuly pendinia history belaru
y intervion god martin the six three e

---
Problem 2
---------

We want to train a LSTM over bigrams, that is pairs of consecutive characters like 'ab' instead of single characters like 'a'. Since the number of possible bigrams is large, feeding them directly to the LSTM using 1-hot encodings will lead to a very sparse representation that is very wasteful computationally.

a- Introduce an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

b- Write a bigram-based LSTM, modeled on the character LSTM above.

c- Introduce Dropout. For best practices on how to use Dropout in LSTMs, refer to this [article](http://arxiv.org/abs/1409.2329).

---

Introducing an embedding lookup on the inputs, and feed the embeddings to the LSTM cell instead of the inputs themselves.

In [48]:
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 64

graph = tf.Graph()
with graph.as_default():
  
    # Parameters: 
    vocabulary_embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.                             
    cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Concatenate parameters
    sx = tf.concat([ix, fx, cx, ox], 1)
    sm = tf.concat([im, fm, cm, om], 1)
    sb = tf.concat([ib, fb, cb, ob], 1)
    # Variables saving state across unrollings.
    saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))
  
    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""    
        smatmul = tf.matmul(i, sx) + tf.matmul(o, sm) + sb
        smatmul_input, smatmul_forget, update, smatmul_output = \
            tf.split(axis=1, num_or_size_splits=4, value=smatmul)
        input_gate = tf.sigmoid(smatmul_input)
        forget_gate = tf.sigmoid(smatmul_forget)
        output_gate = tf.sigmoid(smatmul_output)
        #input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        #forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        #update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        #output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state    

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
    train_inputs = train_data[:num_unrollings]
    train_labels = train_data[1:]  # labels are inputs shifted by one time step.

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, tf.argmax(i, axis=1))
        output, state = lstm_cell(i_embed, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
        saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
    sample_input_embedding = tf.nn.embedding_lookup(vocabulary_embeddings, tf.argmax(sample_input, axis=1))
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
    sample_input_embedding, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
        saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [49]:
num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[1:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = sample(random_distribution())
                    sentence = characters(feed)[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval(
                            {sample_input: feed})
                        feed = sample(prediction)
                        sentence += characters(feed)[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({sample_input: b[0]})
                valid_logprob = valid_logprob + logprob(predictions, b[1])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))

Initialized
Average loss at step 0: 3.302303 learning rate: 10.000000
Minibatch perplexity: 27.18
mueiwj fjtweqqtkbtj zsiyu prnjeonmnxxneaxohenetoftczie h   st o ob raoaupmz hrxr
utanntxe  odnd eqpkikm qne lex chwpi jkn e dd gxues  sezsr   lawkcugizfiaeh toeo
ie  bohv hzewn  iov mwgh ovbxwq bppsojo owgsidri edvigox prevago e undnixtamc  j
oezrge g  niupjomewilqipdiwnbbgjdcldneveei uloqrrcogtwgvveiue lg il ax gmsyp tuz
ze iej qraaatlatl  ro hhksuvial e edacwu rmnv fcttgnraul l omxcy wc rt iiqnnei u
Validation set perplexity: 19.44
Average loss at step 100: 2.294293 learning rate: 10.000000
Minibatch perplexity: 8.10
Validation set perplexity: 8.61
Average loss at step 200: 2.039192 learning rate: 10.000000
Minibatch perplexity: 7.32
Validation set perplexity: 7.39
Average loss at step 300: 1.957545 learning rate: 10.000000
Minibatch perplexity: 6.59
Validation set perplexity: 6.87
Average loss at step 400: 1.900739 learning rate: 10.000000
Minibatch perplexity: 6.95
Validation set perpl

Validation set perplexity: 4.71
Average loss at step 4500: 1.599520 learning rate: 10.000000
Minibatch perplexity: 4.84
Validation set perplexity: 4.56
Average loss at step 4600: 1.604352 learning rate: 10.000000
Minibatch perplexity: 4.89
Validation set perplexity: 4.60
Average loss at step 4700: 1.619297 learning rate: 10.000000
Minibatch perplexity: 4.90
Validation set perplexity: 4.57
Average loss at step 4800: 1.615255 learning rate: 10.000000
Minibatch perplexity: 4.60
Validation set perplexity: 4.84
Average loss at step 4900: 1.637515 learning rate: 10.000000
Minibatch perplexity: 5.68
Validation set perplexity: 4.91
Average loss at step 5000: 1.641613 learning rate: 1.000000
Minibatch perplexity: 5.02
manys he time instrate in demence gives the maitaining manizalide succulated and
x one one nine zero malidies hos romanzies the he terms mestique and one set voi
 of mide reromon agmine cominton s mardes are is long island the were history st
can two fives foo and making a bdurned

Writing a bigram-based LSTM, modeled on the character LSTM above.


In [50]:
embedding_size = 128  # Dimension of the embedding vector.
num_nodes = 64
keep_prob_train = 1.0

graph = tf.Graph()
with graph.as_default():

    # Parameters:
    vocabulary_embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.
    cx = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal(
        [num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_chars = train_data[:num_unrollings]
    train_inputs = zip(train_chars[:-1], train_chars[1:])
    # labels are inputs shifted by one time step.
    train_labels = train_data[2:]

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        bigram_index = tf.argmax(i[0], axis=1) + \
            vocabulary_size * tf.argmax(i[1], axis=1)
        i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, bigram_index)
        drop_i = tf.nn.dropout(i_embed, keep_prob_train)
        output, state = lstm_cell(drop_i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(axis=0, values=outputs), w, b)
        drop_logits = tf.nn.dropout(logits, keep_prob_train)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits=logits, labels=tf.concat(axis=0, values=train_labels)))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        10.0, global_step, 15000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    keep_prob_sample = tf.placeholder(tf.float32)
    sample_input = list()
    for _ in range(2):
        sample_input.append(tf.placeholder(
            tf.float32, shape=[1, vocabulary_size]))
    samp_in_index = tf.argmax(
        sample_input[0], axis=1) + vocabulary_size * tf.argmax(sample_input[1], axis=1)
    sample_input_embedding = tf.nn.embedding_lookup(
        vocabulary_embeddings, samp_in_index)
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input_embedding, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [51]:
import collections
num_steps = 21001
summary_frequency = 100

valid_batches = BatchGenerator(valid_text, 1, 2)

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = collections.deque(maxlen=2)
                    for _ in range(2):
                        feed.append(random_distribution())
                    sentence = characters(feed[0])[0] + characters(feed[1])[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({
                            sample_input[0]: feed[0],
                            sample_input[1]: feed[1],
                        })
                        feed.append(sample(prediction))
                        sentence += characters(feed[1])[0]
                    print(sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({
                    sample_input[0]: b[0],
                    sample_input[1]: b[1],
                    keep_prob_sample: 1.0
                })
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))


Initialized
Average loss at step 0: 3.298619 learning rate: 10.000000
Minibatch perplexity: 27.08
uwe psjh vpgn q  rauenvewobxsjte mnb  c   s hbajsdtlvywmqajtowosbgbaoe kkexs ihtr
ps eqy  heoss bcx vg  dp rvja npa m qmyigufvtqiv nmhzefpqb eshhn ani qe qz hcskqa
zd voyoya  w w vt nzn sa fyssuniembooxwo csi h eiil wlwfhrdja urzueee sreatme tnp
fj dx   hcl  j ad daznesohb us prxfpssgkthi vf rim eb cef hsrpty u    me hejs lm 
xwe brsurs tduetstku t vtn qh hm e fr c s hkey oj  eh l ceshekdfa sepv sli  rqmbb
Validation set perplexity: 20.39
Average loss at step 100: 2.288707 learning rate: 10.000000
Minibatch perplexity: 7.93
Validation set perplexity: 9.54
Average loss at step 200: 1.960973 learning rate: 10.000000
Minibatch perplexity: 7.49
Validation set perplexity: 8.19
Average loss at step 300: 1.872497 learning rate: 10.000000
Minibatch perplexity: 6.15
Validation set perplexity: 8.41
Average loss at step 400: 1.821193 learning rate: 10.000000
Minibatch perplexity: 6.22
Validation set 

Validation set perplexity: 7.48
Average loss at step 4500: 1.601645 learning rate: 10.000000
Minibatch perplexity: 5.40
Validation set perplexity: 7.67
Average loss at step 4600: 1.593318 learning rate: 10.000000
Minibatch perplexity: 4.71
Validation set perplexity: 7.43
Average loss at step 4700: 1.592260 learning rate: 10.000000
Minibatch perplexity: 5.25
Validation set perplexity: 7.54
Average loss at step 4800: 1.605559 learning rate: 10.000000
Minibatch perplexity: 5.38
Validation set perplexity: 7.55
Average loss at step 4900: 1.588053 learning rate: 10.000000
Minibatch perplexity: 5.01
Validation set perplexity: 7.51
Average loss at step 5000: 1.602069 learning rate: 10.000000
Minibatch perplexity: 4.76
rx of frequented to discreonactles recent and system the presses wifall wices nin
ys bt was kuwait s water and effectim early finnive seven nine six five by sca co
rl of for one nine zero zero eight timation two zero zero zero s team victual typ
pjamerica callion alteration da ca

g orventineers digarther is renerage are used rocvling betweenth isbn zero germen
Validation set perplexity: 7.38
Average loss at step 9100: 1.604790 learning rate: 10.000000
Minibatch perplexity: 4.88
Validation set perplexity: 7.29
Average loss at step 9200: 1.623803 learning rate: 10.000000
Minibatch perplexity: 5.28
Validation set perplexity: 7.23
Average loss at step 9300: 1.612519 learning rate: 10.000000
Minibatch perplexity: 5.39
Validation set perplexity: 7.20
Average loss at step 9400: 1.595515 learning rate: 10.000000
Minibatch perplexity: 5.14
Validation set perplexity: 6.98
Average loss at step 9500: 1.607570 learning rate: 10.000000
Minibatch perplexity: 4.38
Validation set perplexity: 7.29
Average loss at step 9600: 1.605453 learning rate: 10.000000
Minibatch perplexity: 4.64
Validation set perplexity: 7.11
Average loss at step 9700: 1.613257 learning rate: 10.000000
Minibatch perplexity: 4.90
Validation set perplexity: 6.84
Average loss at step 9800: 1.611466 learning r

Validation set perplexity: 7.22
Average loss at step 13900: 1.567368 learning rate: 10.000000
Minibatch perplexity: 5.43
Validation set perplexity: 7.04
Average loss at step 14000: 1.561589 learning rate: 10.000000
Minibatch perplexity: 4.18
mw large or solinss and because pranner the also an auricones resultional gening 
greatting some s shumang made of honority deason orgous hold in one pornistried m
vp as the the most with jid  active cause youn disalines eightenre traditions of 
with the pop of corructure of declarvated billes is government trence brings puts
qvictors catlend a duity cone sout web bapple is becodicgraphic one nine eight th
Validation set perplexity: 7.33
Average loss at step 14100: 1.572196 learning rate: 10.000000
Minibatch perplexity: 4.93
Validation set perplexity: 7.06
Average loss at step 14200: 1.580040 learning rate: 10.000000
Minibatch perplexity: 5.45
Validation set perplexity: 7.17
Average loss at step 14300: 1.574481 learning rate: 10.000000
Minibatch pe

Validation set perplexity: 6.61
Average loss at step 18400: 1.568290 learning rate: 1.000000
Minibatch perplexity: 4.10
Validation set perplexity: 6.54
Average loss at step 18500: 1.564477 learning rate: 1.000000
Minibatch perplexity: 5.10
Validation set perplexity: 6.58
Average loss at step 18600: 1.573521 learning rate: 1.000000
Minibatch perplexity: 4.50
Validation set perplexity: 6.54
Average loss at step 18700: 1.562265 learning rate: 1.000000
Minibatch perplexity: 5.29
Validation set perplexity: 6.49
Average loss at step 18800: 1.568153 learning rate: 1.000000
Minibatch perplexity: 5.40
Validation set perplexity: 6.55
Average loss at step 18900: 1.548026 learning rate: 1.000000
Minibatch perplexity: 4.83
Validation set perplexity: 6.54
Average loss at step 19000: 1.596814 learning rate: 1.000000
Minibatch perplexity: 4.57
qoids polls many faditions many what very of builbers beyer conpulf inna are betw
aarial stall for ensultiple for one nine seven ny see two caraited english of 

Introducing Dropout in the LSTMs

In [52]:
embedding_size = 128  # Dimension of the embedding vector.
num_nodes = 64
keep_prob_train = 1.0

graph = tf.Graph()
with graph.as_default():

    # Parameters:
    vocabulary_embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
    # Input gate: input, previous output, and bias.
    ix = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ib = tf.Variable(tf.zeros([1, num_nodes]))
    # Forget gate: input, previous output, and bias.
    fx = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    fb = tf.Variable(tf.zeros([1, num_nodes]))
    # Memory cell: input, state and bias.
    cx = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    cb = tf.Variable(tf.zeros([1, num_nodes]))
    # Output gate: input, previous output, and bias.
    ox = tf.Variable(tf.truncated_normal(
        [embedding_size, num_nodes], -0.1, 0.1))
    om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
    ob = tf.Variable(tf.zeros([1, num_nodes]))
    # Variables saving state across unrollings.
    saved_output = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    saved_state = tf.Variable(
        tf.zeros([batch_size, num_nodes]), trainable=False)
    # Classifier weights and biases.
    w = tf.Variable(tf.truncated_normal(
        [num_nodes, vocabulary_size], -0.1, 0.1))
    b = tf.Variable(tf.zeros([vocabulary_size]))

    # Definition of the cell computation.
    def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        return output_gate * tf.tanh(state), state

    # Input data.
    train_data = list()
    for _ in range(num_unrollings + 1):
        train_data.append(
            tf.placeholder(tf.float32, shape=[batch_size, vocabulary_size]))
    train_chars = train_data[:num_unrollings]
    train_inputs = zip(train_chars[:-1], train_chars[1:])
    # labels are inputs shifted by one time step.
    train_labels = train_data[2:]

    # Unrolled LSTM loop.
    outputs = list()
    output = saved_output
    state = saved_state
    for i in train_inputs:
        bigram_index = tf.argmax(i[0], axis=1) + \
            vocabulary_size * tf.argmax(i[1], axis=1)
        i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, bigram_index)
        drop_i = tf.nn.dropout(i_embed, keep_prob_train)
        output, state = lstm_cell(drop_i, output, state)
        outputs.append(output)

    # State saving across unrollings.
    with tf.control_dependencies([saved_output.assign(output),
                                  saved_state.assign(state)]):
        # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(axis=0, values=outputs), w, b)
        drop_logits = tf.nn.dropout(logits, keep_prob_train)
        loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits=logits, labels=tf.concat(axis=0, values=train_labels)))

    # Optimizer.
    global_step = tf.Variable(0)
    learning_rate = tf.train.exponential_decay(
        10.0, global_step, 15000, 0.1, staircase=True)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    gradients, v = zip(*optimizer.compute_gradients(loss))
    gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)

    # Predictions.
    train_prediction = tf.nn.softmax(logits)

    # Sampling and validation eval: batch 1, no unrolling.
    keep_prob_sample = tf.placeholder(tf.float32)
    sample_input = list()
    for _ in range(2):
        sample_input.append(tf.placeholder(
            tf.float32, shape=[1, vocabulary_size]))
    samp_in_index = tf.argmax(
        sample_input[0], axis=1) + vocabulary_size * tf.argmax(sample_input[1], axis=1)
    sample_input_embedding = tf.nn.embedding_lookup(
        vocabulary_embeddings, samp_in_index)
    saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
    saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
    reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
    sample_output, sample_state = lstm_cell(
        sample_input_embedding, saved_sample_output, saved_sample_state)
    with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                  saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

In [None]:
import collections
num_steps = 21001
summary_frequency = 100

valid_batches = BatchGenerator(valid_text, 1, 2)

with tf.Session(graph=graph) as session:
    tf.global_variables_initializer().run()
    print('Initialized')
    mean_loss = 0
    for step in range(num_steps):
        batches = train_batches.next()
        feed_dict = dict()
        for i in range(num_unrollings + 1):
            feed_dict[train_data[i]] = batches[i]
        _, l, predictions, lr = session.run(
            [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
        mean_loss += l
        if step % summary_frequency == 0:
            if step > 0:
                mean_loss = mean_loss / summary_frequency
            # The mean loss is an estimate of the loss over the last few batches.
            print(
                'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
            mean_loss = 0
            labels = np.concatenate(list(batches)[2:])
            print('Minibatch perplexity: %.2f' % float(
                np.exp(logprob(predictions, labels))))
            if step % (summary_frequency * 10) == 0:
                # Generate some samples.
                print('=' * 80)
                for _ in range(5):
                    feed = collections.deque(maxlen=2)                    
                    for _ in range(2):
                        feed.append(random_distribution())
                    sentence = characters(feed[0])[0] + characters(feed[1])[0]
                    reset_sample_state.run()
                    for _ in range(79):
                        prediction = sample_prediction.eval({
                            sample_input[0]: feed[0],
                            sample_input[1]: feed[1],
                        })
                        feed.append(sample(prediction))
                        sentence += characters(feed[1])[0]
                    print('Sentence: ', sentence)
                print('=' * 80)
            # Measure validation set perplexity.
            reset_sample_state.run()
            valid_logprob = 0
            print('valid_size: ', valid_size)
            for _ in range(valid_size):
                b = valid_batches.next()
                predictions = sample_prediction.eval({
                    sample_input[0]: b[0],
                    sample_input[1]: b[1],
                    keep_prob_sample: 1.0
                })
                valid_logprob = valid_logprob + logprob(predictions, b[2])
            print('Validation set perplexity: %.2f' % float(np.exp(
                valid_logprob / valid_size)))


---
Problem 3
---------

(difficult!)

Write a sequence-to-sequence LSTM which mirrors all the words in a sentence. For example, if your input is:

    the quick brown fox
    
the model should attempt to output:

    eht kciuq nworb xof
    
Refer to the lecture on how to put together a sequence-to-sequence model, as well as [this article](http://arxiv.org/abs/1409.3215) for best practices.

---

Create a utility for probabiity disctribution and sampling

In [6]:
class ProbabilityUtil:
    @staticmethod
    def logprob(predictions, labels):
        """Log-probability of the true labels in a predicted batch."""
        predictions[predictions < 1e-10] = 1e-10
        return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
    
    @staticmethod
    def sample_distribution(distribution):
        """Sample one element from a distribution assumed to be an array of normalized
        probabilities.
        """
        r = random.uniform(0, 1)
        s = 0
        for i in range(len(distribution)):
            s += distribution[i]
            if s >= r:
                return i
        return len(distribution) - 1

    @staticmethod
    def sample(prediction, vocabulary_size):
        """Turn a (column) prediction into 1-hot encoded samples."""
        p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
        p[0, ProbabilityUtil.sample_distribution(prediction)] = 1.0
        return p[0]

    @staticmethod
    def random_distribution(vocabulary_size):
        """Generate a random column of probabilities."""
        b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
        dist = b/np.sum(b, 1)[:,None]
        return dist[0]
    
print('Random distribution:', ProbabilityUtil.random_distribution(20))
print('Random Sample:', ProbabilityUtil.sample(ProbabilityUtil.random_distribution(20), 20))

Random distribution: [ 0.08192215  0.00858568  0.02007259  0.02326842  0.08370015  0.03233085
  0.03482114  0.04786063  0.05115731  0.0470081   0.04312964  0.02920952
  0.04899316  0.08340634  0.00887989  0.07267074  0.08039548  0.07684735
  0.03643272  0.08930814]
Random Sample: [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.]


We need to use alphabets as our basic unit of information and keep track of the words and their ID in a dictionary

In [7]:
class CharacterDictionary:
    """Stores all the english alphabet characters in a dictionary along with 
    known non-alphabet characters '.', '#', and ' '.
    """
    END = '.'
    UNK = '#'
    PAD = ' '
    def __init__(self):
        self._dictionary = { self.PAD: 0, self.END: 1, self.UNK: 2 }
        vocabulary_size = len(string.ascii_lowercase) + 3 # [a-z] + '.' + '#' + ' '
        for char in string.ascii_lowercase:
            self._dictionary[char.lower()] = len(self._dictionary)

    def sentence2ids(self, sentence):
        return [self.char2id(word) for word in sentence]
    
    def ids2sentence(self, sentence_ids):
        return ' '.join([self.id2char(word_id) for word_id in sentence_ids])

    def char2id(self, word):
        if self.hasChar(word):
            return self._dictionary[word]
        else:
            return self._dictionary[self.UNK]

    def id2char(self, word_id):
        if not self.hasId(word_id):
            return self.UNK
        else:
            return self._dictionary.keys()[self._dictionary.values().index(word_id)]
                
    def hasChar(self, word):
        return word in self._dictionary.keys()
    
    def hasId(self, word_id):
        return word_id in self._dictionary.values()

    def isEnd(self, word):
        return word == self.END
    
    def idIsEnd(self, word_id):
        return word_id == self._dictionary[self.END]
        
    def idIsUnknown(self, word_id):
        return word_id == self._dictionary[self.UNK]
        
    def idIsPadding(self, word_id):
        return word_id == self._dictionary[self.PAD]
    
    def length(self):
        return len(self._dictionary)
    
    @staticmethod
    def join(sentences):
        """Given an _array_ of strings we should join both sentences as one with an '<END>'
        label inbetween the strings and a '<PAD>' tag at the end of both sentences.
        """
        return ''.join([CharacterDictionary.END.join(sentences), CharacterDictionary.END])

random_int = np.random.randint(0, len(valid_text))
test_string =  valid_text[random_int:random_int + np.random.randint(10, 20)]
test_array = [85, 25, 36, 16, 27, 1, 2, 68, 31, 42, 37, 59, 81, 12, 91, 23, 0]
    
print('Sentence of "%s" translated to:' % test_string, CharacterDictionary().sentence2ids(test_string))
print(test_array, ' will translate to "%s"' % CharacterDictionary().ids2sentence(test_array))
print('Joining sentences looks like this: "%s"' % CharacterDictionary.join(['sentence to join ', 'valid text']))

Sentence of " the elimin" translated to: [0, 22, 10, 7, 0, 7, 14, 11, 15, 11, 16]
[85, 25, 36, 16, 27, 1, 2, 68, 31, 42, 37, 59, 81, 12, 91, 23, 0]  will translate to "# w # n y . # # # # # # # j # u  "
Joining sentences looks like this: "sentence to join .valid text."


In [8]:
class WordFlipper:
    @staticmethod
    def flip_word(word):
        """ Just flip the characters in this given word"""
        return word[::-1]
    
    @staticmethod
    def flip(input_sentence):
        flipped_sentence = ''
        word_to_flip = ''
        max_size = len(input_sentence) - 1
        for index, char in enumerate(input_sentence):
            if char == ' ' or index == max_size:
                if index == max_size and char != ' ':
                    flipped_sentence = ''.join([flipped_sentence, WordFlipper.flip_word(''.join([word_to_flip, char]))])
                else:
                    flipped_sentence =  ''.join([flipped_sentence, WordFlipper.flip_word(word_to_flip), char])
                word_to_flip = ''
            else:
                word_to_flip += char
        return flipped_sentence

class MirroredCorpus:    
    @staticmethod
    def create(unflipped_text, min_chars = 20, max_chars = 40):
        """ Given a corpus string of text I want to get random substets of sentences 
        and pair them with a sentence made up of a mirror of all its words.
        """
        max_char_count = 0
        flipped_text = ''
        word_to_flip = ''  
        string_to_flip = ''
        char_count = 0
        cursor_index = 0
        random_int = np.random.randint(min_chars, max_chars)
        flipped_text_pairs = []
        for char in unflipped_text:
            char_count += 1
            if char_count >= random_int and char == ' ' or cursor_index == (len(unflipped_text) - 1):
                if len(string_to_flip) > 0:
                    flipped_text = CharacterDictionary.join([WordFlipper.flip(string_to_flip)])
                    flipped_text_pairs.append([CharacterDictionary.join([string_to_flip]), flipped_text])
                string_to_flip = ''
                char_count = 0
                random_int = np.random.randint(min_chars, max_chars)
            else:
                flipped_text = ''.join([flipped_text, char])
                string_to_flip = ''.join([string_to_flip, char])
            max_char_count = max(max_char_count, (char_count + 1)) # char_count includes dot
            cursor_index += 1
        return flipped_text_pairs, max_char_count # return the flipped text and the max number of characters

test_string =  train_text[:330]

print('Sample sentence "%s" is mirrored to: "%s"' %(test_string,  MirroredCorpus.create(test_string)))
print('Sample sentence "%s" is mirrored to: "%s"' %(' This is some string ',  MirroredCorpus.create(' This is some string ', 0, 20)))

Sample sentence "ons anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist society might " is mirrored to: "([['ons anarchists advocate social relations.', 'sno stsihcrana etacovda laicos snoitaler.'], ['based upon voluntary association.', 'desab nopu yratnulov noitaicossa.'], ['of autonomous individuals.', 'fo suomonotua slaudividni.'], ['mutual aid and self governance.', 'lautum dia dna fles ecnanrevog.'], ['while anarchism is most easily.', 'elihw msihcrana si tsom ylisae.'], ['defined by what it is against anarchists.', 'denifed yb tahw ti si tsniaga stsihcrana.'], ['also offer positive visions.', 'osla reffo evitisop snoisiv.'], ['of what they believe to be a truly.', 'fo tahw yeht eveileb ot eb a ylurt.'], ['free society howev

In [9]:
batch_size = 64
num_unrollings = 30

np.set_printoptions(threshold=np.nan)

class SequenceBatchGenerator:
    def __init__(self, data, batch_size, num_unrollings=30):        
        self._dictionary = CharacterDictionary()
        self._data = data
        self._data_size = len(self._data)
        self._batch_size = batch_size       
        self._num_unrollings = num_unrollings
        self._vocabulary_size = self._dictionary.length()
        self._cursor = 0
        self._last_batch = []

    def _next_batch(self):
        """Generate a single batch from the current cursor position in the data.
        Ensure a batch is a complete sentence with both input and ouput string.
        An input and output string must have '<END>' label at the end.
        batch: [batch_size, [input: [vocab_size]], [output: [vocab_size]] ]
        """
        batch = np.zeros(shape=(self._batch_size, 2, self._vocabulary_size), dtype=np.float)
        item_index = 0;
        for batch_index in range(self._batch_size):
            char_in = self._data[self._cursor % self._data_size][0][item_index % len(self._data[self._cursor % self._data_size][0])]
            char_out = self._data[self._cursor % self._data_size][1][item_index % len(self._data[self._cursor % self._data_size][1])]            
            item_index += 1
            batch[batch_index, 0, self._dictionary.char2id(char_in)] = 1.0
            batch[batch_index, 1, self._dictionary.char2id(char_out)] = 1.0
            if self._dictionary.isEnd(char_in):
                self._cursor  += 1
                break
        return batch
        
    def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = []
        if len(self._last_batch):
            batches = [self._last_batch]            
        for _ in range(self._batch_size - len(batches)):
            batches.append(self._next_batch())
        self._last_batch = batches[-1]
        return batches
    
    def decode_item(self, hot_encoding):
        return self._dictionary.id2char(np.argmax(hot_encoding))
    
    def decode_batch(self, batch, single=False):
        """Turn all 1-hot encoding or all probability distribution over the possible
        string back into its (most likely) words representation."""
        string, string_in, string_out = [],[],[]
        sentences = []
        for hot_encoding in batch:
            if not single:
                char_in = self.decode_item(hot_encoding[0])
                char_out = self.decode_item(hot_encoding[1])
                string_in.append(char_in)
                string_out.append(char_out)
                if self._dictionary.isEnd(char_in):
                    string_in = ''.join(string_in)
                    string_out = ''.join(string_out)                   
                    sentences = [string_in, string_out]
                    string_in, string_out = [],[]
            else:
                char = self.decode_item(hot_encoding)
                string.append(char)
                if self._dictionary.isEnd(char):
                    sentences = ''.join(string)
                    string = []
        return sentences
    
    def decode_batches(self, batches, single=False):
        """Convert a sequence of batches back into their (most likely) sentence
        representations."""
        if not single:
            return [self.decode_batch(batch)
                for batch in batches ]
        else:
            return [self.decode_batch(batch, single)
                for batch in batches ]
    
flip_train_data, max_train_char = MirroredCorpus.create(train_text[:300])
flip_valid_data, max_valid_char = MirroredCorpus.create(valid_text[:100])
train_sequence_batches = SequenceBatchGenerator(flip_train_data, max_train_char)
valid_sequence_batches = SequenceBatchGenerator(flip_valid_data, max_valid_char)

print('max_train_char: "%s"'% max_train_char)
print('max_valid_char: "%s"'% max_valid_char)
print('train_text[:300]: "%s"'% flip_train_data[:30])
print('flip_train_text[:300]: "%s"'% flip_train_data[:])
print('-'*50)
print('Printing test batches:')
print(train_sequence_batches.decode_batches(train_sequence_batches.next()))
print(train_sequence_batches.decode_batches(train_sequence_batches.next()))
print('Single batch:', train_sequence_batches.decode_batch(train_sequence_batches.next()[2]))
print('-'*50)
print(valid_sequence_batches.decode_batches(valid_sequence_batches.next()))
print(valid_sequence_batches.decode_batches(valid_sequence_batches.next()))
print('Single batch:', valid_sequence_batches.decode_batch(valid_sequence_batches.next()[1]))


max_train_char: "41"
max_valid_char: "39"
train_text[:300]: "[['ons anarchists advocate social relations.', 'sno stsihcrana etacovda laicos snoitaler.'], ['based upon voluntary association.', 'desab nopu yratnulov noitaicossa.'], ['of autonomous individuals mutual.', 'fo suomonotua slaudividni lautum.'], ['aid and self governance while anarchism.', 'dia dna fles ecnanrevog elihw msihcrana.'], ['is most easily defined by what it is.', 'si tsom ylisae denifed yb tahw ti si.'], ['against anarchists also offer positive.', 'tsniaga stsihcrana osla reffo evitisop.'], ['visions of what they believe to.', 'snoisiv fo tahw yeht eveileb ot.'], ['be a truly free society.', 'eb a ylurt eerf yteicos.'], ['however ideas about .', 'revewoh saedi tuoba .']]"
flip_train_text[:300]: "[['ons anarchists advocate social relations.', 'sno stsihcrana etacovda laicos snoitaler.'], ['based upon voluntary association.', 'desab nopu yratnulov noitaicossa.'], ['of autonomous individuals mutual.', 'fo suomonotua s

Let's build out our the LSTM Cell with a better understanding of its gate base structure

In [10]:
def LSTMCell(x, m_prev, c_prev, cell_weigths, cell_biases):
    """Create a LSTM cell that takes in the current input and previous
    output (if any) to generate a cell output and state
    References:
        - http://arxiv.org/pdf/1402.1128v1.pdf
        - https://colah.github.io/posts/2015-08-Understanding-LSTMs/
    Parameters:
        - x: input for the cell
        - m_prev: output of the previous cell unrolling
        - c_prev: cell state of the previous cell unrolling
        - cell_weights: weights for all gates in the cell 
        - cell_biases: biases for all gates in the cell
    Notes:
        - w represents the weight matrices (e.g. w_ix is the matrix
        of weights from the input gate to the input)
        - the b terms denote bias vectors (b_i is the input gate bias vector)
        - the labels i, f, o, c and m are respectively vectors from the input gate, 
        forget gate, output gate, cell activation and output activation
        - c is also known as the cell state
        - m can also be known as the cell output sometimes denoted by h
        - elements from previous vectors are postfixed with _prev
        - cell_weights: [[w_ix, w_im, w_ic], [w_fx, w_fm, w_fc], [w_ox, w_om, w_oc], [w_cx, w_cm]]
        - cell_biases: [b_i, b_f, b_o, b_c]
    """
    # Get LSTM Cell features (parameters)
    b_i, b_f, b_o, b_c = cell_biases
    w_i, w_f, w_o, w_c = cell_weigths
    w_ix, w_im, w_ic = w_i
    w_fx, w_fm, w_fc = w_f
    w_ox, w_om, w_oc = w_o
    w_cx, w_cm = w_c
    # Create LSTM gates
    i_gate = tf.sigmoid(tf.matmul(x, w_ix) + tf.matmul(m_prev, w_im) + 
                        tf.matmul(c_prev, w_ic) + b_i)
    f_gate = tf.sigmoid(tf.matmul(x, w_fx) + tf.matmul(m_prev, w_fm) +
                        tf.matmul(c_prev, w_fc) + b_f)
    c_tanh = tf.tanh(tf.matmul(x, w_cx) + tf.matmul(m_prev, w_cm) + b_c)
    c_gate = f_gate * c_prev + i_gate * c_tanh
    o_gate = tf.sigmoid(tf.matmul(x, w_ox) + tf.matmul(m_prev, w_om) +
                        tf.matmul(c_prev, w_oc) + b_o)
    m_gate = o_gate * tf.tanh(c_gate)
    return m_gate, c_gate

Then we go on to building our encoder network

In [73]:
class Encoder:
    def __init__(self, options):
        """Create a number of LSTM and prepare them for training to predict the letter
        that comes after a sequence of characters. The output will be used by and decoder.
        Using the many-to-one architecture for variable length inputs.
        """
        self._graph = options['graph']
        self._cell_size = options['cell_size']
        self._cell_layers = options['cell_layers']
        self._batch_size = options['batch_size']
        self._char_size = options['char_size']
        self._num_nodes = options['num_nodes']
        self._embedding_size = options['embedding_size']
        self._vocabulary_size = options['vocabulary_size']
        self._num_unrollings = options['num_unrollings']
        self._vocabulary_embeddings = options['vocabulary_embeddings']

        self._loss = 0
        self._global_step = 0
        self._learning_rate = 0
        self._logits = None
        self._optimizer = None
        self._train_prediction = None
        self._sample_prediction = None
        self._weights = []
        self._biases = []
        self._cell_weights = []
        self._cell_biases = []
        
        self._train_input = []
        self._train_output = []
        self._sample_state = []
        self._sample_input = []
        self._sample_output = []
        self._relevant_state = None
        self._relevant_output = None
        self._reset_sample_state = None
        
        self._scope = None
        self._session = None
        self._variable_initializer = None
        self._build()

    def _build(self):
        with tf.variable_scope("encoder") as self._scope:
            # Variables saving state across unrollings.
            saved_output =  tf.get_variable("saved_outut", shape=[self._batch_size, self._num_nodes], initializer=tf.zeros_initializer, trainable=False)
            saved_state = tf.Variable(tf.zeros([self._batch_size, self._num_nodes]), trainable=False)
            # Classifier weights and biases.
            self._weights = tf.Variable(tf.truncated_normal([self._num_nodes, self._vocabulary_size], -0.1, 0.1))
            self._biases = tf.Variable(tf.zeros([self._vocabulary_size]))
            # Variables for the LSTM Cell
            # Input gate weights
            w_ix = tf.Variable(tf.truncated_normal([self._embedding_size, self._num_nodes], -0.1, 0.1))
            w_im = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            w_ic = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            # Forget gate weights
            w_fx = tf.Variable(tf.truncated_normal([self._embedding_size, self._num_nodes], -0.1, 0.1))
            w_fm = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            w_fc = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            # Output gate weights
            w_ox = tf.Variable(tf.truncated_normal([self._embedding_size, self._num_nodes], -0.1, 0.1))
            w_om = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            w_oc = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            # State gate
            w_cx = tf.Variable(tf.truncated_normal([self._embedding_size, self._num_nodes], -0.1, 0.1))
            w_cm = tf.Variable(tf.truncated_normal([self._num_nodes, self._num_nodes], -0.1, 0.1))
            # Biases
            b_i = tf.Variable(tf.zeros([1, self._num_nodes]))
            b_f = tf.Variable(tf.zeros([1, self._num_nodes]))
            b_o = tf.Variable(tf.zeros([1, self._num_nodes]))
            b_c = tf.Variable(tf.zeros([1, self._num_nodes]))
            # Concatenate Varaiables so we can pass them to the LSTM cell easily
            self._cell_weights = [[w_ix, w_im, w_ic], [w_fx, w_fm, w_fc], [w_ox, w_om, w_oc], [w_cx, w_cm]]
            self._cell_biases = [b_i, b_f, b_o, b_c]
            
            # Input data.
            self._train_input = list()
            for _ in range(self._num_unrollings):
                self._train_input.append( 
                    tf.placeholder(tf.float32, shape=[self._batch_size, self._vocabulary_size]))
            # Change output to be next character in sequence
            # Outputs (labels) of the encoder are inputs at the next time step.
            self._train_output = tf.placeholder(tf.float32, shape=[self._batch_size, self._vocabulary_size])
            
            # Unrolled LSTM loop.
            self._outputs = []
            output = saved_output
            state = saved_state
            for i in self._train_input:
                i_embed = tf.nn.embedding_lookup(self._vocabulary_embeddings, tf.argmax(i, axis=1))
                output, state = LSTMCell(i_embed, output, state, self._cell_weights, self._cell_biases)
                self._outputs.append(output)
            
            # The relevant output is at the last time step
            self._relevant_output = output
            self._relevant_state = state
                
            # State saving across unrollings.
            with tf.control_dependencies([saved_output.assign(self._relevant_output),
                saved_state.assign(self._relevant_state)]):
                # Classifier.
                self._logits = tf.nn.xw_plus_b(output, self._weights, self._biases)
                self._loss = tf.reduce_mean(
                    tf.nn.softmax_cross_entropy_with_logits(
                        labels=self._train_output, logits=self._logits))

            # Optimizer.
            self._global_step = tf.Variable(0)
            self._learning_rate = tf.train.exponential_decay(
                10.0, self._global_step, 5000, 0.1, staircase=True)
            self._optimizer = tf.train.GradientDescentOptimizer(self._learning_rate)
            gradients, v = zip(*self._optimizer.compute_gradients(self._loss))
            gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
            self._optimizer = self._optimizer.apply_gradients(
                zip(gradients, v), global_step=self._global_step)

            # Predictions.
            self._train_prediction = tf.nn.softmax(self._logits)
            
            # Validate 
            self._validate()
            
            # Variable Initializer for the Encoder scope
            self._variable_initializer = tf.global_variables_initializer()

                
    def _validate(self, batch_size=1):
        """Sampling and validation evaluation with a batch size of one and no unrolling.
        """
        self._sample_input = tf.placeholder(tf.float32, shape=[1, self._vocabulary_size])
        sample_input_embedding = tf.nn.embedding_lookup(self._vocabulary_embeddings, tf.argmax(self._sample_input, axis=1))
        saved_sample_output = tf.Variable(tf.zeros([1, self._num_nodes]))
        saved_sample_state = tf.Variable(tf.zeros([1, self._num_nodes]))
        self._reset_sample_state = tf.group( 
            saved_sample_output.assign(tf.zeros([1, self._num_nodes])),
            saved_sample_state.assign(tf.zeros([1, self._num_nodes])))
        self._sample_output, self._sample_state = LSTMCell( 
            sample_input_embedding, saved_sample_output, saved_sample_state, self._cell_weights, self._cell_biases)
        with tf.control_dependencies([saved_sample_output.assign(self._sample_output),
            saved_sample_state.assign(self._sample_state)]):
            self._sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(self._sample_output, self._weights, self._biases))
    
    def _train(self, sequence_batches, num_steps = 7001, summary_frequency = 200, valid_sequence_batches = None):
        """Train the encoder model with our batched data
        """ 
        print('Training the sequence encoder!')
        with tf.Session(graph=self._graph) as self._session:
            assert self._session.graph is self._graph
            self._session.run(self._variable_initializer)
            mean_loss = 0
            for step in range(num_steps):
                batches = sequence_batches.next()
                feed_dict = dict()
                for i in range(self._num_unrollings):
                    feed_dict[self._train_input[i]] = np.array([b[0] for b in batches[i]])
                ## The output feed (label) is the first character of the output string
                feed_dict[self._train_output] = np.array([batch[1][0] for batch in batches])
                _, l, predictions, lr = self._session.run(
                    [self._optimizer, self._loss, self._train_prediction, self._learning_rate], 
                    feed_dict=feed_dict)
                mean_loss += l
                if step % summary_frequency == 0:
                    if step > 0:
                        mean_loss = mean_loss / summary_frequency
                    # The mean loss is an estimate of the loss over the last few batches.
                    print('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
                    mean_loss = 0
                    labels = feed_dict[self._train_output]
                    print('Minibatch perplexity: %.2f' % float(
                        np.exp(ProbabilityUtil.logprob(predictions, labels))))
                    if step % (summary_frequency * 10) == 0:
                        # Generate some single one-hot encoding samples.
                        print('=' * 80)
                        input_feed, output_feed = [],[]
                        input_chars, output_chars = '',''
                        ## We should feed the model one character at a time and take the last prediction
                        for _ in range(self._batch_size):
                            item = ProbabilityUtil.sample(ProbabilityUtil.random_distribution(self._vocabulary_size), self._vocabulary_size)                            
                            item = item.reshape([1, self._vocabulary_size])
                            input_feed.append(item)
                            self._reset_sample_state.run()
                            prediction = self._sample_prediction.eval(
                                {self._sample_input: item})
                            out_item = ProbabilityUtil.sample(prediction[0], self._vocabulary_size)
                            output_feed.append(out_item)
                            input_chars += sequence_batches.decode_item(item)
                            output_chars += sequence_batches.decode_item(out_item)
                        print('For input sentence "%s" we got an encoder output of "%s"' %(
                            input_chars, 
                            output_chars))
                        # NOTE: Okay, we are only able to predict one char at a time. This is not enough
                        print('=' * 80)
                    # Measure validation set perplexity.
                    self._reset_sample_state.run()
                    if valid_sequence_batches:
                        valid_logprob = 0
                        for _ in range(valid_size):
                            b = valid_sequence_batches.next()                                         
                            bv = np.array([ ba[1] for ba in b[i]])
                            predictions = self._sample_prediction.eval({self._sample_input: bv})
                            valid_logprob = valid_logprob + ProbabilityUtil.logprob(predictions, b[1])
                        print('Validation set perplexity: %.2f' % float(np.exp(
                            valid_logprob / valid_size)))
        print('Training completed!')

After an encoder we of course need our decoder

In [74]:
class Decoder:
    def __init__(self, cell_size, cell_layers=1):
        """Create a number of LSTM and prepare them for training to predict the next letter
        given a sequence of characters. Beam serach can also be used to improve prediction.
        """
        self._cell_size = cell_size
        self._cell_layers = cell_layers
        pass
    
    def _train(self, sequence_batches, num_steps = 7001, summary_frequency = 100):
        print('Training the sequence decoder!')
        pass     

We will then need to build a sequence-to-sequence model for our prediction.

In [75]:
class SequenceModel:
    def __init__(self, options):
        self._options = options
        self._encoder = None
        self._decoder = None
        # Create the Tensor graph
        self._graph = tf.Graph()
        with self._graph.as_default():
            self._vocabulary_embeddings = tf.Variable( 
                tf.random_uniform([self._options['vocabulary_size'], 
                                   self._options['embedding_size']], -1.0, 1.0))
            options['vocabulary_embeddings'] = self._vocabulary_embeddings
            options['graph'] = self._graph
            self._encoder = self._build_encoder(self._options)
            self._decoder = self._build_decoder(self._options)
    
    def _build_encoder(self, options):
        """Build the encoder network using its own personal variable scope
        """
        return Encoder(options)
            
    def _build_decoder(self, options):
        """Build the decoder network using its own personal variable scope
        """
        return Decoder(options)
    
    def train(self, sequence_batches):
        """We would like to train our encode to predict a first letter given some state.
        Then we will also train our decoder to predict the next character given.
        Training data contains batches with arrays of both input and output sequences.
        Takes the training sequence of batches.
        """
        self._encoder._train(sequence_batches)        
        self._decoder._train(sequence_batches)
    
    def validate(self):
        self._encoder._validate()
        self._decoder._validate()        
    
    def predict(self, input_data):
        pass

Let us use the model to predict a sequence of words that ends with `<EOS>`.

In [32]:
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 60
batch_size = max_train_char # TODO we should not use maxiumum character count as the batch size
vocabulary_size = train_sequence_batches._dictionary.length()
max_char_count = max(max_train_char, max_valid_char)

# Create options for our network
model_options = {
    'batch_size': batch_size, 
    'char_size': max_train_char, 
    'embedding_size': embedding_size, 
    'num_nodes': num_nodes, 
    'vocabulary_size': vocabulary_size, 
    'num_unrollings': num_unrollings,
    'num_time_steps': num_unrollings,
    'cell_size': 1,
    'cell_layers': 1
}
sequence_model = SequenceModel(model_options)
train_sequence_batches
valid_sequence_batches
sequence_model.train(train_sequence_batches)

print('Model completed!')


Training the sequence encoder!
Average loss at step 0: 3.438741 learning rate: 10.000000
Minibatch perplexity: 31.15
For input sentence "cwxqjj fsnyaxhseqyga.lpjjtukrzgmaeo" we got an encoder output of "le.fdoosseoeoesoftgeooffoazeshkgoet"
Average loss at step 200: 3.149766 learning rate: 10.000000
Minibatch perplexity: 27.90
Average loss at step 400: 3.092173 learning rate: 10.000000
Minibatch perplexity: 21.24
Average loss at step 600: 3.148765 learning rate: 10.000000
Minibatch perplexity: 10.92
Average loss at step 800: 3.150891 learning rate: 10.000000
Minibatch perplexity: 13.04
Average loss at step 1000: 3.092864 learning rate: 10.000000
Minibatch perplexity: 19.03
Average loss at step 1200: 2.990771 learning rate: 10.000000
Minibatch perplexity: 15.17
Average loss at step 1400: 3.078714 learning rate: 10.000000
Minibatch perplexity: 10.76
Average loss at step 1600: 2.987751 learning rate: 10.000000
Minibatch perplexity: 11.04
Average loss at step 1800: 3.003209 learning rate: 1