In [1]:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt

import IPython.display as ipyd
import tensorflow as tf

  return f(*args, **kwds)


In [9]:
%pylab
from six.moves import urllib
import ssl

with open('alice.txt', 'r', encoding='utf-8') as fp:
    txt = fp.read()


tf.reset_default_graph()

Using matplotlib backend: Qt5Agg
Populating the interactive namespace from numpy and matplotlib


In [13]:
vocab = list(set(txt))
len(txt), len(vocab)

(163781, 85)

Great so we now have about 164 thousand characters and 85 unique characters in our vocabulary which we can use to help us train a model of language. Rather than use the characters, we'll convert each character to a unique integer. We'll later see that when we work with words, we can achieve a similar goal using a very popular model called word2vec: https://www.tensorflow.org/versions/r0.9/tutorials/word2vec/index.html

In [17]:
encoder = dict(zip(vocab, range(len(vocab))))
decoder = dict(zip(range(len(vocab)), vocab))

In [21]:
# encoder
# decoder

### Creating the Model
For our model, we'll need to define a few parameters.

In [22]:
# Number of sequences in a mini batch
batch_size = 100

# Number of characters in a sequence
sequence_length = 100

# Number of cells in our LSTM layer
n_cells = 256

# Number of LSTM layers
n_layers = 2

# Total number of characters in the one-hot encoding
n_chars = len(vocab)

Now create the input and output to the network. Rather than having batch size x number of features; or batch size x height x width x channels; we're going to have batch size x sequence length.

In [23]:
X = tf.placeholder(tf.int32, [None, sequence_length], name='X')

# We'll have a placeholder for our true outputs
Y = tf.placeholder(tf.int32, [None, sequence_length], name='Y')

Now remember with MNIST that we used a one-hot vector representation of our numbers. We could transform our input data into such a representation. But instead, we'll use tf.nn.embedding_lookup so that we don't need to compute the encoded vector. Let's see how this works:

In [24]:
# we first create a variable to take us from our one-hot representation to our LSTM cells
embedding = tf.get_variable("embedding", [n_chars, n_cells])

# And then use tensorflow's embedding lookup to look up the ids in X
Xs = tf.nn.embedding_lookup(embedding, X)

# The resulting lookups are concatenated into a dense tensor
print(Xs.get_shape().as_list())

[None, 100, 256]


To create a recurrent network, we're going to need to slice our sequences into individual inputs. That will give us timestep lists which are each batch_size x input_size. Each character will then be connected to a recurrent layer composed of n_cells LSTM units.

In [26]:
# Let's create a name scope for the operations to clean things up in our graph#
with tf.name_scope('reslice'):
    Xs = [tf.squeeze(seq, [1]) for seq in tf.split(Xs, sequence_length, axis=1)]

Now we'll create our recurrent layer composed of LSTM cells.

In [31]:
cells = tf.contrib.rnn.BasicLSTMCell(num_units=n_cells)

We'll initialize our LSTMs using the convenience method provided by tensorflow. We could explicitly define the batch size here or use the tf.shape method to compute it based on whatever X is, letting us feed in different sizes into the graph.

In [32]:
initial_state = cells.zero_state(tf.shape(X)[0], tf.float32)

Great now we have a layer of recurrent cells and a way to initialize them. If we wanted to make this a multi-layer recurrent network, we could use the MultiRNNCell like so:

In [33]:
# Build deeper recurrent net if using more than 1 layer
if n_layers > 1:
    cells = [cells]
    for layer_i in range(1, n_layers):
        with tf.variable_scope('{}'.format(layer_i)):
            this_cell = tf.contrib.rnn.BasicLSTMCell(
                num_units=n_cells)
            cells.append(this_cell)
    cells = tf.contrib.rnn.MultiRNNCell(cells)
    initial_state = cells.zero_state(tf.shape(X)[0], tf.float32)

In either case, the cells are composed of their outputs as modulated by the LSTM's output gate, and whatever is currently stored in its memory contents. Now let's connect our input to it.

In [34]:
# this will return us a list of outputs of every element in our sequence.
# Each output is `batch_size` x `n_cells` of output.
# It will also return the state as a tuple of the n_cells's memory and
# their output to connect to the time we use the recurrent layer.
outputs, state = tf.contrib.rnn.static_rnn(cells, Xs, initial_state=initial_state)

# We'll now stack all our outputs for every cell
outputs_flat = tf.reshape(tf.concat(outputs, axis=1), [-1, n_cells])

For our output, we'll simply try to predict the very next timestep. So if our input sequence was "networ", our output sequence should be: "etwork". This will give us the same batch size coming out, and the same number of elements as our input sequence.

In [35]:
with tf.variable_scope('prediction'):
    W = tf.get_variable(
        "W",
        shape=[n_cells, n_chars],
        initializer=tf.random_normal_initializer(stddev=0.1))
    b = tf.get_variable(
        "b",
        shape=[n_chars],
        initializer=tf.random_normal_initializer(stddev=0.1))

    # Find the output prediction of every single character in our minibatch
    # we denote the pre-activation prediction, logits.
    logits = tf.matmul(outputs_flat, W) + b

    # We get the probabilistic version by calculating the softmax of this
    probs = tf.nn.softmax(logits)

    # And then we can find the index of maximum probability
    Y_pred = tf.argmax(probs, axis=1)

### Loss
Our loss function will take the reshaped predictions and targets, and compute the softmax cross entropy.

In [36]:
with tf.variable_scope('loss'):
    # Compute mean cross entropy loss for each output.
    Y_true_flat = tf.reshape(tf.concat(Y, axis=1), [-1])
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=Y_true_flat)
    mean_loss = tf.reduce_mean(loss)

### Clipping the Gradient
Normally, we would just create an optimizer, give it a learning rate, and tell it to minize our loss. But with recurrent networks, we can help out a bit by telling it to clip gradients. That helps with the exploding gradient problem, ensureing they can't get any bigger than the value we tell it. We can do that in tensorflow by iterating over every gradient and variable, and changing their value before we apply their update to every trainable variable.

In [37]:
with tf.name_scope('optimizer'):
    optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
    gradients = []
    clip = tf.constant(5.0, name="clip")
    for grad, var in optimizer.compute_gradients(mean_loss):
        gradients.append((tf.clip_by_value(grad, -clip, clip), var))
    updates = optimizer.apply_gradients(gradients)

We could also explore other methods of clipping the gradient based on a percentile of the norm of activations or other similar methods, like when we explored deep dream regularization. But the LSTM has been built to help regularize the network through its own gating mechanisms, so this may not be the best idea for your problem. Really, the only way to know is to try different approaches and see how it effects the output on your problem.


### Training

In [None]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

cursor = 0
it_i = 0
while True:
    Xs, Ys = [], []
    for batch_i in range(batch_size):
        if (cursor + sequence_length) >= len(txt) - sequence_length - 1:
            cursor = 0
        Xs.append([encoder[ch]
                   for ch in txt[cursor:cursor + sequence_length]])
        Ys.append([encoder[ch]
                   for ch in txt[cursor + 1: cursor + sequence_length + 1]])

        cursor = (cursor + sequence_length)
    Xs = np.array(Xs).astype(np.int32)
    Ys = np.array(Ys).astype(np.int32)

    loss_val, _ = sess.run([mean_loss, updates],
                           feed_dict={X: Xs, Y: Ys})
    print(it_i, loss_val)

    if it_i % 500 == 0:
        p = sess.run([Y_pred], feed_dict={X: Xs})[0]
        preds = [decoder[p_i] for p_i in p]
        print("".join(preds).split('\n'))

    it_i += 1

### Extensions
There are also certainly a lot of additions we can add to speed up or help with training including adding dropout or using batch normalization that I haven't gone into here. Also when dealing with variable length sequences, you may want to consider using a special token to denote the last character or element in your sequence.

As for applications, completely endless. And I think that is really what makes this field so exciting right now. There doesn't seem to be any limit to what is possible right now. You are not just limited to text first of all. You may want to feed in MIDI data to create a piece of algorithmic music. I've tried it with raw sound data and this even works, but it requires a lot of memory and at least 30k iterations to run before it sounds like anything. Or perhaps you might try some other unexpected text based information, such as encodings of image data like JPEG in base64. Or other compressed data formats. Or perhaps you are more adventurous and want to try using what you've learned here with the previous sessions to add recurrent layers to a traditional convolutional model.


### Future
If you're still here, then I'm really excited for you and to see what you'll create. By now, you've seen most of the major building blocks with neural networks. From here, you are only limited by the time it takes to train all of the interesting ideas you'll have. But there is still so much more to discover, and it's very likely that this entire course is already out of date, because this field just moves incredibly fast. In any case, the applications of these techniques are still fairly stagnant, so if you're here to see how your creative practice could grow with these techniques, then you should already have plenty to discover.

I'm very excited about how the field is moving. Often, it is very hard to find labels for a lot of data in a meaningful and consistent way. But there is a lot of interesting stuff starting to emerge in the unsupervised models. Those are the models that just take data in, and the computer reasons about it. And even more interesting is the combination of general purpose learning algorithms. That's really where reinforcement learning is starting to shine. But that's for another course, perhaps.

### Reading
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks. 2014. https://arxiv.org/abs/1406.2661

Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. 2014.

Alec Radford, Luke Metz, Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. 2015. https://arxiv.org/abs/1511.06434

Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks. 2015. arxiv.org/abs/1506.05751

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, Ole Winther. Autoencoding beyond pixels using a learned similarity metric. 2015. https://arxiv.org/abs/1512.09300

Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, Aaron Courville. Adversarially Learned Inference. 2016. https://arxiv.org/abs/1606.00704

Ilya Sutskever, James Martens, and Geoffrey Hinton. Generating Text with Recurrent Neural Networks, ICML 2011.

A. Graves. Generating sequences with recurrent neural networks. In Arxiv preprint, arXiv:1308.0850, 2013.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.

J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 2014.

Yoon Kim, Yacine Jernite, David Sontag, Alexander M. Rush. Character-Aware Neural Language Models. 2015. https://arxiv.org/abs/1508.06615

I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neural networks. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, pages 1017–1024, New York, NY, USA, June 2011. ACM.