# MI-MVI tutorial 3 #

In this tutorial, you will use **Recurrent Neural Networks (RNN)** to predict words in english text. Large neural language models have been immensely successful in language modelling tasks recently, so it's relevant for you to learn the basics.

 - Based on TF RNN tutorial: https://www.tensorflow.org/tutorials/recurrent
 - Pretty images an some thory come from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

![RNN unroled](images/RNN-unrolled.png "Structure of unrolled RNN.")

In the above diagram, a chunk of neural network, *A*, looks at some input *x_t* and outputs a value *h_t*. A loop allows information to be passed from one step of the network to the next. The *A* labelled boxes are elementary modules which vary with a particular type of RNN. In the standard (vanilla) RNN they look like:

![RNN cenll](images/SimpleRNNcell.png)
in comparison to the LSTM cell:
![LSTM cell](images/LSTMcell.png)

We will focus on the standard RNN cell. The yellow box with *tanh* is a neural network layer. It combines an input *x_t* and a previous state of the preceeding cell. We are going to build such cell from scratch, but you can use the **tf.contrib.rnn.RNNCell** abstract class for easier and faster implementation.

**Import** all packages that will be used.

In [2]:
import os, sys, tarfile
import collections
from six.moves.urllib.request import urlretrieve
import numpy as np
import tensorflow as tf

Download the [Penn Tree Bank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) dataset. We use an identical approach as in the last tutorial.

In [3]:
url = 'http://www.fit.vutbr.cz/~imikolov/rnnlm/'
data_root = 'data/rnn'
last_percent_reported = None

# make sure the dataset directory exists
if not os.path.isdir(data_root):
  os.makedirs(data_root)

def download_progress_hook(count, blockSize, totalSize):
  """A hook to report the progress of a download. This is mostly intended for users with
  slow internet connections. Reports every 5% change in download progress.
  """
  global last_percent_reported
  percent = int(count * blockSize * 100 / totalSize)

  if last_percent_reported != percent:
    if percent % 5 == 0:
      sys.stdout.write("%s%%" % percent)
      sys.stdout.flush()
    else:
      sys.stdout.write(".")
      sys.stdout.flush()
      
    last_percent_reported = percent
    
def maybe_download(filename, expected_bytes, force=False):
  """Download a file if not present, and make sure it's the right size."""
  dest_filename = os.path.join(data_root, filename)
  if force or not os.path.exists(dest_filename):
    print('Attempting to download:', filename) 
    filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)
    print('\nDownload Complete!')
  statinfo = os.stat(dest_filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', dest_filename)
  else:
    raise Exception(
      'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')
  return dest_filename

train_filename = maybe_download('simple-examples.tgz', 34869662)

Found and verified data/rnn/simple-examples.tgz


We need to unpack the downloaded data. The desired content is in **data** subdirectory.

In [4]:
def maybe_extract(filename, force=False):
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  if os.path.isdir(root) and not force:
    # You may override by setting force=True.
    print('%s already present - Skipping extraction of %s.' % (root, filename))
  else:
    print('Extracting data for %s. This may take a while. Please wait.' % root)
    tar = tarfile.open(filename)
    sys.stdout.flush()
    tar.extractall(data_root)
    tar.close()
  data_folders = [
    os.path.join(root, d) for d in sorted(os.listdir(root))
    if os.path.isdir(os.path.join(root, d))]
  
  print(data_folders)
  return data_folders
  
train_folders = maybe_extract(train_filename)

data/rnn/simple-examples already present - Skipping extraction of data/rnn/simple-examples.tgz.
['data/rnn/simple-examples/1-train', 'data/rnn/simple-examples/2-nbest-rescore', 'data/rnn/simple-examples/3-combination', 'data/rnn/simple-examples/4-data-generation', 'data/rnn/simple-examples/5-one-iter', 'data/rnn/simple-examples/6-recovery-during-training', 'data/rnn/simple-examples/7-dynamic-evaluation', 'data/rnn/simple-examples/8-direct', 'data/rnn/simple-examples/9-char-based-lm', 'data/rnn/simple-examples/data', 'data/rnn/simple-examples/models', 'data/rnn/simple-examples/rnnlm-0.2b', 'data/rnn/simple-examples/temp']


Now we define a few helpers to manipulate the data. Because Tensorflow needs tensors of numbers we will represent words by indexes. For this requirement we build a vocabulary from a given file. The vocabulary will contain a key-value pairs of following meaning 'word':ID.

A **ptb_raw_data** method loads all necessary files, creates vocabularies and transform content of train, validation and test datafiles to number sequences.

In [5]:
def _read_words(filename):
  with tf.gfile.GFile(filename, "r") as f:
    return f.read().replace("\n", "<eos>").split()

def _build_vocab(filename, wordsLimit=None):
  data = _read_words(filename)
  counter = collections.Counter(data)
  count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))
  if (wordsLimit!=None):
        count_pairs = count_pairs[0:wordsLimit]
  words, _ = list(zip(*count_pairs))
  word_to_id = dict(zip(words, range(len(words))))
  return word_to_id

def _file_to_word_ids(filename, word_to_id):
  data = _read_words(filename)
  return [word_to_id[word] for word in data if word in word_to_id]

def ptb_raw_data(data_path=None, wordsLimit=None):
  """Load PTB raw data from data directory "data_path".
  Reads PTB text files, converts strings to integer ids, and performs mini-batching of the inputs.
  Args:
    data_path: string path to the directory where simple-examples.tgz has been extracted.
  Returns:
    tuple (train_data, valid_data, test_data, vocabulary)
    where each of the data objects can be passed to PTBIterator.
  """

  train_path = os.path.join(data_path, "ptb.train.txt")
  valid_path = os.path.join(data_path, "ptb.valid.txt")
  test_path = os.path.join(data_path, "ptb.test.txt")

  word_to_id = _build_vocab(train_path, wordsLimit)
  train_data = _file_to_word_ids(train_path, word_to_id)
  valid_data = _file_to_word_ids(valid_path, word_to_id)
  test_data = _file_to_word_ids(test_path, word_to_id)
  vocabulary = len(word_to_id)
  return train_data, valid_data, test_data, vocabulary

Let us inspect the data:

In [6]:
# in case of very slow learning we can trim an input dictionary (10000 is a base)
wordsLimit=10000

train_data, valid_data, test_data, vocabulary = ptb_raw_data(os.path.join(data_root, 'simple-examples/data'), wordsLimit)
vocab = _build_vocab(os.path.join(data_root, 'simple-examples','data','ptb.test.txt'), wordsLimit)
firstitems = {k: vocab[k] for k in sorted(vocab.keys())[:30]}

print('train data len:', len(train_data))
print('validation data len:', len(valid_data))
print('test data len:', len(test_data))
print('vocabulary item count:', vocabulary)
print('the first 30 vocabulary items:', firstitems)

train data len: 929589
validation data len: 73760
test data len: 82430
vocabulary item count: 10000
the first 30 vocabulary items: {'#': 1181, '$': 14, '&': 72, "'": 131, "'d": 1182, "'ll": 963, "'m": 964, "'re": 234, "'s": 9, "'ve": 670, '10-year': 2056, '12-month': 4166, '12-year': 3108, '13th': 1059, '190-point': 1747, '190.58-point': 2057, '1920s': 4167, '1930s': 4168, '1960s': 2058, '1970s': 1748, '1980s': 1327, '1990s': 2494, '20th': 4169, '30-year': 2495, '500-stock': 2496, '52-week': 2059, '<eos>': 2, '<unk>': 0, 'N': 3, 'a': 6}


In [7]:
def ptb_producer(raw_data, batch_size, num_steps, name=None):
  """
  Iterate on the raw PTB data.
  This chunks up raw_data into batches of examples and returns Tensors that are drawn from these batches.
  Args:
    raw_data: one of the raw data outputs from ptb_raw_data.
    batch_size: int, the batch size.
    num_steps: int, the number of unrolls.
    name: the name of this operation (optional).
  Returns:
    A pair of Tensors, each shaped [batch_size, num_steps]. The second element
    of the tuple is the same data time-shifted to the right by one.
  Raises:
    tf.errors.InvalidArgumentError: if batch_size or num_steps are too high.
  """

  with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)

    data_len = tf.size(raw_data)
    batch_len = data_len // batch_size
    data = tf.reshape(raw_data[0 : batch_size * batch_len], [batch_size, batch_len])

    epoch_size = (batch_len - 1) // num_steps
    assertion = tf.assert_positive(epoch_size, message="epoch_size == 0, decrease batch_size or num_steps")
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")

    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    x = tf.strided_slice(data, [0, i * num_steps], [batch_size, (i + 1) * num_steps])
    x.set_shape([batch_size, num_steps])
    y = tf.strided_slice(data, [0, i * num_steps + 1],[batch_size, (i + 1) * num_steps + 1])
    y.set_shape([batch_size, num_steps])
    return x, y

It is a time to put all together... 

## Vanilla RNN and LSTM ##

Preparing the building blocks for both types of RNN.

In [8]:
# it says how many floats will represent coordinates of given word in embeddings
# it is a "width" of embeddings
state_size = 200

# default value for weights in RNN_logits
init_scale = 0.1

In [9]:
def RNN_logits(states, output_size):
  """
  Create a final classification layer that is ran on top of an RNN.

  :param states:             Output states of an RNN.
  :param outputs_size:       Number of classes to predict.
  :return:                   Logits.
  """
        
  # RNN parameters
  V = tf.get_variable('V', shape=[state_size, output_size], 
                        initializer=tf.random_uniform_initializer(minval=-init_scale, maxval=init_scale))
  bo = tf.get_variable('bl', shape=[output_size], initializer=tf.constant_initializer(0.))
    
  # calculate logits
  return tf.matmul(states, V) + bo

![RNN cenll](images/SimpleRNNcell.png)

In [10]:
# one piece of unrolled vanilla RNN
def RNN_step(previous_hidden_state, input_tensor):
  """
  Unroll an RNN for a single step.
  
  :param previous_hidden_state:         Hidden state of the previous time step of the RNN.
  :param input_tensor:                  Input for the current time step.
  :return                               New hidden state.
  """
    
  # RNN parameters
  W = tf.get_variable("W", shape=[state_size, state_size], 
                        initializer=tf.random_uniform_initializer(minval=-init_scale, maxval=init_scale))
  U = tf.get_variable("U", shape=[state_size, state_size], 
                        initializer=tf.random_uniform_initializer(minval=-init_scale, maxval=init_scale))
  b = tf.get_variable("b", shape=[state_size], initializer=tf.constant_initializer(0.))
    
  # calculate new hidden state
  hidden_state = tf.tanh(tf.matmul(previous_hidden_state, W) + tf.matmul(input_tensor,U) + b)
    
  return hidden_state

![LSTM cell](images/LSTMcell.png)

In [11]:
# one piece of unrolled vanilla RNN
def LSTM_step(previous_hidden_state, input_tensor):
  """
  Unroll an LSTM for a single step.
  
  :param previous_hidden_state:         Hidden state of the previous time step of the LSTM.
  :param input_tensor:                  Input for the current time step.
  :return                               New hidden state.
  """
    
  # weights for input
  W = tf.get_variable('W', shape=[4, state_size, state_size], 
                        initializer=tf.random_uniform_initializer(minval=-init_scale, maxval=init_scale))
  # weights for previous hidden state
  U = tf.get_variable('U', shape=[4, state_size, state_size], 
                        initializer=tf.random_uniform_initializer(minval=-init_scale, maxval=init_scale))
    
  bi = tf.get_variable("bi", shape=[state_size], initializer=tf.constant_initializer(0.))
  bf = tf.get_variable("bf", shape=[state_size], initializer=tf.constant_initializer(0.))
  bo = tf.get_variable("bo", shape=[state_size], initializer=tf.constant_initializer(0.))
  bc = tf.get_variable("bc", shape=[state_size], initializer=tf.constant_initializer(0.))
    
  # gather previous internal state and output state
  state, cell = tf.unstack(previous_hidden_state)
    
  # gates
  input_gate = tf.sigmoid(tf.matmul(input_tensor, U[0]) + tf.matmul(state, W[0]) + bi)
  forget_gate = tf.sigmoid(tf.matmul(input_tensor, U[1]) + tf.matmul(state, W[1]) + bf)
  output_gate = tf.sigmoid(tf.matmul(input_tensor, U[2]) + tf.matmul(state, W[2]) + bo)
  gate_weights = tf.tanh(tf.matmul(input_tensor, U[3]) + tf.matmul(state, W[3]) + bc)
    
  # new internal cell state
  cell = cell * forget_gate + gate_weights * input_gate
    
  # output state
  state = tf.tanh(cell) * output_gate
  return tf.stack([state, cell])

Because we will dive into the implementation of vanilla and LSTM RNN it will come handy to understand **tf.transpose** by playing with it for a while. A **tf.transpose** permutates dimensions of a tensor. You will need to specify an order (permutation) of all dimensions (counted from 0 to N-1).

In [12]:
c = tf.constant([[[ 1,  2,  3],
                  [ 4,  5,  6]],
                 [[ 7,  8,  9],
                  [10, 11, 12]]])

ctr = tf.transpose(c, perm=[1, 0, 2])

with tf.Session() as session:
  res = session.run(ctr)
  print(res)

[[[ 1  2  3]
  [ 7  8  9]]

 [[ 4  5  6]
  [10 11 12]]]


And for the better understanding of the **embedings** and **rnn_inputs** variables see:

![Embeddings and rnn_internals](images/rnn-internals.jpg)

### Vanilla RNN ###

In [16]:
num_classes = vocabulary
num_steps = 20
batch_size = 20
max_gradient_norm = 5
learning_rate = 1.0

# simple RNN cell
rnn_type = "vanilla"
tf.reset_default_graph()

# take a subset of data
input_tensor, labels_tensor = ptb_producer(train_data, batch_size=batch_size, num_steps=num_steps)

# with no specification of variable initializer the tf.glorot_uniform_initializer is used
# it is also called Xavier uniform initializer.
embeddings = tf.get_variable("embeddings", [num_classes, state_size])
# rnn input contains "coordinates" in embeddings of IDs from input tensor"
# it is a way how to effectively encode integers representing words to form
# which is familiar for NN.
rnn_inputs = tf.nn.embedding_lookup(embeddings, input_tensor)

# in learning phase the hidden state is a zero-filled tensor
# in case of retrieving we supply initial tensor
init_hidden_state = tf.placeholder(shape=[batch_size, state_size], dtype=tf.float32)

states = tf.scan(RNN_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=init_hidden_state) 
states = tf.transpose(states, [1,0,2])

# a tf.matmul operator in RNN_logits do not accept tensors, only matrixes
# so we need to flatten our tensor into 2D array
states_reshaped = tf.reshape(states, [-1, state_size])
# process that as matrix
logits = RNN_logits(states_reshaped, num_classes)
# reconstruct tensor
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.nn.softmax(logits)

# calculate a difference between predicted and correct labels
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels_tensor)
loss = tf.reduce_mean(losses)

# For reasons of a gradient clipping method see: http://arxiv.org/pdf/1211.5063.pdf
trainable_vars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, trainable_vars), max_gradient_norm)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, trainable_vars), 
                                     global_step=tf.contrib.framework.get_or_create_global_step())

### LSTM ###

In [14]:
num_classes = vocabulary
num_steps = 20
batch_size = 20
max_gradient_norm = 5
learning_rate = 1.0

# LSTM cell
rnn_type = "LSTM"
tf.reset_default_graph()

# take a subset of data
input_tensor, labels_tensor = ptb_producer(train_data, batch_size=batch_size, num_steps=num_steps)

# with no specification of variable initializer the tf.glorot_uniform_initializer is used
# it is also called Xavier uniform initializer.
embeddings = tf.get_variable("embeddings", [num_classes, state_size])
# rnn input contains "coordinates" in embeddings of IDs from input tensor"
# it is a way how to effectively encode integers representing words to form
# which is familiar for NN.
rnn_inputs = tf.nn.embedding_lookup(embeddings, input_tensor)

# an initial hidden state zero-filled tensor
init_hidden_state = tf.placeholder(shape=[2, batch_size, state_size], dtype=tf.float32, name='initial_state')

states = tf.scan(LSTM_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=init_hidden_state) 
states = tf.transpose(states, [1,2,0,3])[0]

# a tf.matmul operator in RNN_logits do not accept tensors, only matrixes
# so we need to flatten our tensor into 2D array
states_reshaped = tf.reshape(states, [-1, state_size])
# process that as matrix
logits = RNN_logits(states_reshaped, num_classes)
# reconstruct tensor
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.nn.softmax(logits)

# calculate a difference between predicted and correct labels
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels_tensor)
loss = tf.reduce_mean(losses)

# For reasons of a gradient clipping method see: http://arxiv.org/pdf/1211.5063.pdf
trainable_vars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, trainable_vars), max_gradient_norm)
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, trainable_vars), 
                                     global_step=tf.contrib.framework.get_or_create_global_step())

You can choose to train either the **vanilla** or the **LSTM** version of a Recurrent Neural Network by running one of the graph definitions above. You can notice that vanilla LSTM is much harder to train even in this small-scale experiment. The difference between vanilla and LSTM becomes much more pronounced when you experiment with larger RNN. Moreover, there are many tasks that are impossible to solve by vanilla RNNs (see [Sepp Hochreiter's and Jürgen Schmidhuber's seminal paper for examples](http://www.bioinf.jku.at/publications/older/2604.pdf)).

In [17]:
num_training_steps = 101

# depending on type we prepare an appropriate initial hidden state tensor
def create_feed_dict(rnn_type):
  if rnn_type == "vanilla":
    return np.zeros([batch_size, state_size])
  else:
    return np.zeros([2, batch_size, state_size])
    
with tf.Session() as session:
  print("RNN type: ", rnn_type)
  session.run(tf.global_variables_initializer())
    
  # black magic with a threading ;-)
  input_coord = tf.train.Coordinator() 
  input_threads = tf.train.start_queue_runners(session, coord=input_coord)
    
  for step in range(num_training_steps):
    
    loss_val, _ = session.run([loss, train_op], feed_dict={
      init_hidden_state: create_feed_dict(rnn_type)
    })
    
    input_vals, labels_vals = session.run([input_tensor, labels_tensor])

    if step > 0 and step % 10 == 0:
        print("step:", step)
        print("loss:", loss_val)
        print()
    
  # wait for the finalization of a multithreaded run
  input_coord.request_stop()
  input_coord.join(input_threads)  
  

RNN type:  vanilla
step: 10
loss: 9.18488

step: 20
loss: 10.7168

step: 30
loss: 10.3884

step: 40
loss: 12.1762

step: 50
loss: 12.8388

step: 60
loss: 10.3577

step: 70
loss: 9.33503

step: 80
loss: 8.59152

step: 90
loss: 7.86242

step: 100
loss: 7.85907



### Task (bonus points) ###

Visualize how the loss changes during the training of your Recurrent Neural Network using [Tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard). Use MI-MVI tutorial 2 as a reference.

## (Optional) Large multi-layer LSTM ##

So far, you have experimented with small models and trained them for tens or hunders of iterations. However, models that are used in practice usually contain **millions** of weights and are trained for **hundreds of thousands of steps**.

You can see an implementation of such a model bellow. We have made several changes to increase its performace:

* truncated back-propagation
* stack multiple layers of LSTM cells on top of each other
* employ smart learning rate schedule

These techniques are beyond the scope of this course but Google Scholar is your friend.

In [42]:
# configuration
num_classes = vocabulary
max_gradient_norm = 5
hidden_size = 200
num_steps = 20
batch_size = 20
num_layers = 2

learning_rate = 1.0
learning_rate_decay = 0.5
epoch_end_decay = 4

num_epochs = 13

# LSTM cell
rnn_type = "LSTM"
tf.reset_default_graph()

# take a subset of data
input_tensor, labels_tensor = ptb_producer(train_data, batch_size=batch_size, num_steps=num_steps)

# TODO: kde se naplni to embeddings?
embeddings = tf.get_variable("embeddings", [num_classes, state_size])
# kdyz se tady z nej maji vybrat hodnoty (sloupce nebo radky?) podle idcek v input_tensor
rnn_inputs = tf.nn.embedding_lookup(embeddings, input_tensor)

def build_layer(rnn_inputs, layer_idx):
    
  with tf.variable_scope("layer{}".format(layer_idx)):
    
    # truncated backprop
    hidden_state = tf.placeholder(tf.float32, shape=[2, batch_size, state_size])
        
    # TODO: proc se to transponuje tam a zase zpatky?
    states = tf.scan(LSTM_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=hidden_state) 
    states = tf.transpose(states, [1,2,0,3])
       
    return states, hidden_state
    
sequence = rnn_inputs

final_states = []
hidden_states = []

for layer_idx in range(num_layers):
  states, hidden_state = build_layer(sequence, layer_idx)
  final_states.append(states[:, :, -1, :])
  hidden_states.append(hidden_state)
    
  sequence = states[0]
    
states_reshaped = tf.reshape(sequence, [-1, state_size])
logits = RNN_logits(states_reshaped, num_classes)
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.nn.softmax(logits)

# calculate a difference between predited and correct labels
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels_tensor)
loss = tf.reduce_sum(losses) / batch_size

learning_rate_tensor = tf.Variable(learning_rate, name="learning_rate")
learning_rate_pl = tf.placeholder(tf.float32, name="learning_rate_pl")
assign_learning_rate = tf.assign(learning_rate_tensor, learning_rate_pl)

trainable_vars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, trainable_vars), max_gradient_norm)
optimizer = tf.train.GradientDescentOptimizer(learning_rate_tensor)
train_op = optimizer.apply_gradients(zip(grads, trainable_vars), 
                                     global_step=tf.contrib.framework.get_or_create_global_step())

In [45]:
epoch_size = ((len(train_data) // batch_size) - 1) // num_steps

with tf.Session() as session:
    
  print("RNN type: ", rnn_type)
  print()
    
  saver = tf.train.Saver()

  session.run(tf.global_variables_initializer())
    
  #
  input_coord = tf.train.Coordinator() 
  input_threads = tf.train.start_queue_runners(session, coord=input_coord)
  #
    
  for epoch in range(num_epochs):
      
    learning_rate_decay = learning_rate_decay ** max(epoch + 1 - epoch_end_decay, 0.0)
    session.run(assign_learning_rate, feed_dict={
      learning_rate_pl: learning_rate * learning_rate_decay
    })
        
    total_loss = 0
    total_time_steps = 0
   
    epoch_hidden_states = []
    for state_pl in hidden_states:
       epoch_hidden_states.append(np.zeros((2, batch_size, state_size)))

    for step in range(epoch_size):

      # build feed dict
      feed_dict = {}
        
      for state_pl, state_val in zip(hidden_states, epoch_hidden_states):
        feed_dict[state_pl] = state_val
            
      loss_val, _, epoch_hidden_states = session.run([loss, train_op, final_states], feed_dict=feed_dict)

      total_loss += loss_val
      total_time_steps += num_steps
            
      epoch_perplexity = np.exp(total_loss / total_time_steps)
    
    print("epoch {} - perplexity: {:.3f}".format(epoch + 1, epoch_perplexity))
    
  saver.save(session, "language-rnn", global_step=0)
    
  #  
  input_coord.request_stop()
  input_coord.join(input_threads)  
  #

RNN type:  LSTM

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.CancelledError'>, Enqueue operation was cancelled
	 [[Node: PTBProducer/input_producer/input_producer_EnqueueMany = QueueEnqueueManyV2[Tcomponents=[DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](PTBProducer/input_producer, PTBProducer/input_producer/range)]]


KeyboardInterrupt: 

## Language Modelling ##

Finally, we would like to show you what a medium-size neural language model can do. You can load two models, small and big, and let them finish sentences for you. We thank Showmax for letting us train the networks on their hardware.

### Small Model ###

The small model consists of two layers of 200 LSTM cells trained for about 10 minutes on a high-end GPU. 

In [46]:
batch_size = 1

model_type = "small"
num_layers = 2

rnn_type = "LSTM"
tf.reset_default_graph()

words_pl = tf.placeholder(tf.int32, shape=[batch_size, None])
num_steps = tf.shape(words_pl)[1]

embeddings = tf.get_variable("embeddings", [num_classes, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, words_pl)

def build_layer(rnn_inputs, layer_idx):
    
  with tf.variable_scope("layer{}".format(layer_idx)):
    
    hidden_state = tf.placeholder(tf.float32, shape=[2, batch_size, state_size])
        
    states = tf.scan(LSTM_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=hidden_state) 
    states = tf.transpose(states, [1,2,0,3])
       
    return states, hidden_state
    
sequence = rnn_inputs

final_states = []
hidden_states = []

for layer_idx in range(num_layers):
  states, hidden_state = build_layer(sequence, layer_idx)
  final_states.append(states[:, :, -1, :])
  hidden_states.append(hidden_state)
    
  sequence = states[0]
    
states_reshaped = tf.reshape(sequence, [-1, state_size])
logits = RNN_logits(states_reshaped, num_classes)
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.argmax(logits, -1)

### Large model ###

In [None]:
batch_size = 1

model_type = "small"
num_layers = 2

rnn_type = "LSTM"
tf.reset_default_graph()

words_pl = tf.placeholder(tf.int32, shape=[batch_size, None])
num_steps = tf.shape(words_pl)[1]

embeddings = tf.get_variable("embeddings", [num_classes, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, words_pl)

def build_layer(rnn_inputs, layer_idx):
    
  with tf.variable_scope("layer{}".format(layer_idx)):
    
    hidden_state = tf.placeholder(tf.float32, shape=[2, batch_size, state_size])
        
    states = tf.scan(LSTM_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=hidden_state) 
    states = tf.transpose(states, [1,2,0,3])
       
    return states, hidden_state
    
sequence = rnn_inputs

final_states = []
hidden_states = []

for layer_idx in range(num_layers):
  states, hidden_state = build_layer(sequence, layer_idx)
  final_states.append(states[:, :, -1, :])
  hidden_states.append(hidden_state)
    
  sequence = states[0]
    
states_reshaped = tf.reshape(sequence, [-1, state_size])
logits = RNN_logits(states_reshaped, num_classes)
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.argmax(logits, -1)

The large model is made of two layers of 650 LSTM cells regularized using dropout. The training time is about 2 hours on a single high-end GPU.

### Inference ###

In [79]:
def word_by_index(index):
  """
  Find a word in vocabulary by its index.
  
  :param index:    Index of the word to find.
  :return:         A string.
  """
    
  for word, idx in vocab.items():
    if idx == index:
      return word
        
def parse_sentence(sentence):
  """
  Transform a sentence into word indexes.
  
  :param sentence:  A sentence.
  :return:          An array of word indexes.
  """
  
  words = sentence.split(" ")
  indexes = []
    
  for word in words:
    
    if word in vocab:
      indexes.append(vocab[word])
    else:
      indexes.append(vocab["<unk>"])

  return indexes

In [89]:
# PUT THE BEGGINING OF THE SENTENCE YOU WANT TO COMPLETE HERE
input_sentence = "the meaning of life is"

words_to_generate = 50
# the network sometimes outputs an "<eos>" token that marks the end of a sentence
# you can choose to ignore it for more interesting outputs
end_early = False             

with tf.Session() as session:
    
  saver = tf.train.Saver()
  saver.restore(session, "models/language-rnn-small")
    
  sentence_hidden_states = []
  for state_pl in hidden_states:
    sentence_hidden_states.append(np.random.uniform(low=-1, high=1, size=(2, batch_size, state_size)))
    
  inputs = [parse_sentence(input_sentence)]
  for i in range(words_to_generate):
        
    feed_dict = {
      words_pl: inputs
    }
        
    for state_pl, state_val in zip(hidden_states, sentence_hidden_states):
      feed_dict[state_pl] = state_val
                    
    pred, sentence_hidden_states = \
      session.run([predictions, final_states], feed_dict=feed_dict)
        
    last_word = pred[0][-1]
    
    if end_early and last_word == vocab["<eos>"]:
      break
        
    inputs[0].append(last_word)
        
  sentence = []
  for word_idx in inputs[0]:
    if word_idx not in [vocab["<unk>"], vocab["N"], vocab["<eos>"]]:
      sentence.append(word_by_index(word_idx))
    
  for word in sentence:
    print(word, end=" ")

INFO:tensorflow:Restoring parameters from models/language-rnn-small
the meaning of life is its in report with think of depends two this it this said market current brady to from deficit-reduction this it this said market current brady to central over an show this it this said market ethnic in 

**Further reading**

  * Well explained LSTM: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  * http://blog.echen.me/2017/05/30/exploring-lstms/