# MI-MVI tutorial 3#

Today we try to use recurrent neural networks (RNN) to predict words in english text.

 - Based on TF RNN tutorial: https://www.tensorflow.org/tutorials/recurrent
 - Pretty images an some thory come from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor. Consider what happens if we unroll the loop:

![RNN unroled](images/RNN-unrolled.png "Structure of unrolled RNN.")

In the above diagram, a chunk of neural network, *A*, looks at some input *x_t* and outputs a value *h_t*. A loop allows information to be passed from one step of the network to the next. The *A* labelled boxes are elementary modules which vary with particular type of RNN. In standard RNN they look like:

![RNN cenll](images/SimpleRNNcell.png)
in comparison to LSTM cell:
![LSTM cell](images/LSTMcell.png)

We will focus on standard RNN cell. The yellow box with *tanh* is neural network layer. It combinates input *x_t* and previous state of the preceeding cell. We are going to build such cell from scratch, but you can implement the **tf.contrib.rnn.RNNCell** abstract class.

**Import** all packages that will be used.

In [None]:
import os, sys, tarfile
import collections
from six.moves.urllib.request import urlretrieve
import numpy as np
import tensorflow as tf

Download a [Penn Tree Bank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) dataset. We use an identical approach as in the last tutorial.

In [None]:
url = 'http://www.fit.vutbr.cz/~imikolov/rnnlm/'
data_root = 'data/rnn'
last_percent_reported = None

# make sure the dataset directory exists
if not os.path.isdir(data_root):
  os.makedirs(data_root)

def download_progress_hook(count, blockSize, totalSize):
  """A hook to report the progress of a download. This is mostly intended for users with
  slow internet connections. Reports every 5% change in download progress.
  """
  global last_percent_reported
  percent = int(count * blockSize * 100 / totalSize)

  if last_percent_reported != percent:
    if percent % 5 == 0:
      sys.stdout.write("%s%%" % percent)
      sys.stdout.flush()
    else:
      sys.stdout.write(".")
      sys.stdout.flush()
      
    last_percent_reported = percent
    
def maybe_download(filename, expected_bytes, force=False):
  """Download a file if not present, and make sure it's the right size."""
  dest_filename = os.path.join(data_root, filename)
  if force or not os.path.exists(dest_filename):
    print('Attempting to download:', filename) 
    filename, _ = urlretrieve(url + filename, dest_filename, reporthook=download_progress_hook)
    print('\nDownload Complete!')
  statinfo = os.stat(dest_filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified', dest_filename)
  else:
    raise Exception(
      'Failed to verify ' + dest_filename + '. Can you get to it with a browser?')
  return dest_filename

train_filename = maybe_download('simple-examples.tgz', 34869662)

We need to unpack the downloaded data. The desired content is in **data** subdirectory.

In [None]:
def maybe_extract(filename, force=False):
  root = os.path.splitext(os.path.splitext(filename)[0])[0]  # remove .tar.gz
  if os.path.isdir(root) and not force:
    # You may override by setting force=True.
    print('%s already present - Skipping extraction of %s.' % (root, filename))
  else:
    print('Extracting data for %s. This may take a while. Please wait.' % root)
    tar = tarfile.open(filename)
    sys.stdout.flush()
    tar.extractall(data_root)
    tar.close()
  data_folders = [
    os.path.join(root, d) for d in sorted(os.listdir(root))
    if os.path.isdir(os.path.join(root, d))]
  
  print(data_folders)
  return data_folders
  
train_folders = maybe_extract(train_filename)

Now we define a few helpers to manipulate the data. Because a TF needs tensors of numbers we need to represent words by numbers. For this requirement we build a vocabulary from a given file. The vocabulary will contain a key-value pairs of following meaning 'word':ID.

A **ptb_raw_data** method loads all necessary files, creates vocabularies and transform content of train, validation and test datafiles to number sequences.

In [None]:
def _read_words(filename):
  with tf.gfile.GFile(filename, "r") as f:
    return f.read().replace("\n", "<eos>").split()

def _build_vocab(filename, wordsLimit=None):
  data = _read_words(filename)
  counter = collections.Counter(data)
  count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))
  if (wordsLimit!=None):
        count_pairs = count_pairs[0:wordsLimit]
  words, _ = list(zip(*count_pairs))
  word_to_id = dict(zip(words, range(len(words))))
  return word_to_id

def _file_to_word_ids(filename, word_to_id):
  data = _read_words(filename)
  return [word_to_id[word] for word in data if word in word_to_id]

def ptb_raw_data(data_path=None, wordsLimit=None):
  """Load PTB raw data from data directory "data_path".
  Reads PTB text files, converts strings to integer ids, and performs mini-batching of the inputs.
  Args:
    data_path: string path to the directory where simple-examples.tgz has been extracted.
  Returns:
    tuple (train_data, valid_data, test_data, vocabulary)
    where each of the data objects can be passed to PTBIterator.
  """

  train_path = os.path.join(data_path, "ptb.train.txt")
  valid_path = os.path.join(data_path, "ptb.valid.txt")
  test_path = os.path.join(data_path, "ptb.test.txt")

  word_to_id = _build_vocab(train_path, wordsLimit)
  train_data = _file_to_word_ids(train_path, word_to_id)
  valid_data = _file_to_word_ids(valid_path, word_to_id)
  test_data = _file_to_word_ids(test_path, word_to_id)
  vocabulary = len(word_to_id)
  return train_data, valid_data, test_data, vocabulary

Let us inspect the data:

In [None]:
# !!! TODO: pokud se ukaze, ze 10000 slov ve slovniku je moc na one hot encoding, profiltrujem slovnik.
wordsLimit=10000

# !!! TODO: posefit one hot encoding

train_data, valid_data, test_data, vocabulary = ptb_raw_data(os.path.join(data_root, 'simple-examples/data'), wordsLimit)
vocab = _build_vocab(os.path.join(data_root, 'simple-examples','data','ptb.test.txt'), wordsLimit)
firstitems = {k: vocab[k] for k in sorted(vocab.keys())[:30]}

print('train data len:', len(train_data))
print('validation data len:', len(valid_data))
print('test data len:', len(test_data))
print('vocabulary item count:', vocabulary)
print('the first 30 vocabulary items:', firstitems)


!!! TODO: comment to following block

In [None]:
def ptb_producer(raw_data, batch_size, num_steps, name=None):
  """Iterate on the raw PTB data.
  This chunks up raw_data into batches of examples and returns Tensors that are drawn from these batches.
  Args:
    raw_data: one of the raw data outputs from ptb_raw_data.
    batch_size: int, the batch size.
    num_steps: int, the number of unrolls.
    name: the name of this operation (optional).
  Returns:
    A pair of Tensors, each shaped [batch_size, num_steps]. The second element
    of the tuple is the same data time-shifted to the right by one.
  Raises:
    tf.errors.InvalidArgumentError: if batch_size or num_steps are too high.
  """
  with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
    raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)

    data_len = tf.size(raw_data)
    batch_len = data_len // batch_size
    data = tf.reshape(raw_data[0 : batch_size * batch_len], [batch_size, batch_len])

    epoch_size = (batch_len - 1) // num_steps
    assertion = tf.assert_positive(epoch_size, message="epoch_size == 0, decrease batch_size or num_steps")
    with tf.control_dependencies([assertion]):
      epoch_size = tf.identity(epoch_size, name="epoch_size")

    i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
    x = tf.strided_slice(data, [0, i * num_steps], [batch_size, (i + 1) * num_steps])
    x.set_shape([batch_size, num_steps])
    y = tf.strided_slice(data, [0, i * num_steps + 1],[batch_size, (i + 1) * num_steps + 1])
    y.set_shape([batch_size, num_steps])
    return x, y

It is a time to put all together... 

In [None]:
# !!! TODO: 
# 1) implementovat standardni RNN bunku a...
# 2) az pote to spojit do celku (kod nize, v tuhle chvili jen pro ispiraci)

num_batches = 2
batch_size = 5
lstm_size = 10

num_steps = 3

words_in_dataset = tf.placeholder(tf.float32, [num_batches, batch_size])
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size, state_is_tuple=True)

# Initial state of the LSTM memory.
hidden_state = tf.zeros([batch_size, lstm.state_size.h])
current_state = tf.zeros([batch_size, lstm.state_size.c])
state = hidden_state, current_state
probabilities = []
loss = 0.0

initial_state = state = tf.zeros([batch_size, lstm.state_size.c])
total_loss = 0.0
    
with tf.Session() as session:
  session.run(tf.global_variables_initializer())
  
  numpy_state = initial_state.eval()  
    
  for step in range(num_steps):
    x, y = ptb_producer(train_data, batch_size=batch_size, num_steps=num_steps, name='batch'+str(step))
    
    print ('x:', x)
    print ('y:', y)
    
    output, state = lstm(x, state)
    print('step:', step)
    print('output:', output)
    print('state:', state)
    
    session.run([final_state, loss], feed_dict={
        initial_state: numpy_state, 
        words: x}
    )
    total_loss += current_loss
    
    # The LSTM output can be used to make next word predictions
    #logits = tf.matmul(output, softmax_w) + softmax_b
    #probabilities.append(tf.nn.softmax(logits))
    #loss += loss_function(probabilities, target_words)
    
    #print('Training finished after', num_steps, 'steps.')
    #validation_accuracy = session.run(accuracy, feed_dict={
    #  words_in_dataset: valid_data
    #})
    #print('Validation accuracy', validation_accuracy, '.')

## Vanilla RNN and LSTM ##

In [None]:
state_size = 200
xav_init = tf.contrib.layers.xavier_initializer

def RNN_step(previous_hidden_state, input_tensor):
    
    # RNN parameters
    W = tf.get_variable("W", shape=[state_size, state_size], initializer=xav_init())
    U = tf.get_variable("U", shape=[state_size, state_size], initializer=xav_init())
    b = tf.get_variable("b", shape=[state_size], initializer=tf.constant_initializer(0.))
    
    # calculate new hidden state
    hidden_state = tf.tanh(tf.matmul(previous_hidden_state, W) + tf.matmul(input_tensor,U) + b)
    
    return hidden_state

def LSTM_step(previous_hidden_state, input_tensor):
    
    # weights for input
    W = tf.get_variable('W', shape=[4, state_size, state_size], initializer=xav_init())
    # weights for previous hidden state
    U = tf.get_variable('U', shape=[4, state_size, state_size], initializer=xav_init())
    
    # gather previous internal state and output state
    state, cell = tf.unstack(previous_hidden_state)
    
    # gates
    input_gate = tf.sigmoid(tf.matmul(input_tensor, U[0]) + tf.matmul(state, W[0]))
    forget_gate = tf.sigmoid(tf.matmul(input_tensor, U[1]) + tf.matmul(state, W[1]))
    output_gate = tf.sigmoid(tf.matmul(input_tensor, U[2]) + tf.matmul(state, W[2]))
    gate_weights = tf.tanh(tf.matmul(input_tensor, U[3]) + tf.matmul(state, W[3]))
    
    # new internal cell state
    cell = cell * forget_gate + gate_weights * input_gate
    
    # output state
    state = tf.tanh(cell) * output_gate
    return tf.stack([state, cell])

def RNN_logits(states):
        
    # RNN parameters
    V = tf.get_variable('V', shape=[state_size, num_classes], initializer=xav_init())
    bl = tf.get_variable('bl', shape=[num_classes], initializer=tf.constant_initializer(0.))
    
    # calculate logits
    return tf.matmul(states, V) + bl

## Vanilla RNN #

In [None]:
tf.reset_default_graph()

rnn_type = "vanilla"

num_classes = vocabulary
num_steps = 20
batch_size = 20
num_layers = 2

input_tensor, labels_tensor = ptb_producer(train_data, batch_size=batch_size, num_steps=num_steps)

embeddings = tf.get_variable("embeddings", [num_classes, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, input_tensor)

init_hidden_state = tf.placeholder(shape=[batch_size, state_size], dtype=tf.float32)

states = tf.scan(RNN_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=init_hidden_state) 
states = tf.transpose(states, [1,0,2])

states_reshaped = tf.reshape(states, [-1, state_size])
logits = RNN_logits(states_reshaped)
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.nn.softmax(logits)

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels_tensor)
loss = tf.reduce_mean(losses)
train_op = tf.train.AdamOptimizer(learning_rate=0.1).minimize(loss)

## LSTM ##

In [None]:
tf.reset_default_graph()

rnn_type = "LSTM"

num_classes = vocabulary
num_steps = 20
batch_size = 20

input_tensor, labels_tensor = ptb_producer(train_data, batch_size=batch_size, num_steps=num_steps)

embeddings = tf.get_variable("embeddings", [num_classes, state_size])
rnn_inputs = tf.nn.embedding_lookup(embeddings, input_tensor)

init_hidden_state = tf.placeholder(shape=[2, batch_size, state_size], dtype=tf.float32, name='initial_state')

states = tf.scan(LSTM_step, tf.transpose(rnn_inputs, [1,0,2]), initializer=init_hidden_state) 
states = tf.transpose(states, [1,2,0,3])[0]

states_reshaped = tf.reshape(states, [-1, state_size])
logits = RNN_logits(states_reshaped)
logits = tf.reshape(logits, [batch_size, num_steps, -1])

predictions = tf.nn.softmax(logits)

losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels_tensor)
loss = tf.reduce_mean(losses)

train_op = tf.train.AdamOptimizer(learning_rate=0.1).minimize(loss)

In [None]:
num_training_steps = 1000

def create_feed_dict(rnn_type):
    if rnn_type == "vanilla":
        return np.zeros([batch_size, state_size])
    else:
        return np.zeros([2, batch_size, state_size])

with tf.Session() as session:
  session.run(tf.global_variables_initializer())
  tf.train.start_queue_runners(sess=session)
    
  for step in range(num_training_steps):
    
    loss_val, _ = session.run([loss, train_op], feed_dict={
        init_hidden_state: create_feed_dict(rnn_type)
    })
    print("step:", step)
    print("loss:", loss_val)
    print()

TODO: 
* calculate perplexity
* show sample outputs
* multi-layer RNN? - I would avoid this if possible

**Further reading**

  * Well explained LSTM: https://colah.github.io/posts/2015-08-Understanding-LSTMs/