# RNN Tutorial

## General applications

Natural language processing concerns with the interaction between computer and human natural language. Among different applications, there are:
 * Translation
 * Sentiment analysis (e.g. comments on products)
 * Document categorization (e.g. spam filtering, language identification, news categorization)
 * Automatic evaluation (e.g. answers to open questions)
 * Automatic summarization (e.g. explore large collection of documents)
 * Paraphasing detection (e.g. trend topic analysis on twitter)
 * Grammar parsing 
 * etc,
 
Standard machine learning approaches can receive directly a set of fixed-length features with numerical or categorical data. However, to deal with sequences such as words or sound, a preprocessing step is usually required (Images adapted from [indico's tutorial](https://indico.io/blog/general-sequence-learning-using-recurrent-neural-nets/)).
![How text is dealt](http://i.imgur.com/0aDV3fC.png)
However structure is important!:
![Structure is important](http://i.imgur.com/xaBWxI2.png)
##Language Modeling
A language model computes the probability of a sequence of words. The main idea is to learn a model that computes higher probabilities to more likely sequences of words, e.g.:
$$P(the\ cat\ is\ small) \geq  P(small\ the\ is\ cat)$$
$$P(walking\ home\ after\ work) \geq  P(walking\ house\ after\ work)$$

In order to get a model like that one can use contextual information of what words are ocurring in similar contexts <img src="http://i.imgur.com/j6ryjWW.png" alt="Distributed Space" height="500" width="500">
Above we can see the different contexts in which the word stars is likely to occur. For instance words like constellations and moon are present in the same context of *stars*.

Bag of words representation can be extended to deal with sequences through inclusion of bi-grams or n-grams and huge tables of co-occurrence statistics. However, this approach is limited to the number of combinations we can deal with.

# Quick review of recurrent neural networks
![RNN approach](http://i.imgur.com/6siwjNl.png)
RNN models sequences through recurrent connections in the hidden units of the network as depicted above. This structure can be shown in its unfolded version:
![RNN approach](http://i.imgur.com/3qvKyoP.png)

Recurrent neural networks (RNNs) are a quite popular option to learn a language model and the distributed representation of words. Since they are good at modelling sequences. Theoretically they can condition the model on all previous word on the corpus. The following is the classical architecture of a RNN: <img src="http://i.imgur.com/uGNd1LZ.png">

RNNs are also attractive because they are capable of handling an input of arbitrary size. Concretely they have three layers: an input, a hidden and an output layer. Interestingly RNNs combine the input vector with the state vector(hidden layer) using a learned function, then they produce a new state vector and therefore an output vector. Essentially RNNs receive a sequence of vectors as input and produce an output sequence of vectors. **An output produced by a RNN is influenced not only by its current input, but by all the past inputs the RNN has been fed**

### A more expressive recurrent unit: Gated Recurrent
To overcome issues related to the training process (Vanishing and exploding gradient), hidden units with gates has been proposed. In particular, we will use the Gated Recurrent Unit ([Cho et al.](http://arxiv.org/abs/1409.1259)):

<img src="gru.png"/>

In summary, $r$ and $z$ works as "Gates" that controls the information flow. $r$ (reset gate) allows or denies to use previous states ($h_(t-1)$) to compute the current state ($h_t$). $z$ (update gate) controls whether $h_t$ is updated.
<img src="http://img.blog.csdn.net/20150830152611813"/>

#Our practical excercise

One can use the learned model to predict the next token iterativelly, so that applying a stochastic process our neural network generates sequences!. We follow architecture tested on [Andrej's blog](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) to learn with a set of different text documents. 
<img src="http://karpathy.github.io/assets/rnn/charseq.jpeg" height="500" width="500">

This section helps to build a RNN model using Theano and Blocks frameworks. It requires two python files which include utils function like ploting and monitoring.

## Architecture
This is the neural network architecture for our char-based NLM:
<img src="https://raw.githubusercontent.com/fagonzalezo/nn_nlp_tutorial/gh-pages/rnn_architecture.jpg" width= 400>

## Preparing the environment
First, we import the required libraries

In [1]:
import numpy
import codecs
from fuel.datasets import IndexableDataset
from fuel.streams import DataStream
from fuel.transformers import Mapping
from fuel.schemes import ShuffledScheme
from collections import OrderedDict

Using gpu device 0: Graphics Device (CNMeM is disabled)


Then, set the seed to get reproducible results

In [2]:
numpy.random.seed(0)

and define parameters of the model

In [3]:
seq_length = 50 # number of chars in the sequence
embedding_size = 128 # number of hidden units per layer
learning_rate = 0.002
nepochs = 10 # number of full passes through the training data
batch_size = 50 # number of samples taken per each update
decay_rate = 0.95 # decay rate for rmsprop
step_clipping = 0.5 # clip gradients at this value

model_name = 'shakespeare'
url_bokeh = 'http://localhost:5006/' # url to online plot training progress
text_file = 'input.txt' # input file
train_size = 0.95 # fraction of data that goes into train set
save_path = 'best_model.pkl' # name to export model file

## Building the dataset
Load text file

In [4]:
with codecs.open(text_file, 'r', 'utf-8') as f:
    data = f.read()
print data[1000:1200]

Second Citizen:
Would you proceed especially against Caius Marcius?

All:
Against him first: he's a very dog to the commonalty.

Second Citizen:
Consider you what services he has done for his country?


Now, we are going to create the vocabulary taking all different characters in the text file and get number of training examples.

In [5]:
if len(data) % seq_length > 0:
    data = data[:len(data) - len(data) % seq_length + 1]
else:
    data = data[:len(data) - seq_length + 1]

nsamples = len(data) // seq_length
chars = list(set(data))
vocab_size = len(chars)
char_to_ix = {ch: i for i, ch in enumerate(chars)}
ix_to_char = {i: ch for i, ch in enumerate(chars)}

We are going to go over the dataset taking chunks of characters and transform them into sequences of integers according to the previous dictionary. `targets` are just sequences moved one character to the right.

In [6]:
features = numpy.empty((nsamples, seq_length), dtype='uint8')
targets = numpy.zeros_like(features)
for i, p in enumerate(range(0, len(data) - 1, seq_length)):
    features[i] = numpy.array([char_to_ix[ch] for ch in data[p:p + seq_length]])
    targets[i] = numpy.array([char_to_ix[ch] for ch in data[p + 1:p + seq_length + 1]])

Now shuffle and split samples into train and validation sets

In [7]:
# Build dataset objects
nsamples_train = int(nsamples * train_size)
train_dataset = IndexableDataset(indexables=OrderedDict(
    [('features', features[:nsamples_train]), ('targets', targets[:nsamples_train])]))
dev_dataset = IndexableDataset(indexables=OrderedDict(
    [('features', features[nsamples_train:]), ('targets', targets[nsamples_train:])]))

Finally, we will export build dataset into Fuel streams. To learn more about fuel, check [the docs](http://fuel.readthedocs.org/en/latest/)

In [8]:
def transpose_stream(data):
    return (data[0].T, data[1].T)

# Define the way samples are going to be retrieved
train_stream = DataStream(dataset=train_dataset, iteration_scheme=ShuffledScheme(
    examples=train_dataset.num_examples, batch_size=batch_size))
dev_stream = DataStream(dataset=dev_dataset, iteration_scheme=ShuffledScheme(
    examples=dev_dataset.num_examples, batch_size=batch_size))

# Required because Recurrent bricks receive as input [sequence, batch, features]
train_stream = Mapping(train_stream, transpose_stream)
dev_stream = Mapping(dev_stream, transpose_stream)

In [9]:
x_vals, y_vals = dev_stream.get_epoch_iterator().next()
x_vals[:,0], y_vals[:,0]

(array([43,  2, 56, 39, 43, 53, 42, 56, 56,  7,  2, 56, 52, 53,  2, 29, 42,
        59, 57, 58, 40, 47, 46, 52,  7,  0, 20,  2, 59, 47, 46, 53, 48,  2,
        59, 47, 52, 58,  2, 47, 39, 56, 59,  2, 59, 47, 42,  2, 61, 42], dtype=uint8),
 array([ 2, 56, 39, 43, 53, 42, 56, 56,  7,  2, 56, 52, 53,  2, 29, 42, 59,
        57, 58, 40, 47, 46, 52,  7,  0, 20,  2, 59, 47, 46, 53, 48,  2, 59,
        47, 52, 58,  2, 47, 39, 56, 59,  2, 59, 47, 42,  2, 61, 42, 57], dtype=uint8))

## Build the model
Blocks framework helps us to build and train neural networks in an easy manner. Again, we will import required classes:

In [10]:
import theano
import numpy
import sys
from theano import tensor
from blocks import initialization
from blocks import roles
from blocks.model import Model
from blocks.bricks import Linear, NDimensionalSoftmax
from blocks.graph import ComputationGraph
from blocks.algorithms import StepClipping, GradientDescent, CompositeRule, RMSProp
from blocks.extensions import FinishAfter, Timing, Printing, saveload, predicates
from blocks.extensions.monitoring import DataStreamMonitoring, TrainingDataMonitoring
from blocks.extensions.training import TrackTheBest
from blocks.extras.extensions.plot import Plot
from blocks.bricks.parallel import Fork
from blocks.bricks.recurrent import GatedRecurrent
from blocks.bricks.lookup import LookupTable
from blocks.filter import VariableFilter
from blocks.main_loop import MainLoop


Now, using Blocks we are building the RNN architecture. Firstly, we will use 1 lookup table to map from indices of the vocabulary to real N-dimensional vectors

In [11]:
# MODEL
x = tensor.imatrix('features')
y = tensor.imatrix('targets')

lookup = LookupTable(length=vocab_size, dim=embedding_size)

Then, we are adding two RNN layers. In particular we are going to use Gated Recurrent Units ([Cho et al.](http://arxiv.org/abs/1409.1259))

In [12]:
# Layer 1
fork1 = Fork(output_names=['linear1', 'gates1'], name='fork1',
             input_dim=embedding_size, output_dims=[embedding_size, embedding_size * 2])
grnn1 = GatedRecurrent(dim=embedding_size, name='gru1')

# Layer 2
fork2 = Fork(output_names=['linear2', 'gates2'], name='fork2',
             input_dim=embedding_size, output_dims=[embedding_size, embedding_size * 2])
grnn2 = GatedRecurrent(dim=embedding_size, name='gru2')

On top of our model we set a Softmax classifier for each predicted character

In [13]:
# Softmax layer
hidden_to_output = Linear(name='hidden_to_output', input_dim=embedding_size,
                          output_dim=vocab_size)
softmax = NDimensionalSoftmax()

With the defined objects, now we are able to build the whole network, performing the forward propagation starting from `x` until `y_hat` prediction

In [14]:
# Propagate x until top brick to get y_hat predictions
embedding = lookup.apply(x)
linear1, gates1 = fork1.apply(embedding)
h1 = grnn1.apply(linear1, gates1)
linear2, gates2 = fork2.apply(h1)
h2 = grnn2.apply(linear2, gates2)
linear_output = hidden_to_output.apply(h2)
linear_output.name = 'linear_output'
y_hat = softmax.apply(linear_output, extra_ndim=1)
y_hat.name = 'y_hat'

Finally we define our cost function as the cross entropy between predictions (`y_hat`) and original targets (`y`)

In [15]:
# COST
cost = softmax.categorical_cross_entropy(y, linear_output, extra_ndim=1).mean()
cost.name = 'cost'

## Define learning algorithm
now, it is required to define initialization strategies for every learnable block. This step allocates variables in GPU memory and sets random values for weights matrices and zeros to biases vectors

In [16]:
# Set initialization strategies
to_init = [lookup, grnn1, fork1, grnn2, fork2, hidden_to_output]
for brick in to_init:
    brick.weights_init = initialization.Orthogonal()
    brick.biases_init = initialization.Constant(0)
    brick.initialize()

Now we define our algorithm based on the `cost` and the parameters previously defined:

In [17]:
# Learning algorithm
cg = ComputationGraph(cost)
step_rules = [RMSProp(learning_rate=learning_rate, decay_rate=decay_rate),
              StepClipping(step_clipping)]
algorithm = GradientDescent(cost=cost,
                            parameters=cg.parameters,
                            step_rule=CompositeRule(step_rules))

This is the last step. We include some extensions to monitor the training process:

In [18]:
# Extensions
def track_best(channel, save_path):
    tracker = TrackTheBest(channel, choose_best=min)
    checkpoint = saveload.Checkpoint(
        save_path, after_training=False, use_cpickle=True)
    checkpoint.add_condition(["after_epoch"],
                             predicate=predicates.OnLogRecord('{0}_best_so_far'.format(channel)))
    return [tracker, checkpoint]

dev_monitor = DataStreamMonitoring(variables=[cost],
                                   before_first_epoch=True, after_epoch=True,
                                   data_stream=dev_stream, prefix="dev")
train_monitor = TrainingDataMonitoring(variables=[cost],
                                       before_first_epoch=True,
                                       after_batch=True, prefix='tra')

extensions = [train_monitor, dev_monitor,
    Timing(),
    Printing(after_epoch=True),
    FinishAfter(after_n_epochs=nepochs),
]

extensions.extend(track_best('dev_cost', save_path))
extensions.append(Plot('jearevaloo_gru_nlm', server_url='http://localhost:5006/',
            channels=[['tra_cost','dev_cost']], before_first_epoch=True, after_batch=False, after_n_batches=200))

Using saved session configuration for http://localhost:5006/
To override, pass 'load_from_config=False' to Session


## Train the model
Finally build the main loop and train the model

In [19]:
main_loop = MainLoop(data_stream=train_stream, algorithm=algorithm,
                     model=Model(cost), extensions=extensions)
main_loop.run()


-------------------------------------------------------------------------------
BEFORE FIRST EPOCH
-------------------------------------------------------------------------------
Training status:
	 batch_interrupt_received: False
	 epoch_interrupt_received: False
	 epoch_started: True
	 epochs_done: 0
	 iterations_done: 0
	 received_first_batch: False
	 resumed_from: None
	 training_started: True
Log records from the iteration 0:
	 dev_cost: 4.17185163498
	 time_initialization: 15.4278581142
	 tra_cost: nan


-------------------------------------------------------------------------------
AFTER ANOTHER EPOCH
-------------------------------------------------------------------------------
Training status:
	 batch_interrupt_received: False
	 epoch_interrupt_received: False
	 epoch_started: False
	 epochs_done: 1
	 iterations_done: 424
	 received_first_batch: True
	 resumed_from: None
	 training_started: True
Log records from the iteration 424:
	 dev_cost: 1.97388708591
	 time_read_data_th

In [20]:
main_loop.profile.report()

Section                                  Time     % of total
------------------------------------------------------------
Before training                          0.00          0.00%
  TrainingDataMonitoring                 0.00          0.00%
  DataStreamMonitoring                   0.00          0.00%
  Timing                                 0.00          0.00%
  Printing                               0.00          0.00%
  FinishAfter                            0.00          0.00%
  TrackTheBest                           0.00          0.00%
  Checkpoint                             0.00          0.00%
  Plot                                   0.00          0.00%
  Other                                  0.00          0.00%
Initialization                          15.43          5.90%
Training                               245.98         94.09%
  Before epoch                           0.26          0.10%
    TrainingDataMonitoring               0.00          0.00%
    DataStreamMonitoring

# Generating text
Hopefully, Our model is now good to predict the next character given a sequence. Thus, we can use it to generate text by feed the model with its own output iterativelly. We first define a theano function, to propagate the input and get hidden activations as well as the probability distribution of the next element in the sequence:

In [21]:
activations = VariableFilter(theano_name_regex='gru._apply_states')(main_loop.model.variables)
#take activations of last element
activations = [act[-1].flatten() for act in activations]
initial_states = VariableFilter(roles=[roles.INITIAL_STATE])(main_loop.model.parameters)[::-1]
states_as_params = [tensor.vector(dtype=initial.dtype) for initial in initial_states]
zip(initial_states, states_as_params)
#Get prob. distribution of the last element in the last seq of the batch
fprop = theano.function([x] + states_as_params, activations + [y_hat[-1, -1, :]], givens=zip(initial_states, states_as_params))

In [22]:
def sample(x_curr, states_values, fprop, temperature=1.0):
    '''
    Propagate x_curr sequence and sample next element according to
    temperature sampling.
    Return: sampled element and a list of the hidden activations produced by fprop.
    '''
    outvars = fprop(x_curr, *states_values)
    activations = outvars[:-1]
    probs = outvars[-1].astype('float64')

    if numpy.random.binomial(1, temperature) == 1:
        probs = probs / probs.sum()
        sample = numpy.random.multinomial(1, probs).nonzero()[0][0]
    else:
        sample = probs.argmax()

    return sample, activations

Set the initial characters or pick the first one at random. Finally we can sample:

In [None]:
primetext = ix_to_char[numpy.random.randint(vocab_size)]
#primetext = 'VICEN'
primetext = ''.join([ch for ch in primetext if ch in char_to_ix.keys()])
    
x_curr = numpy.expand_dims(
    numpy.array([char_to_ix[ch] for ch in primetext], dtype='uint8'), axis=1)
length = 5000
temperature = 0.4
states_values = [initial.get_value() for initial in initial_states]
sys.stdout.write('Starting sampling\n' + primetext)
for _ in range(length):
    idx, states_values = sample(x_curr, states_values, fprop, temperature)
    sys.stdout.write(ix_to_char[idx])
    x_curr = [[idx]]

sys.stdout.write('\n')

Starting sampling
cry, and the maid
Than the since commit the see the stot against tune our state of the still man
To a subject the maid together with a grow unto the father
And the sun the state of the sweatent unto the princess.

LUCIO:
Go towardness so longing liberwer be so news,
Do not to the world be which the sun and the state of the Earl of Marcius.
The state his parting to this life of this
Maric of some stay the king of death,
And not the state content the state of the son of Boling beain:
And tell the father will door command of igers and my sons and my speech
To speak to the sense two shall play the king of the son,
For was to the common news.

DUKE VINCENTIO:
The walk of the sweet some since me to''d the res
Shobe to bed the sun the soldiers of the swift the king.

QUEEN MARGARET:
And the suns of what thy brook of the air to be so long
Lord Marcius Romeo wilt thou we shall be so long of wakes be stones
The seen the oll gain.

LARTwen'd the shame; but the sons and my soul l

### And that concludes the tutorial...
<blockquote class="twitter-video" lang="en"><p lang="en" dir="ltr">… and that concludes Machine Learning 101. Now, go forth and apply what you&#39;ve learned to real data! <a href="http://t.co/D6wSKgdjeM">pic.twitter.com/D6wSKgdjeM</a></p>&mdash; ML Hipster (@ML_Hipster) <a href="https://twitter.com/ML_Hipster/status/633954383542128640">August 19, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<img src="http://i.imgur.com/ZfkhOt4.png" style="max-width:100%; width: 60%; max-width: none; float:left;"/><img src="https://pbs.twimg.com/media/CPhVYYbUkAA-m6D.jpg:small"/>
