# Recurrent Neural Networks

## Seoul AI Meetup, August 5

Martin Kersner, <m.kersner@gmail.com>

## References

### Books
* Hands-On Machine Learning with Scikit-Learn and Tensorflow (Chapter 14. Recurrent Neural Networks)
    * https://www.safaribooksonline.com
    * https://github.com/ageron/handson-ml
* Deep Learning Book (Chapter 10: Sequence Modeling: Reccurent and Recursive Nets)
    * http://www.deeplearningbook.org/
    * https://github.com/HFTrader/DeepLearningBook
    
### Videos
TODO

In [1]:
import numpy as np
import tensorflow as tf

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [2]:
# source https://github.com/ageron/handson-ml
from IPython.display import clear_output, Image, display, HTML

def strip_consts(graph_def, max_const_size=32):
    """Strip large constant values from graph_def."""
    strip_def = tf.GraphDef()
    for n0 in graph_def.node:
        n = strip_def.node.add() 
        n.MergeFrom(n0)
        if n.op == 'Const':
            tensor = n.attr['value'].tensor
            size = len(tensor.tensor_content)
            if size > max_const_size:
                tensor.tensor_content = "b<stripped %d bytes>"%size
    return strip_def

def show_graph(graph_def, max_const_size=32):
    """Visualize TensorFlow graph."""
    if hasattr(graph_def, 'as_graph_def'):
        graph_def = graph_def.as_graph_def()
    strip_def = strip_consts(graph_def, max_const_size=max_const_size)
    code = """
        <script>
          function load() {{
            document.getElementById("{id}").pbtxt = {data};
          }}
        </script>
        <link rel="import" href="https://tensorboard.appspot.com/tf-graph-basic.build.html" onload=load()>
        <div style="height:600px">
          <tf-graph-basic id="{id}"></tf-graph-basic>
        </div>
    """.format(data=repr(str(strip_def)), id='graph'+str(np.random.rand()))

    iframe = """
        <iframe seamless style="width:1200px;height:620px;border:0" srcdoc="{}"></iframe>
    """.format(code.replace('"', '&quot;'))
    display(HTML(iframe))

## Feed Forward Neural Networks

Feed Forward Neural Networks has following limitations:

* Inputs and outputs of **fixed size**.
* Assume **independence** between input data.

## Basic RNN architectures

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg" />

### Recurrent Neurons

Weights and biases are shared over time.

<img src="https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_1401.png" style="height: 70%; width: 70%" />

*Left*: Recurrent network with one neuron in cell.

*Right*: **Unfolded (= unrolled)** recurrent network.

### Implementation of single RNN cell
```python
# x represents input data             [batch_size, n_input_features]
# h represents hidden state           [batch_size, n_neurons]
# W_xh weights applied to input data  [n_input_features, n_neurons]
# W_hh weights of hidden state        [n_neurons, n_neurons]
# W_hy weights for output             [n_neurons, n_outputs] 
# activation function tanh -> [-1, 1]

h = np.tanh(np.dot(x, W_xh) + np.dot(h, W_hh))
#h = np.tanh(np.dot(np.hstack((x, h)), np.vstack((W_xh, W_hh)))) # same as expression above

# prediction at current time
y = np.dot(h, W_hy)
```

## Layer of Recurrent Neurons 

Connections between

* input and hidden layer
* hidden layer in time $t_{i}$ and hidden layer in time $t_{i+1}$
* hidden layer and output layer

are **fully connected**.

<img src="https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_1402.png" style="width: 60%, heigth: 60%" />

*Left*: Recurrent network with cell containing 5 neurons

*Right*: **Unfolded (= unrolled)** recurrent network.

### Memory Cells
TODO

### Input and Output Sequences
TODO

### Basic RNN

In [3]:
n_features = 3
n_neurons  = 5
n_steps    = 2

In [4]:
X0_batch = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 0, 1]]) # t = 0
X1_batch = np.array([[9, 8, 7], [0, 0, 0], [6, 5, 4], [3, 2, 1]]) # t = 1

In [5]:
# source https://github.com/ageron/handson-ml
reset_graph()

X0 = tf.placeholder(tf.float32, [None, n_features])
X1 = tf.placeholder(tf.float32, [None, n_features])

# weight for original connection
Wx = tf.Variable(tf.random_normal(shape=[n_features, n_neurons], dtype=tf.float32))

# weight for recurrent connection
Wy = tf.Variable(tf.random_normal(shape=[n_neurons, n_neurons], dtype=tf.float32))
b  = tf.Variable(tf.zeros([1, n_neurons], dtype=tf.float32))

# Y0 = tf.matmul(X0, Wx)
# [None, n_features] * [n_features, n_neurons] = [None, n_neurons]
Y0 = tf.tanh(tf.matmul(X0, Wx) + b)

# tf.matmul(Y0, Wy) : [None,   n_neurons] * [n_neurons, n_neurons] = [None, n_neurons]
# tf.matmul(X1, Wx) : [None,  n_features] * [n_neurons, n_neurons] = [None, n_neurons]
# b :[1, n_neurons]
Y1 = tf.tanh(tf.matmul(Y0, Wy) + tf.matmul(X1, Wx) + b)

In [6]:
show_graph(tf.get_default_graph())

In [7]:
def train_rnn(X0_batch, X1_batch):    
    init = tf.global_variables_initializer()

    with tf.Session() as sess:
        init.run()
        Y0_val, Y1_val = sess.run([Y0, Y1], feed_dict={X0: X0_batch, X1: X1_batch})
    
    print("Y0\n",   Y0_val)
    print("\nY1\n", Y1_val)

In [8]:
train_rnn(X0_batch, X1_batch)

Y0
 [[-0.0664006   0.96257669  0.68105787  0.70918542 -0.89821595]
 [ 0.9977755  -0.71978885 -0.99657625  0.9673925  -0.99989718]
 [ 0.99999774 -0.99898815 -0.99999893  0.99677622 -0.99999988]
 [ 1.         -1.         -1.         -0.99818915  0.99950868]]

Y1
 [[ 1.         -1.         -1.          0.40200216 -1.        ]
 [-0.12210433  0.62805319  0.96718419 -0.99371207 -0.25839335]
 [ 0.99999827 -0.9999994  -0.9999975  -0.85943311 -0.9999879 ]
 [ 0.99928284 -0.99999815 -0.99990582  0.98579615 -0.92205751]]


## `static_rnn()`

* [tf.contrib.rnn.BasicRNNCell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicRNNCell)
* [tf.nn.static_rnn](https://www.tensorflow.org/api_docs/python/tf/nn/static_rnn) creates one cell per time step.
* Each input placeholder (`X0`, `X1`) have to be manually defined.

In [9]:
# source https://github.com/ageron/handson-ml
reset_graph()

X0 = tf.placeholder(tf.float32, [None, n_features])
X1 = tf.placeholder(tf.float32, [None, n_features])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.nn.static_rnn(basic_cell, [X0, X1],
                                                dtype=tf.float32)
Y0, Y1 = output_seqs

In [10]:
show_graph(tf.get_default_graph())

## `static_rnn()` output

In [11]:
train_rnn(X0_batch, X1_batch)

Y0
 [[ 0.30741334 -0.32884315 -0.65428472 -0.93850589  0.52089024]
 [ 0.99122757 -0.95425421 -0.75180793 -0.99952078  0.98202348]
 [ 0.99992681 -0.99783254 -0.82473528 -0.9999963   0.99947774]
 [ 0.99677098 -0.68750614  0.84199691  0.93039107  0.8120684 ]]

Y1
 [[ 0.99998885 -0.99976051 -0.06679298 -0.99998039  0.99982214]
 [-0.65249437 -0.51520866 -0.37968954 -0.59225935 -0.08968385]
 [ 0.99862403 -0.99715197 -0.03308626 -0.99915648  0.99329019]
 [ 0.99681675 -0.95981938  0.39660636 -0.83076048  0.79671967]]


### `static_rnn()` with single input placeholder

In [12]:
# source https://github.com/ageron/handson-ml
reset_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_features])
X_seqs = tf.unstack(tf.transpose(X, perm=[1, 0, 2]))

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs, states = tf.nn.static_rnn(basic_cell, X_seqs,
                                                dtype=tf.float32)
outputs = tf.transpose(tf.stack(output_seqs), perm=[1, 0, 2])

In [13]:
def train_rnn2(X0_batch, X1_batch):
    X0_batch_tmp = X0_batch[:,:,np.newaxis]
    X1_batch_tmp = X1_batch[:,:,np.newaxis]

    X_batch = np.concatenate((X0_batch_tmp, X1_batch_tmp), axis=2)
    X_batch = np.transpose(X_batch, (0, 2, 1))    

    init = tf.global_variables_initializer()

    with tf.Session() as sess:
        init.run()
        outputs_val = outputs.eval(feed_dict={X: X_batch})

    print("Y0\n", np.transpose(outputs_val, axes=[1, 0, 2])[0])
    print("\nY1\n", np.transpose(outputs_val, axes=[1, 0, 2])[1])

In [14]:
# Y0 output at t = 0
# Y1 output at t = 0
train_rnn2(X0_batch, X1_batch)

Y0
 [[-0.45652324 -0.68064123  0.40938237  0.63104504 -0.45732826]
 [-0.80015349 -0.99218267  0.78177971  0.9971031  -0.99646091]
 [-0.93605185 -0.99983788  0.93088669  0.99998152 -0.99998295]
 [ 0.99273688 -0.99819332 -0.55543643  0.9989031  -0.9953323 ]]

Y1
 [[-0.94288003 -0.99988687  0.94055814  0.99999851 -0.9999997 ]
 [-0.63711601  0.11300932  0.5798437   0.43105593 -0.63716984]
 [-0.9165386  -0.99456042  0.89605415  0.99987197 -0.99997509]
 [-0.02746334 -0.73191994  0.7827872   0.95256817 -0.97817713]]


In [15]:
show_graph(tf.get_default_graph())

## `dynamic_rnn()` 

* [tf.nn.dynamic_rnn](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn)
* No need to unstack, stack and transpose.
* Input `[None, n_steps, n_features]`.
* Output `[None, n_steps, n_neurons]`

In [18]:
# source https://github.com/ageron/handson-ml
reset_graph()

X = tf.placeholder(tf.float32, [None, n_steps, n_features])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

train_rnn2(X0_batch, X1_batch)

Y0
 [[ 0.80872238 -0.52312446 -0.6716494  -0.69762248 -0.54384488]
 [ 0.99547106 -0.02155113 -0.99482894  0.17964774 -0.83173698]
 [ 0.99990267  0.49111056 -0.9999314   0.8413834  -0.9444679 ]
 [-0.80632919  0.93928123 -0.97309881  0.99996096  0.97433066]]

Y1
 [[ 0.9995454   0.99339807 -0.99998379  0.99919224 -0.98379493]
 [-0.06013332  0.4030143   0.02884481 -0.29437575 -0.85681593]
 [ 0.99406189  0.95815992 -0.99768937  0.98646194 -0.91752487]
 [ 0.95047355 -0.51205158 -0.27763969  0.83108062  0.81631833]]


In [19]:
show_graph(tf.get_default_graph())

### Variable Length Input Sequences
Parameter `sequence_length` in `dynamic_rnn` represents the lenghts of input vector in batch.

### Variable Length Output Sequences
EOS token

## Training RNN

* Backpropagation Through Time (BPTT)
    * Forward pass
    * Compute cost function $C(Y_0, Y_1, ..., Y_{n-1}, Y_n)$.
    * Propagate gradient of cost function through the unrolled network.
    * Update model parameters using the gradients computed during BPTT.

### MNIST

In [20]:
# source https://github.com/ageron/handson-ml
reset_graph()

n_steps = 28
n_inputs = 28
n_neurons = 150
n_outputs = 10

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

In [21]:
show_graph(tf.get_default_graph())

In [22]:
# source https://github.com/ageron/handson-ml
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")
X_test = mnist.test.images.reshape((-1, n_steps, n_inputs))
y_test = mnist.test.labels

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [23]:
# source https://github.com/ageron/handson-ml
n_epochs = 100
batch_size = 150

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            X_batch = X_batch.reshape((-1, n_steps, n_inputs))
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

0 Train accuracy: 0.906667 Test accuracy: 0.9061
1 Train accuracy: 0.953333 Test accuracy: 0.9426
2 Train accuracy: 0.946667 Test accuracy: 0.9536
3 Train accuracy: 0.966667 Test accuracy: 0.9626
4 Train accuracy: 0.953333 Test accuracy: 0.9689
5 Train accuracy: 0.953333 Test accuracy: 0.9619
6 Train accuracy: 0.993333 Test accuracy: 0.9734
7 Train accuracy: 0.973333 Test accuracy: 0.9682
8 Train accuracy: 0.98 Test accuracy: 0.9681
9 Train accuracy: 0.986667 Test accuracy: 0.9715
10 Train accuracy: 0.986667 Test accuracy: 0.9754
11 Train accuracy: 0.973333 Test accuracy: 0.9749
12 Train accuracy: 0.986667 Test accuracy: 0.9748
13 Train accuracy: 0.98 Test accuracy: 0.9723
14 Train accuracy: 0.98 Test accuracy: 0.9691
15 Train accuracy: 1.0 Test accuracy: 0.9734
16 Train accuracy: 0.98 Test accuracy: 0.9771
17 Train accuracy: 0.98 Test accuracy: 0.9718
18 Train accuracy: 0.993333 Test accuracy: 0.9714
19 Train accuracy: 0.986667 Test accuracy: 0.979
20 Train accuracy: 0.966667 Test acc

## [tf.contrib.rnn.OutputProjectionWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/OutputProjectionWrapper)

To reduce number of outputs from `BasicRNNCell` using fully connected layer, sharing weights and biases across the time.


## Only one fully connected layer instead of one per time step

Faster than using `OutputProjectionWrapper`.

```python
cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu)
rnn_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

stacked_rnn_outputs = tf.reshape(rnn_outputs, [-1, n_neurons])
stacked_outputs = fully_connected(stacked_rnn_outputs, n_outputs,
                                  activation_fn=None)
outputs = tf.reshape(stacked_outputs, [-1, n_steps, n_outputs])
```

## Generating sequence

## Deep RNN

* Stack  of multiple layers of cells.
* [tf.contrib.rnn.MultiRNNCell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/MultiRNNCell)

```python
basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
multi_layer_cell = tf.contrib.rnn.MultiRNNCell([basic_cell] * n_layers)
outputs, states = tf.nn.dynamic_rnn(multi_layer_cell, X, dtype=tf.float32)
```

### Dropout

[tf.contrib.rnn.DropoutWrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/DropoutWrapper) applies dropout during both training and testing phase!

**Solution**

* Create own wrapper.
* Create two graphs; one for training, one for testing.

#### Example of two graphs (training/testing)

```python
keep_prob = 0.5
cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
if is_training:
    cell = tf.contrib.rnn.DropoutWrapper(cell, input_keep_prob=keep_prob)

...

with tf.Session() as sess:
    if is_training:
        init.run()
        for iteration in range(n_iterations):
            # train the model
        save_path = saver.save(sess, "model.ckpt")
    else:
        saver.restore(sess, "model.ckpt")
        # use the model
```

### Bidirectional Neural Networks
TODO

## RNN problems

With long input sequences RNN suffers from several problems.

* Vanishing/Exploding gradients
* Non-convergance
* Traning takes long time.
* Memory of the first inputs fade away.

**Solutions**

* Good parameter initialization (weights initialized as identity matrix)
* Nonsaturating activation functions (e.g., ReLU)
* Batch Normalization
* Gradient Clipping
* Faster optimizers

But training is **slow**.

Truncated Backpropagation Through Time => model cannot learn long-term dependencies.

## LSTM  Cell ([Long Short-Term Memory, 1997](http://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735#.WIxuWvErJnw))

* Same inputs and outputs as basic RNN cell, but state is split.
* Faster convergence.
* Detect long-term dependencies in data.
* 4 different fully connected layers
* 3 gates (learn what to store in the long-term state, what to throw away, and what to read from it)
    * Input
    * Forget
    * Output
* 2 states
    * short-term
    * long-term


* Tensorflow [tf.contrib.rnn.BasicLSTMCell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell)
* Keras [keras.layers.recurrent.LSTM](https://keras.io/layers/recurrent/#lstm)

### Visualization of LSTM cell

<img src="https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_1413.png" style="height: 30%, width: 30%;" />

### Peephole Connections

* "[Recrrent Nets that Time and Count](ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf)", F. Gers and J. Schmidhuber (2000)
* In LSTM gate controllers utilize only previsou state and current input.
* Peephole connections allow them to use ("peep") long-term state as well.

<img src="https://raw.githubusercontent.com/martinkersner/rnn-meetup/master/images/peephole_connections.png" style="height: 30%, width: 30%;" />

## GRU Cell

* Simplified version of LSTM cell

## Examples

### Character-Level Text Generation

* http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    * https://gist.github.com/karpathy/587454dc0146a6ae21fc
    * https://github.com/jcjohnson/torch-rnn

### QA bAbI tasks

* https://research.fb.com/downloads/babi/
* Synthetic dataset of 20 different tasks for testing text understanding and reasoning.

Example of task with two supporting facts (QA2):

```
1 Mary got the milk there.                                                      
2 John moved to the bedroom.                                                    
3 Sandra went back to the kitchen.                                              
4 Mary travelled to the hallway.                                                
5 Where is the milk?  hallway 1 4 
```

### Question Answering

http://smerity.com/articles/2015/keras_qa.html

Following information are always related to **Two Supporting Facts (QA2)** which can be found in *tasks_1-20_v1-2/en/qa2_two-supporting-facts_[train|test].txt*.

* QA2 subdataset contains 1,000 traing and 1,000 testing samples.
* The length of stories and questions **differ**.
* Test accuracy 31 % ([Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks](https://arxiv.org/abs/1502.05698), 20 %)

#### Data Preprocessing
```python
(# story
 ['Mary', 'got', 'the', 'milk', 'there', '.',
  'John', 'moved', 'to', 'the', 'bedroom', '.',
  'Sandra', 'went', 'back', 'to', 'the', 'kitchen', '.',
  'Mary', 'travelled', 'to', 'the', 'hallway', '.'],
 # question
 ['Where', 'is', 'the', 'milk', '?'],
 # answer
  'hallway')
```

#### Word vocabulary

Only 35 (36) words!

```python
['.', '?', 'Daniel', 'John', 'Mary', 'Sandra', 'Where', 'apple', 'back', 'bathroom', 'bedroom', 'discarded', 'down', 'dropped', 'football', 'garden', 'got', 'grabbed', 'hallway', 'is', 'journeyed', 'kitchen', 'left', 'milk', 'moved', 'office', 'picked', 'put', 'the', 'there', 'to', 'took', 'travelled', 'up', 'went']
```

#### Conversion stories to vectors

```python
[0 ... 5 17 29 24 30  1  4 25 31 29 11  1  6 35  9 31 29 22  1  5 33 31 29 19  1] # pre-padded with zeros
```

## RNN models

Following models can be applied to all bAbI tasks, but have to be trained separately.

#### Model #1 (August 5, 2015)

```python
sentrnn = Sequential()
sentrnn.add(Embedding(vocab_size, EMBED_HIDDEN_SIZE, mask_zero=True))
sentrnn.add(RNN(EMBED_HIDDEN_SIZE, SENT_HIDDEN_SIZE, return_sequences=False))

qrnn = Sequential()
qrnn.add(Embedding(vocab_size, EMBED_HIDDEN_SIZE))
qrnn.add(RNN(EMBED_HIDDEN_SIZE, QUERY_HIDDEN_SIZE, return_sequences=False))

model = Sequential()
model.add(Merge([sentrnn, qrnn], mode='concat'))
model.add(Dense(SENT_HIDDEN_SIZE + QUERY_HIDDEN_SIZE, vocab_size, activation='softmax'))
```

<img src="http://smerity.com/media/images/articles/2015/keras_qa_model.svg" style="height: 50%; width: 50%" />

#### Model #2

```python
sentence = layers.Input(shape=(story_maxlen,), dtype='int32')                   
encoded_sentence = layers.Embedding(vocab_size, EMBED_HIDDEN_SIZE)(sentence)       
encoded_sentence = layers.Dropout(0.3)(encoded_sentence)                        
                                                                                
question = layers.Input(shape=(query_maxlen,), dtype='int32')                   
encoded_question = layers.Embedding(vocab_size, EMBED_HIDDEN_SIZE)(question)       
encoded_question = layers.Dropout(0.3)(encoded_question)                        
encoded_question = RNN(EMBED_HIDDEN_SIZE)(encoded_question)
encoded_question = layers.RepeatVector(story_maxlen)(encoded_question)          
                                                                                
merged = layers.add([encoded_sentence, encoded_question])                       
merged = RNN(EMBED_HIDDEN_SIZE)(merged)                                         
merged = layers.Dropout(0.3)(merged)                                            
preds = layers.Dense(vocab_size, activation='softmax')(merged) 
```

## Generating Sequences With Recurrent Neural Networks, A. Graves, 2015

* [Paper](https://arxiv.org/abs/1308.0850)
* [Source code](https://github.com/szcom/rnnlib)
* [Online demo](https://www.cs.toronto.edu/~graves/handwriting.cgi)

<img src="https://raw.githubusercontent.com/martinkersner/rnn-meetup/master/images/saim1.jpg" />
<img src="https://raw.githubusercontent.com/martinkersner/rnn-meetup/master/images/saim2.jpg" />
<img src="https://raw.githubusercontent.com/martinkersner/rnn-meetup/master/images/saim3.jpg" />