<a href="https://colab.research.google.com/github/lblogan14/master_tensorflow_keras/blob/master/ch6_rnn_tensorflow_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os

import numpy as np
np.random.seed(123)
print("NumPy:{}".format(np.__version__))

import tensorflow as tf
tf.set_random_seed(123)
print("TensorFlow:{}".format(tf.__version__))

NumPy:1.14.6
TensorFlow:1.12.0-rc2


**Recurrent Neural Network (RNN)** is a specialized neural network architecture for handling
sequential data.

#Simple Recurrent Neural Network

![](https://github.com/armando-fandango/Mastering-TensorFlow/blob/master/images/ch-06/06_01.png?raw=true)

The neural Network $N$ takes input $x_t$ to produce output $y_t$. 

At the next time step $t+1$, it takes the input $y_t$ along with input $x_{t+1}$ to produce output $y_{t+1}$,
$$y_t=\phi(w^{(x)}\times x_t+w^{(y)}\times y_{t-1}+b)$$

If we unroll the network at time step 5,
![alt text](https://github.com/armando-fandango/Mastering-TensorFlow/blob/master/images/ch-06/06_03.png?raw=true)
At every time step, the same learning function, $\phi(\cdot)$, and the same parameters, $w$ and $b$, are
used.

We can also add hidden layers in RNN, then unroll it at time step 5 as well,
![alt text](https://github.com/armando-fandango/Mastering-TensorFlow/blob/master/images/ch-06/06_04.png?raw=true)
As you can see, the output $y$ is not always produced at every time step. Instead, an output $h$ is produced at
every time step, and another activation function is applied to this output $h$ to produce the
output $y$,
$$h_t=\phi(w^{(hx)}\times x_t+w^{(hh)}\times h_{t-1}+b^{(h)})$$
$$y_t=\phi(w^{(yh)}\times h_t+b^{(y)})$$
* $w^{(hx)}$ is the weight vector for $x$ inputs that are connected to the hidden layer
* $w^{(hh)}$ is the weight vector for $h$ from the previous time step
* $w^{(yh)}$ is the weight vector for layer connecting the hidden layer to the output layer
* $h_t$ is usually a nonlinear function, such as $\tanh$ or $ReLU$.

In RNN, same parameters, $(w^{(hx)}, w^{(hh)}, w^{(yh)}, b^{(h)}, b^{(y)})$ are used at every time step.

#RNN variants
* **BIdirectional RNN (BRNN)** is used when the output depends on both the
previous and future elements of a sequence. BRNN is implemented by stacking
two RNNs, known as forward and backward Layer, and the output is the result
of the hidden state of both the RNNs.
* **Deep Bidirectional RNN (DBRNN)** extends the BRNN further by
adding multiple layers. The BRNN has hidden layers or cells across the time
dimensions. However, by stacking BRNN, we get the hierarchical presentation in
DBRNN.
* **Long Short-Term Memory (LSTM)** network extends the RNN by using an
architecture that involves multiple nonlinear functions instead of one simple
nonlinear function to compute the hidden state. The LSTM is composed of black
boxes called cells that take the three inputs: the working memory, $h_{t-1}$ at time $t-1$, current input, $x_t$, and long-term memory $c_{t-1}$ at time $t-1$, and
produce the two outputs: updated working memory, $h_t$, and long-term memory $c_t$. The cells use the functions known as gates, to make decisions about saving
and erasing the content selectively from the memory.
* **Gated Recurrent Unit (GRU)** network is a simplified variation of
LSTM. It combines the function of the *forget* and the *input* gates in a
simpler *update* gate. It also combines the *hidden* state and *cell* state into one single
state. Hence, GRU is computationally less expensive as compared to LSTM.
* **seq2seq** model combines the encoder-decoder architecture with RNN
architectures. In seq2seq architecture, the model is trained on sequences of data,
such as text data or time series data, and then the model is used to generate the
output sequences. The seq2seq model consists of an encoder and a
decoder model, both of them built with the RNN architecture. The seq2seq
models can be stacked to build hierarchical multi-layer models.

##LSTM network
When RNNs are trained over very long sequences of data, the gradients tend to become
either very large or very small that they vanish to almost zero. **Long Short-Term Memory
(LSTM)** networks address the vanishing/exploding gradient problem by adding gates for
controlling the access to past information.

In LSTM, a repeating module consisting of four main functions is used. This module that builds the LSTM is called the **cell**. The LSTM cell helps train the
model more effectively when long sequences are passed, by selectively learning or erasing
information. The functions composing the cell are also known as **gates** as they act as
gatekeeper for the information that is passed in and out of the cell.

LSTM model has two kinds of memory:
* working memory, $h$ (hidden state)
* long-term memory, $c$ (cell state)

The cell state or long-term memory flows from cell to cell with only two linear interactions.
The LSTM adds information to the long term memory, or removes information from the
long-term memory, through gates, as shown below:
![alt text](https://github.com/armando-fandango/Mastering-TensorFlow/blob/master/images/ch-06/06_05.png?raw=true)

1. **Forget Gate** $f(\cdot)$ (or remember gate): the $h_{t-1}$ and $x_t$ flows as input to $f(\cdot)$ gate: $$f(\cdot)=\sigma(w^{(fx)}\times x_t+w^{(fh)}\times h_{t-1}+b^{(f)})$$
The function of **forget gate** is to decide which information to forget and which
information to remember. The *sigmoid* activation function is used here, so that an
output of 1 represents that the information is carried over to the next step within
the cell, and an output of 0 represents that the information is selectively
discarded.
2. **Input Gate** $i(\cdot)$ (or save gate):  the $h_{t-1}$ and $x_t$ flows as input to $i(\cdot)$ gate: 
$$i(\cdot)=\sigma(w^{(ix)}\times x_t+w^{(ih)}\times h_{t-1}+b^{(i)})$$
The function of **input gate** is to decide whether to save or discard the input. The
input function also allows the cell to learn which part of candidate memory to
keep or discard.
3. **Candidate Long-Term Memory**: Memory: The candidate long-term memory is computed
from $h_{t-1}$ and $x_t$ using an activation function, which is mostly tanh,
$$\tilde c(\cdot)=\tanh(w^{(\tilde c x)}\times x_t+w^{(\tilde c h)}\times h_{t-1}+b^{(\tilde c)})$$
4. The preceding three calculations are combined to get the update long-term
memory,
$$c_t = c_{t-1}\times f(\cdot) + i(\cdot)\times\tilde c(\cdot)$$
5. **Output Gate** $o(\cdot)$ (or focus/attention gate): the $h_{t-1}$ and $x_t$ flows as input to $o(\cdot)$ gate: 
$$o(\cdot)=\sigma(w^{(ox)}\times x_t+w^{(oh)}\times h_{t-1}+b^{(o)})$$
The function of **output gate** is to decide how much information can be used to
update the working memory.
6. **Working memory** $h_t$ is updated from the long-term memory $c_t$ and the focus/attention vector $o(\cdot)$:
$$h_t=\phi(c_t)\times o(\cdot)$$
$\phi(\cdot)$ is activation function, usually $\tanh$

##GRU network
LSTM Network is computationally expensive, hence, researchers found an almost equally
effective configuration of RNNs, known as **Gated Recurrent Unit (GRU)** architecture.

In GRU, instead of a working and a long-term memory, only one kind of memory is used,
indicated with $h$ (hidden state). The GRU cell adds information to this state memory or
removes information from this state memory through reset and update gates.
![alt text](https://github.com/armando-fandango/Mastering-TensorFlow/blob/master/images/ch-06/06_06.png?raw=true)

1. **Update Gate** $u(\cdot)$: the $h_{t-1}$ and $x_t$ flows as input to $u(\cdot)$ gate: 
$$u(\cdot)=\sigma(w^{(ux)}\times x_t+w^{(uh)}\times h_{t-1}+b^{(u)})$$
2. **Reset Gate** $r(\cdot)$: the $h_{t-1}$ and $x_t$ flows as input to $r(\cdot)$ gate: 
$$r(\cdot)=\sigma(w^{(rx)}\times x_t+w^{(rh)}\times h_{t-1}+b^{(r)})$$
3. **Candidate State Memory**: The candidate long-term memory is computed from
the output of the $r(\cdot)$ gate, $h_{t-1}$, and $x_t$:
$$\tilde h(\cdot)=\tanh(w^{(\tilde h x)}\times x_t+w^{(\tilde h h)}\times h_{t-1}+b^{(\tilde h)})$$
4. The preceding three calculations are combined to get the updated state
memory, $h_t$,
$$h_t=(u_t\cdot\tilde h_t)+((1-u_t)\cdot h_{t-1})$$

#TensorFlow for RNN
##TensorFlow RNN Cell Classes
`tf.nn.rnn_cell` module provides

Class | Description
--- | ---
BasicRNNCell | Provides simple RNN cell
BasicLSTMCell | Provides simple LSTM RNN cell
LSTMCell | Provides LSTM RNN cell
GRUCell | Provides GRU RNN cell
MultiRNNCell | Provides RNN cell made of multiple simple cells joined sequentially

`tf.contrib.rnn` module provides

Class | Description
--- | ---
LSTMBlockCell | Provides the block LSTM RNN cell
LSTMBlockFusedCell | Provides the block fused LSTM RNN cell,
GLSTMCell | Provides the group LSTM cell
GridLSTMCell | Provides the grid LSTM RNN cel
GRUBlockCell | Provides the block GRU RNN cell
BidirectionalGridLSTMCell | Provides bidirectional grid LSTM with bi-direction only in frequency and not in time
NASCell | Provides neural architecture search RNN cell
UGRNNCell | Provides update gate RNN cell

##TensorFlow RNN Model Construction Classes
TensorFlow provides classes to create RNN models from the RNN cell objects. The static
RNN classes add unrolled cells for time steps at the compile time, while dynamic RNN
classes add unrolled cells for time steps at the run time.
* `tf.nn.static_rnn`
* `tf.nn.static_state_saving_rnn`
* `tf.nn.static_bidirectional_rnn`
* `tf.nn.dynamic_rnn`
*  `tf.nn.bidirectional_dynamic_rnn`
*  `tf.nn.raw_rnn`
*  `tf.contrib.rnn.stack_bidirectional_dynamic_rnn`

##TensorFlow RNN Cell Wrapper Classes
TensorFlow also provides classes that wrap other cell classes:
* `tf.contrib.rnn.LSTMBlockWrapper`
* `tf.contrib.rnn.DropoutWrapper`
* `tf.contrib.rnn.EmbeddingWrapper`
* `tf.contrib.rnn.InputProjectionWrapper`
* `tf.contrib.rnn.OutputProjectionWrapper`
* `tf.contrib.rnn.DeviceWrapper`
* `tf.contrib.rnn.ResidualWrapper`

#Keras for RNN
To build the RNN model, add layers from the `keras.layers.recurrent` module. This module contains `SimpleRNN`, `LSTM` and `GRU`

###Stateful Models
Keras recurrent layers also support RNN models that save state between the batches. You
can create a stateful RNN, LSTM, or GRU model by passing stateful parameters as True.
For stateful models, the batch size specified for the inputs has to be a fixed value. In stateful
models, the hidden state learnt from training a batch is reused for the next batch. If you
want to reset the memory at some point during training, it can be done with extra code by
calling the `model.reset_states()` or `layer.reset_states()` functions.

##RNN in Keras for MNIST data

In [3]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('mnist', one_hot=True)

x_train = mnist.train.images
x_test = mnist.test.images
y_train = mnist.train.labels
y_test = mnist.test.labels
n_classes = 10

Extracting mnist/train-images-idx3-ubyte.gz
Extracting mnist/train-labels-idx1-ubyte.gz
Extracting mnist/t10k-images-idx3-ubyte.gz
Extracting mnist/t10k-labels-idx1-ubyte.gz


In [0]:
# preprocessing
x_train = x_train.reshape(-1, 28, 28)
x_test = x_test.reshape(-1, 28, 28)

In [5]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers.recurrent import SimpleRNN
from keras.optimizers import RMSprop
from keras.optimizers import SGD

Using TensorFlow backend.


In [0]:
tf.reset_default_graph()
keras.backend.clear_session()

In [7]:
# build RNN
model = Sequential()
model.add(SimpleRNN(units=16, activation='relu', input_shape=(28, 28)))
model.add(Dense(n_classes))
model.add(Activation('softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, 16)                720       
_________________________________________________________________
dense_1 (Dense)              (None, 10)                170       
_________________________________________________________________
activation_1 (Activation)    (None, 10)                0         
Total params: 890
Trainable params: 890
Non-trainable params: 0
_________________________________________________________________


In [9]:
# compile and train
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(lr=0.01),
              metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=100, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f0cebc368d0>

In [10]:
score = model.evaluate(x_test, y_test)
print('\nTest loss:', score[0])
print('Test Accuracy:', score[1])


Test loss: 14.573981651306152
Test Accuracy: 0.0958
