## Recurrent Neural Networks (RNN) with Keras

- Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language: e.g. speech recognition, speech synthesis, text generation
- Schematically, a RNN layer uses a `for loop` to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far

### with Keras

There are three built-in RNN layers in Keras: 
1. [keras.layers.RNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RNN): `keras.layers.SimpleRNN`
2. [keras.layers.LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): Long Short-Term Memory layer (Hochreiter, 1997)
3. [keras.layers.GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU): Gated Recurrent Unit (Cho et al, 2014)

RNN suffer from short-term memory problems: if a time series is long, RNN have difficulties in carrying information from earlier timepoints over to later timepoints. Specifically, in back propagation RNN experience the **vanishing gradient problem**, i.e. the gradient (values used to update NN weights) shrinks over successive backpropagation steps, and if it becomes too small it won't contribute to learning:

$$
w_{t+1} = w_t - \alpha \cdot \frac{\partial}{\partial w_t}J(w) = 2.1 - 0.1 \cdot 0.001 = 2.0999
$$

(not much of a difference!)

Therefore, RNN can forget what they saw early in the sequence $\rightarrow$ **short-term memory!**


#### Gates in RNN

To address issues with short-term memory, RNN use internal structures (sublayers) called **gates**. Gates can learn which data in a sequence are important to keep or throw away and pass them on along the chain of sequences.
It is s bit like remembering only those words in an ad that struck your memory (e.g. the price, the deadline, the main characteristics)

#### Inside a RNN

- words (or sounds) transformed to vectors
- the RNN processes each sequence of vectors one by one, passing hidden states (units) sequentially to the next steps: in this way, the RNN holds information seen in the each previous step
- the input vector (word) and previous hidden state are combined to form a new vector that has information on the current and previous inputs
- the combined vector goes through the activation function (e.g. `tanh`) and the output is the new hidden state to be combined with the input to be processed by he next unit in the layer
- `tanh` squeashes values from the linear combination of the combined vector values in input (input + hidden state) to the range $[-1,1]$   

Simple RNN (no *gates*, or better only one gate) use much less computational resources than the evolved variants: **LSTM** and **GRU**


#### LSTM

Unlike simple RNN, the operations inside a LSTM cell (sublayer/unit) allow to keep or forget pieces of information. In this way, also information from earlier time steps can make its way to later time steps, reducing the effects of short-term memory. In this journey, information can be added or removed through **gates** (typically **4** in LST layers).
LSTM gates (sublayers) use the **sigmoid** activation function, in $[0,1]$, which permits to 'forget' information by returning 0, or to keep it by returning 1.

## Setting up a RNN

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Simple example of a `Sequential` model that processes sequences of integers, embeds each integer into a 64-dimensional vector, then processes the sequence of vectors using a `LSTM` layer

In [None]:
## parameters
embed_input_size = 1000
embed_output_size = 64
lstm_units = 128
dense_units = 10

In [None]:
model = keras.Sequential() ## not unlike usual NN models (including CNN)

# Add an Embedding layer expecting input vocab of size 1000, and
# output embedding dimension of size 64.
model.add(layers.Embedding(input_dim=embed_input_size, output_dim=embed_output_size)) ## this is unlike dense NN models (model.add(Dense()))

# Add a LSTM layer with 128 internal units.
model.add(layers.LSTM(lstm_units))

# Add a Dense layer with 10 units.
model.add(layers.Dense(dense_units))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          64000     
_________________________________________________________________
lstm (LSTM)                  (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 10)                1290      
Total params: 164,106
Trainable params: 164,106
Non-trainable params: 0
_________________________________________________________________


Let's calculate the n. of parameters in LSTM layers:
- insight about how LSTM handles time dependent or sequence input data
- model capacity and complexity:
    - handle overfitting or underfitting
    - adjust the number of parameters of each layer

`LSTM` expects input data to be a `3D tensor` such that:

`[batch_size, timesteps, feature]`

1. `batch_size`: how many samples in each batch during training and testing
2. `timesteps`: how many values in a sequence, e.g. in `[4, 7, 8, 4]` there are 4 timesteps
3. `features`: how many dimensions to represent data in one time step; e.g. if each value in the sequence is one hot encoded with 9 zero and 1 one then feature is 10

`LSTM` layers have **4 dense layers** in its internal structure

#### Illustration of a LTM layer

<img src="https://github.com/kmkarakaya/ML_tutorials/blob/master/images/LSTM_internal2.png?raw=true" width="700">

- 3 inputs
- 4 dense layers within the LSTM layer
- 2 LSTM units (hidden/cell state)

The LSTM units (e.g. 128 from the example above) are added to the input units (64, from the example above) and multiplied by the number of LSTM units; the corresponding bias terms are then added (128, one per LSTM unit):

$$
(64+128) \cdot 128+128 = 24\,704
$$

This is multiplied by the number of internal dense layers (4):

$$
24\,704 \cdot 4 = 98\,816
$$

This is how the number of parameters in a LSTM layer is calculated


### GRU layers

- 