# TensorFlow for Deep Learning - Processing Sequences Using RNNs and CNNs

Credits:
- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
- [Udacity Deep Learning Nanodegree](https://www.udacity.com/course/deep-learning-nanodegree--nd101)

- [RNN Example](https://youtu.be/MDLk3fhpTx0)
- [Implementing a Char-RNN](https://youtu.be/MMtgZXzFB10)

## Recurrent Neural Networks (RNNs)

### Recurrent Neurons and Layers

A recurrent neural network looks very much like a feedforward neural network, except it also has connections pointing backward. Let's look at the simplest possible RNN, composed of one neuron receiving inputs, producing an output, and sending that output back to itself. At each
time step t (also called a frame), this recurrent neuron receives the inputs $\mathbf{x}_{(t)} $ as well as its own output from the previous time step, $\mathbf{y}_{(t-1)} $. Since there is no previous output at the first time step, it is generally set to 0. 

<img src="images/RNN2.png" align="center" width="500"/>

Each recurrent neuron has two sets of weights: one for the inputs $\mathbf{x}_{(t)} $  and the other for the outputs of the previous time step, $\mathbf{y}_{(t-1)} $. Let's call these weight vectors $\mathbf{w}_x$ and $\mathbf{w}_y$. If we consider the whole recurrent layer instead of just one recurrent neuron, we can place all the weight vectors in two weight matrices, $\mathbf{W}_x$ and $\mathbf{W}_y$. The output vector of the whole recurrent layer for a single instance can then be computed as:

$ \mathbf{y}_{(t)} = \phi (\mathbf{W}_x^T \mathbf{x}_{(t)}  + \mathbf{W}_y^T \mathbf{y}_{(t-1)}  + \mathbf{b}) $

Just as with feedforward neural networks, we can compute a recurrent layer's output in one shot for a whole mini-batch by placing all the inputs at time step t in an input matrix $\mathbf{X}_t$:

<img src="images/RNN3.png" align="center" width="500"/>
<img src="images/RNN4.png" align="center" width="500"/>

- [Recurrent Neural Network - Part a](https://youtu.be/ofbnDxGSUcg)
- [Recurrent Neural Network - Part b](https://youtu.be/wsif3p5t7CI)
- [RNN Unfolded](https://youtu.be/xLIA_PTWXog)

### Memory Cell

- Since the output of a recurrent neuron at time step t is a function of all the inputs from previous time steps, you could say it has a form of _memory_. 
- A part of a neural network that preserves some state across time steps is called a _memory cell_ (or simply a _cell_). A single recurrent neuron, or a layer of recurrent neurons, is a very basic cell, capable of learning only short patterns (typically about 10 steps long, but this varies depending on the task).

    <img src="images/RNN5.png" align="center" width="400"/>


### Input and Output Sequences

- _sequence-to-sequence network_ is useful for predicting time series such as stock prices: you feed it the prices over the last N days, and it must output the prices shifted by one day into the future (i.e., from N-1 days ago to tomorrow).
- _sequence-to-vector network_ is useful for example - feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score.
- _vector-to-sequence network_ where the input could be an image (or the output of a CNN), and the output could be a caption for that image.
- _sequence-to-vector network_ (_encoder_) followed by _vector-to-sequence network_ (_decoder_) could be used for translating a sentence from one language to another. You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation, and then the decoder would decode this vector into a sentence in another language

    <img src="images/RNN6.png" align="center" width="400"/>

### Deep RNN

<img src="images/RNN7.png" align="center" width="400"/>

### Training RNN vis BPTT

- [Backpropagation through Time - Part a](https://youtu.be/eE2L3-2wKac)
- [Backpropagation through Time - Part b](https://youtu.be/bUU9BEQw0IA)
- [Backpropagation through Time - Part c](https://youtu.be/bUU9BEQw0IA)

## Forecasting a Time Series

### Time Series Data

For simplicity, we are using a time series generated by the
generate_time_series() function, shown here:

In [1]:
def generate_time_series(batch_size, n_steps): 
    import numpy as np
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1) 
    time = np.linspace(0, 1, n_steps) 
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))  #   wave 1 
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2 
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # + noise 
    return series[..., np.newaxis].astype(np.float32)

This function creates as many time series as requested (via the batch_size
argument), each of length n_steps, and there is just one value per time step
in each series (i.e., all series are univariate). The function returns a NumPy
array of shape [batch size, time steps, 1], where each series is the sum of
two sine waves of fixed amplitudes but random frequencies and phases,
plus a bit of noise.

**NOTE**: When dealing with time series (and other types of sequences such as sentences), the
input features are generally represented as 3D arrays of shape [batch size, time steps,
dimensionality], where dimensionality is 1 for univariate time series and more for
multivariate time series.

Train, Valid, Test split:

In [2]:
import numpy as np
n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

In [3]:
y_train

array([[ 0.33758703],
       [-0.01265792],
       [ 0.5252002 ],
       ...,
       [ 0.5760455 ],
       [ 0.6067221 ],
       [-0.25718474]], dtype=float32)

### Baseline Metrics (Models)

(i) The simplest approach is to predict the last value in each series. This is called _naive forecasting_, and it is sometimes surprisingly difficult to outperform.

In [4]:
import tensorflow as tf
y_pred = X_valid[:, -1]
np.mean(tf.keras.losses.mean_squared_error(y_valid, y_pred))

2021-11-24 16:23:52.177070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38457 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:31:00.0, compute capability: 8.0
2021-11-24 16:23:52.179238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 38457 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:32:00.0, compute capability: 8.0
2021-11-24 16:23:52.180955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 38457 MB memory:  -> device: 2, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:ca:00.0, compute capability: 8.0
2021-11-24 16:23:52.182647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 38457 MB memory:  -> device: 3, name: NVIDIA A100-SXM4-40GB, pci bu

0.020387158

(ii) Another simple approach is to use a fully connected network. Let's just use a simple Linear Regression model so that each prediction will
be a linear combination of the values in the time series:

In [5]:
model = tf.keras.models.Sequential([ 
    tf.keras.layers.Flatten(input_shape=[50, 1]), 
    tf.keras.layers.Dense(1)
])

model.compile(optimizer=tf.keras.optimizers.Adam(), loss='mse')
model.fit(X_train, y_train, epochs=20)

2021-11-24 16:23:53.044824: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1489b11d6970>

In [6]:
model.evaluate(X_valid, y_valid)



0.0035277805291116238

### Implementing a Simple RNN

In [7]:
model = tf.keras.models.Sequential([ 
  tf.keras.layers.SimpleRNN(1, input_shape=[None, 1])
    # recurrent-neurons, [steps, features]
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn (SimpleRNN)       (None, 1)                 3         
Total params: 3
Trainable params: 3
Non-trainable params: 0
_________________________________________________________________


That's really the simplest RNN you can build. It just contains a single layer, with a single neuron. We do not need to specify
the length of the input sequences (unlike in the previous model), since a recurrent neural network can process any number of time steps (this is why we set the first input dimension to None). By default, the SimpleRNN layer
uses the hyperbolic tangent activation function. It works exactly as we saw
earlier: the initial state $h_{init}$ is set to 0, and it is passed to a single recurrent
neuron, along with the value of the first time step, $x_{(0)}$. The neuron
computes a weighted sum of these values and applies the hyperbolic tangent
activation function to the result, and this gives the first output, $y_0$ . In a
simple RNN, this output is also the new state $h_0$. This new state is passed to
the same recurrent neuron along with the next input value, $x_{(1)}$ , and the
process is repeated until the last time step. Then the layer just outputs the
last value, $y_{49}$ . All of this is performed simultaneously for every time series.

**NOTE**: By default, recurrent layers in Keras only return the final output. To make them return one output per time step, you must set ```return_sequences=True```.

In [8]:
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='mse')
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1489b10653d0>

In [9]:
model.evaluate(X_valid, y_valid)



0.013325328007340431

So it is better than the naive approach but it does not beat a simple linear model. Note that for each neuron, a linear model has one parameter per input and per time step, plus a bias term (in the simple linear model we used, that's a
total of **51** parameters). In contrast, for each recurrent neuron in a simple
RNN, there is just one parameter per input and per hidden state dimension
(in a simple RNN, that's just the number of recurrent neurons in the layer),
plus a bias term. In this simple RNN, that's a total of just **3** parameters.

### Deep RNN

Implementing a deep RNN with tf.keras is quite simple: just stack recurrent
layers. In this example, we use three SimpleRNN layers (but we could add
any other type of recurrent layer, such as an LSTM layer or a GRU layer.

```python
# seq-to-seq network
model = tf.keras.models.Sequential([ 
    tf.keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), 
    tf.keras.layers.SimpleRNN(20, return_sequences=True), 
    tf.keras.layers.SimpleRNN(1)
])
```

**WARNING**: Make sure to set return_sequences=True for all recurrent layers (except the last one, if you only care about the last output). If you don't, they will output a 2D array (containing only the output of the last time step) instead of a 3D array (containing outputs for all time steps), and the next recurrent layer will complain that you are not feeding it sequences in the expected 3D format.

Note that the last layer is not ideal: it must have a single unit because we
want to forecast a univariate time series, and this means we must have a
single output value per time step. However, having a single unit means that
the hidden state is just a single number. That's really not much, and it's
probably not that useful; presumably, the RNN will mostly use the hidden
states of the other recurrent layers to carry over all the information it needs
from time step to time step, and it will not use the final layer's hidden state
very much. Moreover, since a SimpleRNN layer uses the tanh activation
function by default, the predicted values must lie within the range -1 to 1.
But what if you want to use another activation function? For both these
reasons, it might be preferable to replace the output layer with a Dense
layer: it would run slightly faster, the accuracy would be roughly the same,
and it would allow us to choose any output activation function we want. If
you make this change, also make sure to remove ```return_sequences=True```
from the second (now last) recurrent layer (bcoz we now have a _seq-to-vec_ network):

In [10]:
# seq-to-vec network
model = tf.keras.models.Sequential([ 
    tf.keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), 
    tf.keras.layers.SimpleRNN(20), 
    tf.keras.layers.Dense(1)
])

model.summary() # [features*n_neurons + n_neurons*n_neurons + bias] = [1x20 + 20x20 + 20], [20x20 + 20x20 + 20], [20x1 + 1]

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, None, 20)          440       
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 20)                820       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 21        
Total params: 1,281
Trainable params: 1,281
Non-trainable params: 0
_________________________________________________________________


In [11]:
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='mse')
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
  5/219 [..............................] - ETA: 7s - loss: 0.4569 

2021-11-24 16:25:09.933528: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1489b0f40df0>

In [12]:
model.evaluate(X_valid, y_valid)



0.0027229059487581253

### Forecasting Several Time Steps Ahead

- The first option is to use the model we already trained, make it predict the
next value, then add that value to the inputs (acting as if this predicted value
had actually occurred), and use the model again to predict the following
value, and so on. As you might expect, the prediction for the next step will usually be more
accurate than the predictions for later time steps, since the errors might
accumulate.
- The second option is to train an RNN to predict all 10 next values at once.
We can still use a _sequence-to-vector_ model, but it will output 10 values
instead of 1.
- We can still do better: indeed,
instead of training the model to forecast the next 10 values only at the very
last time step, we can train it to forecast the next 10 values at each and
every time step. In other words, we can turn this _sequence-to-vector_ RNN
into a _sequence-to-sequence_ RNN. The advantage of this technique is that
the loss will contain a term for the output of the RNN at each and every
time step, not just the output at the last time step. This means there will be
many more error gradients flowing through the model, and they won??t have
to flow only through time; they will also flow from the output of each time
step. This will both stabilize and speed up training. To be clear, at time step 0 the model will output a vector containing the
forecasts for time steps 1 to 10, then at time step 1 the model will forecast
time steps 2 to 11, and so on. So each target must be a sequence of the same
length as the input sequence, containing a 10-dimensional vector at each
step.

    **NOTE**: It may be surprising that the targets will contain values that appear in the inputs (there is
a lot of overlap between X_train and Y_train). Isn't that cheating? Fortunately, not at
all: at each time step, the model only knows about past time steps, so it cannot look
ahead. It is said to be a _causal_ model.

In [13]:
series = generate_time_series(10000, n_steps + 10)
Y = np.empty((10000, n_steps, 10)) # each target is a sequence of 10D vectors
for step_ahead in range(1, 10 + 1): 
    Y[:, :, step_ahead - 1] = series[:, step_ahead:step_ahead + n_steps, 0]
X_train, Y_train = series[:7000, :n_steps], Y[:7000]
X_valid, Y_valid = series[7000:9000, :n_steps], Y[7000:9000]
X_test, Y_test = series[9000:, :n_steps], Y[9000:]

To turn the model into a sequence-to-sequence model, we must set
return_sequences=True in all recurrent layers (even the last one), and we
must apply the output Dense layer at every time step. Keras offers a
TimeDistributed layer for this very purpose: it wraps any layer (e.g., a
Dense layer) and applies it at every time step of its input sequence. It does
this efficiently, by reshaping the inputs so that each time step is treated as a
separate instance (i.e., it reshapes the inputs from [batch size, time steps,
input dimensions] to [batch size ×? time steps, input dimensions]; in this
example, the number of input dimensions is 20 because the previous
SimpleRNN layer has 20 units), then it runs the Dense layer, and finally it
reshapes the outputs back to sequences (i.e., it reshapes the outputs from
[batch size ×? time steps, output dimensions] to [batch size, time steps,
output dimensions]; in this example the number of output dimensions is 10,
since the Dense layer has 10 units).

In [14]:
# seq-to-seq network
model = tf.keras.models.Sequential([ 
    tf.keras.layers.SimpleRNN(20, return_sequences=True, input_shape=[None, 1]), 
    tf.keras.layers.SimpleRNN(20, return_sequences=True), 
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(10))
])

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_3 (SimpleRNN)     (None, None, 20)          440       
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, None, 20)          820       
_________________________________________________________________
time_distributed (TimeDistri (None, None, 10)          210       
Total params: 1,470
Trainable params: 1,470
Non-trainable params: 0
_________________________________________________________________


All outputs are needed during training, but only the output at the last time
step is useful for predictions and for evaluation. So although we will rely on
the MSE over all the outputs for training, we will use a custom metric for
evaluation, to only compute the MSE over the output at the last time step:

In [15]:
def last_time_step_mse(Y_true, Y_pred): 
    return tf.keras.metrics.mean_squared_error(Y_true[:, -1], Y_pred[:, -1])

In [16]:
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20




Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1489b0d46430>

In [17]:
model.evaluate(X_valid, Y_valid)



[0.01866469345986843, 0.006900131236761808]

## Trend and Seasonality

There are many other models to forecast time series, such as _weighted
moving average_ models or _autoregressive integrated moving average_
(ARIMA) models. Some of them require you to first remove the trend
and seasonality. For example, if you are studying the number of active
users on your website, and it is growing by 10% every month, you
would have to remove this trend from the time series. Once the model is
trained and starts making predictions, you would have to add the trend
back to get the final predictions. Similarly, if you are trying to predict
the amount of sunscreen lotion sold every month, you will probably
observe strong seasonality: since it sells well every summer, a similar
pattern will be repeated every year. You would have to remove this
seasonality from the time series, for example by computing the
difference between the value at each time step and the value one year
earlier (this technique is called _differencing_). Again, after the model is
trained and makes predictions, you would have to add the seasonal
pattern back to get the final predictions.
When using RNNs, it is generally not necessary to do all this, but it
may improve performance in some cases, since the model will not have
to learn the trend or the seasonality.

## Handling Long Sequences

Simple RNNs can be quite good at forecasting time series or handling other
kinds of sequences, but they do not perform as well on long time series or
sequences.

### Fighting the Unstable Gradients Problem

- [RNN Vanishing Gradient --> LSTM](https://youtu.be/nXP0oGGRrO8)

- Why not _ReLU_ activation function?: Many of the tricks we used in deep nets to alleviate the unstable gradients
problem can also be used for RNNs: good parameter initialization, faster
optimizers, dropout, and so on. However, nonsaturating activation functions
(e.g., ReLU) may not help as much here; in fact, they may actually lead the
RNN to be even more unstable during training. Why? Well, suppose
Gradient Descent updates the weights in a way that increases the outputs
slightly at the first time step. Because the same weights are used at every
time step, the outputs at the second time step may also be slightly increased,
and those at the third, and so on until the outputs explode -- and a
nonsaturating activation function does not prevent that. You can reduce this
risk by using a smaller learning rate, but you can also simply use a
saturating activation function like the hyperbolic tangent (this explains why
it is the default). In much the same way, the gradients themselves can
explode. If you notice that training is unstable, you may want to monitor the
size of the gradients (e.g., using TensorBoard) and perhaps use Gradient
Clipping. 
- _Layer Normalization_: Batch Normalization cannot be used as efficiently with RNNs as
with deep feedforward nets. In fact, you cannot use it between time steps,
only between recurrent layers. Another form of normalization often works 
better with RNNs: Layer Normalization. It is very similar to Batch Normalization, but instead of normalizing
across the batch dimension, it normalizes across the features dimension.
One advantage is that it can compute the required statistics on the fly, at
each time step, independently for each instance. This also means that it
behaves the same way during training and testing (as opposed to BN), and it
does not need to use exponential moving averages to estimate the feature
statistics across all instances in the training set. Like BN, Layer
Normalization learns a scale and an offset parameter for each input. In an
RNN, it is typically used right after the linear combination of the inputs and
the hidden states.

In [18]:
class LNSimpleRNNCell(tf.keras.layers.Layer): 
    def __init__(self, units, activation="tanh", **kwargs): 
        super().__init__(**kwargs) 
        self.state_size = units 
        self.output_size = units 
        self.simple_rnn_cell = tf.keras.layers.SimpleRNNCell(units, activation=None) 
        self.layer_norm = tf.keras.layers.LayerNormalization() 
        self.activation = tf.keras.activations.get(activation) 
    def call(self, inputs, states):
        outputs, new_states = self.simple_rnn_cell(inputs, states) 
        norm_outputs = self.activation(self.layer_norm(outputs)) 
        return norm_outputs, [norm_outputs]

LNSimpleRNNCell class inherits
from the keras.layers.Layer class, just like any custom layer. The
constructor takes the number of units and the desired activation function,
and it sets the state_size and output_size attributes, then creates a
SimpleRNNCell with no activation function (because we want to perform
Layer Normalization after the linear operation but before the activation
function). Then the constructor creates the LayerNormalization layer, and
finally it fetches the desired activation function. The call() method starts
by applying the simple RNN cell, which computes a linear combination of
the current inputs and the previous hidden states, and it returns the result
twice (indeed, in a SimpleRNNCell, the outputs are just equal to the hidden
states: in other words, new_states[0] is equal to outputs, so we can
safely ignore new_states in the rest of the call() method). Next, the
call() method applies Layer Normalization, followed by the activation
function. Finally, it returns the outputs twice (once as the outputs, and once
as the new hidden states).

In [19]:
model = tf.keras.models.Sequential([ 
    tf.keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True, 
                     input_shape=[None, 1]), 
    tf.keras.layers.RNN(LNSimpleRNNCell(20), return_sequences=True), 
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(10))
])

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
rnn (RNN)                    (None, None, 20)          480       
_________________________________________________________________
rnn_1 (RNN)                  (None, None, 20)          860       
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 10)          210       
Total params: 1,550
Trainable params: 1,550
Non-trainable params: 0
_________________________________________________________________


All recurrent layers (except for
keras.layers.RNN) and all cells provided by Keras have a dropout
hyperparameter and a recurrent_dropout hyperparameter: the former
defines the dropout rate to apply to the inputs (at each time step), and the
latter defines the dropout rate for the hidden states (also at each time step).
No need to create a custom cell to apply dropout at each time step in an
RNN.

In [20]:
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
  1/219 [..............................] - ETA: 14:28 - loss: 0.7059 - last_time_step_mse: 0.6944

2021-11-24 16:30:37.367592: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1489b0ced370>

In [21]:
model.evaluate(X_valid, Y_valid)



[0.018381061032414436, 0.007873216643929482]

### Tackling the Short-Term Memory Problem

Due to the transformations that the data goes through when traversing an
RNN, some information is lost at each time step. After a while, the RNN's
state contains virtually no trace of the first inputs. o tackle
this problem, various types of cells with long-term memory have been
introduced. They have proven so successful that the basic cells are not used
much anymore. 

#### The _Long Short-Term Memory_ (LSTM) Cells

In [22]:
model = tf.keras.models.Sequential([ 
    tf.keras.layers.LSTM(20, return_sequences=True, input_shape=[None, 1]), 
    tf.keras.layers.LSTM(20, return_sequences=True), 
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(10))
])

model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, None, 20)          1760      
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 20)          3280      
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 10)          210       
Total params: 5,250
Trainable params: 5,250
Non-trainable params: 0
_________________________________________________________________


In [23]:
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x147f9c804a90>

In [24]:
model.evaluate(X_valid, Y_valid)



[0.013086429797112942, 0.002533094258978963]

- [LSTM Basics](https://youtu.be/gjb68a4XsqE)
- [The Forget Gate](https://youtu.be/iWxpfxLUPSU)
- [The Input/Learn Gate](https://youtu.be/aVHVI7ovbHY)
- [The Remember Gate](https://youtu.be/0qlm86HaXuU)
- [The Output/Use Gate](https://youtu.be/5Ifolm1jTdY)
- [Putting it All Together](https://youtu.be/IF8FlKW-Zo0)

<img src="images/RNN9.png" align="center" width="500"/>
<img src="images/RNN8.png" align="center" width="500"/>

As the long-term state $\mathbf{c}_{(t-1)}$ traverses the network from left to right, you can see that it first goes through a _forget gate_, dropping some memories, and then it adds some new memories via the addition operation (which adds the memories that were selected by an _input gate_) The result $\mathbf{c}_{(t)}$ is sent straight out, without any further transformation. So, at each time step, some memories are dropped and some memories are added. Moreover, after the addition operation, the long-term state is copied and passed through the tanh function, and then the result is filtered by the _output gate_. This produces the short-term state $\mathbf{h}_{(t)}$ (which is equal to the cell's output for this
time step, $\mathbf{y}_{(t)}$).

<img src="images/RNN10.png" align="center" width="500"/>

In short, an LSTM cell can learn to recognize an important input (that's the
role of the input gate), store it in the long-term state, preserve it for as long
as it is needed (that??s the role of the forget gate), and extract it whenever it
is needed. This explains why these cells have been amazingly successful at
capturing long-term patterns in time series, long texts, audio recordings, and
more.

<img src="images/RNN11.png" align="center" width="500"/>


#### The _Gated Recurrent Unit_ (GRU) Cells

In [25]:
model = tf.keras.models.Sequential([ 
    tf.keras.layers.GRU(20, return_sequences=True, input_shape=[None, 1]), 
    tf.keras.layers.GRU(20, return_sequences=True), 
    tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(10))
])

model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
gru (GRU)                    (None, None, 20)          1380      
_________________________________________________________________
gru_1 (GRU)                  (None, None, 20)          2520      
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 10)          210       
Total params: 4,110
Trainable params: 4,110
Non-trainable params: 0
_________________________________________________________________


In [26]:
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(lr=0.01), metrics=[last_time_step_mse])
model.fit(X_train, Y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x147f5d9192b0>

In [27]:
model.evaluate(X_valid, Y_valid)



[0.01425903383642435, 0.003919767681509256]

- [GRUs](https://youtu.be/MsxFDuYlTuQ)

<img src="images/RNN12.png" align="center" width="500"/>
<img src="images/RNN13.png" align="center" width="500"/>
<img src="images/RNN14.png" align="center" width="500"/>

#### WaveNet

<img src="images/RNN15.png" align="center" width="500"/>