# Introduction to Recurrent Neural Networks

The final neural network architecture we will cover is the recurrent neural network.

## What we will accomplish

In this notebook we will:
- Discuss the kinds of problems recurrent nets are designed for,
- Give an overview of basic RNN architectures,
- Build a RNN to predict IMDB review sentiment.

Let's go!

In [None]:
## For data handling
import pandas as pd
import numpy as np

## For plotting
import matplotlib.pyplot as plt
from seaborn import set_style

## This sets the plot style
## to have a grid on a white background
set_style("whitegrid")

### Single Hidden Layer RNN

Recurrent neural networks (RNN) were built to deal with sequential data. Some examples of sequential data include:
- Time series,
- Natural language,
- Music

Consider sequential data of the form $(x_t, y_t)$, with $x_t \in \mathbb{R}^p$ and $y_t \in \mathbb{R}^m$.

A Single Hidden Layer RNN has the following form.  We will call the single hidden layer the "state", and give it dimension $d$.  Let $\sigma_1$ and $\sigma_2$ be differentiable activation functions (often a single variable function applied to each coordinate).

$$
\begin{cases}
h_t = \sigma_1 (W_{hx} x_t + W_{hh} h_{t-1} + b_h)\\
y_t = \sigma_2 (W_{yh} h_t + b_y)
\end{cases}
$$

where the learnable parameters of the model are

* $h_{-1} \in \mathbb{R}^d$
* $W_{hx}$ is a $d \times p$ matrix 
* $W_{hh}$ is a $d \times d$ matrix
* $b_h$ is a vector in $\mathbb{R}^d$
* $W_{yh}$ is a $m \times d$ matrix
* $b_y$ is a vector in $\mathbb{R}^m$

The most common choice for $\sigma_1$ is $\tanh$ and the most common choice for $\sigma_2$ is the identity function.

RNNs can be more complicated than this, but we will stick with this example for the purposes of this introduction to the concept.

### Custom implementation

The following custom implementation of an RNN class might help to understand what is going on:

In [None]:
class RNN:
  def __init__(self, p,d,m, sigma_1 = np.tanh, sigma_2 = None, print_h = False):
    self.h = np.zeros((d,1))
    self.W_hh = np.random.randn(d,d)
    self.W_hx = np.random.randn(d,p)
    self.W_yh = np.random.randn(m,d)
    self.b_h = np.random.randn(d,1)
    self.b_y = np.random.randn(m,1)
    self.sigma_1 = sigma_1
    self.sigma_2 = sigma_2
    self.print_h = print_h
  def predict(self, x):
    # update the hidden state
    x = x.reshape(-1,1)
    if self.print_h:
      print(f"old h = {self.h}")
    self.h = np.dot(self.W_hh, self.h) + np.dot(self.W_hx, x) + self.b_h
    if self.sigma_1:
      self.h = self.sigma_1(self.h)
    if self.print_h:
      print(f"new h = {self.h}")
    # compute the output vector
    y = np.dot(self.W_yh, self.h) + self.b_y
    if self.sigma_2:
      y = self.sigma_2(y)
    return y

In [None]:
np.dot(np.random.randn(3,2), np.zeros((2,1)))

In [None]:
# Example with input dimension 2, hidden state of dimension 3, output of dimension 1.
rnn = RNN(2,3,1, print_h=True)

In [None]:
# printing the attributes
import pprint
pprint.pprint(rnn.__dict__)

In [None]:
# run this a few times and discuss how the outputs are being computed.
rnn.predict(np.array([0.3, 0.1]))

The only thing which gets updated when we predict is the state vector $h$.  All of the other attributes are parameters which we would update during training but not when making predictions.

In [None]:
pprint.pprint(rnn.__dict__)

### Double Exponential Smoothing is an instance of RNN.

Just as linear regression is the most basic instance of a FFNN, exponential smoothing is the most basic instance of an RNN.

Recall the set up of double exponential smoothing:

We iteratively update a hidden state consisting of level $s_t$ and slope $b_t$ for a time series $y_t$.

$$
\hat{y}_{t} = \left\lbrace \begin{array}{l c c} s_{t-1} + b_{t-1} & \text{for} & 1<t\leq n \\
                                                s_n + (t-n)b_{n}& \text{for} & t > n \end{array}\right\rbrace, 
$$

where 

$$
s_{t} = \alpha y_t + (1-\alpha) (s_{t-1} + b_{t-1}), \ s_1 = y_1,
$$

$$
b_{t} = \beta (s_t - s_{t-1}) + (1-\beta) b_{t-1}, \ b_1 = y_2 - y_1 \text{ and}
$$

This can be re-written (with some algebra) using our notation above as

$$
h_t = \begin{bmatrix} \alpha \\ \alpha \beta \end{bmatrix} y_t + \begin{bmatrix} 1-\alpha & 1-\alpha \\ -\alpha\beta & 1-\alpha\beta \end{bmatrix} \begin{bmatrix} s_{t-1} \\ b_{t-1}\end{bmatrix}\\
\hat{y}_t = \begin{bmatrix} 1 & 1\end{bmatrix} \begin{bmatrix} s_{t-1} \\ b_{t-1}\end{bmatrix}
$$

So double exponential smoothing is in fact a special case of `RNN(1,2,1)`.

Let's implement it as a subclass!

In [None]:
class DoubleExpSmooth(RNN):
    def __init__(self, alpha, beta, print_h = False):
        super().__init__(1,2,1, None),
        self.W_hh = np.array([[1-alpha, 1-alpha],[-alpha*beta, 1-alpha*beta]])
        self.W_hx = np.array([[alpha],[beta*alpha]])
        self.W_yh = np.array([[1, 1]])
        self.b_h = np.array([[0],[0]])
        self.b_y = np.array([[0]])
        self.print_h = print_h

In [None]:
# Let's see how it handles predicting a noisy linear sequence

des = DoubleExpSmooth(0.7,0.8, print_h = True)
ys = 0.8*np.random.randn(10) + 2* np.arange(10) + 1
preds = list()

for i in range(10):
    pred = des.predict(ys[i])[0,0]
    print(f"\ninput = {ys[i]:.4f}, pred = {pred:4f} \n")
    preds.append(pred)

In [None]:
plt.title("RNN implementation of Double Exponential Smoothing")
plt.scatter(range(10),ys, label = 'Data')
plt.scatter(range(1,11),preds, label = 'One Step Predictions', marker='x')
plt.legend()
plt.show()


### Training an RNN

Training an RNN involves [backpropagation through time](https://en.wikipedia.org/wiki/Backpropagation_through_time).  We "unroll" the RNN for $k$ time steps to treat it like a feed forward neural network with $k$ layers, but where the parameters of each layer are shared.  We would then implement gradient descent on the loss function as usual.  Note that the resulting trained model will still have an "infinite memory" even though we only accounted for $k$ time steps in the loss minimization process.

A major technical issue when training RNNs is the [vanishing/exploding gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem):  the recurrent nature of the network can lead to exponential growth/decay of the model parameters during training.

## Example: IMDB sentiment analysis

As an illustrative example we will use `keras` to build a sentiment classifier using IMDB movie reviews. Let's first load this data set.

In [None]:
## The data is stored in here
from keras.datasets import imdb

In [None]:
## This will determine the number of vocab words in our
## dictionary
max_features = 10000

## num_words tells keras to return the reviews so they contain only
## the num_words most used words across all the reviews
(X_train, y_train), (X_test,y_test) = imdb.load_data(num_words=max_features)

## Note you may receive a warning, this is not your fault, and is due to how
## keras is loading the data

In [None]:
## Let's look at the first training observation
print(X_train[0])
print(y_train[0])

The data is stored as a list of indices, each of which is representative of a word. Let's see what this particular review looks like, once we have translated it from indices to words. Do not focus on the following code for now, as it is not important for building the neural network.

In [None]:
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key,value) in word_index.items()])

## The first training review, where words outside the top 1000 are replaced with
## ? marks
print(" ".join([reverse_word_index.get(i-3, '?') for i in X_train[0]]))
print()
print("sentiment value =", y_train[0])

The review above had a $y$ value of $1$, meaning that it has positive sentiment. A value of $0$ indicates a negative sentiment.



In [None]:
## Making our validation set
from sklearn.model_selection import train_test_split

X_train_train,X_val,y_train_train,y_val = train_test_split(X_train, y_train,
                                                           test_size=.2,
                                                           shuffle=True,
                                                           stratify = y_train,
                                                           random_state=440)
                                                           

A huge feature of the RNN model is the ability to handle variable length inputs.

In [None]:
print(len(X_train[0]),len(X_train[1]))

Tensorflow has a "ragged" tensor type which allows for variable lengths in some dimensions.
We now convert our data into tensors of the appropriate type and shape.

In [None]:
from tensorflow import convert_to_tensor
from tensorflow import ragged

In [None]:
X_train = ragged.constant(X_train)
X_train_train = ragged.constant(X_train_train)
X_val = ragged.constant(X_val)
X_test = ragged.constant(X_test)
y_train = convert_to_tensor(y_train)
y_train_train = convert_to_tensor(y_train_train)
y_val = convert_to_tensor(y_val)
y_test = convert_to_tensor(y_test)

In [None]:
# Look at a few and see that they have different lengths.
X_train_train[1]

### Making the Network

This network will introduce two new layer types `Embedding` and `SimpleRNN`. 

The [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) layer is preprocessing step which is specific to NLP tasks. You can think of it as a dense layer which takes your one-hot encoded vocabulary to a latent space of specified dimension.  It is implemented as its own custom class instead of just using a dense layer for a couple of reasons:  

* Each input of the mode will use only a small fraction of the vocabulary, so it is inefficient to actually use one-hot encoding.
* Creating these kinds of "word embeddings" is common enough, and has accumulated enough task specific techniques, that it makes sense to have a dedicated class.

The [`SimpleRNN`](https://keras.io/api/layers/recurrent_layers/simple_rnn/) layer is the akin to the RNN architecture we described above, but with $y_t = h_t$, i.e.

* $\sigma_2$ is the identity function
* $W_yh$ is the identity matrix
* $b_y = 0$

In [None]:
## Import all the keras stuff we'll need
from keras import models
from keras import layers
from keras import optimizers
from keras import losses
from keras import metrics

In [None]:
# Define sequential model using
# Embedding layer of shape (max_features,32),
# SimpleRNN layer with hidden layer size 10 and return_sequences = False, 
# Dense layer with sigmoid activation.

model = 

* The Embedding layer has $10000 \times 32 = 320000$ parameters 
* The SimpleRNN layer has:
    * $W_{hx}$ gives us $10 \times 32 = 320$ parameters
    * $W_{hh}$ gives us $10 \times 10 = 100$ parameters
    * $b_h$ gives us $10$ parameters
    * Total of $430$ parameters
* The dense layer has $10 + 1 = 11$ parameters ($10$ weights and $1$ bias).

In [None]:
model.summary()

In [None]:
for weight in model.get_weights():
    print(weight.shape)

In [None]:
model.compile(optimizer='rmsprop',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])

In [None]:
# This taakes about 3 minutes to run on my computer.

epochs = 8

history = model.fit(X_train_train, y_train_train,
                    epochs = epochs,
                    batch_size=128,
                    validation_data=(X_val,y_val))

In [None]:
history_dict = history.history

In [None]:
## Plotting the training and validation accuracy
plt.figure(figsize = (8,6))

plt.scatter(range(1,epochs+1), history_dict['accuracy'], label = "Training Accuracy")
plt.scatter(range(1,epochs+1), history_dict['val_accuracy'], label = "Validation Set Accuracy")

plt.xlabel("Epoch", fontsize=12)
plt.ylabel("Accuracy", fontsize=12)

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.legend(fontsize=12)

plt.show()

In [None]:
## Plotting the training and validation loss
plt.figure(figsize = (8,6))

plt.scatter(range(1,epochs+1), history_dict['loss'], label = "Training Loss")
plt.scatter(range(1,epochs+1), history_dict['val_loss'], label = "Validation Set Loss")

plt.xlabel("Epoch", fontsize=12)
plt.ylabel("Loss Function Value", fontsize=12)

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

plt.legend(fontsize=12)

plt.show()

# Train and save final model.

Four epochs seems like enough.  Let's retrain on the training + validation set, see the test set accuracy, and save the model checkpoint.

In [None]:
final_model = models.Sequential([
    layers.Embedding(max_features, 32), 
    layers.SimpleRNN(10, return_sequences=False),
    layers.Dense(1, activation='sigmoid')
    ])

final_model.compile(optimizer='rmsprop',
                 loss='binary_crossentropy',
                 metrics=['accuracy'])

epochs = 4

final_model.fit(X_train, y_train,
                epochs = epochs,
                batch_size=128)

In [None]:
preds = final_model.predict(X_test)

In [None]:
preds.reshape(-1).round()

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, preds.reshape(-1).round())

In [None]:
# final_model.save('lecture_12_assets/imdb_model.keras')

## Next steps

`SimpleRNN` is actually a seldom used RNN layer type because there are more complicated `keras` layers that tend to outperform `SimpleRNN`.  In particular, the most commonly used RNN layer used in practice are [Long Short Term Memory (`LSTM`)](https://en.wikipedia.org/wiki/Long_short-term_memory) layers, which partially address the vanishing gradient problem by allowing the RNN to "forget" irrelevant pieces of information.

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.  Modified by Steven Gubkin 2024.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erd≈ës Institute as subject to the license (see License.md)