# Recurrent Neural Networks - Basics

A Recurrent Neural Network is a type of neural network able to handle sequential data. But what does it mean for data to be "sequential"?

# Introducing sequential data

Sequence data, also known as **sequences**, have a unique characteristic compared to other types of data: *Order matters*. Sequence data, as the name suggests, appear in a certain order and individual data points are **not** independent of each other.

This is unlike what I have seen so far. In what I am used to, training examples are independent from one another. I learned that this type of data is called: **independent and identically distributed(IID)** data. Due to the independence of the data, it does not matter how training samples are fed into the neural network. 

When dealing with sequences however, this is no longer true.

# Sequential data versus time series data 📃📈

Before moving forward and discuss how to represent sequences, I would like to talk about what time series data is, and how different it is from sequential data.

**Time Series data** is *a special type* of sequential data. Each data point is associated with time. In other words, samples are captured at **successive timestamps**. Stock prices and voice records for instance, are time series data.

On the other hand, **not all sequential data is time series data**. Text data, or DNA sequences for instance, are sequential data because the order is important, but they do *not* qualify as time series data.

RNNs can be used to handle both sequential data and time series data.

# How is sequential data represented?

Moving forward, sequences will represent as such:

$$
<x^{(1)}, x^{(2)}, \dots , x^{(T)} >
$$

The superscript indices indicates *the order of the instances, and the length of the sequence $T$*. In case the sequence represents time series data, the superscript represents a particular time. As such, $x^{(t)}$ represents an example point that belongs to a particular time. 

Since RNNs are used to model sequences, they are able to "remember" past information in order to produce new information, unlike MLPs, and CNNs which do not incorporate ordering information since training examples are independent of each other. In that regards, RNNs can be said to have "memory".

# The different categories of sequence modeling

As previously mentioned, RNNs are used to model sequences. In this section, I would like to describe the different modeling tasks based on the explaination of Andrej Karpathy's article [The unreasonable effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

![Type of modeling tasks](./images/img-1.png)

*Each rectangle is a vector and arrows represent functions(e.g. matrix multiply). Input vectors are in red, output vectors are in blue, and green vectors hold the RNN's state(more on this soon).*

- **One-to-one**: From fixed-sized input to fixed-sized output. Image classification, is an example of such task. 

- **One-to-many**: The input is in standard format, and outputs a sequence. Image captioning takes an image and outputs a sequence of words (i.e. a sentence). As such this modeling task falls into this category.

- **Many-to-one**: The input data is a sequence this time, and the output is a fixed-sized vector or a scalar. Sentiment analysis falls into this modeling task. In sentimental analysis a given sequence of words, is classified as expressing "positive" or "negative" sentiment.

- **Many-to-many**: In here, both the input and output arrays are sequences. This category can further be seperated in two categories.
    
    - Delayed: Machine translation falls into this category. An RNN reads a sentence in English and then outputs its translation in french.

    - Synchronized: Video classification falls into this category. An RNN reads successive frame of a video and then labels each of them.

# Understanding the flow of data in RNNs

In this section of the notebook, I would like talk about data flows through a Recurrent Neural Network. Let's consider the flow in a standard feedforward Neural Network and in an RNN side by side for comparison:

![data flow in a NN and in an RNN](./images/img-2.jpg)

Both networks have only hidden layer.

In classic feedforward networks (MLP, CNN) like we have seen so far, data flows from the input layer, to the hidden layer... then from the hidden layer to the output layer.

In an RNN on the other hand, the hidden layer receives its input from the **input layer of the current time step** and the **the hidden layer from the previous time step**. Also, the flow of information is displayed as **loop**, also known as a **recurrent edge** in graph notation, which is how the general RNN architecture gots its name. To make it easy to reason about, RNNs are represented **unfolded** as shown below:

![RNN Unfolded](./images/img-3.jpg)

The internal state of the $t^{th}$ hidden unit is obtained from the $t^{th}$ input vector, and the internal state of the $(t - 1)^{th}$ hidden unit, thus giving the network a "memory" of past inputs, when computing an output.

Like MLPs, RNNs can also have multiple layers. The previous illustration showcases an RNN with only one hidden layer. The following RNN, however, has 2 hidden layer. Unfolded, a 2-layer RNN looks like this:

![Two layer RNN Unfolded](./images/img-4.jpg)

We can summarize the information flow in the above 2-layer RNN as follows:

- $h_1^{(t)}$ represents the $t^{th}$ unit in the first hidden layer. It receives input from $x^{(t)}$, the $t^{th}$ input vector, and the previous hidden unit $h_1^{(t - 1)}$.

- $h_2^{(t)}$ represents the $t^{th}$ unit in the second hidden layer. Unlike $h_1^{(t)}$, $h_2^{(t)}$ receives input from the $t^{th}$ output of the previous layer and the previous hidden unit $h_2^{(t - 1)}$.

# Computing activations in an RNN

With a good understanding of how information flows in RNNs, there is one remaining question: How do I compute activations values? In one of previous section, I said that "
*arrows represent functions(e.g. matrix multiply)*". In fact, each directed edge (representing the connection between boxes) represent a **weight matrix**. Let's describe the different weight matrices. To make it easier, I consider the following single-layer RNN:

![Single layer RNN](./images/img-5.jpg)

- $W_{xh}$: refers to the weight matrix between the $t^{th}$ input vector, and the $t^{th}$ unit in the hidden layer.

- $W_{hh}$: refers to the weight matrix associated with the recurrent edge. The one in the above illustration is called a *hidden-to-hidden recurrence*. But in fact, there exit three types of recurrent connections. 

- $W_{ho}$: refers to the weight matrix between the $t^{th}$ unit in the hidden layer, and the $t^{th}$ unit in the output layer.

Before diving into the math, I would like to talk about the three recurrent connections I mentioned a minute ago. Here is an illustration showing them all:

![Different recurrent connection model](./images/img-6.jpg)

Like I said earlier, there exist three types of recurrent connections:

- **Hidden-to-hidden** recurrence: denoted $W_{hh}$, the one I mentioned earlier, it is the weight matrix between the $(t - 1)^{th}$ unit in the hidden layer and the $t^{th}$ unit in the hidden layer.

- **Output-to-hidden** recurrence: denotes $W_{oh}$. It is the weight matrix between the $(t - 1)^{th}$ output unit and the $(t)^{th}$ hidden unit.

- **Output-to-output** recurrence: denotes $W_{oo}$. It is the weight matrix between $(t - 1)^{th}$ output unit and the $t^{th}$ output unit.

Now I bring back our attention to the following the following illustration:

![Single layer RNN](./images/img-5.jpg)

Computing the activation of the $t^{th}$ unit in the output layer starts with computing the net input (preactivation) in the $t^{th}$ unit the hidden layer: $z_h^{(t)}$.

$$
z_h^{(t)} = W_{xh} x^{(t)} + W_{hh} h^{(t - 1)} + b_h
$$

where $b_h$ is the bias vector.

Then, the activation in the $t^{th}$ hidden unit is given by the following:

$$
h^{(t)} = \sigma_{h}(W_{xh} x^{(t)} + W_{hh} h^{(t - 1)} + b_h)
$$

Where, $\sigma_{h}(\cdot)$ is the activation function of the hidden layer.

The activation in the $t^{th}$ output unit is thus given by:

$$
o^{(t)} = \sigma_{o}(W_{ho}h^{(t)} + b_o)
$$

Where $\sigma_{o}(\cdot)$ is the activation function of the output layer, and $b_o$ is the bias vector for the output units.

$h^{(t)}$ can also be computed by **concatenating** $W_{xh}$ and $W_{hh}$. If so, the formula for computing $h^{(t)}$ becomes the following:

$$
h^{(t)} = \sigma_{h}([W_{xh}; W_{hh}] \begin{bmatrix}x^{(t)} \\ h^{(t - 1)}\end{bmatrix}+ b_h)
$$

To clarify this further, the following illustration shows the process the activation in the $t^{th}$ output unit:

![Computing activation](./images/img-7.jpg)

# Long Short-Term Memory Cells

LSTMs were introduced to overcome the vanished gradient problem. The basic building block of LSTM in a *memory cell*. This cell replaces the hidden layer of standard RNNs. Here is a block diagram of a LSTM memory cell:

![LSTM memory cell](./images/img-8.jpg)

There is *a lot* going on this picture. As least, that's what I thought the first time I saw it! Thanks to this [video](https://youtu.be/YCzL96nL7j0) from StatQuest on YouTube, I got a better understanding. I am grateful to him for the insight he provided in this video. Hopefully, I will be able to properly share my understanding as well. But you're more than welcome to watch the original video.

The flow of information in the cell is controlled by **computation units** (also known as "*gates*"). There are three main gates that will soon discuss:

- The **"forget" gate** with the little $f$ on top of it.

- The **"input" gate** with the little $i$ on top of it.

- The **"output" gate** with the little $o$  next to it.

Now, I will talk about how each of these work together to make LSTMs "work". In the following section, I discuss how values are computed in an LSTM cell; and most importantly what they "mean".

Once again, my understanding of LSTMs was greatly faciliated by the following illustration (i.e. frame) sourced from the StatQuest video I linked earlier. It's another representation of an LSTM cell, just like the image I showed earlier. But this representation is less compact, thus easier to read. At least, for me.

![LSTM memory cell StatQuest](./images/img-9.png)

The green line running across the image is called the **cell state** and represents the **long term memory**. In the first LSTM illustration, the "long term memory" coming from the previous LSTM cell is denoted by $C^{(t -1)}$. You will also notice that there are no weights and/or biaises on the line that modifies the value directly. This allows long term memories, that is information from previous LSTM cells, to flow through a series of LSTM units without causing the gradient to explode or vanish. This mechanism allows the network to "hold onto information" longer compared to vanilla RNNs.

The pink line, denoted by $h^{(t-1)}$ in the previous illustration, is the **hidden state** of the cell. It is represents short-term memories. Unlike the green line, the pink line is connected to weight that directly modifies it. Those weights cause individual LSTM cells in the network, to have their "own way" of manipulating short term memories coming from preceding cells.

We are making progress. But I am not there yet.

Values in an LSTM cell are computed in stages, that I would like to describe. Moving forward, I will consider the original LSTM illustration I shared earlier, and will draw parallel to the StatQuest illustration when appropriate.

## LSTM Stage 1: Percent to remember

Computing values in an LSTM cell starts with **computing the value that comes out of the "forget" gate**. It is the box with the little "$f$" on top. In the StateQuest video, the "forget" gate is the blue box. Again, I recommend watching the video, it's really good.

The short memory from the previous LSTM cell, $h^{(t-1)}$, and the current input vector, $x^{(t)}$, are fed into the first box. $h^{(t-1)}$ gets matrix mutliplied with the weight matrix $W_{hf}$, while the current input vector is matrix multiplied with the weight matrix $W_{xf}$. The result of both matrix multiplication is first added together and then a bias vector, $b_f$, is then added to the resulting sum. The result of all this now, goes through a sigmoid function $\sigma$ that squeezes its this result between in the range $[0, 1]$. This value between 0 and 1, is the output of the forget gate "f".

The output of the "forget" gate is then **multiplied with $C^{(t-1)}$**, which is the long term memory coming from the previous LSTM cell. By multiplying these two values, we are basically *scaling the value of $C^{(t-1)}$*. Let's think about it for a second. If the output of the forget gate is $1$, then multiplying it with $C^{(t-1)}$ leaves it *unchanged*. On the other hand, if the output of the forget gate is $0$, then multiplying it with $C^{(t-1)}$ zeroes out $C^{(t-1)}$, essentially loosing the information completely.

In that sense, it is fair to say that through this mechanism, the current LSTM cell is determining the **amount (i.e. percentage) of long term memory to remember**. In other words, the cell is choosing the *amount (i.e. percentage) of long term memory to forget*. That's why the scaling factor coming out of the box with the little "$f$" on top, is called the "**forget**" gate.

## LSTM Stage 2: Updating the long term memory

$h^{(t-1)}$ and $x^{(t)}$, are fed into the box with the little "$i$" on top of it. In the StatQuest video, this box is the green one. $h^{(t-1)}$ and $x^{(t)}$ get matrix multiplied with weight matrices $W_{hi}$ and $W_{xi}$ respectively. The result of both matrix multiplication are added together, and the bias $b_i$ is also added to this sum. The overall result goes through the sigmoid function $\sigma$ that outputs a value between 0 and 1.

$h^{(t-1)}$ and $x^{(t)}$ are then fed into the box with the little weird $C$ on top of it. In the StatQuest vide, this box is the yellow one. $h^{(t-1)}$ and $x^{(t)}$ get matrix multiplied with weight matrices $W_{hc}$ and $W_{xc}$ respectively; the sum is added together with the bias $b_c$ . The result this time goes through a $tanh$ activation function that squeezes the input between -1 and 1.

The outputs of the box with the little $i$ and the box with the little $C$ are then multiplied together. But... What does it mean? To help us understand let's look at what happens in the box with the little "C" on top. In this box, the short term memory and the current input are combined to **create a potential long term memory**. This value is then multiplied (i.e. scaled) by the output of the $i$ box. This output says what percentage of this potential long term memory to consider (i.e. remember). The newly scaled "potential" long term memory value is then **added** to the current long term memory, $C^{(t - 1)}$, and that gives us a *new long term memory*: $C^{(t)}$.

I personally think of this addition as the current cell's *personal contribution* to the long term memory. The new value of the long term memory, $C^{(t)}$, is then propagated to the next LSTM cell.

## LSTM Stage 3: Updating the short term memory