# 1. Understanding Long-term Dependencies

In our previous posts, we learned how Recurrent Neural Networks can be applied in problems like Name Entity Recognition and Natural Language Processing. We also learned how to build and train Language Models using basic RNNs. We went further ahead to understand how to sample novel sequences from our trained Language Models and generate Shakespeare-like text or any kind of text that we desire. If you recall, our RNN architecture looked something like this:

<img src="figures/rnn_summary.jpg" width="600px">

However, with due course of time, we also realized that there are many dependencies, especially long-term dependencies, when it comes to tackling the English Language in those Language Models made using the basic RNN algorithm.

Let’s take another Language Modelling example to fully understand what we mean by long-term dependencies.

<img src="figures/long_dependencies.jpg" width="600px">

If you carefully observe the two sentences above, you will notice how the word “Dog” at the very beginning of the sentence influences the word “has” which is at the very end. If we change the singular word to a plural word “Dogs”, there is a direct effect on the word “have” which is very far from the influencer word “Dogs”.

Now, the sentence in between can get longer than our liking and is not really under our control. Something like, say, “The dog, which ran out of the door and sat on the neighbor’s porch for three hours and roamed the street for a couple of hours more, has come back.” So, for a word that’s almost at the end of a very long sentence to be influenced by a word which is almost at the beginning of that sentence, is what we call a “long-term dependency”.

The basic Recurrent Neural Networks that we have seen so far are not very good at handling such long-term dependencies, mainly due to the Vanishing Gradient Problem.

## 1.1. Vanishing Gradients

Let’s understand vanishing gradients in detail. Have a look at this very deep neural network algorithm.

<img src="figures/deep_nn.jpg" width="800px">

To carry out forward propagation in this 100+ layer deep neural network and then backpropagate the output 𝑦̂  to affect the computations in earlier layers, is extremely difficult. The gradient from this output and the errors associated will almost vanish by the time they reach the earlier layers during backpropagation.

Essentially, what we are demanding from our neural network is to memorize that the noun used at the beginning of the sentence, i.e., “Dog” or “Dogs” is singular or plural. Only then can it generate either “has” or “have” later in the sequence. Depending on the length of the middle portion of the sequence, which as we saw can be arbitrarily long, the neural network would have to memorize the singular/plural noun for a very long time.

<img src="figures/deep_rnn.jpg" width="800px">

This brings us to an important observation of basic RNN models. In such basic models, the output 𝑦̂ ⟨1⟩ is mainly influenced by inputs closer to it. Similarly, it is hard for the output 𝑦̂  to be influenced by an input at the start of the sequence, say, 𝑦̂ ⟨1⟩. Such long backpropagation is quite tedious to perform for the neural network. And this is the main weakness of the basic Recurrent Neural Network (RNN) algorithm.

To model an algorithm that is good at capturing long-term dependencies, we need to focus on handling the vanishing gradient problem, as we will do in the upcoming sections of this blog post. Along with vanishing gradients, there are other issues with basic RNN models such as “exploding gradients” but those are easier to handle than vanishing gradients. Let’s see how.

## 1.2. Exploding Gradients

When we are backpropagating through time in our RNN model, the gradient can not only decrease exponentially (as we saw above) but also increase exponentially. These exploding gradients can be disastrous for our networks as they can cause our parameters to become so large that our network just goes bonkers!

The silver lining with exploding gradients is that they are easier to spot than vanishing gradients. The network might display NaN (Not a Number), which means there is a numerical overflow in our neural network computations.

We can solve the problem of exploding gradients by applying gradient clipping. This is nothing but scaling or re-scaling our gradient vectors once they reach a threshold or a maximum value. It is as robust a solution for exploding gradients as you can get.

Exploding gradients might look dangerous but they are easily solvable. However, vanishing gradients are quite tricky. There are some solutions that we will explore in the next section which require modifying the hidden layers of our RNN model. We call these new models as GRUs or Gated Recurrent Units and these will help us in capturing long-term dependencies with more ease than our current models.

# 2. Gated Recurrent Unit (GRU)

As we learned in our previous section, vanishing gradients of derivatives can make it hard for RNN models, especially very deep networks, to capture long-term dependencies. This problem can be solved with a modified hidden layer of our Recurrent Neural Network, called a GRU (Gated Recurrent Unit).

First, let’s recall what the hidden layer of our basic RNN looks like:

<img src="figures/rnn_unit.jpg" width="500px">

The formula for computing the activation values at time t of RNN is written as:

$$a^{<t>}=g\left(W_{a}\left[a^{<t-1>}, x^{<t>}\right]+b_{a}\right)$$


## GRU Architecture

Let’s recall the example sentence we previously used: “The dog, which ran out …, has come back.”

How a GRU reads this sentence is pretty much the same as any RNN unit except that there are some modifications. One is the introduction of a memory variable called c. The job of this memory variable is to remember whether the “dog” was singular or plural so that it can be utilized in the latter part of the sentence. At a time $ t  $, this memory cell will have some value $c^{⟨t⟩}$. The GRU unit will give an output of an activation value $a^{⟨t⟩}$, which is actually equal to the same memory variable $c^{⟨t⟩}$. Even though they both have the same value, we will for now, use two different variables for memory cell value and output activation value. However, they won’t have the same value when we move on to Long Short-Term Memory (LSTM) Units later on in this post.

At every time step, we will consider overwriting the memory variable with a value $ \tilde{c}^{\left \langle t \right \rangle} $ and this will in turn replace the value $c^{⟨t⟩}$ using the activation function tanh of $w_c$. This parameter will, in turn, be passed on to the current memory variable along with the previous memory cell value, the activation value and the current input value $x^{⟨t⟩}$, together with the bias. This is what the equation will look like:

$$\tilde{c}^{\left \langle t \right \rangle}= tanh\left ( w_{c}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{c}\right )$$

So now what the Gated Recurrent Unit (GRU) does is that it literally creates a Gate that makes a decision about the input word being singular or plural. This gate is represented by $\Gamma _{u}$ where u stands for update gate, and the value of this gate will either be 0 or 1. Say, 1 when the word is plural and 0 when the word is singular. Our candidate $\tilde{c}^{⟨t⟩}$ for replacing $c^{⟨t⟩}$ is passed through this gate and the gate decides at what time this value is to be used. In our example, the gate is assigned a value $\Gamma _{u}$, at the word “dog” and it makes a decision at the word “has”.

We calculate the gate value using a sigmoid function as represented below:


$$\Gamma _{u}= \sigma \left ( w_{u}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{u}\right )$$

<img src="figures/sigmoid.jpg" width="400px">

In reality, this sigmoid function ranges from values which are infinitesimally close to 0 to values infinitesimally close to 1. However, for intuition purposes, we consider this as absolute 0 and absolute 1.


Now, coming back to the key GRU equation:


$$\tilde{c}^{\left \langle t \right \rangle}= tanh\left ( w_{c}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{c}\right )$$

At the word “dog”, the memory cell value c will be set to 1 (assuming 1 means singular) or 0 (if the word is “dogs” and not “dog”, i.e., plural). The GRU unit will memorize this value of this $c^{⟨t⟩}$ all the way till it the word “has” or “have”. The job of our gate $\Gamma _{u}$ is to continue reading through the words and if any change occurs in singularity or plurality, it makes a decision to update the memory cell value. Once the memory cell value has been used, the gate updates it to signal that there is no further need to memorize any value as the job is done.

Change text: The dog, which ran out …, has come back.

<img src="figures/plural.jpg" width="500px">


## Simplified Notation For GRU

Let us now compile all the concepts we have learned so far about GRU and present a simplified Gated Recurrent Unit with three key equations:


$$\tilde{c}^{\left \langle t \right \rangle}= tanh\left ( w_{c}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{c}\right )$$

$$\Gamma _{u}= \sigma \left ( w_{u}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{u}\right )$$

$$c^{\left \langle t \right \rangle}= \Gamma _{u}\times \tilde{c}^{\left \langle t \right \rangle}+\left ( 1-\Gamma _{u} \right )\times c^{\left \langle t-1 \right \rangle}$$

<img src="figures/gru_simplified.jpg" width="700px">

We will just make a minor change in the first equation to represent the full GRU algorithm. By adding another gate $\Gamma_r$ in the calculation of new candidate value of the memory cell, we can know how relevant is $ c^{\left \langle t-1 \right \rangle}$ in calculating the next candidate for $ c^{\left \langle t \right \rangle}$. We will also add another parameter matrix to help with computing the relevance gate, $\Gamma_r$, and this parameter will be represented as $w_{r}$. So, our revised first equation is now this:

$$\tilde{c}^{\left \langle t \right \rangle}= tanh\left ( w_{c}\begin{bmatrix} \Gamma _{r}\times c^{\left \langle t-x \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{c}\right )$$

And in the equation above, the relevance gate $ \Gamma _{r}$ is computed as follows:

$$\Gamma _{r}= \sigma \left ( w_{r}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{r}\right )$$

# 3. Long Short-Term Memory (LSTM) Unit

As we learned in the earlier section, there is no one way to solve the problem of long-term dependencies or long-range connections. Gated Recurrent Unit (GRU) is the most commonly used. However, there is an even more powerful method than GRU which we will look into now. These are Long Short-Term Memory Units (LSTM).

Before we lay down the equations and notations for LSTM, let’s quickly recap the equations for GRU:

$$\tilde{c}^{\left \langle t \right \rangle}= tanh\left ( w_{c}\begin{bmatrix} \Gamma _{r}\times c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{c}\right )$$

$$\Gamma _{r}= \sigma \left ( w_{r}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{r}\right )$$

$$\Gamma _{u}= \sigma \left ( w_{u}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{u}\right )$$

$$c^{\left \langle t \right \rangle}= \Gamma _{u}\times \tilde{c}^{\left \langle t \right \rangle}+\left ( 1-\Gamma _{u} \right )\times c^{\left \langle t-1 \right \rangle}$$

$$a^{\left \langle t \right \rangle}= c^{\left \langle t \right \rangle}$$

In the case of GRU, we had $a^{⟨t⟩}=c^{⟨t⟩}$. We also had two gates, the update gate and the relevance gate. The update gate $\Gamma _{u}$ would decide whether or not to update the memory cell value, $c^{⟨t⟩}$ using the candidate value, $\tilde{c}^{⟨t⟩}$.

## Notation And Architecture of LSTM Units

The case for Long Short-Term Memory Unit has been laid out impactfully by a seminal paper that has had a huge impact on sequence modeling. This paper was written by Sepp Hochreiter and Jürgen Schmidhuber and is quite deep in its research into the theory of vanishing gradients.

Let us look at the equations that govern LSTM Units, as learned from the research paper by Hochreiter and Schmidhuber.

$$\tilde{c}^{\left \langle t \right \rangle}= tanh\left ( w_{c}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle}\end{bmatrix} +b_{c}\right )$$

$$\Gamma _{f}= \sigma \left ( w_{f}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{f}\right )$$

$$\Gamma _{u}= \sigma \left ( w_{u}\begin{bmatrix} a^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{u}\right )$$

$$\Gamma _{o}= \sigma \left ( w_{o}\begin{bmatrix} c^{\left \langle t-1 \right \rangle}, &x^{\left \langle t \right \rangle} \end{bmatrix} +b_{o}\right )$$

$$c^{\left \langle t \right \rangle}= \Gamma _{u}\ast  \tilde{c}^{\left \langle t \right \rangle}+\Gamma _{f}\times c^{\left \langle t-1 \right \rangle}$$

$$a^{\left \langle t \right \rangle}=\Gamma _{o} \times c^{\left \langle t \right \rangle}$$

In the case of LSTM, the first difference from GRU is that the case where $a^{⟨t⟩}=c^{⟨t⟩}$ will no longer be true. We will be especially using $a^{⟨t-1⟩}$ more than $c^{⟨t-1⟩}$. In addition, we also won’t be using the relevance gate, $\Gamma _{r}$. We could definitely create a variation of LSTM where we use this relevance gate, but the common version of LSTM doesn’t require using this relevance gate. The update gate $\Gamma _{u}$ will be there like in the case of GRU but with more extensive use of $a^{⟨t-1⟩}$.

One new inclusion in LSTM unit is another gate that utilizes the sigmoid function, which we call a Forget Gate. This forget gate $\Gamma _{f}$ will be used instead of the term $1−\Gamma _{u}$. Then, we also have a new output gate which is the sigma of $\Gamma _{o}$. The update value to the memory cell will be $c^{\left \langle t \right \rangle}= \Gamma _{u}\ast  \tilde{c}^{\left \langle t \right \rangle}+\Gamma _{f}\times c^{\left \langle t-1 \right \rangle}$.

Similar to what we learned in GRUs, in the case of LSTMs too, the multiplication in the above equation is a cross-product between vectors or an element-wise multiplication.

LSTM, in total, uses three gates – Update Gate, Forget Gate, and Output Gate. Let’s see how the LSTM architecture looks like.


<img src="figures/lstm_unit.jpg" width="700px">


Note: The diagrams we have used in this blogpost for LSTM are inspired by a blog post by Chris Ola, titled ‘Understanding LSTM Network’. So, big thanks to Chris!

Now, as we can see in the above diagram, all the gate values (forget gate, update gates and output gate) are computed using  $a^{⟨t-1⟩}$ and  $x^{⟨t⟩}$. These two values also go through a tanh function to calculate the candidate value  $\tilde{c}^{⟨t⟩}$. All these values are, then, combined using element-multiplication to get ${c}^{⟨t⟩}$ from the previous ${c}^{⟨t-1⟩}$.

Let’s try connecting this one LSTM unit in parallel with the subsequent units to see how propagation works in LSTM.

<img src="figures/lstm_connected_units.jpg" width="700px">

Just like any forward propagation, each LSTM unit receives the respective inputs $x^{⟨1⟩}$, $x^{⟨2⟩}$ and $x^{⟨3⟩}$ and outputs an activation value, say, $a^{⟨1⟩}$, which then becomes the input $a^{⟨t⟩}$ for the next timestep. We can even simplify the above diagram further and notice how easy it is for LSTM units to have some value $c^{⟨0⟩}$ and have it memorized till all the way to the end of the sequence to give, $c^{⟨3⟩}=0$.

<img src="figures/lstm_connected_units_2.jpg" width="700px">

This is the real advantage of using LSTM units, and in fact, GRU as well, because they are so good at memorizing certain values and that too for a very long time

# 5. RNN: Vanishing Gradients, GRU and LSTM

* Basic RNNs are not good at capturing long-term dependencies or long-range connections
* The problem of Vanishing Gradients and Exploding Gradients are common with basic RNNs
* Gated Recurrent Units (GRU) are simple, fast and solve vanishing gradient problem easily
* Long Short-Term Memory (LSTM) units are slightly more complex, more powerful, more effective in solving the vanishing gradient problem
* No clear winner between GRU and LSTM
* Many other variations of GRU and LSTM are possible upon research and development

# 6. Bidirectional RNN

<img src="figures/brnn.png" width="700px">


# 7. Deep RNN

<img src="figures/deep_rnn.png" width="700px">


# TF Keras Implementation

In [5]:
import numpy as np

import tensorflow as tf

from tensorflow.keras import layers

raw_inputs = [[83, 91, 1, 645, 1253, 927],[73, 8, 3215, 55, 927], [711, 632, 71]]

padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs,
                                                              padding='post')

print(padded_inputs)

embedding = layers.Embedding(input_dim=5000, output_dim=2)
masked_output = embedding(padded_inputs)

lstm = layers.LSTM(6, return_sequences=True)
lstm_output = lstm(masked_output)

# try return_sequences=True
# try bidirectional rnn layer

print(lstm_output)

[[  83   91    1  645 1253  927]
 [  73    8 3215   55  927    0]
 [ 711  632   71    0    0    0]]
tf.Tensor(
[[[ 1.4606985e-03  2.0931803e-03  3.1025200e-03  5.7469361e-04
   -6.9825188e-04 -1.8627032e-03]
  [-2.4986619e-04  3.7122189e-04 -5.6528294e-04 -5.8170420e-04
   -3.4507827e-06 -1.3484832e-05]
  [ 3.7245403e-04  1.3702239e-03 -3.0252014e-03 -3.1880827e-03
    1.1276641e-03  1.9911020e-03]
  [ 7.0879387e-04  9.8479970e-04  1.0087808e-03 -5.0106853e-05
   -5.1621342e-04 -5.8800390e-04]
  [ 2.0653524e-03  3.3100978e-03  2.7797094e-03 -5.6353566e-04
   -7.1271160e-04 -1.6459689e-03]
  [ 1.7910317e-03  2.9859166e-03  8.9065228e-03  4.1447533e-03
   -3.3387528e-03 -6.4124120e-03]]

 [[ 1.2746120e-03  2.1944484e-03 -1.1191622e-03 -2.5478089e-03
    8.5613702e-04  1.1107987e-03]
  [ 9.1523631e-04  1.8221253e-03 -3.6056302e-03 -4.0208520e-03
    1.4223986e-03  2.7387633e-03]
  [ 1.5726050e-03  1.7777006e-03  2.8914653e-03  5.8847020e-04
   -1.0767709e-03 -1.5257648e-03]
  [-1.0015913e

In [8]:
import numpy as np

import tensorflow as tf

from tensorflow.keras import layers

raw_inputs = [[83, 91, 1, 645, 1253, 927],[73, 8, 3215, 55, 927], [711, 632, 71]]

padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs,
                                                              padding='post')

print(padded_inputs)

embedding = layers.Embedding(input_dim=5000, output_dim=2, mask_zero=True)
masked_output = embedding(padded_inputs)

lstm = layers.LSTM(6, return_sequences=True)
lstm_output = lstm(masked_output)

# try return_sequences=True
# try bidirectional rnn layer

print(lstm_output)

[[  83   91    1  645 1253  927]
 [  73    8 3215   55  927    0]
 [ 711  632   71    0    0    0]]
tf.Tensor(
[[[-2.3462430e-03 -2.5434341e-04 -7.8200661e-03  5.8227386e-03
   -5.3193527e-03  6.4351060e-03]
  [-6.3073933e-03  1.6863116e-03 -9.8816315e-03  9.4954576e-03
   -8.3921272e-03  8.8143321e-03]
  [-7.5757806e-03  4.0312167e-03 -2.4713585e-03  6.1046337e-03
   -5.2855238e-03  3.4030250e-03]
  [-2.9076454e-03  1.9918033e-03 -4.7708245e-04  3.2860762e-03
   -3.2701399e-03  1.2332930e-03]
  [-4.1460395e-03  2.8077299e-03  1.2275530e-03  2.1052039e-03
   -2.1505619e-03 -3.1639775e-04]
  [-5.8850911e-03  3.2690505e-03 -8.4176316e-04  3.3044540e-03
   -3.4416821e-03  1.1254834e-03]]

 [[-8.5104775e-04  2.9248875e-04 -8.6646876e-04  8.0311514e-04
   -7.2057929e-04  7.4657972e-04]
  [-4.4985008e-03  2.3430372e-03 -5.9190422e-04  1.7068068e-03
   -1.4912287e-03  8.8470429e-04]
  [-8.4528839e-03  4.8773363e-03  8.9857931e-04  1.6667370e-03
   -1.7453183e-03  5.9934442e-05]
  [-3.8196845e