# Introducing Recurrent Neural Networks (RNNs)

An RNN can have multiple architectures. Some of the possible ways of architecting
an RNN are as follows:

![nlp_common_architectures](../imgs/nlp0.png)

In the preceding diagram, the boxes at the bottom are the input, followed by the
hidden layer (the middle boxes), and then the boxes at the top are the output layer.
The one-to-one architecture is a typical neural network with a hidden layer between
the input and output layers. Examples of different architectures are as follows:

- One-to-many: The input is an image and the output is a caption of the
image.

- Many-to-one: The input is a movie review (multiple words in input) and
the output is the sentiment associated with the review.

- Many-to-many: Machine translation of a sentence in one language to a
sentence in another language.

## The idea behind the need for RNN architecture

RNNs are useful when we want to predict the next event given a sequence of events.
An example of that could be to predict the word that comes after this: <i>This is an ___</i>.

Let's say that in reality, the sentence is <i>This is an example</i>.

Traditional text-mining techniques would solve the problem in the following way:

1. Encode each word while having an additional index for potential new
words:

<i>This</i> : {1, 0, 0, 0}

<i>is</i> : {0, 1, 0, 0}

<i>an</i> : {0, 0, 1, 0}

2. Encode the phrase <i> This is an </i>:

<i> This is an </i> : {1, 1, 1, 0}

3. Create the traing dataset:

Input --> {1, 1, 1, 0}

Output --> {0, 0, 0, 1}

4. Build a model with the given input and output combination:

One of the major drawbacks of the model is that the input representation
does not change in the input sentence regardless of if it is in the form of this
is an, an is this, or this an is.

However, intuitively, we know that each of the preceding sentences is
different and cannot be represented by the same structure mathematically.
This calls for having a different architecture, which looks as follows:

![nlp](../imgs/nlp1.png)

In the preceding architecture, each of the individual words from the sentence enters
an individual box in the input boxes. This ensures that we preserve the structure of
the input sentence; for example, this enters the first box, is enters the second box, and
an enters the third box. The output box at the top will be the output – that is, example.



## Exploring the structure of an RNN

You can think of an RNN as a mechanism to hold memory – where the hidden layer
contains the memory. The unfolded version of an RNN is as follows:

![nlp](../imgs/nlp3.png)

The network on right is an unrolled version of the network on the left. The network
on the right takes one input in each time step and extracts the output at each time
step.

Note that while predicting the output of the third time step, we are incorporating
values from the first two time steps through the hidden layer, which is connecting the
values across time steps.

Let's explore the preceding diagram:

- The u weight represents the weights that connect the input layer to the
hidden layer.

- The w weight represents the hidden layer to the hidden layer connection.

- The v weight represents the hidden layer to the output layer connection.

The output in a given time step depends on both the input in the current time step
and the hidden layer value from the previous time step. With the introduction of the
hidden layer of the previous time step being the input, along with the current time
step's input, we are obtaining information from the previous time steps. This way, we
are creating a pipeline of connections that enable memory storage.



## Why store memory?

There is a need to store memory as, in the preceding example, or even in text
generation in general, the next word does not depend only on the preceding word,
but also on the context of the words preceding the word to predict.

Given that we are looking at the preceding words, there should be a way to keep
them in memory so that we can predict the next word more accurately.

We should also have the memory in order; more often than not, the recent words are
more useful in predicting the next word than the words that are further away from
the word to predict.

A traditional RNN that takes multiple time steps into account for giving predictions
can be visualized as follows:

![nlp](../imgs/nlp4.png)

Notice that as the time step increases, the impact of the input present at a much
earlier time step (time step 1) would be lower on the output at a much later time step
(time step 7). An example of this can be seen here (for a moment, let's ignore the bias
term and assume that the hidden layer input at time step 1 is 0 and we are predicting
the value of the hidden layer at time step 5 – $h_5$ ):

![nlp](../imgs/nlp5.png)

You can see that as the time step increases, the value of the hidden layer (h 5 ) highly
depends on $X_1$ if U>1; however, it is much less dependent on $X_1$ if U<1.

The dependency on the U matrix can also result in the hidden layer ($h_5$) value being
very small, hence resulting in a vanishing gradient when the value of U is very small,
and can cause exploding gradients when the value of U is very high.

The preceding phenomenon results in an issue when there is a long-term dependency
on predicting the next word. To solve this problem, we'll use the LSTM architecture.

# Long Short-Term Memory - LSTM architecture

In the previous section, we learned about how a traditional RNN faces a vanishing or
exploding gradient problem resulting in it not being able to accommodate long-term
memory. In this section, we will learn about how to leverage LSTM to get around this
problem.

In order to further understand the scenario with an example, let's consider the
following sentence:

<i> I am from England. I speak __. </i>

In the preceding sentence, intuitively, we know that the majority of the people from
England speak English. The blank value to be filled (English) is obtained from the fact
that the person is from England. While in this scenario we have the signaling word
(England) closer to the blank value, in a realistic scenario, we might find that the
signal word is far away from the blank space (the word we are trying to predict).
When the distance between the signal word and blank value is large, the predictions
through traditional RNNs might be wrong because of the vanishing or exploding
gradient phenomenon. LSTM addresses this scenario – which we will learn about in
the following section.



## The working details of LSTM

A standard LSTM architecture is as follows:

![nlp](../imgs/nlp6.png)

In the preceding diagram, you can see that while input X and output h remain similar
to what we saw in the Exploring the structure of an RNN section, the computations that
happen between the input and output are different in LSTM. Let's understand the
various activations that happen between the input and output:



![nlp](../imgs/nlp7.png)

In the preceding diagram, we can observe the following:

- $X$ and h represent the input and output at time step $t$.

- $C$ represents the cell state. This potentially helps in storing long-term
memory.

- $C_{t-1}$ is the cell state that is transferred from the previous time step.

- $h_{t-1}$ represents the output of the previous time step.

- $f_t$ represents activations that help with forgetting certain information.

- $i_t$ represents the transformation corresponding to the input combined with
the previous time step's output ($h_{t-1}$).

The content that needs to be forgotten, $f_t$ , is obtained as follows:

![nlp](../imgs/nlp8.png)

Note that $W_{xf}$ and $W_{hf}$ represent the weights associated with the input and the
previous hidden layer, respectively.
The cell state is updated by multiplying the cell state from the previous time step, $C_{t-1}$ ,
by the input content that helps in forgetting: $f_t$.

The updated cll state í á follows:

![nlp](../imgs/nlp9.png)

To understand how the preceding operations help, let's go through the input
sentence: <i> I am from England. I speak __ </i>.

In the next step, we will include additional information from the current time step to
the cell state as well as to the output. The modified cell state (after forgetting what is
to be forgotten) is updated by the input activation (which is based on the current time
step's input and also the previous time step's output) and the modulation
gate, $g_t$ (which helps in identifying the amount by which the cell state is to be
updated).

The input activation is calculated as follows:

![nlp](../imgs/nlp10.png)

Note that $W_{xi}$ and $W_{hi}$ represent the weights associated with the input and the
previous hidden layer, respectively.

The modified gate's activation is calculated as follows:

![nlp](../imgs/nlp11.png)

Note that $W_{xg}$ and $W_{hg}$ represent the weights associated with the input and the
previous hidden layer, respectively.

The modified cell state, $C_t$ , which will be passed to the next time step, is now as
follows:

![nlp](../imgs/nlp12.png)

Finally, we multiply the activated updated cell state ($tanh(C_t)$) by the activated output
values, $O_t$ , to obtain the final output, $h_t$ , at time step $t$:

![nlp](../imgs/nlp13.png)


This way, we can leverage the various gates present in an LSTM to selectively
memorize overly long time steps.