<a href="https://colab.research.google.com/github/ngtinc21/Machine-Learning-Algorithms/blob/main/RNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Overview of Recurrent Neural Network**

---


A recurrent neural network (RNN) is a special type of an artificial neural network - 


*   adapted to work for time series data or data that involves sequences, 
*   such as, natural language processing (NLP), daily stock prices, image capturing, or sensor measurements

***Recurrent*** means the output at the current time step becomes the input to the next time step. At each element of the sequence, the model considers not just the current input, but what it remembers about the **preceding elements**.

The figure below shows a convertion from a Feed-Forward Neural Network into a RNN. The nodes in different layers of the neural network are **compressed to form a single layer** of recurrent neural networks. A, B, and C are the network parameters, used to improve the output of the model.

“x” is the input layer, “h” is the hidden layer, and “y” is the output layer. 

At any given time t, the current input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to the network to improve on the output.

![image.png](https://www.simplilearn.com/ice9/free_resources_article_thumb/Simple_Recurrent_Neural_Network.png)

![image.png](https://www.simplilearn.com/ice9/free_resources_article_thumb/Fully_connected_Recurrent_Neural_Network.gif)

# **Why Uses RNN over the Feed-Forward Neural Nets (Conventional ANN)?**

---

## **Problems of Conventional ANN**
1.   Cannot handle sequential data
2.   Considers only the current input
3.   Cannot memorize previous inputs

> Solutions of these issues: ***RNN***







# **Working Mechanisms of RNN**

---
Below figure shows the representation of a **foward propagating RNN** with mathematical details.  

1.   The input layer takes in the input \*\**$x_t$*** to the neural network and processes it and passes it onto the middle layer \**\**$a_t$**. 

2.   The middle layer can consist of multiple hidden layers, each with its own activation functions ($g_a$ and $g_y$) and weights ($W_{aa}$, $W_{ax}$ and $W_{ya}$) and biases ($b_a$ and $b_y$)

3.   The RNN will **standardize** the different activation functions and weight and biases so that each hidden layer has the **same parameters**

4.    Instead of creating hidden layers with various parameters, it will create one and loop over it as many times as required.

## **Mathematical representation**
> The involved parameters of the RNN as shown in the figure are as followed:
$\require{color}$
>   - \*\*$a_t$** is the value of the hidden units/states at time $t$
>   - \*\*$\colorbox{orange}{\(x_t\)}$** is the input at time step $t$
>   - \*\*$\colorbox{yellowgreen}{\(y_t\)}$** is the output of the network at time step $t$
>   - \*\*$\color{red}{\text{$W_{aa}$}}$** are weights associated with hidden units in recurrent layer
>   - \*\*$\color{purple}{\text{$W_{ax}$}}$** are weights associated with inputs in recurrent layer
>   - \*\*$\color{cyan}{\text{$W_{ya}$}}$** are weights associated with hidden to output layer
>   - \*\*$b_a$** is the biase associated with the recurrent layer
>   - \*\*$b_y$** is the biase associated with the output layer
>   - \*\*$g_a$** is the activation function in the hidden layer used to compute activation value $a_t$
>   - \*\*$g_y$** is the activation function in the output layer used to compute output value $y_t$
>
> ![image.png](https://camo.githubusercontent.com/694a18c571b70f2f15bd5d54f7caff19290c75f94a5e9cba5aaeba64e00ce57b/68747470733a2f2f666972656261736573746f726167652e676f6f676c65617069732e636f6d2f76302f622f646565702d6c6561726e696e672d63726173682d636f757273652e61707073706f742e636f6d2f6f2f36524e4e362e706e673f616c743d6d6564696126746f6b656e3d66663939333764362d613830392d346438312d396137662d386135653131366563643530)
>
> ### At time step $t$, the above RNN uses input value $x_t$ and activation value $a_{t-1}$ to predict output value $y_t$


## **Further Simplification of the Above Notation**
>  Combine \*\**$\color{red}{\text{$W_{aa}$}}$*** and \**\**$\color{purple}{\text{$W_{ax}$}}$** together to form a new parameter \**\**$\color{green}{\text{$W_a$}}$**
> - Compressing the two parameter matrices into one can simplify the notation, especially for more complex model
> - Similarly, \*\*$\color{cyan}{\text{$W_{ya}$}}$** is also changed to a new parameter \**\**$\color{royalblue}{\text{$W_y$}}$**
>
> ![image.png](https://camo.githubusercontent.com/4f55d6af69c8eb429fa3f2c749a579ee43cc189c139ed8950b10c286956773c1/68747470733a2f2f666972656261736573746f726167652e676f6f676c65617069732e636f6d2f76302f622f646565702d6c6561726e696e672d63726173682d636f757273652e61707073706f742e636f6d2f6f2f36524e4e372e706e673f616c743d6d6564696126746f6b656e3d61646531633232362d366436632d346330382d383361352d343265306364613335323737)

## **Back Propagation Through Time (BPTT)**

The backpropagation is the training algorithm used to **update the weights (or the gradients) of parameters** in a neural network in order to minimize the error between the expected output and the predicted output for a given input. For a RNN, the traditional backpropagation algorithm is specifically modified to include the recurrent loop in time to train the weights of the network. 

BPTT differs from the traditional approach in that BPTT **sums errors at each time step** whereas feedforward networks do not need to sum errors as they do not share parameters across each layer.

**Since Keras is being used, there is no worry about how this happens behind the scenes**, what to concern is only about setting up the network correctly (at least at this stage).

# **Advantages and Disadvantages of RNNs**

---

### **Advantages:**

1.   Ability to handle sequence data.
2.   Ability to handle inputs of varying lengths.
3.   Ability to store or ‘memorize’ historical information.

### **Disadvantages:**

1.   Possibly (very) slow computation

2.   Network not taking into account future inputs to make decisions, as only previous information is use for prediction

3.  Vanishing gradient problem
  - the gradients used to compute the weight update may get very close to zero preventing the network from learning new weights
  - the deeper the network, the more pronounced is this problem
  - good for storing memory 3-4 instances of past iterations but larger number of instances don't provide good results

4.  Exploding gradient problem
  - accumulation of large error gradients can result in very large updates to the neural network model weights during the training proces


# **Diffferent Architectures of CNN**

---

### **Types of RNN**

*   One To One - Traditional neural networks
*   One to Many - A single input can produce multiple outputs, e.g. music generation or text generation
*   Many to One - Many inputs from different time steps produce a single outpu, e.g. sentiment analysis, emotion detection and text classification
*   Many to Many - e.g. language translation system and voice recognition

![image.png](https://miro.medium.com/max/1400/1*RsRIEyJyfgvisdW363CbDw.jpeg)
<p align='right'>Image Source: Andrej Karpathy</p>

### **RNN Architectures**

1.   Vanilla RNN
  - Basic RNN architecture with self-looping as **described much by above**
  - A hidden state to represent past knowledge, which is also served as an input to the current step and next step
  - Every step of RNN shares the same activation function and the same sets of parameters
  - BUT a brain with a single set of memory (only few set of parameters to process) is easily overloaded and cannot remember the past with some distance

2.  Bidirectional recurrent neural networks (BRNN)
  - **Inputs from future time steps are used** to improve the accuracy of the network
  - Like having knowledge of the first and last words of a sentence to predict the middle words

3.  Gated Recurrent Units (GRU)
  - Designed to **handle the vanishing an exploding gradient problem**
  - There are **reset and update gate**, in which they determine which information is to be retained for future predictions

4.  Long Short Term Memory (LSTM)
  - Designed to **address the vanishing gradient problem**
  - There are three gates called **input, output and forget gate**. Similar to GRU, these gates determine which information to emphasize, to output and to forget, for the prediction
  - Having three logical units of memories to process different dimensions greatly improves the capability of LSTM to remember both the long and short term information
  - Instead of having a single neural network (recursive tanh hidden) layer, four interacting layers are communicating extraordinarily 

### **Long Short Term Meomry (LSTM)**
Detailed Mechanism:

![image.png](https://miro.medium.com/max/1400/1*cU0pLdq-KvhebSNYiF-hYw.png)

LSTMs work in 3 steps:

  1. Decide how much historical data it should remember by a **forget gate**
      - determined by a sigmoid function
      - the leftermost sigmoid function in the above plot

      ![image.png](https://www.simplilearn.com/ice9/free_resources_article_thumb/LSTMs_step1.png)

  2. Decide how much this unit adds to the current state by an **input gate**
      - determined by a sigmoid function and a tanh function
      - sigmoid function decides which values to let through (0 or 1), while tanh function gives weightage to the values which are passed, deciding their level of importance (-1 to 1)

      ![image.png](https://www.simplilearn.com/ice9/free_resources_article_thumb/LSTMs_step2.png)

  3. Decide what part of the current cell state makes it to the output by an **output gate**
      - determined by a sigmoid function, which decides what parts of the cell state make it to the output
      - Then, put the cell state through tanh to push the values to be between -1 and 1 and multiply it by the output of the sigmoid gate

      ![image.png](https://s3.amazonaws.com/static2.simplilearn.com/ice9/free_resources_article_thumb/LSTMs_step3.png)

> Graphical difference with RNN:
![image.png](https://camo.githubusercontent.com/5e4e339bfad09c8532386f8befec19e2609fe5a77f4f302782da7472cf23b67c/68747470733a2f2f666972656261736573746f726167652e676f6f676c65617069732e636f6d2f76302f622f646565702d6c6561726e696e672d63726173682d636f757273652e61707073706f742e636f6d2f6f2f36524e4e32362e706e673f616c743d6d6564696126746f6b656e3d61623938643732302d633066392d346333322d613663342d633539376438646139303462)