### Why RNNs?

Recurrent Neural Network (retains and the leverages the sequence information)

* We know that the architecture of **CNN** is different from the architecture of **MLP**.
    - **MLP** works very well if the input data is a vector
    - **CNN** workd very well if the input data is an image

* What if the input is a **sequence**?
    - sequence of words (amazon food reviews)
        - this phone is very smart (sequene is very important)
        - when we apply **BoW**, **TFIDF**, **W2V** the context and the sequence is lost
    - when we have time series data (sequence prediction problem)
        - stock market prediction
        - windowing the data based on the interval
        - frequency domain (fourier transformation)
    - machine translation (language)
        - english to spanish (sequence information)
    - speech recogniation (**siri**, **alexa**, **google assistant**)
    - image captioning
        - input is the image
        - output is a sentence describing the image
        - google image search

* RNNs are very important as we use sequence information there by building robust models.
    - output completely depends on the sequence of the inputs

### Recurrent Neural Network

structure and example

* Recurrent means repeating

* Amazon Food Reviews → binary classification task
    - $D = \{x_i, y_i\}$
    - $x_i = \text{sequence of words or a sentence that has a context}$
    - $y_i = \text{binary class - positive or negative}$
    - words are represented in a One-Hot-Encoding way
    
    <br>
    
    - $x_1 \rightarrow \ <x_{11}, x_{12}, x_{13}, x_{14}, x_{15}>$ and $y_1$
    
    ![rnn-structure-example](https://user-images.githubusercontent.com/63333753/143395161-0bf024bf-fcf4-4355-ba32-d68575fe0f5d.jpeg)
    
    <br>
    
    - to get the $y_1$, take $O_5$ and connect it to softmax or logistic layer. We will get the prediction
    
    <br>
    
    ![simple-rnn-structure](https://user-images.githubusercontent.com/63333753/143396318-ff5598e1-40f4-4f12-9bf6-3ab45d3f92fc.jpeg)
    
    <br>
    
    - to create that repeatitive struture, we can place $O_0$ which is a set of zeros or random numbers
    
    <br>
    
    ![simple-rnn-structure-ano](https://user-images.githubusercontent.com/63333753/143396829-557292fa-3a56-4eb6-8522-386751b1002b.jpeg)
    
    <br>
    
    - the final structure is as follows
    
    <br>
    
    <img src="https://www.researchgate.net/profile/Weijiang-Feng/publication/318332317/figure/fig1/AS:614309562437664@1523474221928/The-standard-RNN-and-unfolded-RNN.png">
    
**Credits** - Image from Internet

### Backprop Over Time

backpropagation in RNN

![fb_prop_rnn](https://user-images.githubusercontent.com/63333753/143411274-583b7035-5bdb-4b25-93a9-abebabca1924.jpeg)

From the above image we can formulate -

**Forward Propagation** (weights do not change and follow the arrows)

* $O_1 = f(wx_{i1} + w^1O_0)$

* $O_2 = f(wx_{i2} + w^1O_1)$

* ...

* $O_{10} = f(wx_{i10} + w^1O_9)$

* $\hat{y_i} = g(w^{11}O_{10})$

**Backward Propagation** (weights do change and follow the opposite direction of arrows)

$$\frac{\partial L}{\partial \hat{y_i}}$$

$$\frac{\partial L}{\partial O_{10}} = \frac{\partial L}{\partial \hat{y_i}}\frac{\partial \hat{y_i}}{\partial O_{10}}$$

$$\frac{\partial L}{\partial w^{11}} = \frac{\partial L}{\partial \hat{y_i}}\frac{\partial \hat{y_i}}{\partial w^{11}}$$

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y_i}}\frac{\partial \hat{y_i}}{\partial O_{10}}\frac{\partial O_{10}}{\partial w} \ (\text{for the 1st w from backwards})$$

* By this, we will update all $\{w, w^1, w^{11}\}$ and this happes for single sequence in time.
* The only problem with this is that, if the partial derivative value is less than 1 then it can lead to vanishing gradient problem.
    - this is happening not because we have so many layers but with so many sequences
    - LSTMs and GRUs are mainly designed to overcome this problem

---

- Can we take an intuition of RNN with simple cooking analogy like:
    - If we are cooking a dish, we take a single pan (function) to which we add different ingredients (inputs) one by one in a sequence but the pan remains same.
    - During this everytime a new ingredient is added flavour (output) gets changed.
    - Back propogation is like after finishing the dish we try to taste the sample if we feel that dish does not have enough salt (w, w') we add a little salt and check if the dish is perfect now if not we will add it again till the dish tastes perfect.

### Types of RNNs

* **One to One RNN**
    - also called as simple MLP
    - one input (no sequence)
    - one output (no sequence)

* **Many to One RNN**
    - regular RNN with sequence of inputs and one output
    - very much useful for sentiment classification
    
<!--     <img src="https://www.researchgate.net/profile/Yishan-Jiao/publication/307889236/figure/fig1/AS:614056591388688@1523413908449/Many-to-one-RNN-structure-used-in-the-method-of-DNN-with-RNNon-sequence.png"> -->
    

* **One to Many RNN**
    - one input (no sequence) - can be an image
    - the output can be sequence
    - very much useful for image captioning
    
<!--     <img src="https://www.simplilearn.com/ice9/free_resources_article_thumb/One_to_Many_RNN.png"> -->
    

* **Many to Many RNN (A)**
    - the input and output are of same length
    - the input is a sequence
    - the output is also a sequence
    - very much useful in indentifying the parts of speech given a sentence
        - the input can be sentence (sequence)
        - the output again is a sequence (set of parts of speech)

* **Many to Many RNN (B)**
    - the input and output are of different length
    - the input is a sequence
    - the output is also a sequence
    - very much useful in machine translation
        - english sentence can have 10 words whereas spanish can have 6 words
    - this is also called as encoder-decoder network

<img src="https://www.dummies.com/wp-content/uploads/deep-learning-recurrent-neural-network-input-output.png">

**Credits** - Image from Internet

### LSTMs / GRUs

need → https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

* If the fourth output is dependant on 1st input, it is called **long-term dependency**. The backpropagation may not reach till there.
    - the simple RNN cannot take care of long-term dependency

* We have a huge long-term dependency especially in the problems like machine translation.

* To avoid these problems, developers have come up with LSTMs or GRUs.

### LSTM

Long Short Term Memory RNN - http://colah.github.io/posts/2015-08-Understanding-LSTMs/

* It can take care of long-term dependency as well as short-term dependency.

* LSTM - 1997
    - input gate
    - output gate
    - forget gate

    ![lstm-structure](https://user-images.githubusercontent.com/63333753/143560894-207d3382-d4ee-4d1f-8f58-0ff703ab0402.png)

* Two ways of representing a RNN
    - left side → addition operation
    - right side → concatenation operation
    
    ![two-ways-rnn](https://user-images.githubusercontent.com/63333753/143535208-a61e40e1-52f9-4edf-ba09-d6ce6b35ca18.jpeg)

<br>

* **Read the above blog carefully.**

    <img src="https://i.imgur.com/wM8Wk1Q.png">

* YouTube → https://youtu.be/8HyCNIVRbSU

### GRU - Gated Recurrent Unit

https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be

* GRU - 2014
    - simplified and optimized version inspired from LSTM
    - faster to train
    - as powerful as LSTM
    - two gets
        - reset gate
        - update gate

![gru-structure](https://user-images.githubusercontent.com/63333753/143561226-dc36119e-7ca1-4bb9-9e74-c9a9ec5e26d1.png)

**Credits** - Image from Internet

### Deep RNN

* Forward propagation happens via **→**
* Backward propagation happens via **←**

<img src="https://i.stack.imgur.com/8ngKs.png">

**Credits** - Image from Internet

### Bi-directional RNN

BRNN

* Sometimes one output may depend on multiple (previous) inputs as well as multiple (next or following) inputs which the model will see later.
    - notation wise, we can write like $y_{it}$ is dependent both on $x_{it}$ and $x_{it + k}$
* This is called bidirectional (from both previous and following ends) RNN.

<img src="https://miro.medium.com/max/764/1*6QnPUSv_t9BY9Fv8_aLb-Q.png">

* Forward propagation happens via **→**
* Backward propagation happens via **←**

* This model is mostly used in NLP tasks where a sentiment or output is dependent on bidirectional inputs.

* First the model is fed with the first sequence of words on which the output of forward direction is calculated. Then that sequence is reversed and fed into another RNN, and the output is generated at each time step. Now these outputs are concatenated and accordingly the loss is calculated and backpropogated to forward LSTM and backward LSTM.


**Credits** - Image from Internet