# Recurrent Neural Network (RNN)

- Refer [RNN-in-NLP.ipynb](RNN-in-NLP.ipynb) for more information on how RNN is used/applied in NLP

- Sequential data prediction is considered as a key problem in machine learning and artificial intelligence
- Unlike images where we look at the entire image, we read text documents sequentially to understand the conte
- The likelihood of any sentence can be determined from everyday use of language
- The earlier sequence of words (in time) is important to predict the next word, sentence, paragraph or chapter
- If a word occurs twice in a sentence, but could not be accomodated in the sliding window, then the word is learned twice
- __*An architecture that does not impose a fixed-length limit on the prior context*__ - _The Key Aspect Of RNN_

- __States__ are important in the reading exercise. The previous state definitely affects the next state
- In order to use the previous stat, we need to store it or remember it
- Traditional Neural Networks were not designed as a state machine as anything
- Traditional Neural Networks do not accept arbitrary input length
- Inherent ability to model sequential input
- Handle variable lenght inputs without the use of arbitrary fixed-sized windows
- Use its own output as input
- __RNN__s encode not only __attributional similarities between words__, but also __similarities between pairs of words item Analogy__
  - i.e., it finds out the analogy
    - $\texttt{Chennai : Tamil :: London : English}$
    - $\texttt{go and went}$ is same as $\texttt{run and ran}$
    - $\texttt{queen} \approx \texttt{king - man + woman}$

## A Simple RNN

- RNN is almost similar to Traditional NN, except that there is a recurrent part ($\large h_t$)
  - A small loop in the hidden layer part
- Why a loop in hidden layer
  - We want to capture the preivous state $\large h_{t-1}$, which we have calcualted from input as a linear combination, and include this as part of the new state $\large h_t$
  - i.e., we are incorporating whatever learned in the embedding layer as part of the current state
- $\large h_t$ is going to give the _time-series_

![RNN_A_Simple_RNN](images/RNN_A_Simple_RNN.jpg)

## RNN - An Extension of a Feed Forward Network

- A new parameter $\large U$ is introduced, which has the previous state of embedding layer
  - It is a hidden layer to hidden layer parameter
  - Previous state $\large h_{t-1}$ is stored as $\large U$ (all the time slice info maintained, not just last time slice hidden layer activation values)
  - It is the memory part of RNN, which we want to retain
- In RNN, we need to learn $\large V, W, U$, 3 parameters need to be learnt

![RNN_An_Extension_of_a_Feed_Forward_Network](images/RNN_An_Extension_of_a_Feed_Forward_Network.jpg)

## Multiple Architectures of RNN

- With the capability of memory in RNN, Neural Network can be build/create in sevaral ways
  - __One-to-One__
    - Use the neural network as a standard neural network for classification purposes
  - __One-to-Many__
    - Can be used for image description
      - Input is single image
      - Output is description of scene in that image
  - __Many-to-One__
    - Basically used in Sentiment Analysis
  - __Many-to-Many__
    - Used for Machine Translation (right corner image in the diagram show below)
      - Given a set of words (sentence), translate it into another language have a set of words
    - Used for frame labeling of video sequences

![RNN_Multiple_Architectures_Of_RNN](images/RNN_Multiple_Architectures_Of_RNN.jpg)

## Feed-Forward Algorithm

- Below agorithm shows the Feed Forward flow of RNN

![RNN_Feed_Forward_Algorithm](images/RNN_Feed_Forward_Algorithm.jpg)

## Recurrent Neuron

- Various representation of 'Recurrent Neuron' appears in various research papers
- Below is one of the representation

| Representation 1 | Representation 2 |
| ---- | ---- |
| ![RNN_Recurrent_Neuron_Representation_1](images/RNN_Recurrent_Neuron_Representation_1.jpg) | ![RNN_Recurrent_Neuron_Representation_2](images/RNN_Recurrent_Neuron_Representation_2.jpg) |

## Unrolled RNN

- From [A Gentle Introduction to RNN Unrolling](https://machinelearningmastery.com/rnn-unrolling/)
  - Recurrent neural networks are a type of neural network where the outputs from previous time steps are fed as input to the current time step
  - RNNs are fit and make predictions over many time steps. We can simplify the model by unfolding or unrolling the RNN graph over the input sequence.
    - ![Example-of-an-RNN-with-a-cycle](images/Example-of-an-RNN-with-a-cycle.png)*Example-of-an-RNN-with-a-cycle*
  - Consider the case where we have multiple time steps of input (X(t), X(t+1), …), multiple time steps of internal state (u(t), u(t+1), …), and multiple time steps of outputs (y(t), y(t+1), …).
  - We can unfold the above network schematic into a graph without any cycles.
    - ![Example-of-Unrolled-RNN-on-the-forward-pass](images/Example-of-Unrolled-RNN-on-the-forward-pass.png)*Example-of-Unrolled-RNN-on-the-forward-pass*

- From [v1] Lecture 55
  - We can unroll the RNN through time and we can share the values of the hidden weights through time and we can compute the output in the same maner that we had computed the output in ANN. Only change that we will see is the computation of $\large h_t$ using previous value that you had stored in the memory $\large h_{t-1}$
  - $\large h_0$ is the initial state, $\large x_0$ is the initial input value
  - Using $\large h_0$ and $\large x_0$, we compute the activation $\large h_0$, using which we compute the output $\large y_0$

  - ![RNN_Unrolled_RNN](images/RNN_Unrolled_RNN.jpg)

### RNN Unrolled in Time

- Below diagram shows the different representation of RNN unrolled over time

![RNN_Unrolled_In_Time](images/RNN_Unrolled_In_Time.jpg)

## RNN-Based Language Model

### Character based LM - RNN

- Used in keyboard word prediction where while typing the characters, the model has to predict the word and the upcoming words (when a word is typed)
- In this language model, character will be the input
  - Input will be a __*One-Hot-Vector*__
  - Since character will be the input, the size of the vocabulary will be $26$, number of alphabets in English
  - Example: For the word $\texttt{success}$, characters $\texttt{s}$, $\texttt{u}$, $\texttt{c}$, $\texttt{c}$, $\texttt{e}$, $\texttt{s}$ are inputted one by one as a time series
    - First $\texttt{s}$ is given as input and it is expected to give $\texttt{u}$ as output, if not the error is backpropagated
    - Next $\texttt{u}$ is given as input and it is expected to give $\texttt{c}$ as output, if not the error is backpropagated
    - Above steps are repeated until the stop symbol $\texttt{\$}$ is encounted.
    - Multiple epochs will be executed until the model is trained.
    - Once trained, the model can predict the next character after each current character input

![RNN_Character_Based_LM](images/RNN_Character_Based_LM.jpg)

### Word based Language Model - RNN

- Similar to character based model, we can have Word based LM as well
- Input will be a __*One-Hot-Vector*__
- We will start inputting one word at a time until encoutering end of sentence symbol
- Unlike traditional ANN, this is flexible, in the sense that, you can increase the time slice depending on the lenght of the sentence, and then you can make the system learn, which will give the model, which can be used for prediction
  - Theoreticall, we can go for very very long sentence
- Normally this model is used for generating text

| Word LM | Example of Word LM |
| :---: | :---: |
| ![RNN_Word_Based_LM_1](images/RNN_Word_Based_LM_1.jpg) | ![RNN_Word_Based_LM_2](images/RNN_Word_Based_LM_2.jpg) |

### Advantages and Trouble with RNN

- Theoretically, it is possible to store all historical information in the RNN
  - Unlike traditional ANN, this is flexible, in the sense that, you can increase the time slice depending on the lenght of the sentence, and then you can make the system learn, which will give the model, which can be used for prediction
  - Theoreticall, we can go for very very long sentence
  - When we go for a long time series
    - We get into the problem of __*Vanishing Gradient*__
    - In some cases, we get into the problem of __*Exploding Gradient*__
- Vanishing gradient problem - The diminishing value of $\large \delta$ makes it difficult to capture the long term memory as we move down the memory lane or layers of hidden nodes
  - What is the soltuion?

## BPTT-Forward Pass

## Study Links

- [Chapter 7: The Simple Recurrent Network: A Simple Model that Captures the Structure in Sequences](https://web.stanford.edu/group/pdplab/pdphandbook/handbookch8.html)
- [A Beginner's Guide to LSTMs and Recurrent Neural Networks](https://skymind.ai/wiki/lstm)
- [How to implement a simple RNN](https://peterroelants.github.io/posts/rnn-implementation-part01/)
- [Rolling and Unrolling RNNs](https://shapeofdata.wordpress.com/2016/04/27/rolling-and-unrolling-rnns/)
- [Finding Structure in Time](https://crl.ucsd.edu/~elman/Papers/fsit.pdf)