![bannner](../../src/visuals/banner.png)

# Deep Learning for Sequences

It is finally time to change our data type! Up till now we have explored a lot of the PyTorch methods specifically for Images. This means that we have some static data and are trying to make some downstream prediction of it. What this method was missing was any temporal or sequence dimension where a specific datapoint in time has some relations to times before and after it.

We have all types of sequences in the real world!

- Natural Langauge is a Sequence of Words
- Speech is a Sequence of Signals at varying amplitudes
- Stock Market is a Sequence of Price data
- Videos are a Sequence of Images

The ability to be able to model this is crucial! Here are few of the options we have currently:

- **Convolution!** Instead of iterating through a 2D image, we would just slide a filter across a 1D vector of values. Otherwise this is the same as everything we did with Vision. Unfortunately, Convolutions don't have any mechanism to model relationships between tokens that are further apart than the filter size. Therefore, if we have a kernel of size 7 and a sentence with 30 words, we couldn't related how the last words are related to the first words.

- **Recurrent Neural Network:** RNN can model sequence data and use a concept called "Backpropagation Through Time" to be able to relate different parts of a sequence. We will explore backprop through time a little later, but the main problem with RNN models is it can only successfully relate short sequences as it doesn't have the "memory" to remember things far back in time.

- **Long Short Term Memory:** A popular variant of the RNN that incorporates additional logic to enable remembering things further back in time!

- **Transformers:** The most popular and powerful sequence model today due to a lot of great properties! We will go into depth on the Transformer Architecture in a future lesson!


## Recurrent Neural Network

Lets first take a look at the RNN Architecture!

![RNN](https://editor.analyticsvidhya.com/uploads/17464JywniHv.png)

[credit](https://www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn/)

On the left side we see clearly how the RNN works. We pass in the first part of the input X through some hidden state and some weight matrix W will map it to an output. We then grab the next part of the sequence for X, reuse the updated hidden state from before and again remap to another output for the next timestep. This process is repeated until the sequence is complete. On the right, we see this process flattened out, and this is known as the **Unrolled View** of the RNN. 


Here is a visual that shows you how it all fits together!
![rnngif](https://www.simplilearn.com/ice9/free_resources_article_thumb/Fully_connected_Recurrent_Neural_Network.gif)
[credit](https://www.simplilearn.com/tutorials/deep-learning-tutorial/rnn)


Now in these examples, we see that we have only a single layer of an RNN, but you can theoretically have as many as you want! Here is a visual of this that really helped me piece all this together. **Do note that this image below is actually of an LSTM which has one crucial difference from RNN!**

![Unfolded LSTM](https://i.stack.imgur.com/SjnTl.png)

[credit](https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm)

Here are the main ideas to takeaway from the image above!
- We start with some $h_0$ and $c_0$ known as the hidden and cell states. 
    - RNN and LSTM both have a Hidden State that acts as the "working memory" for the model. This typically cannot model long sequences very well on its own.
    - The Cell State is unique to the LSTM and acts as the long term memory for the model to remember attributes of the sequence far back in time. 
    
- The **depth** shows the number of LSTM/RNN layers we want and **t** is the number of timesteps we have in our data

- The output gives us the "hidden states" of every value in the timeseries from the last LSTM/RNN layer which can be passed to another LSTM/RNN if we wanted. On the right, we can get the final hidden state/cell state for all layers that can be used for prediction (potentially, it depends on what we want). 

    - To clarify, the output gives all hidden states from $h_1, h_2, ... h_n$ where on the right, we get only $h_n$ as well as $c_n$ if it is an LSTM model. 
    
    
## RNN vs LSTM

Going in depth into the architectural differences of RNN vs LSTM and how "long term memory" is encoded is a bit tough to explain but let me first offer some intuition. RNN suffer a condition that we talked about when going through ResNets: **Vanishing Gradients**. The reason this happens is the way these sequence models optimize through a technique known as backpropagation through time. 


#### Backpropagation Through Time 
First think back to the Rolled version of the RNN module, at every iteration we are passing in a single timepoint through some learnable parameters, but the underlying weights to optimize are the same. (i.e You can have a sequence as long as you want but  the model will be the same size, saving on computation). This means the same set of weights will need to be optimized and we are encoding the change in sequence by updating the hidden state that hopefully aggregates previous information. Therefore if we have inputs $x_1, x_2, ... x_n$, we will pass in a value at a time, update the hidden state, and output a value. Afterwards we will calculate a loss depending on what our task was (classification, regression, etc...) and then we will perform the chain rule and backpropagate through every time an input was passed to our weights. If we had $N$ inputs in our sequence, then we will have $N$ things to multiply together in our gradients

It seems like a weird idea but it will make more sense with the visual:

![backpropthroughtime](https://media.licdn.com/dms/image/D5612AQEgpJmnvwHxyA/article-cover_image-shrink_600_2000/0/1680279761408?e=2147483647&v=beta&t=p1mo3UbmkkXzkO_EyyY_PKGYSYuRn8DWJF_pOUT-r-s)

[credit](https://dennybritz.com/posts/wildml/recurrent-neural-networks-tutorial-part-3/)

You can clearly see that to get back to updating at time $x_0$, we need to backpropagate through all the times after it! So the derivative of $E_3$, the output for our RNN at timestep 3, with respect to the weights in $S_0$ will be as such:

$$\frac{dE_3}{ds_0} =\frac{dE_3}{ds_3}\frac{ds_3}{ds_2}\frac{ds_2}{ds_1}\frac{ds_1}{ds_0}$$


Therefore, if our sequence gets longer and the gradients are small, we will have the same issue that ResNet tried to solve with Vanishing Gradients! 


![image.png](https://ashutoshtripathicom.files.wordpress.com/2021/06/rnn-vs-lstm.png)

[credit](https://ashutoshtripathi.com/2021/07/02/what-is-the-main-difference-between-rnn-and-lstm-nlp-rnn-vs-lstm/)

As we compare our LSTM and RNN architecture, we can see that the LSTM has much more going on but there are a few important ways it avoids the "forgetfullness" of RNN.
 - Forget Gate: Decides what past information is important and what to remove. This is then passed through a sigmoid where values close to 0 would cause it to forget, and values close to 1 are to keep
 - Input Gate: A calculation to figure out how much of the input values from $x_t$ (the current timestep value) and $h_{t-1}$ (the hidden state from the previous timestep) should be encoded into the cell state. Essentially, how important is the current timestep and should we add it to our long term memory. Again it uses a sigmoid to scale between 0 and 1.
- Output Gate: Calculation of what the next hidden state should be for the next timestep. 


The Cell State does a simple sum with the output of our Forget and Input Gates and then moves on to the next timestep! Essentially, during backpropagation, this pathway created by the cell state offers a new path for backpropagation that circumvents all the messy calculations happening in the gates, greatly reducing the vanishing gradient problem in a similar way we dealt with ResNet!


## What Kinds of Problems can we Solve with Sequence Models?

![mapping](https://api.wandb.ai/files/ayush-thakur/images/projects/103390/4fc355be.png)

[credit](https://wandb.ai/ayush-thakur/dl-question-bank/reports/LSTM-RNN-in-Keras-Examples-of-One-to-Many-Many-to-One-Many-to-Many---VmlldzoyMDIzOTM)

- One to One is what we have been doing up till now: Given a single image what is a prediction
- One to Many: Given a vector of Image Features can we generate a text caption
- Many to One: Can we classify a sequence?
- Many to Many: This can be two things
    - If Input/Output are not aligned, this would be used for Language Translation where we input a sequence and output a sequence
    - If Input/Output are aligned, then this can be Video Classification where we want to classify each frame but use the information in previous frames