### Why not Feedforward Networks?

Feedforward networks are used to classify images. Let us understand the concept of a feedforward network with an example given below in which we trained our network for classifying various images of animals. If we feed an image of a cat, it will identify that image and provide a relevant label to that particular image. Similarly, if you feed an image of a dog, it will provide a relevant label to that image a particular image as well.

Consider the following diagram:

![image.png](attachment:image.png)

And if you notice the new output that we have got is classifying, a dog has no relation to the previous output that is of a cat, or you can say that the output at the time 't' is independent of output at a time 't-1'. It can be clearly seen that there is no relation between the new output and the previous output. So, we can say that in feedforward networks, the outputs are independent of each other.

There are a few scenarios where we will actually need the previous output to get the new output. Let us discuss one such scenario where we will necessitate using the output that has been previously obtained.

![image-2.png](attachment:image-2.png)

Now, what happens when you read a book. You will understand that book only on the understanding of the previous words. So, if we use a feedforward network and try to predict the next word in the sentence, then in such a case, we will not be able to do that because our output will actually depend on previous outputs. But in the feedforward network, the new output is independent of the previous outputs, i.e., output at 't+1' has no relation with the output at 't-2', 't-1', and 't.' Therefore, it can be concluded that we cannot use feedforward networks for predicting the next word in the sentence. Similarly, many other examples can also be taken where we need the previous output or some information from the previous output, so as to infer the new output.

### How to overcome this challenge?

Consider the following diagram:

![image.png](attachment:image.png)

We have input at 't-1', which we will feed to the network, and then we will get the output at 't-1'. Then at the next timestamp that is at a time 't', we have an input at a time 't', which will be again given to the network along with the information from the previous timestamp, i.e., 't-1' and that will further help us to get the output at 't'. Similarly, at the output for 't+1', we have two inputs; one is the new input that we give, and the other is the information coming from the previous timestamps, i.e., 't' in order to get the output at a time 't+1'. In the same way, it will go on further like this. Here we have embodied in a more generalized way to represent it. There is a loop where the information from the previous timestamp is flowing, and this is how we can solve a particular challenge.

### What are Recurrent Neural Networks?

"Recurrent Networks are one such kind of artificial neural network that are mainly intended to identify patterns in data sequences, such as text, genomes, handwriting, the spoken word, numerical times series data emanating from sensors, stock markets, and government agencies".

### Training a Recurrent Neural Network

A recurrent neural network uses a backpropagation algorithm for training, but backpropagation happens for every timestamp, which is why it is commonly called as backpropagation through time. With backpropagations, there are certain issues, namely vanishing and exploding gradients, that we will see one by one.

#### Vanishing Gradient

As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. As a result, the gradient descent never converges to the optimum. This is known as the vanishing gradients problem.

#### Exploding Gradient

On the contrary, in some cases, the gradients keep on getting larger and larger as the backpropagation algorithm progresses. This, in turn, causes very large weight updates and causes the gradient descent to diverge. This is known as the exploding gradients problem.

### How to overcome these challenges?

## Long Short-Term Memory Networks (LSTMs)

Long Short-Term Memory networks, which are commonly known as "LSTMs," are a special kind of Recurrent Neural Networks that are capable enough of learning long-term dependencies.

### What are long-term dependencies?

It has happened many times that we only require recent data in order to perform questions in a model. But at the same time, we may also need data that has been previously obtained.

Consider the following example to have a better understanding of it.

Machine has the  task of completing the following sentence.

**Today**, due to my current job situation and family coniditions, I ...

**Last year**, due to my current job situation and family coniditions, I ...

Based on the words Today and Last year, the auto-complete sentences will be different. Like,

**Today**, due to my current job situation and family coniditions, I **need** to take a load

**Last year**, due to my current job situation and family coniditions, I **had** to take a loan.

The decission between **need** and **had** was based on the words that appeared in the beginning of the sentence. 
This is called as long-term dependencies

### LSTM Architecture

Long Short Term Memory networks (LSTMs) is a special kind of recurrent neural network capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber in 1997. Remembering information for longer periods of time is their default behavior. The Long short-term memory (LSTM) is made up of a memory cell, an input gate, an output gate and a forget gate. The memory cell is responsible for remembering the previous state while the gates are responsible for controlling the amount of memory to be exposed.

![image-2.png](attachment:image-2.png)

The memory cell is responsible for keeping track of the dependencies between the elements in the input sequence.The present input and the previous is passed to forget gate and the output of this forget gate is fed to the previous cell state. After that the output from the input gate is also fed to the previous cell state.By using this the output gate operates and will generate the output.

### Forget Gate

There are some information from the previous cell state that is not needed for the present unit in a LSTM. A forget gate is responsible for removing this information from the cell state. The information that is no longer required for the LSTM to understand or the information that is of less importance is removed via multiplication of a filter. This is required for optimizing the performance of the LSTM network. In other words we can say that it determines how much of previous state is to be passed to the next state.

![image-3.png](attachment:image-3.png)

The gate has two inputs X t and h t-1. h t-1 is the output of the previous cell and x t is the input at that particular time step. The given inputs are multiplied by the weight matrices and a bias is added. Following this, the sigmoid function(activation function) is applied to this value.

### Input Gate

![image-4.png](attachment:image-4.png)

The process of adding new information takes place in input gate. Here combination of x t and h t-1 is passed through sigmoid and tanh functions(activation functions) and added. Creating a vector containing all the possible values that can be added (as perceived from h t-1 and x t) to the cell state. This is done using the tanh function. By this step we ensure that only that information is added to the cell state that is important and is not redundant.

### Output Gate

![image-5.png](attachment:image-5.png)

A vector is created after applying tanh function to the cell state.Then making a filter using the values of h t-1 and x t, such that it can regulate the values that need to be output from the vector created above. This filter again employs a sigmoid function.Then both of them are multiplied to form output of that cell state.

### Youtube link to understand LSTM clearly

https://www.youtube.com/watch?v=LfnrRPFhkuY