# Natural Language Processing using RNNs and LSTMs

Examples of Sequential Data:
- Speech
- Text (NLP)
- Music
- Protein and DNA sequences
- Stock prices and other time series


• One to one - POS Tagging

• One to many - Caption of image

• Many to one - Sentiment Analysis

• Many to many - Language Translation

# Exploding and Vanishing Gradient

- Vanishing –
As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. As a result, the gradient descent never converges to the optimum. This is known as the vanishing gradients problem.

- Exploding –
On the contrary, in some cases, the gradients keep on getting larger and larger as the backpropagation algorithm progresses. This, in turn, causes very large weight updates and causes the gradient descent to diverge. This is known as the exploding gradients problem.

# Recurrent Neural Networks and Natural Language Processing.
Recurrent Neural Networks (RNNs) are a form of machine learning algorithm that are ideal for sequential data such as text, time series, financial data, speech, audio, video among others.

Among the text usages, the following tasks are among those RNNs perform well at:
- Sequence labelling
- Natural Language Processing (NLP) text classification
- Natural Language Processing (NLP) text generation

RNNs effectively have an internal memory that allows the previous inputs to affect the subsequent predictions. It’s much easier to predict the next word in a sentence with more accuracy, if you know what the previous words were.Often with tasks well suited to RNNs, the sequence of the items is as or more important than the previous item in the sequence.

### Long Term Short Term Memory (LSTM)
An RNN has short term memory. When used in combination with Long Short Term Memory (LSTM) Gates, the network can have long term memory.
Instead of the recurring section of an RNN, an LTSM is a small neural network consisting of four neural network layers. These are the recurring layer from the RNN with three networks acting as gates.
An LSTM also has a cell state as well, along side the hidden state. This cell state is the long term memory. Rather than just returning the hidden state at each iteration, a tuple of hidden states are returned comprised of the cell state and hidden state.
Long Short Term Memory (LSTM) has three gates.
- An **Input gate**, this controls the information input at each time step.
- An **Output gate**, this controls how much information is outputted to the next cell or upward layer
- A **Forget gate**, this controls how much data to lose at each time step.

### Gated recurrent unit (GRU)
A gated recurrent unit is sometimes referred to as a gated recurrent network.
At the output of each iteration there is a small neural network with three neural networks layers implemented, consisting of the recurring layer from the RNN, a reset gate and an update gate. The update gate acts as a forget and input gate. The coupling of these two gates performs a similar function as the three gates forget, input and output in an LSTM.
Compared to an LSTM, a GRU has a merged cell state and hidden state, whereas in an LSTM these are separate.
#### Reset gate
The reset gate takes the input activations from last layer, these are multiplied by a reset factor between 0 and 1. The reset factor is calculated by a neural network with no hidden layer (like a logistic regression), this performs a dot product matrix multiplication between a weight matrix and the addition/concatenation of the previous hidden state and our new input. This is then all put through the sigmoid function e^x / (1 + e^x).
This can learn to do different things in different situations, for example to forget more information if there’s a full stop token.
#### Update gate
The update gate controls how much of the new input to take and how much of the hidden state to take. This is a linear interpolation. This is 1 — Z multiplied by the previous hidden state plus Z multiplied by the new hidden state. This controls to what degree we keep information from the previous states and to what degree we use information from the new state.
The update gate is often represented as a switch in diagrams, although the gate can be in any position to create a linear interpolation between the two hidden states.

### Encoder Decoder

![image.png](attachment:image.png)


#### Encoder
- A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward.
- In question-answering problem, the input sequence is a collection of all words from the question. Each word is represented as x_i where i is the order of that word.
- The hidden states h_i are computed using the formula:

This simple formula represents the result of an ordinary recurrent neural network. As you can see, we just apply the appropriate weights to the previous hidden state h_(t-1) and the input vector x_t.

#### Encoder Vector
- This is the final hidden state produced from the encoder part of the model. It is calculated using the formula above.
- This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.
- It acts as the initial hidden state of the decoder part of the model.

#### Decoder
- A stack of several recurrent units where each predicts an output y_t at a time step t.
Each recurrent unit accepts a hidden state from the previous unit and produces and output as well as its own hidden state.
- In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is represented as y_i where i is the order of that word.
- Any hidden state h_i is computed using the formula:

As you can see, we are just using the previous hidden state to compute the next one.
- The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current time step together with the respective weight W(S). Softmax is used to create a probability vector which will help us determine the final output (e.g. word in the question-answering problem).

The power of this model lies in the fact that it can map sequences of different lengths to each other. As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture.

## Some Applications of LSTM

- Sentiment Analysis
- Sentence Generation
- Machine Translation

## Advanced Applications

- Multi Layer LSTM
- Bi-directional LSTM
- LSTM with attention Mechanism