# BPTT - Exploding and Vanishing Gradient

- From [v1] Lecture 59

- RNN solved many problems that are possible with traditional neural networks
- As people started using RNN for solving their problems, they faced two inherent issues in RNN:
  - __*Exploding Gradient*__
    - Values became very big, for which computer started generating 'NaN'
  - __*Vanishing Gradient*__
    - Values became very extermely small, very close to zero but not equal to zero
- When above issue happens, no action takes place in Neural Network
  - Neural Network unable to learn
    - You need a good gradient, so that network moves towards equillibrium/ local minima which solves the problem
    - If you don't have a good gradient, the training stops, so it can't learn
- In order to design a good RNN network for various purposes, we __*should understand these two difficulties*__ that we would face __*in design of RNN*__

## Understanding Exploding/Vanishing Gradient

### Initial Setup

![RNN_Exploding_Vanishing_Gradient_1](images/RNN_Exploding_Vanishing_Gradient_1.jpg)
![RNN_Exploding_Vanishing_Gradient_2](images/RNN_Exploding_Vanishing_Gradient_2.jpg)
![RNN_Exploding_Vanishing_Gradient_3](images/RNN_Exploding_Vanishing_Gradient_3.jpg)
![RNN_Exploding_Vanishing_Gradient_4](images/RNN_Exploding_Vanishing_Gradient_4.jpg)

### Example: Vanishing Gradient

- RNN is claimed to be that input sequence can be of any length
- But from below example, it is clear that RNN network starts forgetting its memories (See [Gradient Clipping](#Gradient_Clipping) to know more)
  - Due to Vanishing Gradient
  - With the values (rough values given by lecturer, instead of proving it mathematically), we see that differtiation is very less
- That is, RNN is not able to remember long sequence
  - The value ($U$) vanishes when the word that you want to predict is found somewhere very far away in the same sentence.

![RNN_Exploding_Vanishing_Gradient_Example](images/RNN_Exploding_Vanishing_Gradient_Example.jpg)

## Gradient Clipping

<a id='Gradient_Clipping'></a>

- The values of $\large U, V$ and $W$ are unbounded. We do not have any boundary on those.
  - If you look at the values of activation function for $\large s_t$ and $\large z_t$, we always have the value between $[-1 \texttt{and} +1]$ and $[0 \texttt{and} 1]$ respectively
  - In case of $U$ and $V$, they are not bounded by this condition, i.e., they are not conditioned by values
  - So it is possible for $U$ to become very small during backpropagation, and when it becomes extremely small, the long-term remembarance is going to be gone(vanished) or the long-term dependency is gone(vanished).
    - It means, it only remembers the short-term
- This implies that nothing that RNN forget, but the gradient that we are moving from the last time state to the first time state getting smaller and smaller, which causes this issue
  - The values of $U$, $V$ and $W$ are not conditioned, because of that, gradient explodes or vanishes
- The gradient is either very large or very small.
  - This can cause the optimizer to converge slowly
- To speed up training, clip the gradient at certain values
  - If $\large g \lt 1$, or if $\large g \gt 1$, then $\large g = 1$
  - Or
  - if $\large ||g|| \gt \textit{threshold}$, then $\large g \leftarrow \dfrac{threshold}{||g||}g$
- Clip the gradient if it exceeds a $threshold$