In [1]:
# Author: Zhengxiang (Jack) Wang 
# Date: 2021-08-03
# GitHub: https://github.com/jaaack-wang 
# About: Vanishing Gradients and Fancy RNNs for Stanford CS224N- NLP with Deep Learning | Winter 2019

# Table of Contents
- [1. Vanishing gradient](#1)
    - [1.1 Problem defined and cause](#1-1)
    - [1.2 Potential problems](#1-2)
    - [1.3 Possible consequences of Vanishing Gradient on RNN-LM in possible scenarios](#1-3)
- [2. Exploding gradient](#2)
    - [2.1 Problem defined and cause](#2-1)
    - [2.2 Potential problems](#2-2)
    - [2.3 Solutions: gradient clipping](#2-3)
- [3. Long Short-Term Memory (LSTM)](#3)
    - [3.1 Description](#3-1)
    - [3.2 Graphic representation](#3-2)
    - [3.3 LSTM success and replacement](#3-3)
- [4. Gated Recurrent Units (GRU)](#4)
    - [4.1 Description](#4-1)
    - [4.2 LSTM vs GRU](#4-2)
- [5. General solutions to vanishing/exploding gradient](#5)
    - [5.1 Vanishing/exploding gradient in NN](#5-1)
    - [5.2 Residual connections](#5-2)
    - [5.3 Dense connections](#5-3)
    - [5.4 Highway connections](#5-4)
- [6. Bidirectional RNNs](#6)
    - [6.1 Motivation](#6-1)
    - [6.2 Structure](#6-2)
    - [6.3 Restrictions](#6-3)
- [7. Multi-layer RNNs (Stacked)](#7)
    - [7.1 Description](#7-1)
    - [7.2 In practice](#7-2)
- [8. Summary](#8)
- [9. References](#9)

<a name='1'></a>
# 1. Vanishing gradient

<a name='1-1'></a>
## 1.1 Problem defined and cause
- Reference: [Pascanu, R; Mikolov, T; Bengio, Y. 2013. On the difficulty of training recurrent neural networks](http://proceedings.mlr.press/v28/pascanu13.pdf)

<img src='../images/7-vanishingGradients.png' width='600' height='300'>

**Formal proof**
<img src='../images/7-vanishingGradients2.png' width='600' height='300'>

**Exploding gradient**
<img src='../images/7-vanishingGradients3.png' width='600' height='300'>


<a name='1-2'></a>
## 1.2 Potential problems

For example: $\frac{J^{(4)}}{h^{(1)}} < \frac{J^{(4)}}{h^{(2)}} < \frac{J^{(4)}}{h^{(3)}} < \frac{J^{(4)}}{h^{(3)}}$. And more loops the training goes through, the longest distance gradients (weights far away from the J) tend to be vanishing as they approach 0.

<img src='../images/7-whyVGaProblem.png' width='600' height='300'>



<a name='1-3'></a>
## 1.3 Possible consequences of Vanishing Gradient on RNN-LM in possible scenarios 

- Too long a sequence. This is a major problem because for a vanilla RNN (the RNN introudced so far, the state is constantly updated in each time step, which makes it impossible or hardly possible for the model to preserve long-distance dependency. In other word, the longer distance a piece of info is, the harder it will be kept in the model. 
<img src='../images/7-VGEffectonRNN-LM0.png' width='600' height='300'>


- Mixed info not fully learnt 
<img src='../images/7-VGEffectonRNN-LM.png' width='600' height='300'>

<a name='2'></a>
# 2. Exploding gradient

<a name='2-1'></a>
## 2.1 Problem defined and cause

See [1.1](#1.1) **Exploding gradient** (third pic).

<a name='2-2'></a>
## 2.2 Potential problems

- So-called overshotting 

<img src='../images/7-explodingGradient.png' width='600' height='300'>


<a name='2-3'></a>
## 2.3 Solutions: gradient clipping

- Reference: [Pascanu, R; Mikolov, T; Bengio, Y. 2013. On the difficulty of training recurrent neural networks]

<img src='../images/7-GradientClipping.png' width='600' height='300'>

- Reference: [“Deep Learning”, Goodfellow, Bengio and Courville, 2016. Chapter 10.11.1. Sequence Modeling: Recurrent and Recursive Nets](https://www.deeplearningbook.org/contents/rnn.html)
<img src='../images/7-GradientClipping2.png' width='600' height='300'>

- Can also employ more sophisticated optimizers, like Adam, Adagrad, RMSprop etc., to overcome the exploding gradients.



<a name='3'></a>
# 3. Long Short-Term Memory (LSTM)

<a name='3-1'></a>
# 3.1 Description 
- [Hochreiter and Schmidhuber, 1997. “Long short-term memory”.](https://www.bioinf.jku.at/publications/older/2604.pdf)

<img src='../images/7-LSTMDesc.png' width='600' height='300'>


- Forget gate is similar to the idea of Dropout in Deep Neural Network, an intuitive trick to reduce the risk of Vanishing Gradient.
<img src='../images/7-LSTMDesc2.png' width='600' height='300'>


<a name='3-2'></a>
# 3.2 Graphic representation 
- Reference: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
<img src='../images/7-LSTMDiag.png' width='600' height='300'>


<a name='3-3'></a>
# 3.3 LSTM success and replacement

- Source: "Findings of the 2016 Conference on Machine Translation (WMT16)", Bojar et al. 2016, http://www.statmt.org/wmt16/pdf/W16-2301.pdf 
- Source: "Findings of the 2018 Conference on Machine Translation (WMT18)", Bojar et al. 2018, http://www.statmt.org/wmt18/pdf/WMT028.pdf


**Research paradigms are changing very fast!**
<img src='../images/7-LSTMSuccess.png' width='600' height='300'>

<a name='4'></a>
# 4. Gated Recurrent Units (GRU)

<a name='4-1'></a>
## 4.1 Description 

- Reference: "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation", Cho et al. 2014, https://arxiv.org/pdf/1406.1078v3.pdf

<img src='../images/7-GRUDesc.png' width='600' height='300'>


<a name='4-2'></a>
## 4.2 LSTM vs GRU

- Researchers have proposed many gated RNN variants, but LSTM and GRU are the most widely-used
- The biggest difference is that **GRU is quicker** to compute and has fewer parameters
- There is **no conclusive evidence** that one consistently performs better than the other
- **LSTM is a good default choice** (especially if your data has particularly long dependencies, or you have lots of training data)
- **Rule of thumb**: start with LSTM, but switch to GRU if you want something more efficient

<a name='5'></a>
# 5. General solutions to vanishing/exploding gradient 

<a name='5-1'></a>
## 5.1 Vanishing/exploding gradient in NN

Obviously, vanishing/exploding gradient is a program that is not only relevant for RNN, but for all NN (including feed-forward and convolutional), especially deep ones. **Although, for RNN, these problems are more serious due to the design of RNN (i.e., the repeated multiplication by the same weight matrix)**. See: ”Learning Long-Term Dependencies with Gradient Descent is Difficult", Bengio et al. 1994, http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf.

**Causes and solutions:** </br>
- Due to chain rule / choice of nonlinearity function, gradient can become vanishingly small as it backpropagates
- Thus lower layers are learnt very slowly (hard to train)
- Solution: lots of new deep feedforward/convolutional architectures that add more direct connections (thus allowing the gradient to flow)


<a name='5-2'></a>
## 5.2 Residual connections

- Reference: [He et al.2015. Deep Residual Learning for Image Recognition.](https://arxiv.org/pdf/1512.03385.pdf)
- This is a very general trick
<img src='../images/7-ResNet.png' width='600' height='300'>


<a name='5-3'></a>
## 5.3 Dense connections

- Reference: ”Densely Connected Convolutional Networks", Huang et al, 2017. https://arxiv.org/pdf/1608.06993.pdf
- This is more specific to CNN
<img src='../images/7-DenseNet.png' width='600' height='300'>


<a name='5-4'></a>
## 5.4 Highway connections 

- Reference: ”Highway Networks", Srivastava et al, 2015. https://arxiv.org/pdf/1505.00387.pdf
- Highway connections aka “HighwayNet”
- Similar to residual connections, but the identity connection vs the transformation layer is controlled by a dynamic gate
- Inspired by LSTMs, but applied to deep feedforward/convolutional networks

<a name='6'></a>
# 6. Bidirectional RNNs

<a name='6-1'></a>
## 6.1 Motivation

- **Contextual representation**
- Look for both directions
<img src='../images/7-BiRNN.png' width='600' height='300'>


<a name='6-2'></a>
## 6.2 Structure 
<img src='../images/7-BiRNN2.png' width='600' height='300'>
<img src='../images/7-BiRNN3.png' width='600' height='300'>
<img src='../images/7-BiRNN4.png' width='600' height='300'>


<a name='6-3'></a>
## 6.3 Restrictions
<img src='../images/7-BiRNNRestriction.png' width='600' height='300'>



<a name='7'></a>
# 7. Multi-layer RNNs (Stacked)

<a name='7-1'></a>
## 7.1 Description
- RNNs are already “deep” on one dimension (they unroll over many timesteps)
- We can also make them “deep” in another dimension by applying multiple RNNs – this is a multi-layer RNN.
- This allows the network to compute more complex representations
- The lower RNNs should compute lower-level features and the higher RNNs should compute higher-level features.
- Multi-layer RNNs are also called stacked RNNs.


- This can be bidirectional provided that the entire input sentence is accessible.
<img src='../images/7-MultiLRNN.png' width='600' height='300'>


<a name='7-2'></a>
## 7.2 In practice 
- Reference: “Massive Exploration of Neural Machine Translation Architecutres”, Britz et al, 2017. https://arxiv.org/pdf/1703.03906.pdf
- Skips are usually heavily used. 
<img src='../images/7-MultiLRNNInPractice.png' width='600' height='300'>

<a name='8'></a>
# 8. Summary

<img src='../images/7-Summary.png' width='600' height='300'>

<a name='9'></a>
# 9. References

- [Course website](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/index.html)

- [Lecture video](https://www.youtube.com/watch?v=QEw0qEa0E50&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=7) 

- [Lecture slide](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture07-fancy-rnn.pdf)