# General RNN

$\textbf{Backpropagation throught time (BPTT)}$:
$$h^{(t)} = tanh(b + W h^{(t-1)} + U x^{(t)})$$
$$o^{(t)} = softmax(c + V h^{(t)}), L = -\sum_{t=1}^{\tau} \sum_{k=1}^K I(k=y^{(t)}) log \left({o}^{(t)}_k \right)$$

<img src="https://miro.medium.com/max/4136/1*SKGAqkVVzT6co-sZ29ze-g.png" alt="RNN Architecture" style="width:700px;"/>

$\textbf{RNN Task Applications}$:
- Many-to-One: Sentiment Analysis, Video Activity Recognition
- Many-to-Many: Machine Translation, Speech Recognition, Name Entity Recognition
- One-to-Many: Music Generation

<img src="https://cbare.github.io/images/recurrent-neural-net-types.png" alt="RNN types" style="width:700px;"/>

$\textbf{Handle Gradient Explosion}$:
- Gradient Clipping: set a maximum gradient

$\textbf{Handle Gradient Vanishing}$:
- ReLU: constant derivative
- Batch Normalization
  - speed up convergence
  - control overfitting
  - reduce dropout and regularization
- LSTM

## Bidirectional RNN

$$h^{(t)} = tanh(b_{left} + W_{left} h^{(t-1)} + U_{left} x^{(t)})$$
$$g^{(t)} = tanh(b_{right} + W_{right} g^{(t-1)} + U_{right} x^{(t)})$$
$$o^{(t)} = softmax(c + V_{left} h^{(t)} + V_{right} g^{(t)})$$

<img src="http://www.huaxiaozhuan.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/imgs/dl_rnn/bi_RNN.png" align="left" alt="Bi-RNN" style="width:500px;"/>

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/bidirectional-rnn-ltr.png?e3e66fae56ea500924825017917b464a" align="left" alt="bi-RNN-stanford" style="width:400px;"/>

## Deep RNN

$$h^{(t)} = tanh(b_1 + W_1 h^{(t-1)} + U x^{(t)})$$
$$z^{(t)} = tanh(b_2 + W_2 z^{(t-1)} + R z^{(t)})$$
$$o^{(t)} = softmax(c + V z^{(t)})$$

<img src="http://www.huaxiaozhuan.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/imgs/dl_rnn/more_hiden.png" align="left" alt="Deep-RNN" style="width:500px;"/>

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/deep-rnn-ltr.png?f57da6de44ddd4709ad3b696cac6a912" align="right" alt="deep-RNN-stanford" style="width:400px;"/>

# Long Short Term Memory (LSTM)

$\textbf{Key Components}$:

$$\text{forget gate:} {f}^{(t)} = \sigma \left({b}^{f} + U^f {x}^{(t)} + W^f {h}^{(t-1)} \right)$$

$$\text{input gate:} {g}^{(t)} = \sigma \left({b}^{i} + U^i {x}^{(t)} + W^i {h}^{(t-1)} \right)$$

$$\text{output gate:} {q}^{(t)} = \sigma \left({b}^{o} + U^o {x}^{(t)} + W^o {h}^{(t-1)} \right)$$

$$\text{cell stage:} {C}^{(t)} = {f}^{(t)} \odot {C}^{(t-1)} + {g}^{(t)} \odot tanh \left({b} + U {x}^{(t)} + W {h}^{(t-1)} \right)$$

$$\text{cell output:} {h}^{(t)} = tanh \left({C}^{(t)} \right) \odot {q}^{(t)}$$

$${o}^{(t)} = softmax({C}^{(t)} + V {h}^{(t)}), L = -\sum_{t=1}^{\tau} \sum_{k=1}^K I(k=y^{(t)}) log \left({o}^{(t)}_k \right)$$

<img src="http://www.huaxiaozhuan.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/imgs/dl_rnn/lstm_1.png" alt="LSTM" style="width:700px;"/>

$\textbf{Handle gradient vanishing}$:
- memory and input are added
- influence never disappears unless forget gate is closed
- no gradient vanishing when forget gate is opened

# Gradient Recurrent Unit (GRU)

- no forget gate compared with LSTM
- train faster than LSTM on less training data
- simpler and easier to modify

$\textbf{Key Components}$:

$$\text{update gate:} {z}^{(t)} = \sigma \left({b}^{z} + U^z {x}^{(t)} + W^z {h}^{(t-1)} \right)$$

$$\text{reset gate:} {r}^{(t)} = \sigma \left({b}^{r} + U^r {x}^{(t)} + W^r {h}^{(t-1)} \right)$$

$$\text{cell output:} {h}^{(t)} = {z}^{(t)} \odot {h}^{(t-1)} + (1 - {z}^{(t)}) \odot tanh \left({b} + U x^{(t)} + W r^{(t)} h^{(t-1)} \right)$$

$${o}^{(t)} = softmax({C}^{(t)} + V {h}^{(t)}), L = -\sum_{t=1}^{\tau} \sum_{k=1}^K I(k=y^{(t)}) log \left({o}^{(t)}_k \right)$$

<img src="http://www.huaxiaozhuan.com/%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0/imgs/dl_rnn/GRU.png" alt="GRU" style="width:300px;"/>


# References

- RNN: https://www.coursera.org/learn/nlp-sequence-models
- LSTM: https://www.bioinf.jku.at/publications/older/2604.pdf
- GRU: https://arxiv.org/pdf/1406.1078.pdf