# Recurrent Neural Networks (RNNs)

## Advantages

- RNNs work well with **sequential data** where order matters.
- Useful in Natural Language Processing applications.

## Disadvantages

- Same as general neural network disadvantages.

## References

1. Fundamentals of Deep Learning, Chapter 7.
1. Hands-On Machine Learning, Chapter 15, 16.
1. Practical Natural Language Processing.
1. Multi-class Text Classification with LSTM: [https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17](https://towardsdatascience.com/multi-class-text-classification-with-lstm-1590bee1bd17)

# Gated Recurrent Unit (GRU)

**Gated recurrent units** can combine new information $x_t$ with internal state $h_{t-1}$ from the previous step.

Gates use **sigmoid** activation $\sigma$ and **element-wise product** $\odot$.

Update gate:

$$
    z_t = \sigma(W_z \cdot [h_{t-1} x_t])
$$

Reset gate:

$$
    r_t = \sigma(W_r \cdot [h_{t-1} x_t])
$$

Current memory:

$$
    \tilde{h_t} = \tanh(W \cdot [r_t \cdot h_{t-1} x_t])
$$

Internal state (output):

$$
    h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot \tilde{h_t}
$$

# LSTM: Long-Short Term Memory

## The Short-term Memory Problem

## Process

What information we want to forget:

$$
    f_t = \sigma (W_f \cdot [h_{t-1},x_t] + b_f)
$$

What new information we want:

$$
    i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\
    \tilde{C_t} = \tanh (W_C \cdot[h_{t-1}, x_t] + b_C)
$$

Update the old cell state:

$$
    C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C_t}
$$

What do we output:

$$
    O_t = \sigma (W_O [h_{t-1}, x_t] + b_O\\
    h_t = O_t \cdot \tanh(C_t)
$$



# LSTM with Peephole Variance

Adds peepholes to all of the gates. Allows each gate to see the previous state.

$$
    f_t = \sigma(W_f \cdot [C_{t-1}, h_{t-1} x_t] + b_f)\\
    i_t = \sigma(W_i \cdot [C_{t-1}, h_{t-1} x_t] + b_i)\\
    O_t = \sigma(W_O \cdot [C_{t-1}, h_{t-1} x_t] + b_O)
$$