# GRU

There are 3 main problems with 'vanilla' RNNs:
* Some information needs to be remembered for a long time
* Some information needs to be skipped
* In case of a change of context, we should reset the memory state

**GRU** is an architecture which allows the network to decide when to *reset* and *update* the memory state

Compared to 'vanilla' RNN, we learn extra weight to decide what we will keep and what we will update from the input and the previous hidden state (i.e., the memory)

<center>
    <img src='data/gru-1.svg' width="65%" style="margin-left:auto; margin-right:auto"/>
    <p style="font-size:14px;">Source: <a href='d2l.ai'>D2L</a></p>
</center>

Each gate, is a function of the input $\mathbf{X}_t$ and the previous hidden state $\mathbf{H}_{t-1}$ which determine the next hidden state $\mathbf{H}_t$

\begin{split}
\begin{aligned}
\mathbf{R}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xr} + \mathbf{H}_{t-1} \mathbf{W}_{hr} + \mathbf{b}_r)\\
\mathbf{Z}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xz} + \mathbf{H}_{t-1} \mathbf{W}_{hz} + \mathbf{b}_z)
\end{aligned}
\end{split}

We apply sigmoid ($\sigma$) to ensure that both $\mathbf{R}_t$ and $\mathbf{Z}_t$ have value between $[0, 1]$

Once we have, $\mathbf{R}_t$ and $\mathbf{Z}_t$, we can update $\mathbf{H}_{t-1}$ to obtain a candidate $\tilde{\mathbf{H}}_t$:
$$
\tilde{\mathbf{H}}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \left(\mathbf{R}_t \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{hh} + \mathbf{b}_h)
$$
With $\odot$ being the **elementwise** product. 

Remember, $\mathbf{R}_t \in [0, 1]$. Thus, $\mathbf{R}_t$ decide which part of the previous state needs to be added to the new hidden state and which part needs to be removed

<center>
    <img src='data/gru-2.svg' width="65%" style="margin-left:auto; margin-right:auto"/>
    <p style="font-size:14px;">Source: <a href='d2l.ai'>D2L</a></p>
</center>

Now that we have computed our candidate hidden state, we use $\mathbf{Z}_t$ to determine how much we should keep from the previous hidden state and how much we should update

$$
\mathbf{H}_t = \mathbf{Z}_t \odot \mathbf{H}_{t-1}  + (1 - \mathbf{Z}_t) \odot \tilde{\mathbf{H}}_t
$$

<center>
    <img src='data/gru-3.svg' width="65%" style="margin-left:auto; margin-right:auto"/>
    <p style="font-size:14px;">Source: <a href='d2l.ai'>D2L</a></p>
</center>

Let's go back to our 3 issues with RNNs:
* **Some information needs to be remembered for a long time**: The update gate can keep part of the hidden state unchanged for a long time
* **Some information needs to be skipped**: The update gate can prevent update of the hidden state
* **In case of a change of context, we should reset the memory state**: We can fully reset the hidden state by having a reset gate and an update gate with 0 values

In Pytorch, replacing RNN with GRU is straightforward

In [3]:
import torch.nn as nn
num_inputs = 8
num_hiddens = 16
gru_layer = nn.GRU(num_inputs, num_hiddens)