# GRU

* GRU is an advancement of the standard RNN.
* GRUs are very similar to LSTM.
<p align="left">
    <br><br>
    &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;<em>GRU</em> &emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;<em>LSTM</em>
    <img src="https://d2l.ai/_images/gru-3.svg" height="520" width="440" align="left">
    <img src = "https://d2l.ai/_images/lstm-3.svg" height="520" width="440">
    </p><br><br>
* Just like LSTM, GRU uses gates to control the flow of information.
    - **Reset gate** is used to decide how much past information to forget or to remember.
    - **Update gate** is used to capture long-term dependencies in sequence.
    - **Hidden state** - Finally, we compute the hidden state using candidate hidden state and update gate
* GRU are simpler then LSTM due to which they are faster to train

In [1]:
import torch
import torch.nn as nn

## Step by Step GRU 
- <font size=3>Step 1: Reset Gate And Update Gate<br></font>
    \begin{split}\begin{aligned}
    \mathbf{R}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xr} + \mathbf{H}_{t-1} \mathbf{W}_{hr} + \mathbf{b}_r)\\
    \mathbf{Z}_t = \sigma(\mathbf{X}_t \mathbf{W}_{xz} + \mathbf{H}_{t-1} \mathbf{W}_{hz} + \mathbf{b}_z)
    \end{aligned}\end{split}<br>
    <img src = "https://d2l.ai/_images/gru-1.svg" height="620" width="380">
<br><br>

- <font size=3>Step 2: Candidate Hidden state<br></font>
    \begin{split}\begin{aligned}
    \tilde{\mathbf{H}}_t = \tanh(\mathbf{X}_t \mathbf{W}_{xh} + \left(\mathbf{R}_t \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{hh} + \mathbf{b}_h)
    \end{aligned}\end{split}<br>
    <img src = "https://d2l.ai/_images/gru-2.svg" height="720" width="480"><br>
<br><br>

- <font size=3>Step 3: Hidden state<br></font>
    \begin{split}\begin{aligned}
    \mathbf{H}_t = \mathbf{Z}_t \odot \mathbf{H}_{t-1}  + (1 - \mathbf{Z}_t) \odot \tilde{\mathbf{H}}_t.
    \end{aligned}\end{split}<br>
    <img src = "https://d2l.ai/_images/gru-3.svg" height="720" width="480"><br>

## GRU Equations:
Input vector:&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; $\large {X_t} $
<br><br>
Reset Gate: &ensp;&nbsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; $\large R_t = \sigma(W_{xr} X_t + W_{hr} H_{t-1}) $
<br><br>
Update Gate:&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; $ \large Z_t = \sigma(W_{xz} X_t + W_{hz} H_{t-1})$
<br><br>
Candidate hidden state: &emsp; $ \large \tilde H_t = tanh(W_{xh} X_t + R_t * (W_{hh} H_{t-1}) )$
<br><br>
Hidden state: &emsp;&emsp;&emsp;&emsp;&emsp; $ \large H_t = Z_t*H_{t-1} + (1-Z_t)*\tilde H_t$

Reference: [Dive into Deep Learning](https://d2l.ai/chapter_recurrent-modern)

In [2]:
class GRU(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.w_xr = nn.Linear(input_size, hidden_size)
        self.w_hr = nn.Linear(hidden_size, hidden_size)
        
        self.w_xz = nn.Linear(input_size, hidden_size)
        self.w_hz = nn.Linear(hidden_size, hidden_size)
        
        self.w_xh = nn.Linear(input_size, hidden_size)
        self.w_hh = nn.Linear(hidden_size, hidden_size)
        
    def init_hidden(self):
        #initializing hidden state as a tensor of zeros
        # shape -> num_layers, batch_size, hidden_size
        hidden = torch.zeros((1,1,hidden_size), dtype=torch.float32)
        return hidden
        
    def forward(self, x, hidden):   
        reset_gate = torch.sigmoid(self.w_xr(x) + self.w_hr(hidden))
        
        update_gate = torch.sigmoid(self.w_xz(x) + self.w_hz(hidden))
        
        candidate = torch.tanh(self.w_xh(x) + torch.mul(reset_gate, self.w_hh(hidden)) )
        
        hidden_state = torch.mul(update_gate, hidden) + torch.mul((1-update_gate), candidate)
        
        return hidden_state, hidden_state[-1]