# Gated Recurrent Unit (GRU): Lecture Notes

## I. Introduction and Context

### A. What are GRUs?
*   Gated Recurrent Units (GRUs) are a type of **Recurrent Neural Network (RNN) architecture**.
*   They are primarily used to **process sequential data**.
*   GRUs are a relatively recent development, having been introduced in **2014**, compared to LSTMs (Long Short-Term Memory) which came out around 1997.

### B. Why GRUs Exist (Addressing Limitations of Previous RNNs)
*   **Simple RNNs:** Suffer from the **vanishing and exploding gradient problems**, meaning they cannot retain long-term context when the sequence (e.g., a sentence) becomes very long.
*   **LSTMs:** Successfully solved the vanishing/exploding gradient problem by maintaining two different contexts: **Long-Term Memory (Cell State)** and **Short-Term Memory (Hidden State)**, utilizing a very complex architecture involving **three gates** (Forget, Input, and Output).
*   **The Flaw of LSTMs:** While effective, LSTMs have a **complex architecture** and a **high number of parameters**. A high parameter count means that training takes longer, especially when working with large datasets.
*   **The GRU Solution:** GRUs solve this training time issue by offering a **simpler architecture** and possessing **fewer parameters** compared to LSTMs. This means training time is reduced.
*   **Performance:** Despite their simpler structure and fewer gates, GRU performance is **comparable** to LSTMs. In fact, there are certain datasets and problems where GRUs have been empirically proven to **outperform LSTMs**.

## II. GRU Architecture: Components and Setup

<img src="https://miro.medium.com/0*mXUQSGG-WCt2UE-Y">

### A. The Goal of the GRU Cell
*   For any time stamp $t$, the GRU aims to calculate the **Current Hidden State ($H_t$)**.
*   The GRU receives two inputs at each time step:
    1.  The **Previous Hidden State ($H_{t-1}$)**.
    2.  The **Current Time Stamp Input ($X_t$)**.

### B. Key Components (Vectors)
All components mentioned below are fundamentally **vectors** (sets of numbers).

| Term | Notation | Description/Role |
| :--- | :--- | :--- |
| **Previous Hidden State** | $H_{t-1}$ | The memory/context carried from the previous time step. |
| **Current Hidden State** | $H_t$ | The calculated memory/context for the current time step. |
| **Current Input** | $X_t$ | The vectorized input (e.g., a word) at time $t$. |
| **Reset Gate** | $R_t$ | Controls which parts of $H_{t-1}$ should be forgotten/reset based on $X_t$. |
| **Update Gate** | $Z_t$ | Determines the balance between the Past Memory ($H_{t-1}$) and the Candidate Memory ($\tilde{H}_t$). |
| **Candidate Hidden State** | $\tilde{H}_t$ (or $H_t$ with tilde/bar) | The proposed new memory state, heavily influenced by $X_t$, before final balancing. |

**Dimensionality Constraint:** Excluding $X_t$ (which can have a different dimension), the vectors $H_{t-1}$, $H_t$, $R_t$, $Z_t$, and $\tilde{H}_t$ must **always have the same dimension**. This dimension is determined by the **number of hidden units/nodes** specified in the architecture (a hyperparameter).

### C. Input Vectorization ($X_t$)
*   Since the GRU cannot process words directly, the input text must be **converted into vectors** (numbers).
*   Vectorization techniques include simple methods like Bag of Words (BOE) or more advanced methods like Word2Vec embeddings. $X_t$ is the resulting vector for the word/unit being processed at time $t$.

### D. Operational Elements
1.  **Neural Network Layers (Fully Connected Layers):** Represented by yellow boxes in diagrams.
    *   The layers corresponding to the gates ($R_t, Z_t$) use the **Sigmoid** ($\sigma$) activation function, ensuring outputs are between 0 and 1.
    *   The layer calculating the Candidate Hidden State ($\tilde{H}_t$) uses the **$\tanh$** activation function.
    *   The number of nodes must be **identical** across all three neural network layers.
2.  **Element-wise Operations:** Represented by pink circles.
    *   **Multiplication ($\times$):** Point-wise multiplication (element-wise).
    *   **Addition ($+$):** Point-wise addition (element-wise).
    *   **One-Minus ($1-$):** Calculates $(1 - Z_t)$ for element-wise use.

## III. The GRU Architecture: Conceptual Flow

<img src="https://i.ibb.co/YB2TNnnc/image.png">

### A. The Big Idea: Single State Memory
*   Unlike LSTMs, the GRU's "big idea" is that it does **not need two separate states** (cell state and hidden state) to carry long-term and short-term context.
*   It uses a **single Hidden State** ($H$) to manage both types of context through careful manipulation by its two gates.
*   **Intuition:** The Hidden State acts as the **memory of the system**, where different vector dimensions represent different aspects of the context (e.g., power, conflict, tragedy, or revenge in a story).

### B. Two-Step Transition Process
The transition from Past Memory ($H_{t-1}$) to Current Memory ($H_t$) occurs in two steps:

1.  **Step 1: Creation of Candidate Memory ($\tilde{H}_t$)**
    *   The past memory ($H_{t-1}$) is combined with the current input ($X_t$) to create a $\tilde{H}_t$ (Candidate Hidden State).
    *   This step uses the **Reset Gate ($R_t$)** to modulate (reset or reduce) irrelevant parts of $H_{t-1}$ based on $X_t$.
2.  **Step 2: Balancing and Final Output ($H_t$)**
    *   The GRU cannot directly use $\tilde{H}_t$ as $H_t$ because $\tilde{H}_t$ is often **too heavily reliant** on the current input ($X_t$).
    *   The **Update Gate ($Z_t$)** is used to dynamically balance the Past Memory ($H_{t-1}$) and the Candidate Memory ($\tilde{H}_t$).
    *   The resulting balance forms the final $H_t$.


## IV. Detailed Architecture: Calculation Steps

The calculation of $H_t$ involves four chronological steps:

### Step 1: Calculate Reset Gate ($R_t$)
*   **Purpose:** To determine which dimensions of the *past memory* ($H_{t-1}$) should be substantially forgotten (reset) based on the significance of the current input ($X_t$).
*   **Calculation:** $H_{t-1}$ and $X_t$ are concatenated and passed through a Sigmoid activation layer.
    $$R_t = \sigma(W_R \cdot [H_{t-1}, X_t] + B_R)$$

### Step 2: Calculate Candidate Hidden State ($\tilde{H}_t$)
This step involves two sub-operations:
1.  **Modulation of Past Memory:** The Reset Gate $R_t$ performs an element-wise multiplication with $H_{t-1}$ to create the **Modulated/Resetted Past Memory**:
    $$R_t \odot H_{t-1}$$
2.  **Proposal Generation:** The modulated memory is concatenated with the Current Input $X_t$ and passed through a $\tanh$ activation layer. This proposes the new memory state $\tilde{H}_t$.
    $$\tilde{H}_t = \tanh(W_C \cdot [ (R_t \odot H_{t-1}), X_t] + B_C)$$

### Step 3: Calculate Update Gate ($Z_t$)
*   **Purpose:** To determine the precise **balance** (weighting) to be applied to the Past Memory ($H_{t-1}$) versus the Candidate Memory ($\tilde{H}_t$).
*   **Calculation:** $H_{t-1}$ and $X_t$ are concatenated and passed through a Sigmoid activation layer.
    $$Z_t = \sigma(W_Z \cdot [H_{t-1}, X_t] + B_Z)$$

### Step 4: Calculate Current Hidden State ($H_t$)
*   **Purpose:** To combine the Past Memory and Candidate Memory using the balance determined by $Z_t$.
*   **Mechanism:** If $Z_t$ is high, more weight is given to the new Candidate Memory ($\tilde{H}_t$); if $Z_t$ is low, more weight is given to the Past Memory ($H_{t-1}$).
*   **Calculation:** This uses the $1-Z_t$ operation and element-wise addition and multiplication.
    $$H_t = (1 - Z_t) \odot H_{t-1} + Z_t \odot \tilde{H}_t$$

## V. Key Differences Between LSTMs and GRUs

| Feature | LSTM (Long Short-Term Memory) | GRU (Gated Recurrent Unit) | Citation |
| :--- | :--- | :--- | :--- |
| **Number of Gates** | Three (Input, Forget, Output). | Two (Reset, Update). | |
| **Memory Units** | Two separate states: Cell State and Hidden State. | One single state: Hidden State only. | |
| **Architecture Complexity**| Complex. | Simpler. | |
| **Parameter Count** | Higher. The formula contains four $W$ weights. | Lower. The formula contains three $W$ weights. | |
| **Computational Speed** | Higher computational complexity; generally slower. | Faster to compute, especially on smaller datasets or when resources are limited. | |
| **Performance** | Better on many tasks, especially more complex tasks and larger datasets. | Can perform comparably to LSTMs; often train faster due to fewer parameters. | |
| **Usage Recommendation**| Used when empirical testing shows better results, or when tasks are complex. | Often the **first choice** when starting out due to simplicity. | |