<a href="https://www.kaggle.com/code/mrafraim/dl-day-25-lstm-gru?scriptVersionId=290137071" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 25: LSTM & GRU

Wlcome to Day 25!

Today you‚Äôll learn:

1. Why vanilla RNNs fail on long sequences
2. What vanishing gradient really means (intuitively)
3. What an LSTM is and why it fixes RNN problems
4. LSTM gates: Forget, Input, Output
5. How information flows through an LSTM cell
6. What a GRU is and how it differs from LSTM
7. When to use RNN vs LSTM vs GRU

> By the end of this notebook, you will understand why LSTM exists, not just how it works.

If you found this notebook helpful, your **<b style="color:red;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# What is a Vanilla RNN?

A vanilla RNN is the simplest possible recurrent neural network.
It is the original RNN formulation - no gates, no cell state, no memory control.

It consists of:
- An input
- A hidden state (memory)
- A nonlinear activation function

That‚Äôs it.

## Core idea

A vanilla RNN processes a sequence one time step at a time, while carrying forward a single hidden state that summarizes the past.

At each time step $t$:
- It reads the current input $x_t$
- It combines it with the previous hidden state $h_{t-1}$
- It produces a new hidden state $h_t$

Mathematically:

$$
h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b)
$$

Where:
- $x_t$ ‚Üí input at time step $t$
- $h_{t-1}$ ‚Üí memory from the past
- $h_t$ ‚Üí updated memory
- $\tanh$ ‚Üí squashing nonlinearity
- Same weights are reused at every time step

## Why it‚Äôs called ‚Äúvanilla‚Äù

‚ÄúVanilla‚Äù means:
- No gating mechanisms
- No selective memory
- No protection against gradient decay
- One single memory vector doing everything

> A vanilla RNN blindly mixes past and present at every step.


## Structural view

At each time step:

$$
x_t + h_{t-1} ‚Üí [Linear + tanh] ‚Üí h_t
$$

Unrolled over time:

$$
x‚ÇÅ ‚Üí [RNN] ‚Üí h‚ÇÅ ‚Üí [RNN] ‚Üí h‚ÇÇ ‚Üí [RNN] ‚Üí h‚ÇÉ ‚Üí ...
$$

Same cell. Same weights. Growing dependency chain.

## What the hidden state really is

The hidden state $h_t$ is:
- A compressed summary of everything seen so far
- Fixed-size, regardless of sequence length
- Overwritten at every time step

This creates a fundamental tension:
- Short-term details vs long-term memory
- New input vs old information

Vanilla RNNs have no mechanism to manage this tradeoff.

## What vanilla RNNs can do well

Vanilla RNNs work when:
- Sequences are short
- Dependencies are local
- Patterns are simple

Examples:
- Simple signal smoothing
- Toy language models
- Educational demonstrations

## What vanilla RNNs cannot do well

They fail when:
- Sequences are long
- Important information appears far in the past
- Memory must be preserved precisely

Examples they struggle with:
- Long sentences
- Long time-series forecasting
- Context-dependent language tasks

## The critical flaw

The vanilla RNN:
- Reuses the same transformation repeatedly
- Multiplies gradients through many time steps
- Has no mechanism to protect important memory

This leads directly to:
- Repeated multiplication causes:
  - **Vanishing gradients** (‚Üí 0)
  - **Exploding gradients** (‚Üí ‚àû) 
- Forgetting long-term information
- Cannot learn dependencies far in the past

> LSTM exists specifically to fix this flaw.

# Vanishing Gradient

Vanishing gradient is not a bug. It is a mathematical consequence of how vanilla RNNs are built.

During training, RNNs use Backpropagation Through Time (BPTT):
- The RNN is unrolled across time
- Gradients flow backward from later time steps to earlier ones

To update early weights, gradients must pass through many repeated transformations.


Hidden state:

$$
h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)
$$

Key observation:
- The same weight matrix $W_{hh}$ is applied at every time step
- The same activation function (`tanh`) is applied repeatedly


## Gradient flow across time

Consider the gradient of the loss $L$ with respect to an early hidden state $h_k$:

$$
\frac{\partial L}{\partial h_k}
=
\prod_{t=k+1}^{T}
\frac{\partial h_t}{\partial h_{t-1}}
\cdot
\frac{\partial L}{\partial h_T}
$$

Each term contains:

$$
\frac{\partial h_t}{\partial h_{t-1}}
=
W_{hh}^\top \cdot \tanh'(z_t)
$$

So gradients are repeatedly multiplied by:
- The recurrent weight matrix
- The derivative of `tanh`

## Why gradients vanish 

Key facts:
- `tanh'(x) ‚â§ 1`
- Usually < 1, especially near saturation
- Eigenvalues of $W_{hh}$ are often < 1 for stability

```
Eigenvalue tells you how much a matrix stretches a vector in a particular direction.

In RNNs: eigenvalues of $ùëä_{‚Ñé‚Ñé}$ tell you how the hidden state (and gradients) grow or shrink over time.

- ‚à£Œª‚à£>1 ‚Üí exponential growth (exploding)
- ‚à£Œª‚à£<1 ‚Üí exponential decay (vanishing)
- ‚à£Œª‚à£=1 ‚Üí stable (ideal for long sequences)
```
So each step multiplies the gradient by a number slightly less than 1.

Example:
$$
0.8^{10} \approx 0.11
$$
$$
0.8^{50} \approx 0.000014
$$

After many time steps:

Gradient ‚Üí almost zero

Early time steps receive no learning signal.


## Exploding Gradient

If Eigenvalues of $W_{hh} > 1$

Then:
$$
1.2^{50} \approx 9100
$$

Result:
- Exploding gradients  
- Numerical instability  
- Training collapses

So vanilla RNNs live in a narrow unstable zone:
- Too small ‚Üí vanishing
- Too large ‚Üí exploding


## Why this Breaks Learning

Vanilla RNNs:
- Learn recent inputs well
- Forget distant inputs completely

They behave like:
> ‚ÄúShort-term memory machines‚Äù

This is why they fail at:
- Long sentences
- Long time dependencies
- Context-heavy tasks

> Vanishing gradients occur because backpropagation through many time steps repeatedly multiplies gradients by numbers less than 1, causing early information to disappear.

---
<p style="text-align:center; font-size:18px;"> (Optional) </p>

### 1Ô∏è. Jacobian Definition

When a function maps vectors:

$$
\mathbf{y} = f(\mathbf{x}), \quad \mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m
$$

the derivative is not a number. It‚Äôs a matrix of partial derivatives, called the Jacobian:

$$
J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} =
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} 
\end{bmatrix}
$$

- Each row = gradient of one output w.r.t all inputs  
- Each column = how one input affects all outputs


**Simple Jacobian Example**

Define:

$$
\mathbf{y} = \tanh(\mathbf{x}), \quad \mathbf{x} = \begin{bmatrix}x_1\\x_2\end{bmatrix}, \mathbf{y} = \begin{bmatrix}y_1\\y_2\end{bmatrix}
$$

$$
y_1 = \tanh(x_1), \quad y_2 = \tanh(x_2)
$$

Derivative (Jacobian):

$$
\frac{\partial y_i}{\partial x_j} =
\begin{cases}
1 - \tanh^2(x_i) & i = j \\
0 & i \neq j
\end{cases}
$$

<br>

$$
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} =
\begin{bmatrix}
1-\tanh^2(x_1) & 0 \\
0 & 1-\tanh^2(x_2)
\end{bmatrix} = \operatorname{diag}(1-\tanh^2(\mathbf{x}))
$$

**Numerical example:**

$$
\mathbf{x} = \begin{bmatrix}1\\2\end{bmatrix} \quad \Rightarrow \quad
\tanh(1)\approx0.761, \quad \tanh(2)\approx0.964
$$

$$
J \approx
\begin{bmatrix}
0.42 & 0 \\
0 & 0.07
\end{bmatrix}
$$


### 2. Vanilla RNN Hidden State & Gradient Flow

Hidden state:

$$
h_t = \tanh(z_t), \quad z_t = W_{hh} h_{t-1} + W_{xh} x_t
$$

Backprop:

$$
\frac{\partial L}{\partial h_{t-1}} =
\frac{\partial h_t}{\partial h_{t-1}} \cdot \frac{\partial L}{\partial h_t}, \quad
\frac{\partial h_t}{\partial h_{t-1}} = \operatorname{diag}(1-\tanh^2(z_t)) W_{hh}
$$

- Each time step multiplies gradient by the Jacobian  
- Diagonal entries <1 ‚Üí gradient shrinks ‚Üí vanishing gradient


**Numerical Example (2D, 2 time steps)**

Weights:

$$
W_{hh} = \begin{bmatrix}0.5 & 0 \\ 0 & 0.5\end{bmatrix},\quad
W_{xh} = \begin{bmatrix}0.8 & 0 \\ 0 & 0.8\end{bmatrix}
$$

Inputs:

$$
x_1 = \begin{bmatrix}1\\2\end{bmatrix},\quad
x_2 = \begin{bmatrix}2\\1\end{bmatrix},\quad
h_0 = \begin{bmatrix}0\\0\end{bmatrix}
$$

Forward pass:

$$
\begin{aligned}
z_1 &= W_{hh}h_0 + W_{xh}x_1 = \begin{bmatrix}0.8\\1.6\end{bmatrix} \Rightarrow h_1 = \tanh(z_1) \approx \begin{bmatrix}0.664\\0.921\end{bmatrix} \\
z_2 &= W_{hh}h_1 + W_{xh}x_2 = \begin{bmatrix}1.332\\1.321\end{bmatrix} \Rightarrow h_2 \approx \begin{bmatrix}0.869\\0.867\end{bmatrix}
\end{aligned}
$$

Jacobians:

$$
\frac{\partial h_1}{\partial h_0} = \operatorname{diag}(1-\tanh^2(z_1)) W_{hh} \approx \begin{bmatrix}0.280 & 0 \\ 0 & 0.076\end{bmatrix}
$$

$$
\frac{\partial h_2}{\partial h_1} \approx \begin{bmatrix}0.123 & 0 \\ 0 & 0.124\end{bmatrix}
$$

Gradient flow:

Assume $\frac{\partial L}{\partial h_2} = [1,1]^T$

$$
\frac{\partial L}{\partial h_0} =
\frac{\partial h_1}{\partial h_0} \cdot 
\frac{\partial h_2}{\partial h_1} \cdot
\frac{\partial L}{\partial h_2} \approx 
\begin{bmatrix}0.034 \\ 0.009\end{bmatrix}
$$

- The original gradient at $h_2$: 1 ‚Üí after two steps: [0.034, 0.009]
- Exponentially shrunk because each Jacobian < 1
- Early hidden states receive almost zero gradient

> Gradient shrinks drastically ‚Üí vanishing gradient


**Key Takeaways**

- Hidden state derivative = diagonal Jacobian √ó weight
- Backprop across many steps = product of Jacobians
- Each diagonal <1 ‚Üí exponential shrink
- Longer sequences ‚Üí earlier gradients ‚Üí practically 0* 

---

# Long Short-Term Memory (LSTM)

LSTM is a vanilla RNN on steroids:  

It fixes the vanishing gradient problem by introducing controlled memory through gates.

Core idea:
> LSTM decides what to remember, what to forget, and what to output at each time step.


## Components of LSTM

1. **Cell state ($C_t$)**: long-term memory  
   - Carries information across time steps
   - Changes slowly (additive updates)
   
2. **Hidden state ($h_t$)**: short-term memory / output  
   - Used for predictions at each time step

3. **Gates (sigmoid activations)**: control information flow:
   - **Forget gate ($f_t$)** ‚Üí decide what to erase from $C_{t-1}$
   - **Input gate ($i_t$)** ‚Üí decide what new info to add
   - **Output gate ($o_t$)** ‚Üí decide what part of $C_t$ to output as $h_t$

Mathematically:

**1Ô∏è. Forget Gate ($f_t$)**

$$
f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)
$$

Where,

- $h_{t-1}$ ‚Üí previous hidden state (short-term memory)  
- $x_t$ ‚Üí current input  
- $[h_{t-1}, x_t]$ ‚Üí concatenation of previous hidden + current input  
- $W_f$ ‚Üí weights for forget gate  
- $b_f$ ‚Üí bias term  
- $\sigma$ ‚Üí sigmoid activation ‚Üí outputs values between 0 and 1


> $f_t$ decides how much of the previous cell memory $C_{t-1}$ to keep or forget.  

- $f_t = 0$ ‚Üí forget everything  
- $f_t = 1$ ‚Üí keep everything  


**2. Input Gate ($i_t$) and Candidate Memory ($\tilde{C}_t$)**

<u>Input Gate ($i_t$)</u>

$$
i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)
$$

Explanation:  
- Controls how much new information to write to the cell state.  
- Sigmoid ensures the gate outputs 0 (ignore new info) ‚Üí 1 (fully write new info).

<u>Candidate Memory ($\tilde{C}_t$)</u>

$$
\tilde{C}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)
$$

Explanation: 
- Generates new candidate values that could be added to memory.  
- $\tanh$ squashes values to [-1, 1], keeping memory stable.  
- $\tilde{C}_t$ is proposed new content; input gate $i_t$ decides how much actually enters $C_t$.


**3Ô∏è. Cell State Update ($C_t$)**

$$
C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t
$$

Explanation:  

- **Previous memory:** $f_t \cdot C_{t-1}$ ‚Üí retained portion of old memory  
- **New information:** $i_t \cdot \tilde{C}_t$ ‚Üí portion of new candidate added  
- **Additive update** (instead of overwriting) helps gradients flow easily, solving vanishing gradient problem

Intuition:
> The cell state is like a water tank:  
> - Forget gate = drain valve  
> - Input gate = faucet adding new water  


**4Ô∏è. Output Gate ($o_t$) and Hidden State ($h_t$)**

<u>Output Gate ($o_t$)</u>

$$
o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)
$$

Explanation:  
- Determines how much of the cell state $C_t$ should be exposed as the hidden state (short-term output)  
- Sigmoid ‚Üí 0 means ‚Äúhide everything‚Äù, 1 means ‚Äúreveal everything‚Äù

<u>Hidden State ($h_t$)</u>

$$
h_t = o_t \cdot \tanh(C_t)
$$

Explanation: 
- $\tanh(C_t)$ ‚Üí squashes memory values to [-1,1]  
- Multiply by $o_t$ ‚Üí output only selected memory  
- $h_t$ is both the output for this step and the hidden input for next step

Intuition:

The hidden state is like the water flowing out of the tank tap.  
> - $C_t$ = water stored  
> - $o_t$ = how wide the tap is open    

## How LSTM Works

1. **Forget gate** ‚Üí erase old memory (partial)  
2. **Input gate + candidate memory** ‚Üí add new info  
3. **Cell state update** ‚Üí memory carries forward gradually  
4. **Output gate ‚Üí hidden state** ‚Üí controlled short-term output  

> LSTM separates long-term memory ($C_t$) and short-term output ($h_t$), allowing it to preserve important information across long sequences.

> Additive updates to $C_t$ preserve gradients, avoiding vanishing gradient problem.

**Visual Analogy**

Imagine a water tank:

- **Cell state** = water level (long-term memory)  
- **Forget gate** = drain valve (how much old memory to discard)  
- **Input gate** = faucet (how much new memory to add)  
- **Output gate** = tap to output water (hidden state $h_t$)

Water flows controlled by valves (gates), not by random overflow.

## Manual LSTM Example

We will simulate a single LSTM layer with a simple sequence `[1, 2]` and scalar weights to see how memory ($C_t$) and hidden state ($h_t$) evolve.

**Setup**

- Input sequence: `x = [1, 2]`  
- Previous hidden state: `h0 = 0`  
- Previous cell state: `C0 = 0`  
- Simplified weights and biases:


| Gate | Weight | Bias |
|------|--------|------|
| Forget $W_f$ | 0.5 | 0 |
| Input $W_i$ | 0.6 | 0 |
| Candidate $W_c$ | 0.9 | 0 |
| Output $W_o$ | 0.7 | 0 |


- Activation: sigmoid $\sigma(x) = 1/(1+e^{-x})$, $tanh$ as usual  

<u>**Time Step 1: Input = 1**</u>

**Step 1: Forget Gate**

$$
f_1 = \sigma(W_f * x_1 + W_f * h_0) = \sigma(0.5*1 + 0.5*0) = \sigma(0.5) \approx 0.622
$$

**Step 2: Input Gate**

$$
i_1 = \sigma(W_i * x_1 + W_i * h_0) = \sigma(0.6*1 + 0.6*0) = \sigma(0.6) \approx 0.645
$$

**Step 3: Candidate Memory**

$$
\tilde{C}_1 = \tanh(W_c * x_1 + W_c * h_0) = \tanh(0.9*1 + 0) = \tanh(0.9) \approx 0.716
$$

**Step 4: Cell State Update**

$$
C_1 = f_1 * C_0 + i_1 * \tilde{C}_1 = 0.622*0 + 0.645*0.716 \approx 0.462
$$

**Step 5: Output Gate**

$$
o_1 = \sigma(W_o * x_1 + W_o * h_0) = \sigma(0.7*1 + 0) = \sigma(0.7) \approx 0.668
$$

**Step 6: Hidden State**

$$
h_1 = o_1 * \tanh(C_1) = 0.668 * \tanh(0.462) \approx 0.668 * 0.432 \approx 0.288
$$

After first time step: $C_1 \approx 0.462$, $h_1 \approx 0.288$


<u>**Time Step 2: Input = 2**</u>

**Step 1: Forget Gate**

$$
f_2 = \sigma(W_f * x_2 + W_f * h_1) = \sigma(0.5*2 + 0.5*0.288) = \sigma(1.144) \approx 0.758
$$

**Step 2: Input Gate**

$$
i_2 = \sigma(W_i * x_2 + W_i * h_1) = \sigma(0.6*2 + 0.6*0.288) = \sigma(1.373) \approx 0.797
$$

**Step 3: Candidate Memory**

$$
\tilde{C}_2 = \tanh(W_c * x_2 + W_c * h_1) = \tanh(0.9*2 + 0.9*0.288) = \tanh(1.96) \approx 0.961
$$

**Step 4: Cell State Update**

$$
C_2 = f_2 * C_1 + i_2 * \tilde{C}_2 = 0.758*0.462 + 0.797*0.961 \approx 0.350 + 0.766 \approx 1.116
$$

**Step 5: Output Gate**

$$
o_2 = \sigma(W_o * x_2 + W_o * h_1) = \sigma(0.7*2 + 0.7*0.288) = \sigma(1.801) \approx 0.858
$$

**Step 6: Hidden State**

$$
h_2 = o_2 * \tanh(C_2) = 0.858 * \tanh(1.116) \approx 0.858 * 0.806 \approx 0.691
$$

After second time step: $C_2 \approx 1.116$, $h_2 \approx 0.691$


**Summary Table**


| Time step | Input | Forget $f_t$ | Input $i_t$ | Candidate $\tilde{C}_t$ | Cell $C_t$ | Output $o_t$ | Hidden $h_t$ |
|-----------|-------|--------------|-------------|-------------------------|------------|--------------|---------------|
| 1         | 1     | 0.622        | 0.645       | 0.716                   | 0.462      | 0.668        | 0.288         |
| 2         | 2     | 0.758        | 0.797       | 0.961                   | 1.116      | 0.858        | 0.691         |


**What This Shows**

1. Cell state $C_t$ accumulates long-term memory gradually  
2. Hidden state $h_t$ is controlled output at each step  
3. Gates regulate memory flow (forget old info, add new info, control output)  
4. LSTM prevents vanishing gradients due to additive updates in $C_t$  


## LSTM vs RNN

| Aspect | RNN | LSTM |
|----|----|----|
| Memory | Short | Long + Short |
| Vanishing Gradient | Yes |  Controlled |
| Gates | No |  Yes |
| Complexity | Simple | More parameters |
| Performance | Limited | Strong on sequences |


#  Gated Recurrent Unit (GRU)

GRU is a simplified version of LSTM designed to solve the vanishing gradient problem while being computationally lighter.

- GRU is a type of RNN that controls memory using gates, similar to LSTM.  
- Differences from LSTM:
  1. **No separate cell state**: only a hidden state $h_t$.
  2. **Fewer gates**: combines forget + input gate into update gate.
  3. Faster to train and requires fewer parameters.  

Core idea:  

> Decide what to keep from the past and what to update from new input, using fewer gates.


## Components of GRU

A GRU cell has two main gates:

1. **Update gate ($z_t$)**: controls how much of the previous hidden state to keep

   $$
   z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)
   $$
   
2. **Reset gate ($r_t$)**: controls how much of previous hidden state to combine with current input

   $$
   r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)
   $$

3. **Candidate hidden state ($\tilde{h}_t$)**: new information to be added

   $$
   \tilde{h}_t = \tanh(W_h [r_t * h_{t-1}, x_t] + b_h)
   $$

4. **Final hidden state ($h_t$)**: combination of previous state and candidate

   $$
   h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t
   $$


Intuition:

- **Update gate $z_t$:** ‚ÄúHow much memory from the past should I keep?‚Äù  
- **Reset gate $r_t$:** ‚ÄúHow much of previous memory should I forget when calculating new info?‚Äù  
- **Candidate $\tilde{h}_t$:** new info computed with selective memory  
- **Hidden state $h_t$:** mix of old and new memory controlled by $z_t$

## How GRU Works

For each time step:

1. **Compute update gate** $z_t$ ‚Üí decide proportion of old hidden state to keep  
2. **Compute reset gate** $r_t$ ‚Üí decide how much past info affects candidate  
3. **Compute candidate hidden state** $\tilde{h}_t$ using $r_t$  
4. **Update hidden state** $h_t$ ‚Üí weighted sum of previous hidden and candidate  

> GRU merges LSTM‚Äôs forget and input gates into one update gate, so it‚Äôs simpler.


## Manual Example of GRU

**Setup:**

- Sequence: `x = [1, 2]`  
- Previous hidden: `h0 = 0`  
- Weights (scalars):

| Gate | Weight | Bias |
|------|--------|------|
| Update $W_z$ | 0.6 | 0 |
| Reset $W_r$ | 0.5 | 0 |
| Candidate $W_h$ | 0.9 | 0 |

- Activation: sigmoid & tanh  


<u>**Time Step 1: Input = 1**</u>

1. **Update gate**
$$
z_1 = \sigma(W_z * x_1 + W_z * h_0) = \sigma(0.6*1 + 0.6*0) = \sigma(0.6) \approx 0.645
$$

2. **Reset gate**
$$
r_1 = \sigma(W_r * x_1 + W_r * h_0) = \sigma(0.5*1 + 0) = \sigma(0.5) \approx 0.622
$$

3. **Candidate hidden state**
$$
\tilde{h}_1 = \tanh(W_h * (r_1 * h_0 + x_1)) = \tanh(0.9 * (0.622*0 + 1)) = \tanh(0.9) \approx 0.716
$$

4. **Hidden state**
$$
h_1 = (1 - z_1) * h_0 + z_1 * \tilde{h}_1 = (1 - 0.645)*0 + 0.645*0.716 \approx 0.462
$$


<u>**Time Step 2: Input = 2**</u>

1. **Update gate**
$$
z_2 = \sigma(0.6*2 + 0.6*0.462) = \sigma(1.677) \approx 0.841
$$

2. **Reset gate**
$$
r_2 = \sigma(0.5*2 + 0.5*0.462) = \sigma(1.231) \approx 0.774
$$

3. **Candidate hidden state**
$$
\tilde{h}_2 = \tanh(0.9*(r_2*h_1 + x_2)) = \tanh(0.9*(0.774*0.462 + 2)) = \tanh(1.816) \approx 0.948
$$

4. **Hidden state**
$$
h_2 = (1 - z_2) * h_1 + z_2 * \tilde{h}_2 = (1 - 0.841)*0.462 + 0.841*0.948 \approx 0.901
$$

**Summary Table**


| Time step | Input | $z_t$ | $r_t$ | Candidate $\tilde{h}_t$ | Hidden $h_t$ |
|-----------|-------|-----|-----|-------------------------|------------|
| 1         | 1     | 0.645 | 0.622 | 0.716 | 0.462 |
| 2         | 2     | 0.841 | 0.774 | 0.948 | 0.901 |


The hidden state grows, incorporating both previous memory and new input. GRU is simpler than LSTM because it has no separate cell state.


## LSTM vs GRU


| Aspect | LSTM | GRU |
|--------|------|-----|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| Cell State | Separate $C_t$ | No separate cell; only hidden state |
| Complexity | More parameters | Fewer parameters ‚Üí faster |
| Memory Control | Fine-grained (long + short term) | Combined memory (less flexible) |
| Training Speed | Slower | Faster |
| Performance | Slightly better on very long sequences | Comparable in practice |
| Use Case | When long-term dependencies are crucial | When data is smaller or speed is important |


- LSTM = heavy-duty memory machine  
- GRU = light, fast, almost as effective, simpler to implement  

# What to Remember About RNN, LSTM, and GRU


1Ô∏è‚É£ **The ONE Thing You Must Never Forget**

- **Vanilla RNN fails** ‚Üí vanishing gradients ‚Üí short memory
- **LSTM exists** ‚Üí protect long-term information
- **GRU exists** ‚Üí simpler, faster alternative to LSTM

If you remember nothing else, remember this causal chain.

2Ô∏è‚É£ **One-Line Mental Models**

**Vanilla RNN**
> ‚ÄúHidden state is repeatedly multiplied ‚Üí gradients die.‚Äù

**LSTM**
> ‚ÄúSeparate memory highway + gates decide what to keep, add, and expose.‚Äù

**GRU**
> ‚ÄúSingle memory state that blends old and new information efficiently.‚Äù


3Ô∏è‚É£ **Structural Facts Worth Storing**

**LSTM > remember ONLY these**
- Two states:
  - **Cell state ($C_t$)** ‚Üí long-term memory
  - **Hidden state ($h_t$)** ‚Üí output / short-term memory
- Gates are control valves, not math:
  - Forget ‚Üí erase memory
  - Input ‚Üí write memory
  - Output ‚Üí expose memory
- Additive memory update ‚Üí gradients survive

**GRU > remember ONLY these**
- One state: **hidden state**
- Two gates:
  - **Update gate** ‚Üí how much past to keep
  - **Reset gate** ‚Üí how much past to ignore
- No separate cell state ‚Üí faster, simpler


# Practical Guidelines

- Use **RNN** ‚Üí very short sequences, teaching concepts
- Use **LSTM** ‚Üí long dependencies, language, time series
- Use **GRU** ‚Üí limited data, faster training

Industry default:
> Start with GRU, move to LSTM if needed


# Key Takeaways from Day 25

- Vanilla RNNs fail due to vanishing gradients
- LSTM introduces gated memory control
- Forget gate is the most critical innovation
- Cell state enables long-term dependency learning
- GRU is a lighter alternative to LSTM

---

<p style="text-align:center; font-size:18px;">
¬© 2026 Mostafizur Rahman
</p>
