# 🔁 Recurrent Neural Networks (RNNs)

## 📌 What is an RNN?

A **Recurrent Neural Network (RNN)** is a type of neural network designed to process **sequential data**, such as text, time series, or speech.

The key feature is a **hidden state** that carries information from previous time steps, enabling the network to have "memory."

---

## 🔧 RNN Cell Update Formula

At each time step \( t \), the hidden state is updated as:

$$
h_t = \tanh(W_{xh} x_t + W_{hh} h_{t-1} + b_h)
$$

- \( x_t \): input at time step \( t \)
- \( h_{t-1} \): hidden state from the previous step
- \( W_{xh} \): input-to-hidden weights
- \( W_{hh} \): hidden-to-hidden weights (shared across all time steps)
- \( b_h \): bias
- \( \tanh \): non-linearity to keep activations bounded

---

## 🔄 Unrolling the RNN

For a sequence of length \( T \), the RNN is "unrolled" like this:

$$
h_1 = \tanh(W_{xh} x_1 + W_{hh} h_0 + b_h) \\
h_2 = \tanh(W_{xh} x_2 + W_{hh} h_1 + b_h) \\
\vdots \\
h_T = \tanh(W_{xh} x_T + W_{hh} h_{T-1} + b_h)
$$

This shows how the hidden state flows through time, accumulating knowledge from earlier inputs.

---

## 🧠 Key Concepts

### ✅ Strengths:
- Naturally handles variable-length sequences
- Captures temporal patterns (e.g. language, stock prices)
- Simple and intuitive architecture

### ❌ Limitations:
- **Vanishing gradients**: early inputs have negligible effect on output
- **Exploding gradients**: gradients can become too large and destabilize training
- **Slow training**: can't parallelize across time steps

---

## 📉 Vanishing & Exploding Gradients

To compute the gradient of the loss \( L \) with respect to a past hidden state:

$$
\frac{\partial L}{\partial h_{t-k}} = \frac{\partial L}{\partial h_t} \cdot \prod_{i=1}^{k} \frac{\partial h_{t-i+1}}{\partial h_{t-i}}
$$

### Vanishing:
If each term in the product is \( < 1 \), the gradient shrinks:

$$
\prod (0.5) \cdot (0.5) \cdot (0.5) = 0.125
$$

→ Early time steps have **near-zero gradient**, so they don't learn.

### Exploding:
If each term is \( > 1 \), the gradient grows:

$$
\prod (5) \cdot (5) \cdot (5) = 125
$$

→ Early time steps have **huge gradient**, which can blow up weights.

---

## 🛡️ Fixes

- **Gradient clipping** to prevent explosion:
  ```python
  torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)


# 🔐 Long Short-Term Memory (LSTM)

## 📌 What is an LSTM?

An **LSTM (Long Short-Term Memory)** is a type of recurrent neural network (RNN) designed to **learn long-term dependencies** more effectively than a basic RNN.

It addresses the **vanishing gradient problem** by introducing a **cell state** and **gating mechanisms** that control the flow of information.

---

## 🧠 Core Components

LSTM maintains:
- A **hidden state** $h_t$ — short-term working memory
- A **cell state** $c_t$ — long-term memory
- **Gates** to manage memory updates

---

## ⚙️ LSTM Cell Equations

At time step $t$, with input $x_t$, previous hidden state $h_{t-1}$, and previous cell state $c_{t-1}$:

### 🔒 Forget gate:
$$
f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)
$$

### ➕ Input gate:
$$
i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)
$$

### 🧠 Candidate memory:
$$
\tilde{c}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)
$$

### 🔁 Cell state update:
$$
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t
$$

### 📤 Output gate:
$$
o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)
$$

### 🧠 Hidden state update:
$$
h_t = o_t \odot \tanh(c_t)
$$

Here:
- $\sigma$ is the sigmoid activation function  
- $\tanh$ is the hyperbolic tangent function  
- $\odot$ is element-wise multiplication

---

## 🧱 What Each Gate Does

| Gate | Equation | Purpose |
|------|----------|---------|
| Forget Gate $f_t$ | $\sigma(W_f x_t + U_f h_{t-1})$ | Decides what to forget from $c_{t-1}$ |
| Input Gate $i_t$ | $\sigma(W_i x_t + U_i h_{t-1})$ | Decides what new info to add |
| Candidate $\tilde{c}_t$ | $\tanh(W_c x_t + U_c h_{t-1})$ | Proposes new memory |
| Output Gate $o_t$ | $\sigma(W_o x_t + U_o h_{t-1})$ | Decides what to output as $h_t$ |

---

## 💡 Why LSTMs Solve the Vanishing Gradient Problem

- The **cell state** $c_t$ enables gradients to flow **additively**, not multiplicatively, through time.
- When $f_t \approx 1$ and $i_t \approx 0$, the cell simply carries forward unchanged:
  $$
  c_t \approx c_{t-1}
  $$


---

## 📊 Activation Function Roles

- **Sigmoid**: used in gates — outputs in \( (0, 1) \), perfect for gating/filtering
- **Tanh**: used for candidate and output signal — outputs in \( (-1, 1) \), good for representing values with direction (positive or negative)

| Function | Used For | Role |
|----------|----------|------|
| \( \sigma \) (sigmoid) | Forget, Input, Output gates | Controls how much info flows |
| \( \tanh \) | Candidate memory, final hidden state | Represents information signal |

---

## 🧠 Summary

- LSTM introduces a **cell state** that persists through time with minimal modification.
- **Gates** allow the model to learn what to keep, forget, and output.
- Fixes vanishing gradients by preserving important long-term dependencies.
- Used heavily before transformers, still useful in speech, time series, and edge devices.



# 💡 Why Gradients in LSTM Flow Additively

The core LSTM cell state update equation is:

$$
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t
$$

Let’s look at how this evolves over multiple time steps:

### 🧮 Step-by-Step Cell State Flow

Unrolling the cell across 3 time steps:

- $c_1 = f_1 \odot c_0 + i_1 \odot \tilde{c}_1$
- $c_2 = f_2 \odot c_1 + i_2 \odot \tilde{c}_2$
- $c_3 = f_3 \odot c_2 + i_3 \odot \tilde{c}_3$

Now substitute $c_1$ and $c_2$ recursively:

$$
\begin{aligned}
c_3 &= f_3 \odot [f_2 \odot (f_1 \odot c_0 + i_1 \odot \tilde{c}_1) + i_2 \odot \tilde{c}_2] + i_3 \odot \tilde{c}_3 \\
&= f_3 \odot f_2 \odot f_1 \odot c_0 + f_3 \odot f_2 \odot i_1 \odot \tilde{c}_1 + f_3 \odot i_2 \odot \tilde{c}_2 + i_3 \odot \tilde{c}_3
\end{aligned}
$$

### ✅ Key Insight

- Each contribution is **gated** and then **added**.
- This avoids a long chain of multiplications like in vanilla RNNs.
- So gradients **don’t vanish as easily** — useful information can persist much longer!

---

# 🎯 Why Use Sigmoid vs Tanh?

## 🧩 Sigmoid: For Gates

The sigmoid function:

$$
\sigma(x) = \frac{1}{1 + e^{-x}} \in (0, 1)
$$

Used in:
- Forget gate: $f_t$
- Input gate: $i_t$
- Output gate: $o_t$

### ✅ Reason:
- Acts like a **soft switch**
- Values near 0 block the signal; values near 1 let it pass
- Perfect for **controlling flow**

---

## 🧠 Tanh: For Memory & Signal

The tanh function:

$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \in (-1, 1)
$$

Used in:
- Candidate memory: $\tilde{c}_t$
- Hidden state output: $h_t = o_t \odot \tanh(c_t)$

### ✅ Reason:
- Outputs centered around 0 (good for **expressiveness**)
- Can encode **positive and negative** values
- Keeps the activations **bounded**, which helps training stability

---

## 🔍 Summary Table

| Component               | Activation | Purpose                                |
|------------------------|------------|----------------------------------------|
| Forget/Input/Output Gates | Sigmoid    | Soft control (0 to 1)                  |
| Candidate Memory, Output | Tanh       | Rich signal (−1 to 1)                  |
| Cell State Update        | Additive   | Preserves memory across time steps     |


# 🧠 Gated Recurrent Units (GRUs)

## Overview

GRUs are a type of recurrent neural network that aim to solve the **vanishing gradient problem** found in vanilla RNNs. They do this using a simplified gating mechanism compared to LSTMs, while still enabling the network to retain or forget information over long sequences.

GRUs are often faster to train than LSTMs due to fewer parameters and offer competitive performance.

---

## 🔢 Key Equations

Let:
- $x_t$ = input at time $t$
- $h_{t-1}$ = previous hidden state
- $h_t$ = current hidden state
- $\sigma$ = sigmoid activation
- $\tanh$ = tanh activation
- $\odot$ = element-wise multiplication

**1. Update Gate**
\[
$$z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)$$
\]

**2. Reset Gate**
\[
$$r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)$$
\]

**3. Candidate Hidden State**
\[
$$\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)$$
\]

**4. Final Hidden State**
\[
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$
\]

---

## 🧠 Intuition

- The **update gate** $z_t$ decides how much of the new candidate state vs. the previous hidden state to keep.
- The **reset gate** $r_t$ controls how much of the past hidden state to forget *when computing the candidate*.
- The final hidden state $h_t$ is a **blend** of the previous hidden state and the new candidate, controlled by $z_t$.
- This **additive structure** helps preserve gradient flow across time steps, avoiding vanishing gradients.

---

## ✅ Why GRUs Help with Vanishing Gradients

While GRUs do use **multiplicative gates**, it’s the **additive composition** of the final hidden state:
\[
h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
\]
that allows gradients to flow through time steps **without excessive shrinking or blowing up**, unlike vanilla RNNs which rely on recursive multiplications.

---

# 🔁 Comparison Table: Vanilla RNN vs LSTM vs GRU

| Feature                      | Vanilla RNN                    | LSTM                                          | GRU                                      |
|-----------------------------|--------------------------------|-----------------------------------------------|------------------------------------------|
| States                      | $h_t$                          | $h_t$, $c_t$ (hidden + cell)                  | $h_t$ only                               |
| Gates                       | None                           | Input, Forget, Output                         | Update, Reset                            |
| Update Equation             | $h_t = \tanh(Wx_t + Uh_{t-1})$ | Complex gating with cell and hidden state     | Blended candidate and hidden state       |
| Vanishing Gradient Handling | ❌                             | ✅ via cell state with additive updates       | ✅ via additive updates in $h_t$         |
| Training Speed              | Fast                           | Slower due to more parameters                 | Faster than LSTM                         |
| Parameter Count             | Low                            | High                                          | Medium                                   |
| Use Case Suitability        | Short-term dependencies        | Long-term dependencies, large datasets        | Similar to LSTM but better for smaller tasks |
| Output                      | $h_t$                          | $h_t$ (modulated by $o_t$ and $c_t$)          | $h_t$                                    |

---

# ✅ Summary

- GRUs simplify LSTMs by combining the input and forget mechanisms into one **update gate**, and removing the separate cell state.
- They perform similarly to LSTMs on many tasks while being faster to train and less likely to overfit on small datasets.
- The key to avoiding vanishing gradients is the **additive blending** in the hidden state update.



# Padding and Masking in Sequence Models

## Why This Matters

- Real-world sequences such as sentences or time-series data naturally have different lengths.
- Neural networks, particularly when trained in batches, require fixed-size inputs.
- **Padding** is used to bring all sequences to the same length, while **masking** tells the model which parts of the sequence are actual data versus padding.

---

## Padding Sequences

When you pad a sequence, you add a special token (often represented by 0 or a token like `[PAD]`) to sequences shorter than the maximum length in a batch so that every sequence has the same length. For example, if you have three sequences:

- Sequence 1: [1, 2, 3]
- Sequence 2: [4, 5]
- Sequence 3: [6, 7, 8, 9]

After padding (to match the length 4, which is the longest), you get:

- Padded Sequence 1: [1, 2, 3, 0]
- Padded Sequence 2: [4, 5, 0, 0]
- Padded Sequence 3: [6, 7, 8, 9]

Here, 0 acts as the padding token.

---

## Creating a Mask

A mask is created alongside the padded sequences to indicate which tokens are real data and which are padding. In the example above, the mask would be:

- For Sequence 1: [1, 1, 1, 0]
- For Sequence 2: [1, 1, 0, 0]
- For Sequence 3: [1, 1, 1, 1]

A '1' (or `True`) indicates a real token, while a '0' (or `False`) indicates a padded token. This mask is used during model computations (like in attention mechanisms or loss calculations) so that the padding does not affect the learning process.

---

## Optional: Packing for RNNs

When working with RNNs (such as GRUs or LSTMs), it is sometimes beneficial to use packing techniques (e.g., using functions like `pack_padded_sequence` in PyTorch). Packing sequences allows the RNN to process only the real data without spending computation on the padded portions. This leads to a more efficient model training, as the model can ignore the padded timesteps altogether.

---

## Summary

| Term     | Purpose                                      |
|----------|----------------------------------------------|
| Padding  | Makes sequences the same length by adding tokens (e.g., 0 or [PAD]) to shorter sequences |
| Masking  | Identifies which positions in the sequence are real data vs. padding, so that the model can ignore the padding during computation |
| Packing  | (Optional) Groups valid data together for RNNs to avoid processing padding tokens, leading to efficient training |

This approach—combining padding, masking, and optionally packing—ensures that variable-length sequences can be handled effectively by neural networks without contaminating the learning process with irrelevant padding information.


# 🧱 Sequence Preprocessing: Padding, Masking, Tokenization & Special Tokens

---

## 📏 Managing Variable-Length Sequences

### 🔹 Padding

- Sequences (like sentences) are rarely the same length.
- Neural networks, especially in batched training, require fixed-size inputs.
- Padding is used to make all sequences the same length by appending a special token (commonly `0` or `[PAD]`).

**Example:**

Raw tokenized sequences:  
[5, 3, 7]  
[12, 5]  
[15, 8, 9, 4]  

After padding to max length 4:  
[5, 3, 7, 0]  
[12, 5, 0, 0]  
[15, 8, 9, 4]  

---

### 🔸 Masking

- Padding introduces fake data — we don’t want the model to learn from that.
- A **mask** is created to indicate which tokens are real (1) and which are padding (0).

**Example Mask:**  
[1, 1, 1, 0]  
[1, 1, 0, 0]  
[1, 1, 1, 1]  

Masks are used in:  
- Attention layers (to block out padding)
- Loss calculations (to ignore padded tokens)

---

## 🧠 Tokenization Methods

Transformers operate on **token IDs**, not raw text. Tokenization turns raw text into smaller units called tokens.

---

### 🔹 Word-Level Tokenization

- Splits text by spaces or punctuation.
- Easy but can't handle rare or unknown words.
- Example:  
  `"unhappiness"` → `["unhappiness"]`

---

### 🔸 Character-Level Tokenization

- Breaks text into individual characters.
- Always works (no OOV), but loses word-level semantics and creates long sequences.  
  `"hello"` → `["h", "e", "l", "l", "o"]`

---

### 🔶 Subword Tokenization (used in BERT, GPT, T5, etc.)

Breaks rare or complex words into more frequent, reusable **subword units**.

---

#### 🧩 WordPiece (Used in BERT)

- Learns subwords that maximize likelihood of training data.
- Adds `##` to mark subword continuations.
- Example:  
  `"unhappiness"` → `["un", "##happi", "##ness"]`

---

#### 🔗 Byte-Pair Encoding (BPE) (Used in GPT-2, GPT-3)

- Inspired by compression algorithms.
- Starts with characters, then merges most frequent **adjacent pairs**.
- Example:  
  `"blockchainbro"` → `["block", "chain", "bro"]`  
  `"unhappiness"` → `["un", "happiness"]` or `["un", "happi", "ness"]`

---

#### 📦 SentencePiece (Used in T5, mT5)

- Trained directly on raw text (no whitespace splitting).
- Useful for multilingual models or languages without spaces.
- Operates at the Unicode level.

---

## 🧩 Special Tokens in Transformer Models

| Token     | Purpose                                                                 |
|-----------|-------------------------------------------------------------------------|
| `[PAD]`   | Padding token; ignored by model and attention                           |
| `[MASK]`  | Used during masked language modeling (BERT pretraining)                 |
| `[CLS]`   | Classification token; summary vector for sentence(s)                    |
| `[SEP]`   | Separator token between sentences or segments                           |
| `[UNK]`   | Unknown token (used when subword split is impossible — rare in practice) |

---

## 📌 Example: BERT Input Format

Suppose you're inputting two sentences for classification:

- Sentence A: "I love AI."
- Sentence B: "It is powerful."

**BERT Input Tokens (structured format):**  
`[CLS] I love AI . [SEP] It is powerful . [SEP]`

- `[CLS]` goes at the start and its final embedding is used for classification.
- `[SEP]` separates Sentence A and B.
- These tokens are then tokenized into subwords using WordPiece and mapped to **input IDs**.

---

### Behind the scenes, the BERT tokenizer outputs:

- `input_ids`: Token IDs including `[CLS]`, `[SEP]`, subwords, etc.
- `token_type_ids`: 0s for Sentence A, 1s for Sentence B
- `attention_mask`: 1s for real tokens, 0s for padding

---

## ✅ Summary

| Concept              | Purpose                                                                 |
|----------------------|-------------------------------------------------------------------------|
| Padding              | Aligns sequence lengths for batch processing                            |
| Masking              | Prevents model from learning from padded tokens                         |
| Word-Level Tokenization | Splits by word, simple but limited                                    |
| Character-Level      | No OOV, but loses structure                                              |
| Subword Tokenization | Breaks into reusable chunks, balances vocab size and coverage           |
| WordPiece            | BERT-style subwords with `##` continuation markers                       |
| BPE                  | GPT-style subword merges based on frequency of adjacent character pairs |
| SentencePiece        | Raw-text trained, good for multilingual or no-space languages           |
| `[PAD]`, `[MASK]`, `[CLS]`, `[SEP]` | Special tokens that signal padding, masking, classification, and segmentation |

