# 📚 1. Basic Derivatives and Chain Rule

---

## 🔹 What is a Derivative?

- Measures the **rate of change** of a function.
- In deep learning, derivatives tell us **how much to adjust weights** to minimize loss.

**Example:**

$$
f(x) = x^2
\quad \Rightarrow \quad
f'(x) = 2x
$$

---

## 🔹 Common Derivatives You Must Know

| Function                  | Derivative                          |
|----------------------------|-------------------------------------|
| $ x^n $                    | $ nx^{n-1} $                       |
| $ e^x $                    | $ e^x $                            |
| $ \ln(x) $                 | $ \frac{1}{x} $                    |
| $ \sin(x) $                | $ \cos(x) $                        |
| $ \cos(x) $                | $ -\sin(x) $                       |
| $ \tanh(x) $               | $ 1 - \tanh^2(x) $                 |
| $ \sigma(x) = \frac{1}{1+e^{-x}} $ (Sigmoid) | $ \sigma(x)(1-\sigma(x)) $ |

---

## 🔹 Chain Rule (Single-variable)

If:

$$
y = f(g(x))
$$

then:

$$
\frac{dy}{dx} = \frac{dy}{dg} \times \frac{dg}{dx}
$$

**Intuition:**  
You chain together how changes in $ x $ affect $ g(x) $, then how $ g(x) $ affects $ y $.

---

## 🔹 Chain Rule (Multivariable)

Suppose:

$$
z = f(x, y), \quad x = g(t), \quad y = h(t)
$$

then:

$$
\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt}
$$

- Multivariable chain rule **adds up contributions** from each path.

---

## 🔹 Product Rule

If:

$$
f(x) = u(x) v(x)
$$

then:

$$
f'(x) = u'(x) v(x) + u(x) v'(x)
$$

---

## 🔹 Quotient Rule

If:

$$
f(x) = \frac{u(x)}{v(x)}
$$

then:

$$
f'(x) = \frac{u'(x) v(x) - u(x) v'(x)}{v(x)^2}
$$

---

## 🧠 Key Insight:

- **Chain Rule** glues functions together.
- **Product/Quotient Rules** handle multiplying/dividing functions.
- **These tools are the heart of backpropagation.**


# 📚 2. Multivariable Calculus (Partial Derivatives, Gradients, Jacobians)

---

## 🔹 Partial Derivatives

- When a function depends on multiple variables, we take the **partial derivative** with respect to one variable at a time, treating the others as constants.

**Example:**

If:

$$
f(x, y) = x^2 + 3xy + y^2
$$

then:

$$
\frac{\partial f}{\partial x} = 2x + 3y
\quad\quad
\frac{\partial f}{\partial y} = 3x + 2y
$$

---

## 🔹 Gradient Vector

- The **gradient** is a vector of all partial derivatives.
- Points in the direction of steepest increase.

**Definition:**

$$
\nabla f(x, y) =
\left[
\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}
\right]
$$

---

## 🔹 Jacobian Matrix

- Generalizes gradients when output is a vector (multi-output functions).

Suppose:

$$
\mathbf{y} =
\begin{bmatrix}
y_1(x_1, x_2, ..., x_n) \\
y_2(x_1, x_2, ..., x_n) \\
\vdots \\
y_m(x_1, x_2, ..., x_n)
\end{bmatrix}
$$

then the **Jacobian** is:

$$
J_{ij} = \frac{\partial y_i}{\partial x_j}
$$

- Rows = output components
- Columns = input components

---

## 🔹 Why Jacobians Matter in Deep Learning

- **Backpropagation** across layers uses **Jacobian matrices**.
- Especially crucial for operations like **softmax**, where each output depends on all inputs.
- Allows efficient calculation of gradients across complex compositions.

---

## 🧠 Key Intuition:

- **Partial derivatives**: "How sensitive is output to one input?"
- **Gradient**: "Which way should I move to increase output fastest?"
- **Jacobian**: "How do multiple outputs change with multiple inputs?"


## 🔢 Deep Learning Math Essentials: Exponentials, Logarithms, and Their Role in Loss Functions

---

### 🔹 What is $e$?

- $e \approx 2.718$ is Euler's number.
- Defined by:

$$
e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^n
$$

- Unique property:

$$
\frac{d}{dx} e^x = e^x
$$

Used heavily in deep learning because it provides smooth, always-positive, non-vanishing gradients.

---

### 🔹 Exponential Function: $e^x$

- Always positive: $e^x > 0$
- Grows rapidly as $x \to \infty$
- Flattens near zero as $x \to -\infty$
- Derivative:

$$
\frac{d}{dx} e^x = e^x
$$

- Operations:
  - $e^{a + b} = e^a \cdot e^b$
  - $e^{a - b} = \frac{e^a}{e^b}$

---

### 🔹 Logarithmic Function: $\log(x)$

- Inverse of $e^x$
- Defined as:

$$
\log(x) = y \iff e^y = x
$$

- Only defined for $x > 0$
- Grows slowly, explodes negatively as $x \to 0^+$
- Derivative:

$$
\frac{d}{dx} \log(x) = \frac{1}{x}
$$

- Operation rules:
  - $\log(ab) = \log a + \log b$
  - $\log\left(\frac{a}{b}\right) = \log a - \log b$
  - $\log(a^b) = b \log a$
  - $\log(e^x) = x$
  - $e^{\log(x)} = x$

---

### 🔹 Inverse Identity

Exponentials and logs undo each other:

$$
\log(e^x) = x \quad \text{and} \quad e^{\log(x)} = x
$$

---

### 🔹 Why Use $\log$ and $e^x$ in Deep Learning?

| Purpose                            | Example                                           | Why                         |
|-----------------------------------|---------------------------------------------------|------------------------------|
| Convert scores to probabilities   | $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ | $e^x$ exaggerates differences |
| Stabilize products                | $\log(p_1 \cdot p_2) = \log p_1 + \log p_2$       | Avoids underflow             |
| Gradient-based optimization       | $\frac{d}{dx} e^x,\ \frac{d}{dx} \log(x)$         | Smooth derivatives           |

---

![Alt Text](log_e_viz.png)


### 🔹 Binary Cross-Entropy (BCE)

Used in binary classification tasks:

$$
\text{BCE}(\hat{y}, y) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
$$

Where:
- $y \in \{0, 1\}$ is the true label
- $\hat{y} \in (0, 1)$ is the predicted probability

Special cases:
- If $y = 1$: $\text{Loss} = -\log(\hat{y})$
- If $y = 0$: $\text{Loss} = -\log(1 - \hat{y})$

Gradients:
- If $y = 1$: $\frac{dL}{d\hat{y}} = -\frac{1}{\hat{y}}$
- If $y = 0$: $\frac{dL}{d\hat{y}} = \frac{1}{1 - \hat{y}}$

---

### 🔹 Softmax Function

Used to convert logits to a probability distribution:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Properties:
- $\hat{y}_i \in (0, 1)$
- $\sum_i \hat{y}_i = 1$

---

### 🔹 Cross-Entropy Loss (Multiclass)

With one-hot true labels:

$$
\text{CE}(y, \hat{y}) = -\sum_i y_i \log(\hat{y}_i)
$$

If class $k$ is true, and $y_k = 1$:

$$
\text{Loss} = -\log(\hat{y}_k)
$$

---

### 🔹 Log-Softmax Trick

Instead of computing:

$$
\log(\text{softmax}(z_i)) = \log\left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
$$

Use:

$$
\log(\text{softmax}(z_i)) = z_i - \log\left(\sum_j e^{z_j}\right)
$$

This is numerically stable and gives clean gradients:
$$
\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i
$$

---

### 🔹 KL Divergence

Measures how much one distribution $Q$ diverges from a true distribution $P$:

$$
D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log\left(\frac{P(i)}{Q(i)}\right)
$$

Can be rewritten as:

$$
D_{\text{KL}}(P \parallel Q) = H(P, Q) - H(P)
$$

- $H(P, Q)$ = cross-entropy
- $H(P)$ = entropy of true distribution

---

### 🧠 Final Summary

| Concept             | Formula                                                        | Use Case                     |
|---------------------|----------------------------------------------------------------|------------------------------|
| Exponential         | $e^x$                                                          | Amplify scores (softmax)     |
| Logarithm           | $\log(x)$                                                      | Stabilize, inverse of $e^x$  |
| BCE                 | $- [y \log(\hat{y}) + (1-y) \log(1 - \hat{y})]$                | Binary classification        |
| Softmax             | $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$                  | Convert logits to probs      |
| Cross-Entropy       | $-\sum_i y_i \log(\hat{y}_i)$                                   | Multiclass classification    |
| Log-Softmax         | $z_i - \log \sum_j e^{z_j}$                                     | Numerically stable softmax   |
| KL Divergence       | $\sum_i P(i) \log \frac{P(i)}{Q(i)}$                            | Measure distribution mismatch|



# 📚 3. Activation Functions and Their Derivatives

---

## 🔹 Sigmoid Activation

**Formula:**

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

**Derivative:**

$$
\sigma'(x) = \sigma(x) (1 - \sigma(x))
$$

---

## 🔹 Tanh Activation

**Formula:**

$$
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
$$

**Derivative:**

$$
\frac{d}{dx} \tanh(x) = 1 - \tanh^2(x)
$$

---

![activation_sigmoid_tanh](activations.png)

## 🔹 ReLU (Rectified Linear Unit)

**Formula:**

$$
\text{ReLU}(x) = \max(0, x)
$$

**Derivative:**

$$
\text{ReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Keeps positive values, zeros out negatives.
- Prevents vanishing gradients for \( x > 0 \).
- Issue: Dead neurons when \( x \leq 0 \).

![relu](relu_viz.png)
---

## 🔹 Leaky ReLU

**Formula:**

$$
\text{LeakyReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
\quad \text{where } \alpha \in [0.01, 0.1]
$$

**Derivative:**

$$
\text{LeakyReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}
$$

**Fix:**
- Allows small gradient for \( x < 0 \), preventing dead neurons.

![leaky_relu](leaky_relu.png)
---

## 🔹 GELU (Gaussian Error Linear Unit)

**Exact Formula:**

$$
\text{GELU}(x) = x \cdot \Phi(x)
$$

where:

$$
\Phi(x) = \frac{1}{2} \left[ 1 + \text{erf}\left( \frac{x}{\sqrt{2}} \right) \right]
$$

**Derivative (using Product Rule):**

$$
\frac{d}{dx} \text{GELU}(x) = \Phi(x) + x \cdot \phi(x)
$$

where:

$$
\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

---

**Approximate Formula (used in practice):**

$$
\text{GELU}(x) \approx 0.5x \left[ 1 + \tanh\left( \sqrt{\frac{2}{\pi}} (x + 0.044715x^3) \right) \right]
$$

**Notes:**
- Smooth version of ReLU.
- Default activation in Transformers like **BERT** and **GPT**.
- Encourages smoother gradient flows for better optimization.

![gelu](gelu_viz.png)
---

## 🧠 Key Intuition:

- Activations introduce **non-linearities** → allow deep networks to approximate complex functions.
- **ReLU variants** (Leaky ReLU, GELU) address dead neuron issues or smoothness needs.


# 📚 4. Softmax and Cross-Entropy: Full Gradient Derivations

---

## 🔹 Softmax Function

Given logits vector \( \mathbf{z} \), the softmax outputs probabilities:

$$
\hat{y}_i = \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Properties:
- $$ 0 < \hat{y}_i < 1 $$
- $$ \sum_i \hat{y}_i = 1 $$


---

## 🔹 Cross-Entropy Loss

Given:
- True labels \( \mathbf{y} \) (one-hot vector)
- Predicted probabilities \( \hat{\mathbf{y}} \)

The cross-entropy loss is:

$$
L = -\sum_i y_i \log(\hat{y}_i)
$$

Since only one \( y_i = 1 \) in a one-hot vector:

$$
L = -\log(\hat{y}_{\text{true class}})
$$

---

# 🚀 Full Derivation: Gradient of Softmax w.r.t. Logits

We want:

$$
\frac{\partial \hat{y}_i}{\partial z_j}
$$

Expand:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}}
$$

Apply quotient rule carefully:

---

### 🔸 Case 1: \( i = j \)

When differentiating \( \hat{y}_i \) w.r.t. \( z_i \):

$$
\frac{\partial \hat{y}_i}{\partial z_i} =
\frac{e^{z_i} (\sum_k e^{z_k}) - e^{z_i} e^{z_i}}{(\sum_k e^{z_k})^2}
= \hat{y}_i (1 - \hat{y}_i)
$$

---

### 🔸 Case 2: \( i != j \)

When differentiating \( \hat{y}_i \) w.r.t. a different \( z_j \):

$$
\frac{\partial \hat{y}_i}{\partial z_j} =
\frac{0 \cdot (\sum_k e^{z_k}) - e^{z_i} e^{z_j}}{(\sum_k e^{z_k})^2}
= -\hat{y}_i \hat{y}_j
$$

---

# 🚀 Now: Derivative of Loss w.r.t. Logits \( z \)

We derive two ways:

---

## 🔹 Method 1: Using Chain Rule and Gradient of Softmax

By chain rule:

$$
\frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial z_j}
$$

Where:

$$
\frac{\partial L}{\partial \hat{y}_i} = -\frac{y_i}{\hat{y}_i}
$$

Thus:

$$
\frac{\partial L}{\partial z_j} = -\sum_i \frac{y_i}{\hat{y}_i} \frac{\partial \hat{y}_i}{\partial z_j}
$$

Substituting the softmax gradients:

- For \( i = j \):

$$
\frac{\partial \hat{y}_i}{\partial z_i} = \hat{y}_i (1 - \hat{y}_i)
$$

- For \( i != j \):

$$
\frac{\partial \hat{y}_i}{\partial z_j} = -\hat{y}_i \hat{y}_j
$$

Splitting sums:

$$
= -\left( \frac{y_j}{\hat{y}_j} \hat{y}_j (1-\hat{y}_j) + \sum_{i \neq j} \frac{y_i}{\hat{y}_i} (-\hat{y}_i \hat{y}_j) \right)
$$

Simplify:

First term:

$$
- y_j (1-\hat{y}_j)
$$

Second term:

$$
+ \hat{y}_j \sum_{i \neq j} y_i
$$

Since:

$$
\sum_i y_i = 1
\quad \Rightarrow \quad
\sum_{i \neq j} y_i = 1 - y_j
$$

Thus:

$$
\frac{\partial L}{\partial z_j} = -y_j (1-\hat{y}_j) + \hat{y}_j (1-y_j)
$$

Expand:

$$
= -y_j + y_j \hat{y}_j + \hat{y}_j - \hat{y}_j y_j
$$

Simplify:

$$
= \hat{y}_j - y_j
$$

✅ Final result.

---

![backprop]( cross_enropy_gradient_wrt_z_via_traditional_softmax_dervitatives_pt_1.png)
![backprop](cross_enropy_gradient_wrt_z_via_softmax_dervitatives_pt_2.png)

## 🔹 Method 2: Using Log-Sum-Exp Trick Directly

Expand cross-entropy loss:

First:

$$
L = -\sum_i y_i \log(\hat{y}_i)
$$

Substitute \( \hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}} \):

$$
= -\sum_i y_i (z_i - \log\sum_k e^{z_k})
$$

Expand:

$$
= -\sum_i y_i z_i + \log\sum_k e^{z_k}
$$

Now differentiate w.r.t. \( z_j \):

First term:

$$
\frac{\partial}{\partial z_j} \left( -\sum_i y_i z_i \right) = -y_j
$$

Second term:

$$
\frac{\partial}{\partial z_j} \log\sum_k e^{z_k} = \frac{e^{z_j}}{\sum_k e^{z_k}} = \hat{y}_j
$$

Thus:

$$
\frac{\partial L}{\partial z_j} = \hat{y}_j - y_j
$$

✅ Same clean result.

---

# 🎯 Final Gradient Summary

Whether you differentiate via:

- Chain rule using softmax gradient
- Log-sum-exp trick

**You always get:**

$$
\frac{\partial L}{\partial z_j} = \hat{y}_j - y_j
$$

---

# 🧠 Key Takeaways

- Softmax derivatives differ depending if \( i = j \) or \( i != j \).
- Cross-entropy loss with softmax makes backpropagation efficient.
- Deep learning libraries (e.g., PyTorch, TensorFlow) fuse softmax + cross-entropy for numerical stability.


## 🔁 Final Result — Jacobian of the Softmax Function

---

### 🧠 What Are We Doing?

We want to compute the **Jacobian matrix** of the softmax function:

Given logits:

$$
\mathbf{z} = [z_1, z_2, \dots, z_n]
$$

The softmax output is:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_k e^{z_k}} = \frac{e^{z_i}}{S}, \quad \text{where } S = \sum_k e^{z_k}
$$

Our goal:  
Compute the partial derivatives:

$$
J_{ij} = \frac{\partial \hat{y}_i}{\partial z_j}
$$

This tells us **how each softmax output** $\hat{y}_i$ changes if we **nudge the input logit** $z_j$.

---

### 🔧 Step-by-Step Derivation

---

#### ✅ Case 1: $i = j$

We're looking at:

$$
\frac{\partial \hat{y}_i}{\partial z_i}
$$

Using the quotient rule:

Let:
- $u = e^{z_i}$
- $v = S = \sum_k e^{z_k}$
- $u' = e^{z_i}$
- $v' = e^{z_i}$ (since only one term in the sum depends on $z_i$)

Then:

$$
\frac{\partial \hat{y}_i}{\partial z_i}
= \frac{u' \cdot v - u \cdot v'}{v^2}
= \frac{e^{z_i} \cdot S - e^{z_i} \cdot e^{z_i}}{S^2}
$$

Factor:

$$
= \frac{e^{z_i}}{S} \left(1 - \frac{e^{z_i}}{S} \right)
= \hat{y}_i (1 - \hat{y}_i)
$$

✅ This is the **diagonal of the Jacobian**

---

#### ✅ Case 2: $i \ne j$

Now $z_j$ doesn’t affect the numerator $e^{z_i}$, but **does** affect the denominator:

Let:
- $u = e^{z_i}$ → constant
- $v = S = \sum_k e^{z_k}$
- $u' = 0$
- $v' = e^{z_j}$

Then:

$$
\frac{\partial \hat{y}_i}{\partial z_j}
= \frac{0 \cdot S - e^{z_i} \cdot e^{z_j}}{S^2}
= - \frac{e^{z_i} e^{z_j}}{S^2}
= -\hat{y}_i \hat{y}_j
$$

✅ These are the **off-diagonal entries**

---

### ✅ Final Formula (Unified for All $i, j$):

$$
\boxed{
\frac{\partial \hat{y}_i}{\partial z_j} = \hat{y}_i \left( \delta_{ij} - \hat{y}_j \right)
}
$$

Where:
- $\delta_{ij} = 1$ if $i = j$, else $0$  
  (this is the **Kronecker delta** — a switch that says: “are we on the diagonal?”)

---

### 🔁 Matrix Form:

Let $\hat{\mathbf{y}} \in \mathbb{R}^n$ be the softmax output vector. Then:

- Diagonal matrix:

$$
\text{diag}(\hat{\mathbf{y}}) =
\begin{bmatrix}
\hat{y}_1 & 0 & \dots & 0 \\
0 & \hat{y}_2 & \dots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \dots & \hat{y}_n
\end{bmatrix}
$$

- Outer product:

$$
\hat{\mathbf{y}} \hat{\mathbf{y}}^\top =
\begin{bmatrix}
\hat{y}_1^2 & \hat{y}_1 \hat{y}_2 & \dots \\
\hat{y}_2 \hat{y}_1 & \hat{y}_2^2 & \dots \\
\vdots & \vdots & \ddots
\end{bmatrix}
$$

So the **Jacobian of softmax is:**

$$
\boxed{
J = \text{diag}(\hat{\mathbf{y}}) - \hat{\mathbf{y}} \hat{\mathbf{y}}^\top
}
$$

---

### 🔍 What It Really Means:

- **Diagonal entries**: $\hat{y}_i (1 - \hat{y}_i)$  
  → If I increase my own logit $z_i$, my probability goes up (but never past 1)

- **Off-diagonal entries**: $-\hat{y}_i \hat{y}_j$  
  → If another logit $z_j$ increases, it steals probability mass from $z_i$

---

### 💡 TL;DR

> **Softmax Jacobian** =  
> *"I get a diagonal boost if I nudge my own logit...  
> but I lose ground if any other logit rises."*

That’s how softmax redistributes probability mass — and why it’s perfect for multiclass classification.



# 📚 5. Full Chain Rule: Backpropagation through Composite Layers

---

## 🔹 Setup: A Simple 3-Layer Neural Network

Suppose we have:

1. **Input** \( x \)
2. **Hidden layer 1 output**:

$$
h_1 = f_1(W_1 x + b_1)
$$

3. **Hidden layer 2 output**:

$$
h_2 = f_2(W_2 h_1 + b_2)
$$

4. **Final output (logits)**:

$$
z = W_3 h_2 + b_3
$$

5. **Predictions** (after softmax):

$$
\hat{y} = \text{softmax}(z)
$$

6. **Loss**:

$$
L = \text{CrossEntropy}(\hat{y}, y)
$$

---

## 🔹 Goal

We want to compute:

$$
\frac{\partial L}{\partial W_1}, \quad \frac{\partial L}{\partial W_2}, \quad \frac{\partial L}{\partial W_3}
$$

using **backpropagation**.

---

## 🔹 Step-by-Step Chain of Gradients

1. From **loss to logits**:

Already derived earlier:

$$
\frac{\partial L}{\partial z} = \hat{y} - y
$$

---

2. From **logits to second hidden layer**:

$$
\frac{\partial L}{\partial h_2} = \frac{\partial L}{\partial z} W_3^T
$$

because:

$$
z = W_3 h_2 + b_3
$$

and the derivative of a linear transformation is just the matrix transpose.

---

3. From **second hidden layer to first hidden layer**:

Apply chain rule through activation \( f_2 \):

$$
\frac{\partial L}{\partial (W_2 h_1 + b_2)} = \frac{\partial L}{\partial h_2} \circ f_2'(W_2 h_1 + b_2)
$$

where \( \circ \) denotes elementwise multiplication (Hadamard product).

Then:

$$
\frac{\partial L}{\partial W_2} = \left( \frac{\partial L}{\partial (W_2 h_1 + b_2)} \right) h_1^T
$$

because:

$$
W_2 h_1 + b_2
$$

is a linear transformation.

---

4. From **first hidden layer back to input**:

Similarly:

$$
\frac{\partial L}{\partial h_1} = \left( \frac{\partial L}{\partial (W_2 h_1 + b_2)} \right) W_2^T
$$

Then chain rule through \( f_1 \):

$$
\frac{\partial L}{\partial (W_1 x + b_1)} = \frac{\partial L}{\partial h_1} \circ f_1'(W_1 x + b_1)
$$

Thus:

$$
\frac{\partial L}{\partial W_1} = \left( \frac{\partial L}{\partial (W_1 x + b_1)} \right) x^T
$$

---

## 🔹 Visual Flow of Backpropagation



![backprop](back_prop_computation_graph.png)

At each step:
- Multiply by the weight matrix transpose
- Apply elementwise derivative of activation function
- Keep chaining backward

---

## 🧠 Key Intuitions:

- **Each layer's gradient** depends on the **gradients from layers after it**.
- **Matrix transposes** appear naturally from how linear layers work.
- **Elementwise multiplications** appear naturally from activations like ReLU, tanh, sigmoid.
- **Backpropagation is just repeated chain rule** — no magic, only bookkeeping!

---

# 🎯 Summary of Full Backprop Chain

| Gradient | Formula |
|----------|---------|
| $ \frac{\partial L}{\partial z} $ | $ \hat{y} - y $ |
| $ \frac{\partial L}{\partial W_3} $ | $ (\hat{y} - y) h_2^T $ |
| $ \frac{\partial L}{\partial h_2} $ | $ (\hat{y} - y) W_3^T $ |
| $ \frac{\partial L}{\partial W_2} $ | $ (\partial L / \partial h_2) \circ f_2'(W_2 h_1 + b_2) \times h_1^T $ |
| $ \frac{\partial L}{\partial h_1} $ | $ ((\partial L / \partial h_2) \circ f_2') W_2^T $ |
| $ \frac{\partial L}{\partial W_1} $ | $ (\partial L / \partial h_1) \circ f_1'(W_1 x + b_1) \times x^T $ |


## 🔐 Log-Softmax and the Log-Sum-Exp Trick

---

### 🔹 The Problem

When computing:

$$
\log(\text{softmax}(z_i)) = \log\left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
= z_i - \log\left( \sum_j e^{z_j} \right)
$$

You're at risk of:

- **Overflow** if any $z_j$ is large (e.g. $e^{1000}$)
- **Underflow** if softmax returns very small values (log of near-zero → $-\infty$)
- **NaNs** in training and gradient instability

---

### 🔹 The Log-Sum-Exp Trick (LSE)

To stabilize:

$$
\log \sum_j e^{z_j}
$$

We apply:

$$
\log \sum_j e^{z_j} = \max_j z_j + \log \sum_j e^{z_j - \max_j z_j}
$$

This avoids numerical overflow by subtracting the largest logit before exponentiation.

---

### ✅ Used in Practice: `log_softmax`

Instead of doing:

```python
log_probs = torch.log(torch.softmax(logits, dim=-1))
```

Which is unsafe…

Use:

```python
log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
```

This is:

- **Numerically stable** (uses log-sum-exp trick internally)
- **Efficient** (avoids extra computation)
- **Safe for backprop** (no NaNs, exploding gradients)

---

### 🔹 Log-Softmax Identity

The math behind `log_softmax`:

$$
\log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j}
$$

With the LSE trick applied:

$$
= z_i - \left[ \max_j z_j + \log \sum_j e^{z_j - \max_j z_j} \right]
$$

This ensures the entire operation is stable, even if logits are huge.

---

### ❌ Why You Shouldn’t Do `log(softmax(...))`

| Issue                        | Result                        |
|-----------------------------|-------------------------------|
| $e^x$ on large logits        | Overflow / `inf`              |
| Dividing large exponentials | Loss of precision             |
| Taking $\log(0)$             | `-inf`                        |
| Manual implementation       | No use of LSE trick           |

---

### ✅ `log_softmax` Summary

| Property                  | Description                                      |
|---------------------------|--------------------------------------------------|
| Input                     | Raw logits (unbounded real numbers)             |
| Output                    | Log-probabilities (sums to 1 in log space)      |
| Implementation            | Uses log-sum-exp trick internally               |
| Use case                  | Preferred for NLL, CE, KL, language modeling    |
| Gradient                  | Clean: $\nabla_{z_i} L = \hat{y}_i - y_i$       |
| Alternative to            | `log(softmax(z))` (don’t do this)              |

---

### 🧠 Final Mental Model

- `softmax` → converts scores to probs (risk of overflow)
- `log(softmax(...))` → unstable unless you apply LSE manually
- `log_softmax` → **optimized fusion** with **log-sum-exp built in**

> **Moral of the story:**  
> 💯 Always use `log_softmax` when working with log-probabilities from logits.



## 🔥 KL Divergence in Deep Learning — Full Intuition + Math

---

### 🔹 What is KL Divergence?

KL divergence measures **how different** a predicted probability distribution $Q$ is from a reference (true) distribution $P$:

$$
D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log \left( \frac{P(i)}{Q(i)} \right)
$$

> Read as: “KL of $P$ relative to $Q$”

- It is **not symmetric**:
  $$
  D_{\text{KL}}(P \parallel Q) \ne D_{\text{KL}}(Q \parallel P)
  $$

---

### 🔹 Log Rule Breakdown

We can break the KL formula using log rules:

$$
\log\left(\frac{P(i)}{Q(i)}\right) = \log P(i) - \log Q(i)
$$

So:

$$
D_{\text{KL}}(P \parallel Q) = \sum_i P(i) [\log P(i) - \log Q(i)] = H(P, Q) - H(P)
$$

Where:
- $H(P, Q)$ is the **cross-entropy**
- $H(P)$ is the **entropy** of the true distribution

---

### 🔹 When $P$ is One-Hot (Normal Classification)

Let:
$$
P = [0, 1, 0] \quad \text{(true class is index 1)}
$$

Then:

- $\log P(i)$ is only defined for $i=1$ → others contribute 0
- So:
  $$
  H(P) = -\sum_i P(i) \log P(i) = 0
  $$

And:

$$
H(P, Q) = -\sum_i P(i) \log Q(i) = -\log Q(1)
$$

Therefore:

$$
D_{\text{KL}}(P \parallel Q) = -\log Q(\text{true class})
$$

✅ KL = Cross-Entropy

---

### 🔹 When $P$ is NOT One-Hot (Soft Targets)

Let:

$$
P = [0.7,\ 0.2,\ 0.1], \quad Q = [0.6,\ 0.3,\ 0.1]
$$

Then:

- $H(P, Q) = -\sum_i P(i) \log Q(i)$
- $H(P) = -\sum_i P(i) \log P(i)$
- $D_{\text{KL}} = H(P, Q) - H(P)$

Now KL $\ne$ CE. The **difference reflects how much Q diverges from the full shape of P**, not just its top class.

---

### 🧠 Intuition

- KL divergence measures the **inefficiency** of assuming $Q$ when the true distribution is $P$
- It quantifies the **extra bits of information** needed to encode samples from $P$ using $Q$

---

### 🔹 Connection to Maximum Likelihood

Maximum Likelihood Estimation (MLE) seeks to:

$$
\arg\max_\theta \sum_x \log Q_\theta(x)
$$

This is equivalent to:

$$
\arg\min_\theta D_{\text{KL}}(P \parallel Q_\theta)
$$

✅ Minimizing KL divergence = Maximizing log-likelihood

They are mathematically tied:
- **KL is just cross-entropy minus entropy**
- If $P$ is known, **minimizing KL = maximizing model fit**

---

### 🔹 When to Use KL vs Cross-Entropy

| Scenario | Is $P$ one-hot? | Use CE = KL? | Best Loss |
|----------|------------------|---------------|------------|
| Classification | ✅ Yes | ✅ Yes | CrossEntropyLoss |
| Label smoothing | ❌ No | ❌ No | KL Divergence |
| Knowledge distillation | ❌ No | ❌ No | KL Divergence |
| Probabilistic models (e.g. VAEs, RL) | ❌ No | ❌ No | KL Divergence |

---

### 🔹 Summary

| Concept | Formula | Notes |
|--------|---------|-------|
| KL Divergence | $D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$ | Measures inefficiency |
| KL = CE - Entropy | $D_{\text{KL}} = H(P, Q) - H(P)$ | When $P$ is not one-hot |
| KL = CE (special case) | $D_{\text{KL}} = -\log Q(\text{true})$ | When $P$ is one-hot |
| MLE ≈ KL minimization | $\arg\max \log Q_\theta(x) = \arg\min D_{\text{KL}}(P \parallel Q_\theta)$ | Core learning principle |



In [5]:
import numpy as np