# 🧠 Derivative Rules, Guidelines for Substitution, and Common Function Derivatives

This note summarizes essential differentiation rules for deep learning, when to use substitution (aka chain rule), and how to derive common functions like sigmoid and exponentials.

---

## 📘 Core Derivative Rules

| Rule             | Description                                | Formula |
|------------------|--------------------------------------------|---------|
| Constant Rule     | Derivative of a constant                   | $\frac{d}{dx}(c) = 0$ |
| Power Rule        | For $x^n$, pull the exponent down          | $\frac{d}{dx}(x^n) = nx^{n-1}$ |
| Sum Rule          | Derivative of a sum = sum of derivatives   | $\frac{d}{dx}(f + g) = f' + g'$ |
| Product Rule      | For multiplying functions                  | $(fg)' = f'g + fg'$ |
| Quotient Rule     | For dividing functions                     | $\left( \frac{f}{g} \right)' = \frac{f'g - fg'}{g^2}$ |
| Chain Rule        | For composite functions $f(g(x))$          | $\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$ |

---

## 🧠 Guidelines for Substitution / Chain Rule

Use substitution (and chain rule) when:
- You're differentiating a **composite function**
- The expression includes **something more complex than plain $x$**
- Example: $e^{-x}$, $\log(1 + x^2)$, $\tanh(x^3)$

**Key principle:**  
If you're not differentiating directly with respect to $x$, you're likely using the chain rule.

---

### 💡 Chain Rule in Action (with $e^{-x}$)

Let:
- $f(x) = e^{-x}$
- Think of this as $f(x) = e^{u(x)}$, where $u(x) = -x$

Then:
$$
\frac{d}{dx}(e^{-x}) = e^{-x} \cdot \frac{d}{dx}(-x) = e^{-x} \cdot (-1) = -e^{-x}
$$

✅ The negative sign comes from the derivative of the inner function ($-x$).

---

## 🔢 Derivatives of Common Functions

### 🔷 Polynomials:
- $\frac{d}{dx}(x^2) = 2x$
- $\frac{d}{dx}(x^n) = nx^{n-1}$

### 🔷 Exponentials:
- $\frac{d}{dx}(e^x) = e^x$
- $\frac{d}{dx}(e^{-x}) = -e^{-x}$
- $\frac{d}{dx}(a^x) = a^x \log a$

### 🔷 Logarithms (natural log):
- $\frac{d}{dx}(\log x) = \frac{1}{x}$

### 🔷 Trigonometric (FYI only):
- $\frac{d}{dx}(\sin x) = \cos x$
- $\frac{d}{dx}(\cos x) = -\sin x$
- $\frac{d}{dx}(\tan x) = \sec^2 x$

### 🔷 Hyperbolic / Neural Net Activations:
- $\frac{d}{dx}(\tanh x) = 1 - \tanh^2 x$
- $\frac{d}{dx}(\text{sigmoid}(x)) = \sigma(x)(1 - \sigma(x))$

---

## ✅ Summary

- Derivative rules = tools. Chain rule = glue.
- Substitution helps identify when to apply chain rule
- Even simple expressions like $e^{-x}$ are composite under the hood
- Practice rewriting and differentiating in baby steps with annotations

---


![Activation Functions](activations.png)

## 🔍 Intuition Behind Sigmoid vs Tanh in Deep Learning

Understanding activation functions isn’t just about taking derivatives — it's about how they behave **during optimization**, especially for **gradient flow**, **convergence speed**, and **vanishing gradients**.

---

### 🔹 1. Output Range

| Function | Output Range | Zero-Centered? |
|----------|--------------|----------------|
| Sigmoid  | $(0, 1)$     | ❌ No          |
| Tanh     | $(-1, 1)$    | ✅ Yes         |

- **Why it matters:**  
  Zero-centered outputs (like tanh) help gradients flow **positively and negatively**, making weight updates more balanced and efficient.

---

### 🔹 2. Vanishing Gradient Problem

Both functions **saturate** when input $x$ is very positive or negative.

| Function | Saturation Zones                  | Derivative Trend |
|----------|-----------------------------------|------------------|
| Sigmoid  | $x < -3$ or $x > 3$ → flattens    | Derivative $\approx 0$ |
| Tanh     | $x < -3$ or $x > 3$ → flattens    | Derivative $\approx 0$ |

- **Why it matters:**  
  When neurons output in these zones, their gradients vanish → **very slow training or dead neurons** in deeper layers.

---

### 🔹 3. Gradient Strength

| Function | Max Derivative | Location     |
|----------|----------------|--------------|
| Sigmoid  | $0.25$         | At $x = 0$   |
| Tanh     | $1.0$          | At $x = 0$   |

- **Why it matters:**  
  Stronger gradients mean faster updates near $0$ — **tanh is more expressive and efficient** in the core training zone.

---

### 🔹 4. Use Cases in Deep Learning

| Use Case                        | Activation |
|----------------------------------|------------|
| Binary classification output     | Sigmoid    |
| Multiclass classification output | Softmax    |
| Hidden layers (historically)     | Tanh       |
| Modern hidden layers             | ReLU       |

- **Why tanh over sigmoid in hidden layers?**  
  It's zero-centered and provides stronger gradients.

---

### 🔹 5. Why ReLU Took Over

While tanh and sigmoid are still useful:
- **ReLU** doesn’t saturate for $x > 0$
- It keeps gradients alive
- Great for deep networks
- Easier to optimize

But:
- **Sigmoid** is still used in output layers
- **Tanh + Sigmoid** still power LSTM/GRU gates

---

### 📊 Visual Summary

#### Sigmoid & Tanh

- Sigmoid squashes input to $(0, 1)$, flattens out at extremes  
- Tanh squashes to $(-1, 1)$, symmetric and zero-centered

#### Their Derivatives

- **Sigmoid Derivative** peaks at $0.25$ and vanishes quickly  
- **Tanh Derivative** peaks at $1$ and is wider around center  
- Both die off at $|x| > 3$


## 🔥 Activation Functions + Softmax + Cross-Entropy — Deep Learning Intuition

---

### 🔹 ReLU (Rectified Linear Unit)

**Formula:**

$$
\text{ReLU}(x) = \max(0, x)
$$

$$
\text{ReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Keep positive values, zero out negatives.
- Fast to compute, no saturation in the positive range.

**Issues:**
- "Dead neurons" — if a neuron gets stuck negative, it may never recover.

---

### 🔹 Leaky ReLU

**Formula:**

$$
\text{LeakyReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
\quad\text{where } \alpha \approx 0.01 \text{ or } 0.1
$$

**Derivative:**

$$
\text{LeakyReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}
$$

**Fixes:**
- Lets small gradients pass through for $x < 0$ → avoids dead neurons.

---

### 🔹 GELU (Gaussian Error Linear Unit)

**Exact Formula:**

$$
\text{GELU}(x) = x \cdot \Phi(x)
= x \cdot \frac{1}{2} \left[ 1 + \text{erf}\left( \frac{x}{\sqrt{2}} \right) \right]
$$

**Derivative (Exact):**

$$
\frac{d}{dx} \text{GELU}(x) = \Phi(x) + x \cdot \phi(x)
\quad\text{where } \phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

**Approximate Formula (Used in practice):**

$$
\text{GELU}(x) \approx 0.5x \left[ 1 + \tanh\left( \sqrt{\frac{2}{\pi}}(x + 0.044715x^3) \right) \right]
$$

---

### 🔹 Softmax + Cross-Entropy

#### Softmax:

Given logits vector $ \mathbf{z} $, softmax converts to probabilities:

$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

---

#### Cross-Entropy Loss:

Given true class $ \mathbf{y} $ (one-hot), and predicted probs $ \hat{\mathbf{y}} = \text{softmax}(\mathbf{z}) $:

$$
\text{CE}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_i y_i \log(\hat{y}_i)
= -\log(\hat{y}_{\text{true class}})
$$

---

#### Combined Derivative:

$$
\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i
$$

This is why frameworks like PyTorch use `CrossEntropyLoss(logits, targets)` directly.

---

### 🧠 Visual Intuitions

- Softmax with larger logits → more confident predictions (sharper output)
- Cross-entropy loss:
  - Low when $ \hat{y}_{\text{true}} \approx 1 $
  - Very high when $ \hat{y}_{\text{true}} \approx 0 $


## 🔥 Activation Functions + Softmax + Cross-Entropy — Deep Learning Intuition (With Math)

---

### 🔹 ReLU (Rectified Linear Unit)

**Formula:**

$$
\text{ReLU}(x) = \max(0, x)
$$

**Derivative:**

$$
\text{ReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Outputs the input directly if it's positive, else outputs 0.
- Introduces non-linearity while maintaining simplicity.
- **No gradient saturation** in the positive region → keeps gradients alive.

**Drawback:**
- Neurons can "die" if they fall into the $x \leq 0$ region (zero gradient forever).

---

### 🔹 Leaky ReLU

**Formula:**

$$
\text{LeakyReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
\quad\text{where } \alpha \in [0.01, 0.1]
$$

**Derivative:**

$$
\text{LeakyReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Small negative slope instead of flat zero.
- Allows small gradient when $x < 0$ → avoids dead neurons.
- Often used in GANs or deep CNNs for better convergence.

---

### 🔹 GELU (Gaussian Error Linear Unit)

**Exact Formula:**

$$
\text{GELU}(x) = x \cdot \Phi(x)
$$

Where:

$$
\Phi(x) = \frac{1}{2} \left[ 1 + \text{erf}\left( \frac{x}{\sqrt{2}} \right) \right]
$$

**Derivative (Exact)(Product Rule):**

$$
\frac{d}{dx} \text{GELU}(x) = \Phi(x) + x \cdot \phi(x)
$$

Where:

$$
\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

**Approximate Formula (used in practice):**

$$
\text{GELU}(x) \approx 0.5x \left[ 1 + \tanh\left( \sqrt{\frac{2}{\pi}}(x + 0.044715x^3) \right) \right]
$$

**Intuition:**
- Smoothed version of ReLU that uses probability weighting.
- Weighs inputs by their likelihood of being positive under a standard normal distribution.
- No hard threshold → smoother gradient flow.

**Use case:**
- Default activation in Transformer models (e.g. BERT, GPT).
- Helps with convergence and generalization in large-scale deep learning.

---

### 🔹 Softmax + Cross-Entropy

---

#### ✅ Softmax Function

Given raw logits vector:

$$
\mathbf{z} = [z_1, z_2, \dots, z_n]
$$

Softmax converts it to a probability distribution:

$$
\hat{y}_i = \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

- Output: $0 < \hat{y}_i < 1$
- Ensures: $\sum_i \hat{y}_i = 1$

---

#### ✅ Cross-Entropy Loss

Given one-hot encoded true label $ \mathbf{y} $ and predicted probabilities $ \hat{\mathbf{y}} $, the cross-entropy loss is:

$$
\text{CE}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_i y_i \log(\hat{y}_i)
$$

Since only one $ y_i = 1 $, this simplifies to:

$$
\text{Loss} = -\log(\hat{y}_{\text{true class}})
$$

---

#### ✅ Log-Softmax Identity

Softmax followed by log simplifies to:

$$
\log(\text{softmax}(z_i)) = z_i - \log\left( \sum_j e^{z_j} \right)
$$

Used in practice as `log_softmax()` for **numerical stability** (avoids overflow).

---

#### ✅ Derivative of Softmax + Cross-Entropy

Combined, the gradient becomes extremely clean:

$$
\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i
$$

- Just the **difference between predicted and actual class**
- Avoids needing to separately backprop through softmax and log
- Efficient and stable — this is why it’s **always implemented as a single combined op** in PyTorch/TF

---

### 🧠 Visual Insights

- **Softmax**:
  - With small logit differences → soft probability distribution
  - With large logit differences → sharp confidence spike

- **Cross-Entropy**:
  - Loss is **low** when the predicted probability for the true class is **close to 1**
  - Loss is **high** when the model is confident **but wrong**


## 🔥 Log-Softmax: Full Derivation + Gradient

---

### 🔹 1. Recall Softmax

Given logits:

$$
z = [z_1, z_2, \dots, z_n]
$$

The softmax function is:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

---

### 🔹 2. Derive Log-Softmax

Take the log of softmax:

$$
\log(\hat{y}_i) = \log\left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
= z_i - \log \left( \sum_j e^{z_j} \right)
$$

This is the **log-softmax identity**:

$$
\boxed{
\log(\text{softmax}(z_i)) = z_i - \log\left( \sum_j e^{z_j} \right)
}
$$

---

### 🔹 3. Log-Softmax as a Function

Let’s define:

$$
\ell_i = \log(\text{softmax}(z_i)) = z_i - \log \left( \sum_j e^{z_j} \right)
$$

We want to compute the derivative:

$$
\frac{\partial \ell_i}{\partial z_k}
$$

---

### 🔸 Case 1: $i = k$


$$
\frac{\partial \ell_i}{\partial z_i} =
\frac{\partial}{\partial z_i} \left( z_i - \log \sum_j e^{z_j} \right)
$$


Split it:

- First term: $ \frac{\partial z_i}{\partial z_i} = 1 $
- Second term:

$$
\frac{\partial}{\partial z_i} \log \left( \sum_j e^{z_j} \right)
= \frac{e^{z_i}}{\sum_j e^{z_j}} = \hat{y}_i
$$

So:

$$
\frac{\partial \ell_i}{\partial z_i} = 1 - \hat{y}_i
$$

---

### 🔸 Case 2: $i \ne k$

Only the second term depends on $z_k$:

$$
\frac{\partial \ell_i}{\partial z_k} = -\frac{\partial}{\partial z_k} \log \left( \sum_j e^{z_j} \right)
= -\frac{e^{z_k}}{\sum_j e^{z_j}} = -\hat{y}_k
$$

---

### 🔹 4. Final Result — Jacobian of Log-Softmax

$$
\frac{\partial \ell_i}{\partial z_k} =
\begin{cases}
1 - \hat{y}_i & \text{if } i = k \\
-\hat{y}_k & \text{if } i \ne k
\end{cases}
$$

This is the **Jacobian matrix** of log-softmax, and it's used in general backprop.

---

### 🔹 5. Cross-Entropy with Log-Softmax (Combined)

Cross-entropy loss:

$$
L = -\sum_i y_i \log(\hat{y}_i)
= -\sum_i y_i \cdot \ell_i
$$

Now take derivative of $L$ w.r.t. logits $z_k$:

$$
\frac{\partial L}{\partial z_k}
= -\sum_i y_i \cdot \frac{\partial \ell_i}{\partial z_k}
$$

Now plug in the two cases from above:

- When $i = k$: contributes $y_k (1 - \hat{y}_k)$
- When $i \ne k$: contributes $y_i (-\hat{y}_k)$

So total derivative:

$$
\frac{\partial L}{\partial z_k} =
- \left( y_k (1 - \hat{y}_k) + \sum_{i \ne k} y_i (-\hat{y}_k) \right)
$$

Factor out $\hat{y}_k$:

$$
= -y_k (1 - \hat{y}_k) + \hat{y}_k \sum_{i \ne k} y_i
$$

Since $\sum_i y_i = 1$, we know $\sum_{i \ne k} y_i = 1 - y_k$, so:

$$
\frac{\partial L}{\partial z_k} =
- y_k + \hat{y}_k
$$

Rewritten:

$$
\boxed{
\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
}
$$

---

### ✅ Final Takeaway

- This is why softmax + cross-entropy are **always combined**
- The gradient simplifies to:

$$
\nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y}
$$

- No need to manually backprop through softmax or log
- Frameworks like PyTorch use this exact trick for `CrossEntropyLoss(logits, labels)`

---

### 📌 Summary

| Component | Expression |
|-----------|------------|
| Log-Softmax | $ \log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j} $ |
| $\frac{\partial \ell_i}{\partial z_k}$ | $ 1 - \hat{y}_i$ if $i=k$, else $-\hat{y}_k$ |
| Cross-Entropy | $ L = -\sum_i y_i \log(\hat{y}_i) $ |
| Final Gradient | $ \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k $ |



## 🧠 Log of Softmax — What It Actually Means (Step-by-Step Breakdown)

---

### 🔹 Goal:

Understand the identity:

$$
\log(\text{softmax}(z_i)) = z_i - \log \left( \sum_j e^{z_j} \right)
$$

---

### 🧪 Example Logits:

Let:

$$
z = [2.0, 1.0, 0.1]
$$

---

### ✅ Step 1: Compute Softmax

The softmax formula is:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Numerically:

- $e^2 \approx 7.39$
- $e^1 \approx 2.72$
- $e^{0.1} \approx 1.105$

So:

$$
\sum_j e^{z_j} \approx 7.39 + 2.72 + 1.105 = 11.215
$$

Softmax outputs:

- $\hat{y}_0 = \frac{7.39}{11.215} \approx 0.659$
- $\hat{y}_1 = \frac{2.72}{11.215} \approx 0.242$
- $\hat{y}_2 = \frac{1.105}{11.215} \approx 0.099$

---

### ✅ Step 2: Take the Log of Softmax

Now we compute:

$$
\log(\text{softmax}(z_i)) = \log \left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
= \log(e^{z_i}) - \log\left( \sum_j e^{z_j} \right)
= z_i - \log \sum_j e^{z_j}
$$

---

### 🔍 Numerical Example for $z_0 = 2.0$

$$
\log(\text{softmax}(2.0)) = 2.0 - \log(11.215) \approx 2.0 - 2.417 = -0.417
$$

Other entries:

- $\log(\text{softmax}(1.0)) \approx 1.0 - 2.417 = -1.417$
- $\log(\text{softmax}(0.1)) \approx 0.1 - 2.417 = -2.317$

---

### 🧨 Common Mistake Explained

If you saw something like:

$$
2 - (-0.417) = 2.417 \Rightarrow \text{(wrong interpretation)}
$$

You probably did:

$$
2 - \log(\text{softmax}(2.0)) = 2 - \log(0.659) \approx 2 - (-0.417) = 2.417
$$

That’s **not** how log-softmax works — that’s *backing out the log of the softmax probability*, not applying log to softmax directly.

---

### 🧠 Final Takeaway:

To compute log-softmax **correctly**:

$$
\log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j}
$$

This is:
- **Numerically stable**
- **Logically correct**
- **Used in all deep learning frameworks** (e.g., `log_softmax()` in PyTorch)

---

## 🔥 Combined Log-Softmax + Cross-Entropy — Why Gradient is $\hat{y} - y$

---

### 🔹 1. What Happens Separately

Logits:

$$
z = [2.0, 1.0, 0.1]
$$

Softmax:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}} \Rightarrow \hat{y} \approx [0.71, 0.21, 0.08]
$$

Cross-entropy loss with true label $y = [1, 0, 0]$:

$$
L = -\sum_i y_i \log(\hat{y}_i) = -\log(0.71) \approx 0.342
$$

---

### 🔹 2. What Happens When Combined

Instead of computing softmax and log separately:

$$
\log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j}
$$

So cross-entropy becomes:

$$
L = -\sum_i y_i \cdot \left(z_i - \log \sum_j e^{z_j}\right)
= -\sum_i y_i z_i + \log \sum_j e^{z_j}
$$

---

### 🔹 3. Derivative of Cross-Entropy w.r.t. Logits $z_k$

Take derivative of:

$$
L = -\sum_i y_i z_i + \log \sum_j e^{z_j}
$$

Split into two parts:

#### (1) First term:

$$
\frac{\partial}{\partial z_k} \left( -\sum_i y_i z_i \right) = -y_k
$$

#### (2) Second term:

$$
\frac{\partial}{\partial z_k} \left( \log \sum_j e^{z_j} \right) = \frac{e^{z_k}}{\sum_j e^{z_j}} = \hat{y}_k
$$

---

### ✅ Final Result:

Add both parts:

$$
\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
$$

This gives you **clean, simple gradients** from raw logits — no need to manually softmax first.

---

### 📌 Summary

| Concept                     | Formula |
|----------------------------|---------|
| Softmax                    | $ \hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $ |
| Log-Softmax                | $ \log(\hat{y}_i) = z_i - \log \sum_j e^{z_j} $ |
| Cross-Entropy Loss         | $ L = -\sum_i y_i \log(\hat{y}_i) $ |
| Combined Derivative        | $ \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k $ |

---



## 🔢 Deep Learning Math Essentials: Exponentials, Logarithms, and Their Role in Loss Functions

---

### 🔹 What is $e$?

- $e \approx 2.718$ is Euler's number.
- Defined by:

$$
e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^n
$$

- Unique property:

$$
\frac{d}{dx} e^x = e^x
$$

Used heavily in deep learning because it provides smooth, always-positive, non-vanishing gradients.

---

### 🔹 Exponential Function: $e^x$

- Always positive: $e^x > 0$
- Grows rapidly as $x \to \infty$
- Flattens near zero as $x \to -\infty$
- Derivative:

$$
\frac{d}{dx} e^x = e^x
$$

- Operations:
  - $e^{a + b} = e^a \cdot e^b$
  - $e^{a - b} = \frac{e^a}{e^b}$

---

### 🔹 Logarithmic Function: $\log(x)$

- Inverse of $e^x$
- Defined as:

$$
\log(x) = y \iff e^y = x
$$

- Only defined for $x > 0$
- Grows slowly, explodes negatively as $x \to 0^+$
- Derivative:

$$
\frac{d}{dx} \log(x) = \frac{1}{x}
$$

- Operation rules:
  - $\log(ab) = \log a + \log b$
  - $\log\left(\frac{a}{b}\right) = \log a - \log b$
  - $\log(a^b) = b \log a$
  - $\log(e^x) = x$
  - $e^{\log(x)} = x$

---

### 🔹 Inverse Identity

Exponentials and logs undo each other:

$$
\log(e^x) = x \quad \text{and} \quad e^{\log(x)} = x
$$

---

### 🔹 Why Use $\log$ and $e^x$ in Deep Learning?

| Purpose                            | Example                                           | Why                         |
|-----------------------------------|---------------------------------------------------|------------------------------|
| Convert scores to probabilities   | $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ | $e^x$ exaggerates differences |
| Stabilize products                | $\log(p_1 \cdot p_2) = \log p_1 + \log p_2$       | Avoids underflow             |
| Gradient-based optimization       | $\frac{d}{dx} e^x,\ \frac{d}{dx} \log(x)$         | Smooth derivatives           |

---

### 🔹 Binary Cross-Entropy (BCE)

Used in binary classification tasks:

$$
\text{BCE}(\hat{y}, y) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]
$$

Where:
- $y \in \{0, 1\}$ is the true label
- $\hat{y} \in (0, 1)$ is the predicted probability

Special cases:
- If $y = 1$: $\text{Loss} = -\log(\hat{y})$
- If $y = 0$: $\text{Loss} = -\log(1 - \hat{y})$

Gradients:
- If $y = 1$: $\frac{dL}{d\hat{y}} = -\frac{1}{\hat{y}}$
- If $y = 0$: $\frac{dL}{d\hat{y}} = \frac{1}{1 - \hat{y}}$

---

### 🔹 Softmax Function

Used to convert logits to a probability distribution:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Properties:
- $\hat{y}_i \in (0, 1)$
- $\sum_i \hat{y}_i = 1$

---

### 🔹 Cross-Entropy Loss (Multiclass)

With one-hot true labels:

$$
\text{CE}(y, \hat{y}) = -\sum_i y_i \log(\hat{y}_i)
$$

If class $k$ is true, and $y_k = 1$:

$$
\text{Loss} = -\log(\hat{y}_k)
$$

---

### 🔹 Log-Softmax Trick

Instead of computing:

$$
\log(\text{softmax}(z_i)) = \log\left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
$$

Use:

$$
\log(\text{softmax}(z_i)) = z_i - \log\left(\sum_j e^{z_j}\right)
$$

This is numerically stable and gives clean gradients:
$$
\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i
$$

---

### 🔹 KL Divergence

Measures how much one distribution $Q$ diverges from a true distribution $P$:

$$
D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log\left(\frac{P(i)}{Q(i)}\right)
$$

Can be rewritten as:

$$
D_{\text{KL}}(P \parallel Q) = H(P, Q) - H(P)
$$

- $H(P, Q)$ = cross-entropy
- $H(P)$ = entropy of true distribution

---

### 🧠 Final Summary

| Concept             | Formula                                                        | Use Case                     |
|---------------------|----------------------------------------------------------------|------------------------------|
| Exponential         | $e^x$                                                          | Amplify scores (softmax)     |
| Logarithm           | $\log(x)$                                                      | Stabilize, inverse of $e^x$  |
| BCE                 | $- [y \log(\hat{y}) + (1-y) \log(1 - \hat{y})]$                | Binary classification        |
| Softmax             | $\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$                  | Convert logits to probs      |
| Cross-Entropy       | $-\sum_i y_i \log(\hat{y}_i)$                                   | Multiclass classification    |
| Log-Softmax         | $z_i - \log \sum_j e^{z_j}$                                     | Numerically stable softmax   |
| KL Divergence       | $\sum_i P(i) \log \frac{P(i)}{Q(i)}$                            | Measure distribution mismatch|



## 🔥 KL Divergence in Deep Learning — Full Intuition + Math

---

### 🔹 What is KL Divergence?

KL divergence measures **how different** a predicted probability distribution $Q$ is from a reference (true) distribution $P$:

$$
D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log \left( \frac{P(i)}{Q(i)} \right)
$$

> Read as: “KL of $P$ relative to $Q$”

- It is **not symmetric**:
  $$
  D_{\text{KL}}(P \parallel Q) \ne D_{\text{KL}}(Q \parallel P)
  $$

---

### 🔹 Log Rule Breakdown

We can break the KL formula using log rules:

$$
\log\left(\frac{P(i)}{Q(i)}\right) = \log P(i) - \log Q(i)
$$

So:

$$
D_{\text{KL}}(P \parallel Q) = \sum_i P(i) [\log P(i) - \log Q(i)] = H(P, Q) - H(P)
$$

Where:
- $H(P, Q)$ is the **cross-entropy**
- $H(P)$ is the **entropy** of the true distribution

---

### 🔹 When $P$ is One-Hot (Normal Classification)

Let:
$$
P = [0, 1, 0] \quad \text{(true class is index 1)}
$$

Then:

- $\log P(i)$ is only defined for $i=1$ → others contribute 0
- So:
  $$
  H(P) = -\sum_i P(i) \log P(i) = 0
  $$

And:

$$
H(P, Q) = -\sum_i P(i) \log Q(i) = -\log Q(1)
$$

Therefore:

$$
D_{\text{KL}}(P \parallel Q) = -\log Q(\text{true class})
$$

✅ KL = Cross-Entropy

---

### 🔹 When $P$ is NOT One-Hot (Soft Targets)

Let:

$$
P = [0.7,\ 0.2,\ 0.1], \quad Q = [0.6,\ 0.3,\ 0.1]
$$

Then:

- $H(P, Q) = -\sum_i P(i) \log Q(i)$
- $H(P) = -\sum_i P(i) \log P(i)$
- $D_{\text{KL}} = H(P, Q) - H(P)$

Now KL $\ne$ CE. The **difference reflects how much Q diverges from the full shape of P**, not just its top class.

---

### 🧠 Intuition

- KL divergence measures the **inefficiency** of assuming $Q$ when the true distribution is $P$
- It quantifies the **extra bits of information** needed to encode samples from $P$ using $Q$

---

### 🔹 Connection to Maximum Likelihood

Maximum Likelihood Estimation (MLE) seeks to:

$$
\arg\max_\theta \sum_x \log Q_\theta(x)
$$

This is equivalent to:

$$
\arg\min_\theta D_{\text{KL}}(P \parallel Q_\theta)
$$

✅ Minimizing KL divergence = Maximizing log-likelihood

They are mathematically tied:
- **KL is just cross-entropy minus entropy**
- If $P$ is known, **minimizing KL = maximizing model fit**

---

### 🔹 When to Use KL vs Cross-Entropy

| Scenario | Is $P$ one-hot? | Use CE = KL? | Best Loss |
|----------|------------------|---------------|------------|
| Classification | ✅ Yes | ✅ Yes | CrossEntropyLoss |
| Label smoothing | ❌ No | ❌ No | KL Divergence |
| Knowledge distillation | ❌ No | ❌ No | KL Divergence |
| Probabilistic models (e.g. VAEs, RL) | ❌ No | ❌ No | KL Divergence |

---

### 🔹 Summary

| Concept | Formula | Notes |
|--------|---------|-------|
| KL Divergence | $D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$ | Measures inefficiency |
| KL = CE - Entropy | $D_{\text{KL}} = H(P, Q) - H(P)$ | When $P$ is not one-hot |
| KL = CE (special case) | $D_{\text{KL}} = -\log Q(\text{true})$ | When $P$ is one-hot |
| MLE ≈ KL minimization | $\arg\max \log Q_\theta(x) = \arg\min D_{\text{KL}}(P \parallel Q_\theta)$ | Core learning principle |



## 🔐 Log-Softmax and the Log-Sum-Exp Trick

---

### 🔹 The Problem

When computing:

$$
\log(\text{softmax}(z_i)) = \log\left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
= z_i - \log\left( \sum_j e^{z_j} \right)
$$

You're at risk of:

- **Overflow** if any $z_j$ is large (e.g. $e^{1000}$)
- **Underflow** if softmax returns very small values (log of near-zero → $-\infty$)
- **NaNs** in training and gradient instability

---

### 🔹 The Log-Sum-Exp Trick (LSE)

To stabilize:

$$
\log \sum_j e^{z_j}
$$

We apply:

$$
\log \sum_j e^{z_j} = \max_j z_j + \log \sum_j e^{z_j - \max_j z_j}
$$

This avoids numerical overflow by subtracting the largest logit before exponentiation.

---

### ✅ Used in Practice: `log_softmax`

Instead of doing:

```python
log_probs = torch.log(torch.softmax(logits, dim=-1))
```

Which is unsafe…

Use:

```python
log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
```

This is:

- **Numerically stable** (uses log-sum-exp trick internally)
- **Efficient** (avoids extra computation)
- **Safe for backprop** (no NaNs, exploding gradients)

---

### 🔹 Log-Softmax Identity

The math behind `log_softmax`:

$$
\log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j}
$$

With the LSE trick applied:

$$
= z_i - \left[ \max_j z_j + \log \sum_j e^{z_j - \max_j z_j} \right]
$$

This ensures the entire operation is stable, even if logits are huge.

---

### ❌ Why You Shouldn’t Do `log(softmax(...))`

| Issue                        | Result                        |
|-----------------------------|-------------------------------|
| $e^x$ on large logits        | Overflow / `inf`              |
| Dividing large exponentials | Loss of precision             |
| Taking $\log(0)$             | `-inf`                        |
| Manual implementation       | No use of LSE trick           |

---

### ✅ `log_softmax` Summary

| Property                  | Description                                      |
|---------------------------|--------------------------------------------------|
| Input                     | Raw logits (unbounded real numbers)             |
| Output                    | Log-probabilities (sums to 1 in log space)      |
| Implementation            | Uses log-sum-exp trick internally               |
| Use case                  | Preferred for NLL, CE, KL, language modeling    |
| Gradient                  | Clean: $\nabla_{z_i} L = \hat{y}_i - y_i$       |
| Alternative to            | `log(softmax(z))` (don’t do this)              |

---

### 🧠 Final Mental Model

- `softmax` → converts scores to probs (risk of overflow)
- `log(softmax(...))` → unstable unless you apply LSE manually
- `log_softmax` → **optimized fusion** with **log-sum-exp built in**

> **Moral of the story:**  
> 💯 Always use `log_softmax` when working with log-probabilities from logits.



## 🧠 Section 2: Partial Derivatives and Gradients

---

### 🔹 What Is a Partial Derivative?

If a function depends on multiple variables, a **partial derivative** tells you how it changes when you change **just one variable**, holding the others constant.

---

### ✅ Example

Let:

$$
f(x, y) = x^2 y + \sin(y)
$$

- Partial derivative w.r.t. $x$:

$$
\frac{\partial f}{\partial x} = 2xy
$$

- Partial derivative w.r.t. $y$:

$$
\frac{\partial f}{\partial y} = x^2 + \cos(y)
$$

When taking $\frac{\partial f}{\partial x}$, you treat $y$ like a **constant**.

---

### 🔹 Notation

- Regular derivative: $\frac{df}{dx}$ — use when $f$ has one variable  
- Partial derivative: $\frac{\partial f}{\partial x}$ — use when $f$ has many variables

---

### 🔹 Gradient Vector

If $f$ is a function of multiple variables:

$$
f(x_1, x_2, \dots, x_n)
$$

Then the **gradient** is the vector of all partial derivatives:

$$
\nabla f =
\begin{bmatrix}
\frac{\partial f}{\partial x_1} \\\\
\frac{\partial f}{\partial x_2} \\\\
\vdots \\\\
\frac{\partial f}{\partial x_n}
\end{bmatrix}
$$

---

### 🔹 Geometric Meaning

- Gradient points in the direction of **steepest ascent**  
- In deep learning, we **move opposite the gradient** to minimize the loss

---

### 🔧 Common Structure in DL Loss Functions

Let:

$$
L = (w_1 x_1 + w_2 x_2 - y)^2
$$

Then:

$$
\frac{\partial L}{\partial w_1} = 2(w_1 x_1 + w_2 x_2 - y) \cdot x_1
$$

$$
\frac{\partial L}{\partial w_2} = 2(w_1 x_1 + w_2 x_2 - y) \cdot x_2
$$

This pattern appears in:
- Linear regression  
- MLPs  
- Backpropagation  

---

### 🔍 Strategy for Computing Partials

If:

$$
f(x, y, z) = (2x - 3y + 4z - 6)^2
$$

Then:

1. Let $a = 2x - 3y + 4z - 6$  
2. $f = a^2$  
3. Use chain rule:  
   $$
   \frac{\partial f}{\partial x} = 2a \cdot \frac{\partial a}{\partial x} = 4a
   $$  
4. Same idea for $y$ and $z$

---

### 🧠 Practice Prompt

Let:

$$
f(x, y) = (3x + 4y - 5)^2
$$

Then:

$$
\frac{\partial f}{\partial x} = 6(3x + 4y - 5)
$$

$$
\frac{\partial f}{\partial y} = 8(3x + 4y - 5)
$$

You can distribute if you prefer:

$$
6(3x + 4y - 5) = 18x + 24y - 30
$$

$$
8(3x + 4y - 5) = 24x + 32y - 40
$$

---

### 🔄 Summary

| Concept | Formula | Notes |
|--------|---------|-------|
| Partial Derivative | $\frac{\partial f}{\partial x}$ | Derivative w.r.t one input |
| Gradient Vector | $\nabla f = [ \frac{\partial f}{\partial x_1}, \dots ]$ | All partials stacked |
| Loss Gradient | $\frac{\partial L}{\partial w} = 2(\hat{y} - y) \cdot x$ | Linear regression form |
| Direction | Opposite gradient = downhill | Used in SGD & Adam |
