# 🧠 Derivative Rules, Guidelines for Substitution, and Common Function Derivatives

This note summarizes essential differentiation rules for deep learning, when to use substitution (aka chain rule), and how to derive common functions like sigmoid and exponentials.

---

## 📘 Core Derivative Rules

| Rule             | Description                                | Formula |
|------------------|--------------------------------------------|---------|
| Constant Rule     | Derivative of a constant                   | $\frac{d}{dx}(c) = 0$ |
| Power Rule        | For $x^n$, pull the exponent down          | $\frac{d}{dx}(x^n) = nx^{n-1}$ |
| Sum Rule          | Derivative of a sum = sum of derivatives   | $\frac{d}{dx}(f + g) = f' + g'$ |
| Product Rule      | For multiplying functions                  | $(fg)' = f'g + fg'$ |
| Quotient Rule     | For dividing functions                     | $\left( \frac{f}{g} \right)' = \frac{f'g - fg'}{g^2}$ |
| Chain Rule        | For composite functions $f(g(x))$          | $\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$ |

---

## 🧠 Guidelines for Substitution / Chain Rule

Use substitution (and chain rule) when:
- You're differentiating a **composite function**
- The expression includes **something more complex than plain $x$**
- Example: $e^{-x}$, $\log(1 + x^2)$, $\tanh(x^3)$

**Key principle:**  
If you're not differentiating directly with respect to $x$, you're likely using the chain rule.

---

### 💡 Chain Rule in Action (with $e^{-x}$)

Let:
- $f(x) = e^{-x}$
- Think of this as $f(x) = e^{u(x)}$, where $u(x) = -x$

Then:
$$
\frac{d}{dx}(e^{-x}) = e^{-x} \cdot \frac{d}{dx}(-x) = e^{-x} \cdot (-1) = -e^{-x}
$$

✅ The negative sign comes from the derivative of the inner function ($-x$).

---

## 🔢 Derivatives of Common Functions

### 🔷 Polynomials:
- $\frac{d}{dx}(x^2) = 2x$
- $\frac{d}{dx}(x^n) = nx^{n-1}$

### 🔷 Exponentials:
- $\frac{d}{dx}(e^x) = e^x$
- $\frac{d}{dx}(e^{-x}) = -e^{-x}$
- $\frac{d}{dx}(a^x) = a^x \log a$

### 🔷 Logarithms (natural log):
- $\frac{d}{dx}(\log x) = \frac{1}{x}$

### 🔷 Trigonometric (FYI only):
- $\frac{d}{dx}(\sin x) = \cos x$
- $\frac{d}{dx}(\cos x) = -\sin x$
- $\frac{d}{dx}(\tan x) = \sec^2 x$

### 🔷 Hyperbolic / Neural Net Activations:
- $\frac{d}{dx}(\tanh x) = 1 - \tanh^2 x$
- $\frac{d}{dx}(\text{sigmoid}(x)) = \sigma(x)(1 - \sigma(x))$

---

## ✅ Summary

- Derivative rules = tools. Chain rule = glue.
- Substitution helps identify when to apply chain rule
- Even simple expressions like $e^{-x}$ are composite under the hood
- Practice rewriting and differentiating in baby steps with annotations

---


![Activation Functions](activations.png)

## 🔍 Intuition Behind Sigmoid vs Tanh in Deep Learning

Understanding activation functions isn’t just about taking derivatives — it's about how they behave **during optimization**, especially for **gradient flow**, **convergence speed**, and **vanishing gradients**.

---

### 🔹 1. Output Range

| Function | Output Range | Zero-Centered? |
|----------|--------------|----------------|
| Sigmoid  | $(0, 1)$     | ❌ No          |
| Tanh     | $(-1, 1)$    | ✅ Yes         |

- **Why it matters:**  
  Zero-centered outputs (like tanh) help gradients flow **positively and negatively**, making weight updates more balanced and efficient.

---

### 🔹 2. Vanishing Gradient Problem

Both functions **saturate** when input $x$ is very positive or negative.

| Function | Saturation Zones                  | Derivative Trend |
|----------|-----------------------------------|------------------|
| Sigmoid  | $x < -3$ or $x > 3$ → flattens    | Derivative $\approx 0$ |
| Tanh     | $x < -3$ or $x > 3$ → flattens    | Derivative $\approx 0$ |

- **Why it matters:**  
  When neurons output in these zones, their gradients vanish → **very slow training or dead neurons** in deeper layers.

---

### 🔹 3. Gradient Strength

| Function | Max Derivative | Location     |
|----------|----------------|--------------|
| Sigmoid  | $0.25$         | At $x = 0$   |
| Tanh     | $1.0$          | At $x = 0$   |

- **Why it matters:**  
  Stronger gradients mean faster updates near $0$ — **tanh is more expressive and efficient** in the core training zone.

---

### 🔹 4. Use Cases in Deep Learning

| Use Case                        | Activation |
|----------------------------------|------------|
| Binary classification output     | Sigmoid    |
| Multiclass classification output | Softmax    |
| Hidden layers (historically)     | Tanh       |
| Modern hidden layers             | ReLU       |

- **Why tanh over sigmoid in hidden layers?**  
  It's zero-centered and provides stronger gradients.

---

### 🔹 5. Why ReLU Took Over

While tanh and sigmoid are still useful:
- **ReLU** doesn’t saturate for $x > 0$
- It keeps gradients alive
- Great for deep networks
- Easier to optimize

But:
- **Sigmoid** is still used in output layers
- **Tanh + Sigmoid** still power LSTM/GRU gates

---

### 📊 Visual Summary

#### Sigmoid & Tanh

- Sigmoid squashes input to $(0, 1)$, flattens out at extremes  
- Tanh squashes to $(-1, 1)$, symmetric and zero-centered

#### Their Derivatives

- **Sigmoid Derivative** peaks at $0.25$ and vanishes quickly  
- **Tanh Derivative** peaks at $1$ and is wider around center  
- Both die off at $|x| > 3$


## 🔥 Activation Functions + Softmax + Cross-Entropy — Deep Learning Intuition

---

### 🔹 ReLU (Rectified Linear Unit)

**Formula:**

$$
\text{ReLU}(x) = \max(0, x)
$$

$$
\text{ReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Keep positive values, zero out negatives.
- Fast to compute, no saturation in the positive range.

**Issues:**
- "Dead neurons" — if a neuron gets stuck negative, it may never recover.

---

### 🔹 Leaky ReLU

**Formula:**

$$
\text{LeakyReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
\quad\text{where } \alpha \approx 0.01 \text{ or } 0.1
$$

**Derivative:**

$$
\text{LeakyReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}
$$

**Fixes:**
- Lets small gradients pass through for $x < 0$ → avoids dead neurons.

---

### 🔹 GELU (Gaussian Error Linear Unit)

**Exact Formula:**

$$
\text{GELU}(x) = x \cdot \Phi(x)
= x \cdot \frac{1}{2} \left[ 1 + \text{erf}\left( \frac{x}{\sqrt{2}} \right) \right]
$$

**Derivative (Exact):**

$$
\frac{d}{dx} \text{GELU}(x) = \Phi(x) + x \cdot \phi(x)
\quad\text{where } \phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

**Approximate Formula (Used in practice):**

$$
\text{GELU}(x) \approx 0.5x \left[ 1 + \tanh\left( \sqrt{\frac{2}{\pi}}(x + 0.044715x^3) \right) \right]
$$

---

### 🔹 Softmax + Cross-Entropy

#### Softmax:

Given logits vector $ \mathbf{z} $, softmax converts to probabilities:

$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

---

#### Cross-Entropy Loss:

Given true class $ \mathbf{y} $ (one-hot), and predicted probs $ \hat{\mathbf{y}} = \text{softmax}(\mathbf{z}) $:

$$
\text{CE}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_i y_i \log(\hat{y}_i)
= -\log(\hat{y}_{\text{true class}})
$$

---

#### Combined Derivative:

$$
\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i
$$

This is why frameworks like PyTorch use `CrossEntropyLoss(logits, targets)` directly.

---

### 🧠 Visual Intuitions

- Softmax with larger logits → more confident predictions (sharper output)
- Cross-entropy loss:
  - Low when $ \hat{y}_{\text{true}} \approx 1 $
  - Very high when $ \hat{y}_{\text{true}} \approx 0 $


## 🔥 Activation Functions + Softmax + Cross-Entropy — Deep Learning Intuition (With Math)

---

### 🔹 ReLU (Rectified Linear Unit)

**Formula:**

$$
\text{ReLU}(x) = \max(0, x)
$$

**Derivative:**

$$
\text{ReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Outputs the input directly if it's positive, else outputs 0.
- Introduces non-linearity while maintaining simplicity.
- **No gradient saturation** in the positive region → keeps gradients alive.

**Drawback:**
- Neurons can "die" if they fall into the $x \leq 0$ region (zero gradient forever).

---

### 🔹 Leaky ReLU

**Formula:**

$$
\text{LeakyReLU}(x) =
\begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
\quad\text{where } \alpha \in [0.01, 0.1]
$$

**Derivative:**

$$
\text{LeakyReLU}'(x) =
\begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}
$$

**Intuition:**
- Small negative slope instead of flat zero.
- Allows small gradient when $x < 0$ → avoids dead neurons.
- Often used in GANs or deep CNNs for better convergence.

---

### 🔹 GELU (Gaussian Error Linear Unit)

**Exact Formula:**

$$
\text{GELU}(x) = x \cdot \Phi(x)
$$

Where:

$$
\Phi(x) = \frac{1}{2} \left[ 1 + \text{erf}\left( \frac{x}{\sqrt{2}} \right) \right]
$$

**Derivative (Exact)(Product Rule):**

$$
\frac{d}{dx} \text{GELU}(x) = \Phi(x) + x \cdot \phi(x)
$$

Where:

$$
\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}
$$

**Approximate Formula (used in practice):**

$$
\text{GELU}(x) \approx 0.5x \left[ 1 + \tanh\left( \sqrt{\frac{2}{\pi}}(x + 0.044715x^3) \right) \right]
$$

**Intuition:**
- Smoothed version of ReLU that uses probability weighting.
- Weighs inputs by their likelihood of being positive under a standard normal distribution.
- No hard threshold → smoother gradient flow.

**Use case:**
- Default activation in Transformer models (e.g. BERT, GPT).
- Helps with convergence and generalization in large-scale deep learning.

---

### 🔹 Softmax + Cross-Entropy

---

#### ✅ Softmax Function

Given raw logits vector:

$$
\mathbf{z} = [z_1, z_2, \dots, z_n]
$$

Softmax converts it to a probability distribution:

$$
\hat{y}_i = \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

- Output: $0 < \hat{y}_i < 1$
- Ensures: $\sum_i \hat{y}_i = 1$

---

#### ✅ Cross-Entropy Loss

Given one-hot encoded true label $ \mathbf{y} $ and predicted probabilities $ \hat{\mathbf{y}} $, the cross-entropy loss is:

$$
\text{CE}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_i y_i \log(\hat{y}_i)
$$

Since only one $ y_i = 1 $, this simplifies to:

$$
\text{Loss} = -\log(\hat{y}_{\text{true class}})
$$

---

#### ✅ Log-Softmax Identity

Softmax followed by log simplifies to:

$$
\log(\text{softmax}(z_i)) = z_i - \log\left( \sum_j e^{z_j} \right)
$$

Used in practice as `log_softmax()` for **numerical stability** (avoids overflow).

---

#### ✅ Derivative of Softmax + Cross-Entropy

Combined, the gradient becomes extremely clean:

$$
\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i
$$

- Just the **difference between predicted and actual class**
- Avoids needing to separately backprop through softmax and log
- Efficient and stable — this is why it’s **always implemented as a single combined op** in PyTorch/TF

---

### 🧠 Visual Insights

- **Softmax**:
  - With small logit differences → soft probability distribution
  - With large logit differences → sharp confidence spike

- **Cross-Entropy**:
  - Loss is **low** when the predicted probability for the true class is **close to 1**
  - Loss is **high** when the model is confident **but wrong**


## 🔥 Log-Softmax: Full Derivation + Gradient

---

### 🔹 1. Recall Softmax

Given logits:

$$
z = [z_1, z_2, \dots, z_n]
$$

The softmax function is:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

---

### 🔹 2. Derive Log-Softmax

Take the log of softmax:

$$
\log(\hat{y}_i) = \log\left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
= z_i - \log \left( \sum_j e^{z_j} \right)
$$

This is the **log-softmax identity**:

$$
\boxed{
\log(\text{softmax}(z_i)) = z_i - \log\left( \sum_j e^{z_j} \right)
}
$$

---

### 🔹 3. Log-Softmax as a Function

Let’s define:

$$
\ell_i = \log(\text{softmax}(z_i)) = z_i - \log \left( \sum_j e^{z_j} \right)
$$

We want to compute the derivative:

$$
\frac{\partial \ell_i}{\partial z_k}
$$

---

### 🔸 Case 1: $i = k$


$$
\frac{\partial \ell_i}{\partial z_i} =
\frac{\partial}{\partial z_i} \left( z_i - \log \sum_j e^{z_j} \right)
$$


Split it:

- First term: $ \frac{\partial z_i}{\partial z_i} = 1 $
- Second term:

$$
\frac{\partial}{\partial z_i} \log \left( \sum_j e^{z_j} \right)
= \frac{e^{z_i}}{\sum_j e^{z_j}} = \hat{y}_i
$$

So:

$$
\frac{\partial \ell_i}{\partial z_i} = 1 - \hat{y}_i
$$

---

### 🔸 Case 2: $i \ne k$

Only the second term depends on $z_k$:

$$
\frac{\partial \ell_i}{\partial z_k} = -\frac{\partial}{\partial z_k} \log \left( \sum_j e^{z_j} \right)
= -\frac{e^{z_k}}{\sum_j e^{z_j}} = -\hat{y}_k
$$

---

### 🔹 4. Final Result — Jacobian of Log-Softmax

$$
\frac{\partial \ell_i}{\partial z_k} =
\begin{cases}
1 - \hat{y}_i & \text{if } i = k \\
-\hat{y}_k & \text{if } i \ne k
\end{cases}
$$

This is the **Jacobian matrix** of log-softmax, and it's used in general backprop.

---

### 🔹 5. Cross-Entropy with Log-Softmax (Combined)

Cross-entropy loss:

$$
L = -\sum_i y_i \log(\hat{y}_i)
= -\sum_i y_i \cdot \ell_i
$$

Now take derivative of $L$ w.r.t. logits $z_k$:

$$
\frac{\partial L}{\partial z_k}
= -\sum_i y_i \cdot \frac{\partial \ell_i}{\partial z_k}
$$

Now plug in the two cases from above:

- When $i = k$: contributes $y_k (1 - \hat{y}_k)$
- When $i \ne k$: contributes $y_i (-\hat{y}_k)$

So total derivative:

$$
\frac{\partial L}{\partial z_k} =
- \left( y_k (1 - \hat{y}_k) + \sum_{i \ne k} y_i (-\hat{y}_k) \right)
$$

Factor out $\hat{y}_k$:

$$
= -y_k (1 - \hat{y}_k) + \hat{y}_k \sum_{i \ne k} y_i
$$

Since $\sum_i y_i = 1$, we know $\sum_{i \ne k} y_i = 1 - y_k$, so:

$$
\frac{\partial L}{\partial z_k} =
- y_k + \hat{y}_k
$$

Rewritten:

$$
\boxed{
\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
}
$$

---

### ✅ Final Takeaway

- This is why softmax + cross-entropy are **always combined**
- The gradient simplifies to:

$$
\nabla_{\mathbf{z}} L = \hat{\mathbf{y}} - \mathbf{y}
$$

- No need to manually backprop through softmax or log
- Frameworks like PyTorch use this exact trick for `CrossEntropyLoss(logits, labels)`

---

### 📌 Summary

| Component | Expression |
|-----------|------------|
| Log-Softmax | $ \log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j} $ |
| $\frac{\partial \ell_i}{\partial z_k}$ | $ 1 - \hat{y}_i$ if $i=k$, else $-\hat{y}_k$ |
| Cross-Entropy | $ L = -\sum_i y_i \log(\hat{y}_i) $ |
| Final Gradient | $ \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k $ |



## 🧠 Log of Softmax — What It Actually Means (Step-by-Step Breakdown)

---

### 🔹 Goal:

Understand the identity:

$$
\log(\text{softmax}(z_i)) = z_i - \log \left( \sum_j e^{z_j} \right)
$$

---

### 🧪 Example Logits:

Let:

$$
z = [2.0, 1.0, 0.1]
$$

---

### ✅ Step 1: Compute Softmax

The softmax formula is:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Numerically:

- $e^2 \approx 7.39$
- $e^1 \approx 2.72$
- $e^{0.1} \approx 1.105$

So:

$$
\sum_j e^{z_j} \approx 7.39 + 2.72 + 1.105 = 11.215
$$

Softmax outputs:

- $\hat{y}_0 = \frac{7.39}{11.215} \approx 0.659$
- $\hat{y}_1 = \frac{2.72}{11.215} \approx 0.242$
- $\hat{y}_2 = \frac{1.105}{11.215} \approx 0.099$

---

### ✅ Step 2: Take the Log of Softmax

Now we compute:

$$
\log(\text{softmax}(z_i)) = \log \left( \frac{e^{z_i}}{\sum_j e^{z_j}} \right)
= \log(e^{z_i}) - \log\left( \sum_j e^{z_j} \right)
= z_i - \log \sum_j e^{z_j}
$$

---

### 🔍 Numerical Example for $z_0 = 2.0$

$$
\log(\text{softmax}(2.0)) = 2.0 - \log(11.215) \approx 2.0 - 2.417 = -0.417
$$

Other entries:

- $\log(\text{softmax}(1.0)) \approx 1.0 - 2.417 = -1.417$
- $\log(\text{softmax}(0.1)) \approx 0.1 - 2.417 = -2.317$

---

### 🧨 Common Mistake Explained

If you saw something like:

$$
2 - (-0.417) = 2.417 \Rightarrow \text{(wrong interpretation)}
$$

You probably did:

$$
2 - \log(\text{softmax}(2.0)) = 2 - \log(0.659) \approx 2 - (-0.417) = 2.417
$$

That’s **not** how log-softmax works — that’s *backing out the log of the softmax probability*, not applying log to softmax directly.

---

### 🧠 Final Takeaway:

To compute log-softmax **correctly**:

$$
\log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j}
$$

This is:
- **Numerically stable**
- **Logically correct**
- **Used in all deep learning frameworks** (e.g., `log_softmax()` in PyTorch)

---

## 🔥 Combined Log-Softmax + Cross-Entropy — Why Gradient is $\hat{y} - y$

---

### 🔹 1. What Happens Separately

Logits:

$$
z = [2.0, 1.0, 0.1]
$$

Softmax:

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}} \Rightarrow \hat{y} \approx [0.71, 0.21, 0.08]
$$

Cross-entropy loss with true label $y = [1, 0, 0]$:

$$
L = -\sum_i y_i \log(\hat{y}_i) = -\log(0.71) \approx 0.342
$$

---

### 🔹 2. What Happens When Combined

Instead of computing softmax and log separately:

$$
\log(\text{softmax}(z_i)) = z_i - \log \sum_j e^{z_j}
$$

So cross-entropy becomes:

$$
L = -\sum_i y_i \cdot \left(z_i - \log \sum_j e^{z_j}\right)
= -\sum_i y_i z_i + \log \sum_j e^{z_j}
$$

---

### 🔹 3. Derivative of Cross-Entropy w.r.t. Logits $z_k$

Take derivative of:

$$
L = -\sum_i y_i z_i + \log \sum_j e^{z_j}
$$

Split into two parts:

#### (1) First term:

$$
\frac{\partial}{\partial z_k} \left( -\sum_i y_i z_i \right) = -y_k
$$

#### (2) Second term:

$$
\frac{\partial}{\partial z_k} \left( \log \sum_j e^{z_j} \right) = \frac{e^{z_k}}{\sum_j e^{z_j}} = \hat{y}_k
$$

---

### ✅ Final Result:

Add both parts:

$$
\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
$$

This gives you **clean, simple gradients** from raw logits — no need to manually softmax first.

---

### 📌 Summary

| Concept                     | Formula |
|----------------------------|---------|
| Softmax                    | $ \hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $ |
| Log-Softmax                | $ \log(\hat{y}_i) = z_i - \log \sum_j e^{z_j} $ |
| Cross-Entropy Loss         | $ L = -\sum_i y_i \log(\hat{y}_i) $ |
| Combined Derivative        | $ \frac{\partial L}{\partial z_k} = \hat{y}_k - y_k $ |

---

