# 🧠 Deep Learning Mathematics Cheat Sheet

---

## ✅ 1. Derivative Rules

- **Power Rule:**  
  $$ \frac{d}{dx} x^n = nx^{n-1} $$

- **Product Rule:**  
  $$ \frac{d}{dx}[uv] = u'v + uv' $$

- **Quotient Rule:**  
  $$ \frac{d}{dx} \left( \frac{u}{v} \right) = \frac{u'v - uv'}{v^2} $$

- **Chain Rule (Single var):**  
  $$ \frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x) $$

---

## ✅ 2. Common Function Derivatives

- $$ \frac{d}{dx} \sin(x) = \cos(x) $$
- $$ \frac{d}{dx} \cos(x) = -\sin(x) $$
- $$ \frac{d}{dx} e^x = e^x $$
- $$ \frac{d}{dx} \log(x) = \frac{1}{x} $$

---

## ✅ 3. Activation Function Derivatives

- **Sigmoid:**  
  $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
  $$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$

- **Tanh:**  
  $$ \frac{d}{dx} \tanh(x) = 1 - \tanh^2(x) $$

- **ReLU:**  
  $$ \text{ReLU}(x) = \max(0, x) $$
  $$ \text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} $$

- **GELU (approx):**  
  $$ \text{GELU}(x) \approx 0.5x \left(1 + \tanh\left( \sqrt{\frac{2}{\pi}}(x + 0.0447x^3) \right) \right) $$
  > Derivative is complex and usually auto-differentiated.

---

## ✅ 4. Partial Derivatives

- For $$ f(x, y) = (3x + 4y - 5)^2 $$:
  - $$ \frac{\partial f}{\partial x} = 6(3x + 4y - 5) $$
  - $$ \frac{\partial f}{\partial y} = 8(3x + 4y - 5) $$

- **Gradient Vector:**  
  $$ \nabla f = \left[ \frac{\partial f}{\partial x_1},\ \frac{\partial f}{\partial x_2},\ \dots \right] $$

---

## ✅ 5. Chain Rule (Multivariable / Backprop)

For composite:

$$
L = (\hat{y} - y)^2,\quad \hat{y} = h^2,\quad h = w_1x + b
$$

Then:

$$
\frac{dL}{dw_1} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dh} \cdot \frac{dh}{dw_1} = 2(\hat{y} - y) \cdot 2h \cdot x
$$

---

## ✅ 6. Softmax Function

$$
\hat{y}_i = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

- Output is a probability distribution
- Sensitive to large logits
- Used in final layer of classification models

---

## ✅ 7. Log-Softmax Identity

$$
\log(\text{softmax}(z_i)) = z_i - \log \left( \sum_j e^{z_j} \right)
$$

- Helps with numerical stability
- Efficient loss computation (built into frameworks)

---

## ✅ 8. Cross-Entropy Loss

Given:
- $$ y = [0, 1, 0] $$
- $$ \hat{y} = \text{softmax}(z) $$

Cross-entropy:

$$
L = -\sum_i y_i \log(\hat{y}_i) = -\log(\hat{y}_{\text{true class}})
$$

---

## ✅ 9. Gradient of Cross-Entropy w.r.t Logits

$$
\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k
$$

- Comes from combining softmax + cross-entropy
- Applies chain rule after log-sum-exp trick

---

## ✅ 10. Jacobian of Softmax

$$
\frac{\partial \hat{y}_i}{\partial z_j} =
\begin{cases}
\hat{y}_i(1 - \hat{y}_i) & \text{if } i = j \\
-\hat{y}_i \hat{y}_j & \text{if } i \neq j
\end{cases}
$$

Matrix form:

$$
J = \text{diag}(\hat{y}) - \hat{y} \hat{y}^\top
$$

---

## ✅ 11. Log-Sum-Exp Trick

To avoid overflow in:

$$
\log \left( \sum_j e^{z_j} \right)
$$

Use:

$$
\log \left( \sum_j e^{z_j} \right) = \max(z) + \log \left( \sum_j e^{z_j - \max(z)} \right)
$$

- Prevents numerical instability
- Shifts logits before exponentiation

---

## ✅ 12. Final Summary: Backprop Essentials

| Component             | Gradient Form                                 |
|------------------------|------------------------------------------------|
| Linear layer           | $$ \frac{dL}{dW} = \delta \cdot x^\top $$     |
| Activation (sigmoid, tanh) | Use chain rule w/ activation derivative |
| Softmax + Cross-Entropy | $$ \frac{dL}{dz_k} = \hat{y}_k - y_k $$      |
| Hidden layers          | Apply multivariable chain rule backwards     |
| Jacobian               | For vector outputs like softmax               |

---
