<a href="https://colab.research.google.com/github/karankulshrestha/ai-notebooks/blob/main/Neural_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Optimization?
The process of making your model learn by finding the best parameters to minimize error

# The Core Problem
Neural networks are non-convex functions with complex landscapes. Neural networks are non-convex functions, meaning their loss surface (the graph of loss vs. parameters) is bumpy and full of hills and valleys — not a smooth bowl shape.

There are many peaks and dips (local minima and maxima) instead of one single lowest point.

Key Challenges:

Local Minima vs Global Minima: Non-convex functions have many local minima

*   Saddle Points: More common than local minima in high dimensions
*   Poor Conditioning: Loss surface shaped like long, narrow valleys
*   Vanishing/Exploding Gradients: Gradients become very tiny → the weights stop changing → the network learns very slowly or not at all. Gradients become very large → weights change too much → training becomes unstable or blows up.
*   Cliffs: Extremely steep regions that can "catapult" parameters far away
*   Modern Reality: For large networks, most local minima are actually "good enough" - we don't need the global minimum


<img src="https://miro.medium.com/v2/format:webp/1*f9a162GhpMbiTVTAua_lLQ.png" alt="Loss surface of a neural network" width="500" height="400">


x-axis → θ₀

y-axis → θ₁
These are the parameters (weights) of your model.

z-axis → J(θ₀, θ₁)
This is the loss (cost function) — how wrong your model’s predictions are for that combination of parameters.

# Stochastic Gradient Descent (SGD)

<img src="https://julienharbulot.com/images/gradient-descent/gradient-vector.png" alt="Gradient vector illustration" width="500" height="400">

## How It Works
- **Sample:** Pick a small random batch of data  
- **Compute:** Calculate gradient (direction of steepest increase)  
- **Update:** Move parameters in the opposite direction (downhill)

### The Math
$$
\theta \leftarrow \theta - \varepsilon \times \nabla J(\theta)
$$
**Where:**  
- $\theta$ = parameters  
- $\varepsilon$ = learning rate (step size)  
- $\nabla J(\theta)$ = gradient (slope)

---

### Learning Rate Rules
- **Too High:** Oscillates wildly, may never converge  
- **Too Low:** Learns very slowly, may get stuck  
- **Must Decrease Over Time:**  
  Most implementations use $\varepsilon_k = \varepsilon_0 \times (1 - k / \tau)$


## Momentum - The Speed Booster

**What It Does:** Helps SGD move faster through flat regions and narrow valleys

### How It Works:
- **Accumulate:** Keep track of past gradients (like velocity)
- **Accelerate:** Continue moving in the same direction if gradients point the same way
- **Dampen:** Slow down if gradients change direction

### The Math:
$$
\begin{align}
v &\leftarrow \alpha v - \varepsilon \nabla J(\theta) \quad \text{(velocity update)} \\
\theta &\leftarrow \theta + v \quad \text{(parameter update)}
\end{align}
$$

**Where:**
- $\alpha$ = momentum coefficient (usually 0.9)
- $v$ = velocity vector
- $\varepsilon$ = learning rate
- $\theta$ = parameters

### Physical Intuition:
Like a ball rolling down a hill — gains speed going downhill, slows when going uphill

### When to Use:
When SGD is too slow or gets stuck in narrow valleys

---
## AdaGrad - The Smart Scaler

**What It Does:** Adapts learning rate for each parameter individually

### How It Works:
- **History Track:** Keep running sum of squared gradients for each parameter
- **Smart Scaling:** Parameters with large gradients get smaller learning rates
- **Gentle Progress:** Parameters with small gradients keep larger learning rates

### The Math:
$$
\begin{align}
r &\leftarrow r + g \odot g \quad \text{(accumulate squared gradients)} \\
\Delta\theta &\leftarrow -\frac{\varepsilon}{\sqrt{r + \delta}} \odot g \quad \text{(adaptive update)} \\
\theta &\leftarrow \theta + \Delta\theta \quad \text{(parameter update)}
\end{align}
$$

**Where:**
- $r$ = accumulated squared gradients (element-wise)
- $g$ = current gradient $\nabla J(\theta)$
- $\varepsilon$ = global learning rate
- $\delta$ = small constant for numerical stability (typically $10^{-7}$)
- $\odot$ = element-wise multiplication

### When to Use:
Problems where different parameters need different learning rates

---
## RMSprop - The Adaptive Memory

**What It Does:** Like AdaGrad but uses exponential moving average instead of full history

### Why Better Than AdaGrad:
- **Problem:** AdaGrad accumulates all past gradients, making learning rate too small
- **Solution:** RMSprop "forgets" old gradients using exponential decay

### The Math:
$$
\begin{align}
r &\leftarrow \rho r + (1-\rho) g \odot g \quad \text{(exponential moving average)} \\
\Delta\theta &\leftarrow -\frac{\varepsilon}{\sqrt{r + \delta}} \odot g \quad \text{(adaptive update)} \\
\theta &\leftarrow \theta + \Delta\theta \quad \text{(parameter update)}
\end{align}
$$

**Where:**
- $r$ = exponential moving average of squared gradients
- $g$ = current gradient $\nabla J(\theta)$
- $\rho$ = decay rate (typically 0.9)
- $\varepsilon$ = global learning rate
- $\delta$ = small constant for numerical stability (typically $10^{-7}$)
- $\odot$ = element-wise multiplication

### When to Use:
Most popular adaptive method in deep learning

---
## Adam - The Adaptive Champion

**What It Does:** Combines momentum and adaptive learning rates with bias correction

### Key Features:
- **First Moment:** Uses momentum-like term (exponential average of gradients)
- **Second Moment:** Uses adaptive learning rate (exponential average of squared gradients)
- **Bias Correction:** Fixes initialization bias in both moments

### The Math:
$$
\begin{align}
s &\leftarrow \rho_1 s + (1-\rho_1) g \quad \text{(first moment)} \\
r &\leftarrow \rho_2 r + (1-\rho_2) g \odot g \quad \text{(second moment)} \\
\hat{s} &\leftarrow \frac{s}{1-\rho_1^t} \quad \text{(bias correction)} \\
\hat{r} &\leftarrow \frac{r}{1-\rho_2^t} \quad \text{(bias correction)} \\
\Delta\theta &\leftarrow -\frac{\varepsilon}{\sqrt{\hat{r}} + \delta} \odot \hat{s} \quad \text{(update)} \\
\theta &\leftarrow \theta + \Delta\theta \quad \text{(parameter update)}
\end{align}
$$

**Where:**
- $s$ = first moment estimate (momentum)
- $r$ = second moment estimate (adaptive learning rate)
- $\hat{s}$ = bias-corrected first moment
- $\hat{r}$ = bias-corrected second moment
- $g$ = current gradient $\nabla J(\theta)$
- $t$ = timestep
- $\odot$ = element-wise multiplication

### Hyperparameters:
- $\rho_1 = 0.9$ (first moment decay)
- $\rho_2 = 0.999$ (second moment decay)
- $\varepsilon = 0.001$ (learning rate)
- $\delta = 10^{-8}$ (numerical stability constant)

### When to Use:
Generally robust default choice for most problems

---
## Newton's Method - The Quadratic Approximation

**What It Does:** Uses second derivatives (Hessian) for faster convergence

### How It Works:
- **Local Approximation:** Approximate function as quadratic curve near current point
- **Direct Jump:** Jump directly to minimum of that quadratic approximation

### The Math:
$$
\theta^* = \theta_0 - H^{-1} \nabla J(\theta_0) \quad \text{(Newton's update)}
$$

**Where:**
- $\theta^*$ = new parameter values
- $\theta_0$ = current parameter values
- $H$ = Hessian matrix (matrix of all second derivatives)
- $H^{-1}$ = inverse of the Hessian
- $\nabla J(\theta_0)$ = gradient at current point

### Visual Comparison:

<img src="https://upload.wikimedia.org/wikipedia/commons/d/da/Newton_optimization_vs_grad_descent.svg"
     alt="Newton's Method vs Gradient Descent"
     width="200"
     height="auto">

### Problems:
- **Expensive:** Requires computing inverse Hessian ($O(n^3)$ for $n$ parameters)
- **Hessian Issues:** Needs positive definite Hessian, fails at saddle points

---

## Simple Explanation

Think of optimization methods as ways to find the bottom of a valley:

**Regular gradient descent** is like a blindfolded person taking small steps downhill. They feel which way is steepest and take a step in that direction. Simple, but slow.

**Newton's Method** is like having a detailed map of the nearby terrain. Instead of just knowing "which way is down," you know the exact shape of the valley around you. This lets you calculate exactly where the bottom should be and jump straight there in one step.

### The Catch:

1. **Expensive to compute:** Creating that detailed map (the Hessian) requires measuring the curvature in every direction. For a model with millions of parameters, this is like measuring millions × millions of curvature values, then solving a huge system of equations. It's computationally impractical.

2. **Can be fooled:** The method assumes the valley is bowl-shaped (positive definite Hessian). But at saddle points (like a mountain pass), the terrain curves up in some directions and down in others. Newton's Method gets confused and might jump in the wrong direction.

### When to Use:
Only practical for small problems (few hundred parameters at most). In deep learning, we use approximations like L-BFGS or adaptive methods like Adam instead.

---
## Quasi-Newton Methods - The Practical Second-Order

**What They Do:** Approximate Newton's method without computing full Hessian

### How They Work:
- **Smart Approximation:** Build an approximation of the Hessian using only gradient information
- **Iterative Updates:** Update the Hessian approximation with each step instead of computing it from scratch
- **Balance:** Get most benefits of Newton's method at a fraction of the cost

### The Math:

**BFGS (Broyden-Fletcher-Goldfarb-Shanno):**
$$
\begin{align}
\theta_{t+1} &= \theta_t - \alpha_t H_t^{-1} \nabla J(\theta_t) \\
H_{t+1} &= H_t + \text{correction based on gradient changes}
\end{align}
$$

**L-BFGS (Limited-memory BFGS):**
- Stores only recent gradient history (typically last 10-20 steps)
- Memory: $O(nm)$ instead of $O(n^2)$ where $m$ is history size

**Where:**
- $H_t$ = approximate Hessian at step $t$
- $\alpha_t$ = learning rate (often from line search)
- $\nabla J(\theta_t)$ = gradient at step $t$

### Visual: L-BFGS Optimization Process

<img src="https://www.researchgate.net/profile/Zhiwen-Hu-4/publication/349537968/figure/fig2/AS:11431281176031871@1689988798867/The-gradient-ascent-process-of-by-L-BFGS-optimization-algorithm.png" alt="L-BFGS Optimization Process" width="400"
     height="auto">

### Popular Variants:
- **BFGS:** Full Hessian approximation, good for small to medium problems
- **L-BFGS:** Memory-efficient version, practical for large problems
- **SR1 (Symmetric Rank-1):** Simpler update rule, less stable

### Advantages:
- Faster convergence than first-order methods (gradient descent, Adam)
- More practical than Newton's method (no need to compute or invert Hessian)
- Works well on smooth optimization problems

### Disadvantages:
- Still more expensive per iteration than gradient descent
- Requires storing history (though L-BFGS minimizes this)
- Less effective on non-smooth or stochastic problems

---

## Simple Explanation

Remember how **Newton's Method** needs an expensive detailed map (the Hessian) of the terrain? And **gradient descent** only knows which way is downhill at each step?

**Quasi-Newton methods** are the clever middle ground:

Imagine you're hiking down a mountain and keeping notes. Instead of measuring every curve in the terrain (Newton's expensive map), you:
1. Take a step and note how the slope changed
2. Use those notes to guess what the terrain looks like
3. Your guess gets better with each step

It's like building a map as you go, learning from experience rather than measuring everything upfront.

### Real-World Analogy:

Think of learning to ride a bike on different hills:
- **Gradient descent:** You only feel if you're going up or down right now
- **Newton's method:** You have a complete topographic map (expensive to make)
- **Quasi-Newton:** You remember how the hills felt on previous rides and use that experience to predict what's coming

### The L-BFGS Trick:

Regular BFGS remembers everything, which gets expensive. L-BFGS is like only keeping your last 10-20 notes instead of a massive journal. You lose some detail, but it's much more practical and still works great.

### When to Use:
- **BFGS:** Medium-sized problems (thousands of parameters) with smooth objectives
- **L-BFGS:** Larger problems (tens of thousands to millions of parameters) where you need better convergence than gradient descent but can't afford full Newton's method
- **Deep Learning:** Rarely used for training neural networks (mini-batch stochasticity breaks the assumptions), but sometimes used for fine-tuning or on specific layers

---
## PART 5: INITIALIZATION STRATEGIES

### Weight Initialization - The Foundation

**Why It Matters:** Initial parameters determine whether training succeeds or fails

#### Glorot (Xavier) Initialization:

$$
W_{ij} \sim U\left(-\sqrt{\frac{6}{m+n}}, \sqrt{\frac{6}{m+n}}\right)
$$

**Where:**
- $W_{ij}$ = weight connecting input $i$ to output $j$
- $U(a, b)$ = uniform distribution between $a$ and $b$
- $m$ = number of input dimensions
- $n$ = number of output dimensions

**For:** Linear networks, balanced activation and gradient variance

---

#### He Initialization:

$$
W_{ij} \sim U\left(-\sqrt{\frac{6}{m}}, \sqrt{\frac{6}{m}}\right)
$$

**Where:**
- $W_{ij}$ = weight connecting input $i$ to output $j$
- $U(a, b)$ = uniform distribution between $a$ and $b$
- $m$ = number of input dimensions

**For:** ReLU networks, accounts for activation function properties

---

#### Orthogonal Initialization:

**Approach:** Initialize weights as orthogonal matrices

**Benefits:**
- Maintains gradient flow through deep networks
- Preserves norm of activations during forward pass
- Prevents vanishing/exploding gradients

**Scaling:** Often needs gain factor adjustment based on activation function

---

### Bias Initialization

**General Rule:** Usually set to small constants (0 or small positive values)

#### Special Cases:
- **Output Units:** Set to match desired output statistics
- **ReLU Units:** Sometimes set to 0.1 to avoid initial saturation (dead neurons)
- **Gates (LSTM):** Set forget gates to 1 to keep information flowing initially

---

## Simple Explanation

Think of training a neural network like teaching a group of students. If they all start with wildly different skill levels (bad initialization), some will excel while others give up immediately. Good initialization gives everyone a fair starting point.

### Why Initialization Matters:

**Too Large:** Weights explode → activations become huge → gradients explode → training fails

**Too Small:** Weights vanish → activations die out → gradients disappear → network learns nothing

**Just Right:** Signals flow smoothly through the network, gradients stay reasonable, training succeeds

### The Different Strategies:

**Glorot (Xavier) Initialization** is like giving students study materials proportional to both how much they need to learn (inputs) and how much they need to teach (outputs). It balances information flow in both directions.

**He Initialization** is optimized for ReLU networks. ReLU kills half the neurons (negative values → zero), so He initialization compensates by starting with slightly larger weights. It's like giving students extra resources knowing half will be filtered out.

**Orthogonal Initialization** ensures that information doesn't get distorted as it passes through layers. Think of it like keeping the volume constant on a stereo system as music passes through multiple components—no amplification or dampening at each stage.

### Real-World Analogy:

Imagine a relay race:
- **Bad initialization:** Runners start at random speeds (some sprinting, some walking) → chaos
- **Good initialization:** Everyone starts at a sustainable pace → smooth handoffs → team finishes strong

### Bias Initialization Tips:

**Starting at zero** works for most cases—it's neutral and lets the network learn what it needs.

**Small positive values for ReLU** (like 0.1) prevent "dead neurons" that never activate. It's like giving neurons a small head start so they don't get stuck at zero forever.

**Forget gate bias = 1 in LSTMs** means "remember everything initially." The network can learn to forget later, but starting with memory helps learning.

### When to Use What:

- **Sigmoid/Tanh activations:** Use Glorot initialization
- **ReLU/LeakyReLU activations:** Use He initialization  
- **Very deep networks (100+ layers):** Consider orthogonal initialization
- **Recurrent networks (RNN/LSTM):** Orthogonal for recurrent weights, He/Glorot for input weights

# KEY INSIGHTS TO REMEMBER

**The Core Principle:** Optimization is about finding good (not perfect) parameters efficiently

**The Big Secret:** For large networks, local minima aren't the main problem - it's about finding any good minimum quickly

**The Modern Reality:**
* Adam + BatchNorm + Learning Rate Scheduling covers 90% of deep learning needs
* Good initialization often matters more than perfect optimization

**The Expert Mindset:**
* Start simple (SGD/Momentum)
* Add complexity only when needed
* Monitor gradients, not just loss
* Use adaptive methods for convenience, not perfection

# OPTIMIZATION VS REGULARIZATION CONNECTION

**Remember:** These two chapters are deeply connected:
* Optimization: How to find good parameters
* Regularization: How to make sure those parameters generalize
* Both crucial: You need both for successful deep learning

**Common Combinations:**
* SGD + L2 Regularization (classic combination)
* Adam + Dropout (modern deep learning)
* Any Optimizer + Early Stopping (automatic regularization)