# Lecture Notes: Variance Reduction in Stochastic Gradient Descent (SGD)

## Recap: What is Stochastic Gradient Descent?

<br>

<img src="./images/391.png" width="500" style="display: block; margin: auto;">

<br>

Stochastic Gradient Descent (SGD) is a widely used optimization algorithm in deep learning.

- It updates model parameters **after computing the gradient of the loss** with respect to a **single sample or batch**.
- This introduces **noise** into the optimization path, which can be helpful for escaping local minima but also slows convergence.
- SGD trades off precision in gradient estimation for **faster updates** and **lower memory cost** compared to full-batch gradient descent.

## The Problem: High Gradient Variance

The convergence rate of SGD is roughly proportional to the variance of the gradient estimates.

- If the gradients computed on different samples are all pointing in **roughly the same direction**, the optimizer moves efficiently — this is **low variance**.
- If those gradients **disagree widely**, the optimizer "jumps around" and converges more slowly — this is **high variance**.

This variance limits how quickly we can train deep networks and how stable the training process is.

## Goal: Reduce Variance to Speed Up and Stabilize Training

To address the variance issue in SGD, two main techniques are commonly used:

1. **Mini-Batch Gradient Descent**
2. **Momentum**

Each reduces variance in a different way — and they **stack** nicely when used together.

---

## 1. Mini-Batch Gradient Descent

<br>

<img src="./images/392.png" width="500" style="display: block; margin: auto;">

<br>

<br>

<img src="./images/393.png" width="500" style="display: block; margin: auto;">

<br>

<br>

<img src="./images/394.png" width="500" style="display: block; margin: auto;">

<br>

<br>

<img src="./images/395.png" width="500" style="display: block; margin: auto;">

<br>

Instead of computing the gradient on **a single training example**, we compute it on a **small batch** and average the gradients.

This:
- **Reduces variance** in the gradient estimate.
- Makes the optimization path **smoother and more stable**.
- **Speeds up convergence** while retaining much of the computational efficiency of SGD.

### Batch Size

How big should you make your batch size? The emprical answer: As big as possible, as big as you can fit on a single GPU. The limiting factor in this case, is the memory of the GPU you are using. This will not lead to batches that are too big or too slow to update, because the limiting factor on a GPU is the memory, not the actual computation.

For larger batches, making the batch a power of two will aid the GPU, as GPUs are optimized for nicely rounded off sizes.

<br>

<img src="./images/396.png" width="500" style="display: block; margin: auto;">

<br>

#### Trade-offs
- A **larger batch** provides a better estimate of the true gradient (lower variance), but is **more expensive** to compute.
- A **smaller batch** is cheaper but noisier (higher variance).

#### Hyperparameter
- The **batch size** becomes a critical tuning parameter.

#### Special Cases:
- Batch size = 1 → equivalent to vanilla SGD.
- Batch size = dataset size → equivalent to full Gradient Descent.

### Why It Works (Intuition)

Averaging over multiple samples means the **random noise in individual gradients cancels out**, leaving a clearer signal of the "true" direction to descend in.

Mathematically, if $\hat{g}$ is the estimated gradient from a mini-batch and $g$ is the true full-dataset gradient:

$\text{Var}[\hat{g}_{\text{minibatch}}] \leq \text{Var}[\hat{g}_{\text{SGD}}]$

### **Always Use Mini-Batches**

---

## 2. Momentum

<br>

<img src="./images/397.png" width="500" style="display: block; margin: auto;">

<br>

While mini-batches average across **samples**, momentum averages across **time** (steps).

Momentum keeps track of an additional term, the average gradient. It will update this every single time a gradient is calculated using a running average.

### Motivation:
- Gradient directions from batch to batch can still be noisy.
- Momentum helps **"smooth out" fluctuations** over time by incorporating the direction of previous gradients.

### Mechanism

In regular SGD, the model parameters are updated based **only on the current gradient**.  
With **momentum**, we improve this by keeping a **running average** of the previous gradients — like adding "inertia" to our updates.

Let:
- $g_t$ be the current gradient at time step $t$
- $v_t$ be the **velocity**, or the running average of past gradients

The momentum update rule is:

- $v_t = \mu v_{t-1} + g_t$  → update the velocity by combining past and current gradients  
- $\theta_t = \theta_{t-1} - \epsilon v_t$  → update model parameters using the velocity

Where:
- $\mu$ is the **momentum coefficient** (typically 0.9), which controls how much of the previous gradient history to keep
- $\epsilon$ is the **learning rate**

Instead of following just the current gradient, momentum follows a **blend of the current and previous gradients**, resulting in faster and smoother convergence.

### Benefits of Momentum

- **Dampens oscillations** in directions with noisy gradients.
- **Accelerates convergence** in consistently downhill directions.
- Helps escape **shallow local minima** and avoid overshooting.

### Hyperparameter:
- **Momentum coefficient** ($\mu$): typically set to **0.9** by default in practice.
- PyTorch's `SGD` **does not set this automatically** — you must provide `momentum=0.9` manually.

## Visualization: What Do These Look Like?

<br>

<img src="./images/399.png" width="500" style="display: block; margin: auto;">

<br>

<br>

<img src="./images/398.png" width="500" style="display: block; margin: auto;">

<br>

- **SGD**: erratic jumps, often zig-zagging and slow.
- **Mini-Batch SGD**: smoother path, less jagged.
- **Momentum**: smoother, faster convergence — often visually close to full gradient descent.

Even mini-batches may produce spikes due to:
- Loss evaluated on a hard sample
- Taking a step in an imprecise direction

## Momentum in Practice

<br>

<img src="./images/3911.png" width="500" style="display: block; margin: auto;">

<br>

- **Set it to 0.9 and leave it** — it works well in almost all practical cases.
- If you forget to set it, **PyTorch will default to 0**, i.e., no momentum.
- Other optimizers like Adam and RMSProp use momentum by default under the hood.

---

## Comparison Table

|               | Mini-Batch SGD              | Momentum                          |
|---------------|-----------------------------|-----------------------------------|
| Averages over | Multiple samples             | Gradient history                  |
| Reduces       | Sample-based variance        | Temporal variance (oscillations)  |
| Hyperparam    | Batch size                   | Momentum factor $\mu$             |
| Cost          | Higher (more forward/backward passes) | Low (stores 1 additional gradient vector) |

---

## Final Summary

<br>

<img src="./images/3912.png" width="500" style="display: block; margin: auto;">

<br>

- SGD works, but has **high variance**, especially with small batches or noisy data.
- Use **mini-batches** to reduce sample-wise variance.
- Use **momentum** to reduce noise over time and stabilize updates.
- Both techniques are **simple**, **complementary**, and **widely used** in practice.
- Always train your networks with **mini-batch SGD and momentum** — they're the default baseline for a reason.

```python
# Example in PyTorch
torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
