# A Simple Weight Decay can Improve Generalization

# Weight Decay and Generalization – Full Notebook Notes

---

## 1. Key Concepts

- **Generalization:** The network's ability to perform well on unseen data, not just memorize training examples.
- **Linear network:** Output is a weighted sum of inputs:  
  $$f(x) = w^T x$$
- **Non-linear network:** Networks with activations like ReLU, sigmoid, tanh; output is not a simple sum → introduces curvature.
- **Weight vector $w$:** All trainable parameters collected into one vector.
- **Hidden layers:** Layers between input and output that transform data.
- **Static noise:** Random variation in inputs or targets, usually with zero mean.

**Intuition:**  
- Linear networks = straight arrows in weight space.  
- Non-linear networks = twisted, curved surfaces.

---

## 2. Why Weight Decay?

- **Problem:** Large weights → network can memorize noise → poor generalization.
- Noise amplification formula for input noise $\eta$:  
$$\text{Var}(w^T \eta) = \sigma^2 \|w\|^2$$  
- **Solution:** Add a penalty for large weights:
$$E(w) = E_0(w) + \frac{\lambda}{2} \sum_i w_i^2$$  
- Gradient descent update:
$$\dot{w}_i = -\frac{\partial E_0}{\partial w_i} - \lambda w_i$$  

**Analogy:**  
- Think of weights as a volume knob. Large weights → amplify noise.  
- Weight decay → turns down the gain to prevent noise spikes.

---

## 3. Feed-Forward Networks

- **Network output:** $f_w(e)$  
- **Teacher network:** Ideal network $f_u$ with weights $u$  
- **Cost function (MSE):**
$$E_0(w) = \frac{1}{2} \sum_{\mu=1}^p [f_u(e^\mu) - f_w(e^\mu)]^2$$

- **Gradient descent with decay:**
$$w_i \leftarrow w_i + \eta \sum_{\mu=1}^{p} [f_u(e^\mu) - f_w(e^\mu)] \frac{\partial f_w(e^\mu)}{\partial w_i} - \eta \lambda w_i$$

**Explanation:**  
- Two forces on weights:
  1. **Data force:** reduce error.  
  2. **Decay force:** shrink weights → simpler network.

---

## 4. Learning with Noisy Targets

- Targets with noise:  
$$\text{Target} = f_u(e) + \eta, \quad \eta \sim \text{mean 0, variance } \sigma^2$$

- Weight update:
$$\dot{w}_i \propto \sum_{\mu} \left( \frac{1}{N} \sum_j v_j f_j^\mu + \frac{1}{\sqrt{N}} \eta^\mu \right) f_i^\mu - \lambda w_i$$

- Asymptotic solution:
$$v_r = \frac{A u_r - \frac{1}{\sqrt{N}} \sum_\mu \eta^\mu f_r^\mu}{A + A_r}$$

- **Optimal weight decay:**
$$\lambda_{\text{optimal}} = \frac{\sigma^2}{|u|^2}$$

**Intuition:**  
- Noise pushes weights randomly → weight decay acts as a brake.  
- Stronger noise → stronger decay needed.  
- Analogy: sliders on a soundboard, decay prevents noisy spikes.

---

## 5. Non-Linear Networks

- Exact analysis impossible → use **local linearization** (zoom in).  
- For realizable functions ($f = f_u$), $p < W$ → **manifold of solutions** (valley) with zero training error.  
- Linear expansion:
$$\dot{v}_i \approx - \sum_j A_{ij} v_j - \lambda v_i$$

- **Matrix $A$:**  
  - Outer product of derivatives → curvature.  
  - Rank $R \leq \min(p, W)$ → flat directions → valley/rain gutter.  

- **Weight decay picks the smallest norm solution** → simpler network, better generalization (Ockham’s Razor).  
- Small target errors → same argument as linear case; decay reduces overfitting.

