# Lecture Notes: Why Very Deep Networks Fail to Train

## Motivation

What happens if we try to train a **very deep network**, like one with 100 layers?

- Answer: **It won’t train**
- In this section, we explore:
  - Why deep networks struggle to train
  - How to analyze the problem using simple models
  - Techniques to **enable training** very deep networks

<br>

---

## Case Study: A Deep Linear Network

<br>

<img src="./images/411.png" width="500" style="display: block; margin: auto;">

<br>

To simplify, consider:
- A network of $n$ **linear layers**
- Each layer has:
  - A **single scalar weight** $w$
  - A **bias** $b$
- Input $x \in \mathbb{R}$, output $y \in \mathbb{R}$

This is a highly simplified network, but it reveals key behaviors that extend to real-world models.

<br>

---

## Forward Pass Dynamics

The output of layer 1:  
$y_1 = wx + b$

The output of layer 2:  
$y_2 = w(wx + b) + b = w^2x + wb + b$

In general, at the $n$th layer:  
$y_n = w^n x + \left( \frac{1 - w^n}{1 - w} \right) b$

(Assuming all $w_i = w$ and $b_i = b$)

<br>

---

## Activation Behavior by Weight Magnitude

<br>

<img src="./images/412.png" width="500" style="display: block; margin: auto;">

<br>

### Case 1: $w < 1$
- $w^n \rightarrow 0$ as $n \rightarrow \infty$
- Activations vanish: network **forgets the input**
- Output becomes dominated by bias instead of actual input
- Known as **vanishing input**

### Case 2: $w = 1$
- $y_n = x + nb$ (ideal)
- Input preserved, but sensitive to initialization
- Extremely rare in practice

### Case 3: $w > 1$
- $w^n \rightarrow \infty$ as $n$ increases
- **Activations explode**
- Outputs become `inf` or `nan`

<br>

---

## Backward Pass Dynamics (Gradients)

<br>

<img src="./images/413.png" width="500" style="display: block; margin: auto;">

<br>

During backpropagation, gradients behave similarly:

- Gradients are also multiplied by $w$ at each layer
- So they **vanish** if $w < 1$
- They **explode** if $w > 1$

This leads to:
- **Vanishing gradients**: nothing updates
- **Exploding gradients**: model diverges

Usually in practice, activations will explode before your gradients, causing the output of your network to be NaN, which will cause all of the weights in the network to be set to NaN during backpropagation. However, exploding gradients do exist, and they are hard to control if the network was not initialized properly.

<br>

---

## Summary of Effects

<br>

<img src="./images/414.png" width="500" style="display: block; margin: auto;">

<br>

| Weight $w$ | Activations | Gradients | Outcome |
|------------|-------------|-----------|---------|
| $< 1$      | Vanish      | Vanish    | Network is stable, but learns nothing |
| $= 1$      | Stable      | Stable    | Ideal but unlikely |
| $> 1$      | Explode     | Explode   | Network crashes |

<br>

---

## Handling Exploding Gradients

### General Info

Exploding gradients occur when gradients grow exponentially during backpropagation, often due to large weights or overly large learning rates.  
This leads to unstable training, and eventually `inf` or `nan` values in the weights.

- Vanishing/exploding gradients were major problems in **recurrent networks (RNNs)**
- Now occur in **very deep feedforward networks**
- Most modern frameworks (e.g., PyTorch) use smart initializations to reduce the risk

<br>

<img src="./images/415.png" width="500" style="display: block; margin: auto;">

<br>

<br>

<img src="./images/416.png" width="500" style="display: block; margin: auto;">

<br>

### Key Symptoms

- Sudden spikes in loss or rapid increase in loss to `inf`
- Model output becomes `nan` or `inf`
- Network crashes or throws numerical errors

### Diagnosis

- **Plot gradient norms** and **weight norms** per layer
- Check for:
  - Extremely large values
  - Appearance of `inf` or `nan`
- Try training with **learning rate = 0**:
  - If the loss still fluctuates, the variation is due to the dataset, not learning
  - If it becomes `nan` during actual training, it’s likely an exploding gradient problem

### Remedy

- Lower the **learning rate**
- Use **gradient clipping** (not covered yet)
- Check or improve **weight initialization**  
- Avoid very large initial weights or overly deep unregularized layers

<br>

---

## Handling Vanishing Gradients

### General Info

Vanishing gradients occur when gradients shrink exponentially during backpropagation.  
As a result, early layers learn very slowly or not at all.

This is an especially common problem in deep networks where the weight norms are small (e.g., < 1).

<br>

<img src="./images/417.png" width="500" style="display: block; margin: auto;">

<br>

<br>

<img src="./images/418.png" width="500" style="display: block; margin: auto;">

<br>

### Key Symptoms

- Loss decreases slightly at first, then plateaus
- Network output relies only on the final layer (bias terms)
- Gradients in earlier layers approach zero
- No meaningful learning beyond shallow layers

### Diagnosis

- Plot **gradient norms** per layer — early layers will have gradients near zero
- Try a **learning rate of 0** and compare baseline loss fluctuation
- Compare weight and gradient flow across layers

### Remedy

This happens to all but the shallowest network, so there is really not a direct remedy.

- Change the **network structure** — vanilla deep networks will not work
- Slightly increase learning rate if gradients are extremely small
- Use **normalization techniques** (e.g., BatchNorm)
- Add **residual connections** to preserve gradient flow
- Adjust weight **initialization** (e.g., Xavier or He initialization)

<br>

---

## Final Notes

<br>

<img src="./images/419.png" width="500" style="display: block; margin: auto;">

<br>

- Vanishing gradients: common, require architectural changes
- Exploding gradients: rarer, often fixed with learning rate reduction
- Solutions (covered next):
  - **Normalization**
  - **Residual connections**
  - Better initialization
