# Causes of Vanishing / Exploding Gradients

## The Core Problem

During backpropagation, gradients are **multiplied** across layers via the chain rule. In deep networks, this repeated multiplication causes gradients to either:
- **Shrink exponentially** → Vanishing gradients
- **Grow exponentially** → Exploding gradients

---

## Main Causes

### 1. **Activation Function Choice**

| Function | Problem | Why |
|----------|---------|-----|
| Sigmoid | Vanishing | Derivative max is 0.25; squashes values to (0,1) |
| Tanh | Vanishing | Derivative max is 1; saturates at extremes |
| ReLU | Can cause "dead neurons" | Zero gradient for negative inputs |

```
Sigmoid derivative: σ'(x) = σ(x)(1 - σ(x)) → max = 0.25
```

### 2. **Poor Weight Initialization**

- **Too small weights** → Gradients shrink → Vanishing
- **Too large weights** → Gradients grow → Exploding

### 3. **Network Depth**

```
Gradient ∝ (weight × activation_derivative)^n
```
- If this term < 1 → Vanishes after many layers
- If this term > 1 → Explodes after many layers

### 4. **Recurrent Neural Networks (RNNs)**

- Same weights are multiplied across many time steps
- Long sequences make the problem severe

---

## Solutions

| Problem | Solutions |
|---------|-----------|
| **Vanishing** | ReLU/Leaky ReLU, proper initialization (Xavier/He), Batch Normalization, Skip connections (ResNets), LSTM/GRU for RNNs |
| **Exploding** | Gradient clipping, proper initialization, lower learning rate, Batch Normalization |

---