# Week 5: Backpropagation - The Engine of Deep Learning
**IME775: Data Driven Modeling and Optimization**
ðŸ“– **Reference**: Krishnendu Chaudhury. *Math and Architectures of Deep Learning*, Chapter 6
---
## Learning Objectives
- Understand backpropagation as reverse-mode automatic differentiation
- Derive gradients for common layers
- Implement backpropagation from scratch
- Verify gradients numerically


In [None]:
import numpy as np
import matplotlib.pyplot as plt

## 5.1 Why Backpropagation?
**Problem**: Computing gradients for millions of parameters
**Naive approach**: Numerical gradients require $2n$ forward passes for $n$ parameters
**Solution**: Backpropagation computes ALL gradients in ONE backward pass!


In [None]:
# Demonstrate computational cost
def numerical_gradient_cost(n_params, n_samples):
    forward_passes = 2 * n_params  # Central difference

## 5.2 Computational Graphs
Neural network computation can be represented as a **directed acyclic graph (DAG)**:
- Nodes: Operations or variables
- Edges: Data flow
**Forward pass**: Compute outputs (left â†’ right)
**Backward pass**: Compute gradients (right â†’ left)


## 5.3 The Chain Rule in Action
For $L = f(g(h(x)))$:
$$\frac{\partial L}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}$$
Each node:
1. Computes **local gradient** (derivative of its operation)
2. Multiplies by **upstream gradient** (from output)


In [None]:
# Visualize backpropagation on a simple network
# y = sigmoid(w2 * relu(w1 * x + b1) + b2)
class ComputeNode:
    def __init__(self, name):
        self.name = name
        self.output = None
        self.grad = None
    def __repr__(self):

In [None]:
# Print the gradients
print("Gradients computed via backpropagation:")
print("-" * 40)
for name, grad in grads.items():
    print(f"âˆ‚L/âˆ‚{name} = {grad:.6f}")

## 5.4 Layer-by-Layer Gradients
### Linear Layer: $z = Wx + b$
- $\frac{\partial L}{\partial W} = x^T \cdot \frac{\partial L}{\partial z}$
- $\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z}$
- $\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial z}$
### ReLU: $h = \max(0, z)$
- $\frac{\partial L}{\partial z} = \frac{\partial L}{\partial h} \cdot \mathbb{1}[z > 0]$


In [None]:
# Full backprop implementation
class Layer:
    def forward(self, x):
        raise NotImplementedError
    def backward(self, grad_output):
        raise NotImplementedError
class Linear(Layer):
    def __init__(self, in_features, out_features):
        # He initialization
        self.W = np.random.randn(in_features, out_features) * np.sqrt(2.0 / in_features)
        self.b = np.zeros(out_features)
        self.grad_W = None
        self.grad_b = None
    def forward(self, x):
        self.x = x  # Cache for backward

In [None]:
# Test the implementation
class SimpleNet:
    def __init__(self, layer_sizes):
        self.layers = []
        for i in range(len(layer_sizes) - 1):
            self.layers.append(Linear(layer_sizes[i], layer_sizes[i+1]))
            if i < len(layer_sizes) - 2:  # No activation after last layer
                self.layers.append(ReLU())
    def forward(self, x):
        for layer in self.layers:
            x = layer.forward(x)

## 5.5 Gradient Checking
**Always verify analytical gradients numerically!**
$$\frac{\partial L}{\partial \theta} \approx \frac{L(\theta + \epsilon) - L(\theta - \epsilon)}{2\epsilon}$$
Relative error should be $< 10^{-5}$


In [None]:
# Gradient checking
def gradient_check(layer, x, upstream_grad, eps=1e-5):
    """Check if analytical gradient matches numerical gradient."""
    # Analytical gradient
    output = layer.forward(x)
    layer.backward(upstream_grad)
    results = []
    if hasattr(layer, 'W'):
        # Check W gradient
        analytical_W = layer.grad_W.copy()
        numerical_W = np.zeros_like(layer.W)
        for i in range(layer.W.shape[0]):
            for j in range(layer.W.shape[1]):
                original = layer.W[i, j]
                layer.W[i, j] = original + eps
                out_plus = layer.forward(x)
                loss_plus = np.sum(out_plus * upstream_grad)
                layer.W[i, j] = original - eps
                out_minus = layer.forward(x)
                loss_minus = np.sum(out_minus * upstream_grad)
                layer.W[i, j] = original
                numerical_W[i, j] = (loss_plus - loss_minus) / (2 * eps)
        rel_error_W = np.linalg.norm(analytical_W - numerical_W) / (
            np.linalg.norm(analytical_W) + np.linalg.norm(numerical_W) + 1e-8)
        results.append(('W', rel_error_W))

## 5.6 Vanishing and Exploding Gradients
In deep networks, gradients can become very small (vanishing) or very large (exploding):
$$\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial h^{(L)}} \prod_{l=1}^{L} \frac{\partial h^{(l)}}{\partial h^{(l-1)}}$$
**Solutions**: ReLU, proper initialization, skip connections, normalization


In [None]:
# Demonstrate vanishing gradients with sigmoid vs ReLU
def simulate_gradient_flow(n_layers, activation='sigmoid'):
    np.random.seed(42)
    gradients = [1.0]  # Start with upstream gradient of 1
    for l in range(n_layers):
        # Random weights
        W = np.random.randn(100, 100) * 0.1
        # Random activations (for derivative computation)
        if activation == 'sigmoid':
            h = 1 / (1 + np.exp(-np.random.randn(100)))
            derivative = np.mean(h * (1 - h))  # sigmoid derivative
        else:  # relu
            z = np.random.randn(100)
            derivative = np.mean(z > 0)  # relu derivative
        # Gradient magnitude after this layer
        grad_magnitude = gradients[-1] * np.abs(W).mean() * derivative
        gradients.append(grad_magnitude)

## Summary
| Concept | Key Point |
|---------|-----------|
| **Backpropagation** | Compute all gradients in one backward pass |
| **Chain Rule** | Multiply local gradient by upstream gradient |
| **Gradient Checking** | Verify analytical vs numerical gradients |
| **Vanishing Gradients** | Use ReLU, proper init, skip connections |
---
## References
- **Primary**: Krishnendu Chaudhury. *Math and Architectures of Deep Learning*, Chapter 6.
- **Classic**: Rumelhart, Hinton & Williams (1986). "Learning representations by back-propagating errors."
## Connection to ML Refined Curriculum
Backpropagation enables training for:
- All gradient descent methods (Weeks 2-3)
- Any supervised learning model (Weeks 4-8)
