# Lesson 3A: Neural Networks Theory

<a name="introduction"></a>
## Introduction

Neural networks can be understood by thinking about how you learned to recognize handwritten digits as a child.

When you first saw the number '7', you didn't memorize every possible way to write it. Instead, your brain learned patterns: a horizontal line at the top, a diagonal stroke going down-right, sometimes with a small horizontal dash through the middle. After seeing dozens of examples - some neat, some messy, some stylized - your brain built an internal representation that could recognize '7' even when written by someone you'd never met.

That's exactly what neural networks do. They learn hierarchical patterns from data. The first layer might detect simple edges and curves. The next layer combines these into more complex shapes like circles or corners. Deeper layers recognize complete digits by combining these shapes. After training on thousands of examples, the network can recognize handwritten digits it has never seen before.

In this lesson, we'll:

1. Understand the theory behind neural networks and how they differ from logistic regression
2. Build a multi-layer neural network from scratch to deeply understand each component
3. Implement forward propagation and backpropagation by hand
4. Apply it to the MNIST handwritten digit dataset
5. Visualize what the network learns and how it makes decisions

Then in the next lesson (3b), we'll:
1. Use PyTorch to implement the same network more efficiently
2. Examine modern architectures and optimization techniques
3. Learn best practices for training deep neural networks in production


## Table of contents

1. [Introduction](#introduction)
2. [Required libraries](#required-libraries)
3. [What is a neural network?](#what-is-a-neural-network)
4. [From logistic regression to neural networks](#from-logistic-regression-to-neural-networks)
   - [The limitation of linear models](#the-limitation-of-linear-models)
   - [Adding hidden layers](#adding-hidden-layers)
   - [Why multiple layers matter](#why-multiple-layers-matter)
5. [The building blocks of neural networks](#the-building-blocks-of-neural-networks)
   - [The artificial neuron](#the-artificial-neuron)
   - [Activation functions](#activation-functions)
   - [Why we need non-linearity](#why-we-need-non-linearity)
   - [Common activation functions](#common-activation-functions)
6. [Forward propagation: Making predictions](#forward-propagation-making-predictions)
   - [Single neuron example](#single-neuron-example)
   - [Full network example](#full-network-example)
   - [Implementing forward propagation](#implementing-forward-propagation)
7. [The loss function: Measuring error](#the-loss-function-measuring-error)
   - [Cross-entropy loss for classification](#cross-entropy-loss-for-classification)
   - [Understanding the loss landscape](#understanding-the-loss-landscape)
8. [Backpropagation: Learning from mistakes](#backpropagation-learning-from-mistakes)
   - [The chain rule intuition](#the-chain-rule-intuition)
   - [Computing gradients layer by layer](#computing-gradients-layer-by-layer)
   - [The calculus of backpropagation](#the-calculus-of-backpropagation)
   - [Implementing backpropagation](#implementing-backpropagation)
9. [Gradient descent: Updating the weights](#gradient-descent-updating-the-weights)
   - [Batch vs mini-batch vs stochastic](#batch-vs-mini-batch-vs-stochastic)
   - [Learning rate and convergence](#learning-rate-and-convergence)
10. [Building a neural network from scratch](#building-a-neural-network-from-scratch)
    - [Network architecture](#network-architecture)
    - [Complete implementation](#complete-implementation)
    - [Training loop](#training-loop)
11. [Training on MNIST: Recognizing handwritten digits](#training-on-mnist-recognizing-handwritten-digits)
    - [Loading and exploring the dataset](#loading-and-exploring-the-dataset)
    - [Preprocessing the data](#preprocessing-the-data)
    - [Training the network](#training-the-network)
    - [Visualizing the training process](#visualizing-the-training-process)
12. [Evaluating our network](#evaluating-our-network)
    - [Accuracy and confusion matrix](#accuracy-and-confusion-matrix)
    - [Analyzing mistakes](#analyzing-mistakes)
    - [Visualizing learned features](#visualizing-learned-features)
13. [Understanding what the network learned](#understanding-what-the-network-learned)
    - [First layer: Edge detectors](#first-layer-edge-detectors)
    - [Hidden layer activations](#hidden-layer-activations)
    - [Output layer: Digit probabilities](#output-layer-digit-probabilities)
14. [Common challenges and solutions](#common-challenges-and-solutions)
    - [Overfitting and underfitting](#overfitting-and-underfitting)
    - [Vanishing and exploding gradients](#vanishing-and-exploding-gradients)
    - [Initialization strategies](#initialization-strategies)
15. [Conclusion: Our guide to neural networks](#conclusion-our-journey-through-neural-networks)
    - [Looking ahead to lesson 3B](#looking-ahead-to-lesson-3b)
    - [Further reading](#further-reading)


<a name="required-libraries"></a>
## Required libraries

Before we get started, let's load the necessary libraries that will be used throughout this lesson.

In this lesson we will use the following libraries:
<table style="margin-left:0">
<tr>
<th align="left">Library</th>
<th align="left">Purpose</th>
</tr>
<tr>
<td>Pandas</td>
<td>Data tables and data manipulation</td>
</tr>
<tr>
<td>Numpy</td>
<td>Numerical computing and matrix operations</td>
</tr>
<tr>
<td>Matplotlib</td>
<td>Plotting and visualization</td>
</tr>
<tr>
<td>Seaborn</td>
<td>Statistical visualisation</td>
</tr>
<tr>
<td>Scikit-learn</td>
<td>Dataset loading, preprocessing, and evaluation metrics</td>
</tr>
<tr>
<td>Typing</td>
<td>Type hints for better code documentation</td>
</tr>
</table>

In [None]:
# Standard library imports
from typing import List, Tuple, Dict, Optional
import warnings
warnings.filterwarnings('ignore')

# Third party imports
import numpy as np
import pandas as pd
from numpy.typing import NDArray

# Visualization imports
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report
)

# Set random seeds for reproducibility
np.random.seed(42)

# Configure plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

<a name="what-is-a-neural-network"></a>
## What is a neural network?

A neural network is a computational model inspired by how biological neurons work in the brain. Just as your brain contains billions of interconnected neurons that process information, an artificial neural network consists of layers of interconnected artificial neurons that learn to recognize patterns.

Formally, a neural network is a function that:
1. Takes input data (like pixel values of an image)
2. Passes it through multiple layers of processing
3. Produces an output (like "this is a 7" or "this is cancer")

What makes neural networks powerful is their ability to learn **hierarchical representations**:
- **Layer 1** might detect basic edges and curves in an image
- **Layer 2** combines edges into shapes like circles or corners
- **Layer 3** combines shapes into parts of digits
- **Output Layer** combines parts into complete digit recognition

This hierarchical learning happens automatically during training - you don't manually specify what each layer should detect. The network learns the most useful representations for the task through examples.

Let's build this understanding step by step, starting from what we already know: logistic regression.

<a name="from-logistic-regression-to-neural-networks"></a>
## From logistic regression to neural networks

In Lesson 1, we learned about logistic regression. Let's recall how it works and see why we need something more powerful.

<a name="the-limitation-of-linear-models"></a>
### The limitation of linear models

Logistic regression computes a weighted sum of inputs and passes it through a sigmoid function:

### $z = w_1x_1 + w_2x_2 + ... + w_nx_n + b$
### $\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$

This works well for **linearly separable** problems - cases where you can draw a straight line (or hyperplane) to separate classes.

But what about problems that aren't linearly separable? Consider the XOR problem:

| x₁ | x₂ | Output |
|----|----|--------|
| 0  | 0  | 0      |
| 0  | 1  | 1      |
| 1  | 0  | 1      |
| 1  | 1  | 0      |

Try as you might, you cannot draw a single straight line to separate the 1s from the 0s. Logistic regression will fail on this problem.

**Real-world implications:**
- Recognizing handwritten digits requires detecting curves, intersections, and spatial relationships
- Diagnosing diseases requires understanding complex interactions between symptoms
- Most interesting problems involve non-linear patterns

We need a model that can learn **non-linear decision boundaries**.

<a name="adding-hidden-layers"></a>
### Adding hidden layers

The key insight: **stack multiple layers of logistic regression units and add non-linear activation functions**.

A simple neural network with one hidden layer looks like this:

```
Input Layer → Hidden Layer → Output Layer
    x₁  \      / h₁  \      / ŷ₁
    x₂  ─────→  h₂  ─────→  ŷ₂
    x₃  /      \ h₃  /      \ ŷ₃
```

**Forward propagation through the network:**

1. **Input to Hidden Layer:**
   - For each hidden neuron $j$: $z_j^{[1]} = \sum_i w_{ij}^{[1]} x_i + b_j^{[1]}$
   - Apply activation: $h_j = \sigma(z_j^{[1]})$

2. **Hidden to Output Layer:**
   - For each output neuron $k$: $z_k^{[2]} = \sum_j w_{jk}^{[2]} h_j + b_k^{[2]}$
   - Apply activation: $\hat{y}_k = \sigma(z_k^{[2]})$

The superscripts $[1]$ and $[2]$ indicate which layer's parameters we're using.

<a name="why-multiple-layers-matter"></a>
### Why multiple layers matter

**The Universal Approximation Theorem** states that a neural network with:
- Just one hidden layer
- Enough neurons
- A non-linear activation function

Can approximate **any continuous function** to arbitrary accuracy.

This is a remarkable theoretical result, but in practice:
- Deeper networks (more layers) often need **fewer total neurons** to achieve the same accuracy
- Deeper networks can learn **more efficient representations** of hierarchical patterns
- Modern deep learning uses networks with dozens or even hundreds of layers

**Intuition:** 
- Shallow network: Might need millions of neurons to memorize every possible handwritten '7'
- Deep network: Layer 1 learns edges, Layer 2 learns shapes, Layer 3 combines them - much more efficient!

Now let's understand the fundamental building block: the artificial neuron.

<a name="the-building-blocks-of-neural-networks"></a>
## The building blocks of neural networks

<a name="the-artificial-neuron"></a>
### The artificial neuron

An artificial neuron (also called a perceptron or unit) is the basic computational unit in a neural network. It performs two simple operations:

**1. Weighted Sum (Linear Combination):**
### $z = \sum_{i=1}^{n} w_i x_i + b = w^T x + b$

Where:
- $x_i$ are the inputs (features or outputs from previous layer)
- $w_i$ are the weights (learned parameters)
- $b$ is the bias term (also learned)

**2. Activation Function:**
### $a = f(z)$

Where $f$ is a non-linear function (we'll examine these next).

**Biological inspiration:**
- Biological neurons receive inputs through dendrites
- They sum up these signals
- If the sum exceeds a threshold, the neuron "fires" (sends a signal)
- Artificial neurons mimic this behavior with weighted sums and activation functions

**Visual representation:**
```
     x₁ ──w₁──┐
     x₂ ──w₂──┤
     x₃ ──w₃──┼──→ Σ (z) ──→ f(z) ──→ a (output)
        ...   │
     xₙ ──wₙ──┘
        + b
```


<a name="activation-functions"></a>
### Activation functions

<a name="why-we-need-non-linearity"></a>
#### Why we need non-linearity

Here's a crucial insight: **without non-linear activation functions, a deep neural network is no more powerful than logistic regression**.

Why? Because stacking linear functions just gives you another linear function:

If we had two layers with just linear operations:
- Layer 1: $h = W^{[1]}x + b^{[1]}$
- Layer 2: $y = W^{[2]}h + b^{[2]}$

Substituting:
- $y = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]}$
- $y = (W^{[2]}W^{[1]})x + (W^{[2]}b^{[1]} + b^{[2]})$
- $y = W_{combined}x + b_{combined}$

This is just a single linear layer! We gain nothing from depth.

**Non-linear activation functions** break this collapse, allowing each layer to learn genuinely new representations.

<a name="common-activation-functions"></a>
#### Common activation functions

Let's examine the most commonly used activation functions:

**1. Sigmoid Function** (what we used in logistic regression)
### $\sigma(z) = \frac{1}{1 + e^{-z}}$

- **Range:** (0, 1)
- **Use case:** Output layer for binary classification
- **Pros:** Smooth, differentiable, outputs probabilities
- **Cons:** Vanishing gradients (derivatives very small for large |z|), not zero-centered

**2. Hyperbolic Tangent (tanh)**
### $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = \frac{2}{1 + e^{-2z}} - 1$

- **Range:** (-1, 1)
- **Use case:** Hidden layers (better than sigmoid)
- **Pros:** Zero-centered (unlike sigmoid), smooth
- **Cons:** Still has vanishing gradient problem

**3. ReLU (Rectified Linear Unit)** - Most popular!
### $\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}$

- **Range:** [0, ∞)
- **Use case:** Hidden layers in deep networks
- **Pros:** 
  - Computationally efficient
  - No vanishing gradient for positive values
  - Networks train faster
  - Sparsity (many neurons output 0)
- **Cons:** 
  - "Dying ReLU" problem (neurons can get stuck at 0)
  - Not zero-centered

**4. Leaky ReLU** (fixes dying ReLU)
### $\text{LeakyReLU}(z) = \max(0.01z, z) = \begin{cases} z & \text{if } z > 0 \\ 0.01z & \text{if } z \leq 0 \end{cases}$

- Small negative slope (0.01) prevents neurons from dying

**5. Softmax** (for multi-class classification output)
### $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

- **Range:** (0, 1) and all outputs sum to 1
- **Use case:** Output layer for multi-class classification
- **Interpretation:** Converts scores into probabilities

Let's visualize these activation functions:


In [None]:
def sigmoid(z: NDArray) -> NDArray:
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-z))

def tanh(z: NDArray) -> NDArray:
    """Hyperbolic tangent activation function."""
    return np.tanh(z)

def relu(z: NDArray) -> NDArray:
    """ReLU activation function."""
    return np.maximum(0, z)

def leaky_relu(z: NDArray, alpha: float = 0.01) -> NDArray:
    """Leaky ReLU activation function."""
    return np.where(z > 0, z, alpha * z)

# Generate input values
z = np.linspace(-5, 5, 100)

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Common Activation Functions', fontsize=16, fontweight='bold')

# Plot sigmoid
axes[0, 0].plot(z, sigmoid(z), 'b-', linewidth=2, label='sigmoid(z)')
axes[0, 0].axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
axes[0, 0].axvline(x=0, color='r', linestyle='--', alpha=0.5)
axes[0, 0].set_xlabel('z', fontsize=12)
axes[0, 0].set_ylabel('σ(z)', fontsize=12)
axes[0, 0].set_title('Sigmoid: σ(z) = 1/(1 + e⁻ᶻ)', fontsize=13)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].legend(fontsize=11)
axes[0, 0].text(2, 0.2, 'Range: (0, 1)', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot tanh
axes[0, 1].plot(z, tanh(z), 'g-', linewidth=2, label='tanh(z)')
axes[0, 1].axhline(y=0, color='r', linestyle='--', alpha=0.5)
axes[0, 1].axvline(x=0, color='r', linestyle='--', alpha=0.5)
axes[0, 1].set_xlabel('z', fontsize=12)
axes[0, 1].set_ylabel('tanh(z)', fontsize=12)
axes[0, 1].set_title('Hyperbolic Tangent', fontsize=13)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].legend(fontsize=11)
axes[0, 1].text(2, -0.6, 'Range: (-1, 1)\nZero-centered ✓', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot ReLU
axes[1, 0].plot(z, relu(z), 'r-', linewidth=2, label='ReLU(z)')
axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.5)
axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.5)
axes[1, 0].set_xlabel('z', fontsize=12)
axes[1, 0].set_ylabel('ReLU(z)', fontsize=12)
axes[1, 0].set_title('ReLU: max(0, z)', fontsize=13)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].legend(fontsize=11)
axes[1, 0].text(2, 1, 'Range: [0, ∞)\nMost popular! ⭐', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), 'm-', linewidth=2, label='Leaky ReLU(z)')
axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.5)
axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.5)
axes[1, 1].set_xlabel('z', fontsize=12)
axes[1, 1].set_ylabel('Leaky ReLU(z)', fontsize=12)
axes[1, 1].set_title('Leaky ReLU: max(0.01z, z)', fontsize=13)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].legend(fontsize=11)
axes[1, 1].text(2, 1, 'Range: (-∞, ∞)\nFixes dying ReLU', fontsize=10, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

print("\n📊 Activation Function Properties:")
print("\nSigmoid:      Good for output layer (probabilities), but has vanishing gradients")
print("Tanh:         Better than sigmoid (zero-centered), still has vanishing gradients")
print("ReLU:         ⭐ Most popular! Fast, no vanishing gradients, but can 'die'")
print("Leaky ReLU:   Fixes dying ReLU problem with small negative slope")

**Key takeaway:** For most modern neural networks, we use:
- **ReLU** (or Leaky ReLU) in hidden layers
- **Sigmoid** for binary classification output
- **Softmax** for multi-class classification output

In this lesson, we'll use ReLU for hidden layers and softmax for the output layer since we're classifying 10 digit classes.

<a name="forward-propagation-making-predictions"></a>
## Forward propagation: Making predictions

Forward propagation is the process of passing input data through the network to generate predictions. Let's build this understanding with concrete examples.

<a name="single-neuron-example"></a>
### Single neuron example

Consider a single neuron in a hidden layer receiving 3 inputs:

**Given:**
- Inputs: $x = [0.5, 0.3, 0.8]$
- Weights: $w = [0.4, -0.2, 0.6]$
- Bias: $b = 0.1$

**Step 1: Compute weighted sum**
### $z = w^T x + b = (0.4)(0.5) + (-0.2)(0.3) + (0.6)(0.8) + 0.1$
### $z = 0.2 - 0.06 + 0.48 + 0.1 = 0.72$

**Step 2: Apply activation function** (let's use ReLU)
### $a = \text{ReLU}(0.72) = \max(0, 0.72) = 0.72$

This output (0.72) becomes an input to the next layer!

Now let's see how this scales to a full network.

<a name="full-network-example"></a>
### Full network example

Let's work through a complete forward pass for a tiny network:
- **Input layer:** 3 features
- **Hidden layer:** 2 neurons (ReLU activation)
- **Output layer:** 2 neurons (softmax activation)

**Network parameters:**
```
Input: x = [0.5, 0.3, 0.8]

Hidden layer weights W[1] (2×3):
    [[0.4, -0.2, 0.6],
     [0.1,  0.5, -0.3]]
    
Hidden layer biases b[1]: [0.1, -0.1]

Output layer weights W[2] (2×2):
    [[0.7, -0.4],
     [0.3,  0.8]]
     
Output layer biases b[2]: [0.2, -0.2]
```

**Forward pass:**

**Layer 1 (Input → Hidden):**

Neuron 1:
- $z_1^{[1]} = 0.4(0.5) - 0.2(0.3) + 0.6(0.8) + 0.1 = 0.72$
- $h_1 = \text{ReLU}(0.72) = 0.72$

Neuron 2:
- $z_2^{[1]} = 0.1(0.5) + 0.5(0.3) - 0.3(0.8) - 0.1 = -0.14$
- $h_2 = \text{ReLU}(-0.14) = 0$ (ReLU killed negative value)

Hidden layer output: $h = [0.72, 0]$

**Layer 2 (Hidden → Output):**

Output 1:
- $z_1^{[2]} = 0.7(0.72) - 0.4(0) + 0.2 = 0.704$

Output 2:
- $z_2^{[2]} = 0.3(0.72) + 0.8(0) - 0.2 = 0.016$

**Apply softmax:**
- $e^{z_1^{[2]}} = e^{0.704} = 2.022$
- $e^{z_2^{[2]}} = e^{0.016} = 1.016$
- Sum = 3.038

Final probabilities:
- $p_1 = 2.022 / 3.038 = 0.666$ (66.6% probability for class 1)
- $p_2 = 1.016 / 3.038 = 0.334$ (33.4% probability for class 2)

**Prediction:** Class 1 (since 0.666 > 0.334)

This is what happens every time the network makes a prediction! Now let's implement this in code.

<a name="implementing-forward-propagation"></a>
### Implementing forward propagation

Let's implement the forward pass in code. We'll create helper functions that we'll use in our complete network later.

In [None]:
def softmax(z: NDArray) -> NDArray:
    """Softmax activation function with numerical stability.
    
    Args:
        z: Input array of shape (n_samples, n_classes)
    
    Returns:
        Probabilities for each class, same shape as input
    """
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)


def relu_derivative(z: NDArray) -> NDArray:
    """Derivative of ReLU function.
    
    Args:
        z: Input array
    
    Returns:
        Gradient: 1 where z > 0, else 0
    """
    return (z > 0).astype(float)


# Test our forward propagation with the example from before
x_test = np.array([[0.5, 0.3, 0.8]])

# Hidden layer
W1 = np.array([[0.4, -0.2, 0.6],
               [0.1,  0.5, -0.3]])
b1 = np.array([[0.1, -0.1]])

z1 = x_test @ W1.T + b1
h1 = relu(z1)
print(f"Hidden layer activations: {h1}")
print(f"Expected: [[0.72, 0]]\n")

# Output layer
W2 = np.array([[0.7, -0.4],
               [0.3,  0.8]])
b2 = np.array([[0.2, -0.2]])

z2 = h1 @ W2.T + b2
output = softmax(z2)
print(f"Output probabilities: {output}")
print(f"Expected: [[0.666, 0.334]] (approximately)")
print(f"\nPredicted class: {np.argmax(output)}")

<a name="the-loss-function-measuring-error"></a>
## The loss function: Measuring error

Once we make a prediction, we need to measure how wrong we were. This measurement is called the **loss** (or cost).

<a name="cross-entropy-loss-for-classification"></a>
### Cross-entropy loss for classification

For multi-class classification, we use **categorical cross-entropy loss**:

### $L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log(\hat{y}_{ij})$

Where:
- $N$ is the number of samples
- $K$ is the number of classes
- $y_{ij}$ is 1 if sample $i$ belongs to class $j$, else 0 (one-hot encoded)
- $\hat{y}_{ij}$ is the predicted probability for sample $i$ being class $j$

**Intuition:** 
- If the true class is "7" and we predict 90% probability for "7": $-\log(0.9) = 0.105$ (small loss, good!)
- If the true class is "7" but we predict only 10% for "7": $-\log(0.1) = 2.303$ (large loss, bad!)
- The loss is 0 only when we predict 100% probability for the correct class
- The loss approaches infinity as we approach 0% for the correct class

**Why logarithm?** 
- Logarithm heavily penalizes confident wrong predictions
- It's the negative log-likelihood (maximum likelihood estimation)
- The derivative has a nice form for backpropagation

Let's implement it:

In [None]:
def cross_entropy_loss(y_true: NDArray, y_pred: NDArray) -> float:
    """Compute categorical cross-entropy loss.
    
    Args:
        y_true: True labels, one-hot encoded (n_samples, n_classes)
        y_pred: Predicted probabilities (n_samples, n_classes)
    
    Returns:
        Average loss across all samples
    """
    n_samples = y_true.shape[0]
    # Clip predictions to avoid log(0)
    y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
    # Compute loss
    loss = -np.sum(y_true * np.log(y_pred_clipped)) / n_samples
    return loss


# Example: Predict digit "7" (class 7)
y_true = np.array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]])  # True class: 7

# Good prediction: 90% confidence for class 7
y_pred_good = np.array([[0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.90, 0.01, 0.01]])
loss_good = cross_entropy_loss(y_true, y_pred_good)

# Bad prediction: Only 10% confidence for class 7
y_pred_bad = np.array([[0.15, 0.15, 0.15, 0.15, 0.05, 0.05, 0.05, 0.10, 0.10, 0.05]])
loss_bad = cross_entropy_loss(y_true, y_pred_bad)

print(f"Loss with good prediction (90% correct): {loss_good:.4f}")
print(f"Loss with bad prediction (10% correct):  {loss_bad:.4f}")
print(f"\nThe bad prediction has {loss_bad/loss_good:.1f}x higher loss!")

<a name="backpropagation-learning-from-mistakes"></a>
## Backpropagation: Learning from mistakes

Backpropagation is the algorithm that allows neural networks to learn. It computes how much each weight contributed to the error, so we know how to adjust them.

<a name="the-chain-rule-intuition"></a>
### The chain rule intuition

Imagine you're late to work. To figure out why:
1. **You were late** ← because traffic was slow
2. **Traffic was slow** ← because you left late
3. **You left late** ← because your alarm didn't go off

You traced the problem backwards through a chain of causes. Backpropagation does the same thing mathematically.

**The chain rule** from calculus lets us compute derivatives through compositions of functions:

### $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$

In words: *How much does loss change with weight = (loss → output) × (output → pre-activation) × (pre-activation → weight)*

<a name="computing-gradients-layer-by-layer"></a>
### Computing gradients layer by layer

For our two-layer network:

**Output layer gradients:**
- $\frac{\partial L}{\partial W^{[2]}} = \frac{1}{N}(\hat{y} - y)^T h^{[1]}$
- $\frac{\partial L}{\partial b^{[2]}} = \frac{1}{N}\sum(\hat{y} - y)$

**Hidden layer gradients:**
- $\frac{\partial L}{\partial W^{[1]}} = \frac{1}{N}X^T \left[(\hat{y} - y)W^{[2]} \odot \text{ReLU}'(z^{[1]})\right]$
- $\frac{\partial L}{\partial b^{[1]}} = \frac{1}{N}\sum\left[(\hat{y} - y)W^{[2]} \odot \text{ReLU}'(z^{[1]})\right]$

Where $\odot$ is element-wise multiplication.

**Key insight:** The gradient for earlier layers depends on gradients from later layers - we *propagate backwards*!

<a name="building-a-neural-network-from-scratch"></a>
## Building a neural network from scratch

<a name="network-architecture"></a>
### Network architecture

We'll build a two-layer neural network:
- **Input layer:** 784 neurons (28×28 pixel images flattened)
- **Hidden layer:** 128 neurons with ReLU activation
- **Output layer:** 10 neurons with softmax activation (digits 0-9)

<a name="complete-implementation"></a>
### Complete implementation

Here's our complete neural network class with forward and backward propagation:

In [None]:
class NeuralNetwork:
    """Two-layer neural network with ReLU and softmax."""
    
    def __init__(self, input_size: int, hidden_size: int, output_size: int, learning_rate: float = 0.01):
        """Initialize network with random weights.
        
        Args:
            input_size: Number of input features
            hidden_size: Number of hidden layer neurons
            output_size: Number of output classes
            learning_rate: Learning rate for gradient descent
        """
        self.lr = learning_rate
        
        # He initialization for weights (good for ReLU)
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))
        
        # Store activations for backprop
        self.cache = {}
        
    def forward(self, X: NDArray) -> NDArray:
        """Forward propagation.
        
        Args:
            X: Input data (n_samples, n_features)
        
        Returns:
            Output probabilities (n_samples, n_classes)
        """
        # Hidden layer
        self.cache['X'] = X
        self.cache['z1'] = X @ self.W1 + self.b1
        self.cache['h1'] = relu(self.cache['z1'])
        
        # Output layer
        self.cache['z2'] = self.cache['h1'] @ self.W2 + self.b2
        self.cache['y_pred'] = softmax(self.cache['z2'])
        
        return self.cache['y_pred']
    
    def backward(self, y_true: NDArray) -> None:
        """Backward propagation and weight update.
        
        Args:
            y_true: True labels, one-hot encoded (n_samples, n_classes)
        """
        n_samples = y_true.shape[0]
        
        # Output layer gradients
        dz2 = self.cache['y_pred'] - y_true
        dW2 = (self.cache['h1'].T @ dz2) / n_samples
        db2 = np.sum(dz2, axis=0, keepdims=True) / n_samples
        
        # Hidden layer gradients
        dh1 = dz2 @ self.W2.T
        dz1 = dh1 * relu_derivative(self.cache['z1'])
        dW1 = (self.cache['X'].T @ dz1) / n_samples
        db1 = np.sum(dz1, axis=0, keepdims=True) / n_samples
        
        # Update weights
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1
    
    def train(self, X: NDArray, y: NDArray) -> float:
        """Perform one training step.
        
        Args:
            X: Training data
            y: True labels (one-hot encoded)
        
        Returns:
            Loss value
        """
        # Forward pass
        y_pred = self.forward(X)
        
        # Compute loss
        loss = cross_entropy_loss(y, y_pred)
        
        # Backward pass
        self.backward(y)
        
        return loss
    
    def predict(self, X: NDArray) -> NDArray:
        """Make predictions.
        
        Args:
            X: Input data
        
        Returns:
            Predicted class labels
        """
        y_pred = self.forward(X)
        return np.argmax(y_pred, axis=1)

print("✅ Neural network class implemented!")

<a name="training-on-mnist-recognizing-handwritten-digits"></a>
## Training on MNIST: Recognizing handwritten digits

<a name="loading-and-exploring-the-dataset"></a>
### Loading and exploring the dataset

MNIST is a dataset of 70,000 handwritten digit images (28×28 pixels). It's split into 60,000 training images and 10,000 test images.

Let's load and examine it:


In [None]:
print("Downloading MNIST dataset... (this may take a minute)\n")

# Load MNIST
mnist = fetch_openml('mnist_784', version=1, parser='auto')
X, y = mnist.data.values, mnist.target.values.astype(int)

print(f"Dataset shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Pixel values range: [{X.min():.1f}, {X.max():.1f}]")
print(f"\nUnique digits: {np.unique(y)}")
print(f"Samples per digit:\n{pd.Series(y).value_counts().sort_index()}")

Let's visualize some examples:

In [None]:
# Plot 16 random digits
fig, axes = plt.subplots(4, 4, figsize=(10, 10))
fig.suptitle('Sample MNIST Handwritten Digits', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    # Pick a random image
    idx = np.random.randint(len(X))
    image = X[idx].reshape(28, 28)
    
    ax.imshow(image, cmap='gray')
    ax.set_title(f'Label: {y[idx]}', fontsize=12)
    ax.axis('off')

plt.tight_layout()
plt.show()

<a name="preprocessing-the-data"></a>
### Preprocessing the data

Before training, we need to:
1. **Normalize** pixel values to [0, 1] range
2. **Split** into train/validation/test sets
3. **One-hot encode** labels for cross-entropy loss

In [None]:
# For faster training, use subset of data
# (Remove these lines to train on full dataset)
n_samples = 10000  # Use 10k samples for faster demo
indices = np.random.choice(len(X), n_samples, replace=False)
X, y = X[indices], y[indices]

# Normalize to [0, 1]
X = X / 255.0

# Split into train/validation/test (60/20/20)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

print(f"Training set:   {X_train.shape[0]:,} samples")
print(f"Validation set: {X_val.shape[0]:,} samples")
print(f"Test set:       {X_test.shape[0]:,} samples")

# One-hot encode labels
def one_hot_encode(y: NDArray, n_classes: int = 10) -> NDArray:
    """Convert integer labels to one-hot encoding."""
    one_hot = np.zeros((len(y), n_classes))
    one_hot[np.arange(len(y)), y] = 1
    return one_hot

y_train_oh = one_hot_encode(y_train)
y_val_oh = one_hot_encode(y_val)
y_test_oh = one_hot_encode(y_test)

print(f"\nOne-hot encoded label shape: {y_train_oh.shape}")
print(f"Example: digit {y_train[0]} → {y_train_oh[0]}")

<a name="training-the-network"></a>
### Training the network

Now let's train our neural network! We'll train for multiple epochs and track both training and validation loss.

In [None]:
# Initialize network
input_size = 784  # 28×28 pixels
hidden_size = 128
output_size = 10  # digits 0-9
learning_rate = 0.1

nn = NeuralNetwork(input_size, hidden_size, output_size, learning_rate)

# Training loop
n_epochs = 50
batch_size = 128
n_batches = len(X_train) // batch_size

train_losses = []
val_losses = []
train_accuracies = []
val_accuracies = []

print(f"Training neural network for {n_epochs} epochs...\n")
print(f"{'Epoch':<6} {'Train Loss':<12} {'Val Loss':<12} {'Train Acc':<12} {'Val Acc':<12}")
print("-" * 60)

for epoch in range(n_epochs):
    # Shuffle training data
    indices = np.random.permutation(len(X_train))
    X_train_shuffled = X_train[indices]
    y_train_shuffled = y_train_oh[indices]
    
    # Mini-batch training
    epoch_losses = []
    for i in range(n_batches):
        start_idx = i * batch_size
        end_idx = start_idx + batch_size
        
        X_batch = X_train_shuffled[start_idx:end_idx]
        y_batch = y_train_shuffled[start_idx:end_idx]
        
        loss = nn.train(X_batch, y_batch)
        epoch_losses.append(loss)
    
    # Compute metrics
    train_loss = np.mean(epoch_losses)
    val_pred = nn.forward(X_val)
    val_loss = cross_entropy_loss(y_val_oh, val_pred)
    
    train_acc = accuracy_score(y_train, nn.predict(X_train))
    val_acc = accuracy_score(y_val, nn.predict(X_val))
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    train_accuracies.append(train_acc)
    val_accuracies.append(val_acc)
    
    # Print every 5 epochs
    if (epoch + 1) % 5 == 0:
        print(f"{epoch+1:<6} {train_loss:<12.4f} {val_loss:<12.4f} {train_acc:<12.3f} {val_acc:<12.3f}")

print("\n✅ Training complete!")

<a name="visualizing-the-training-process"></a>
### Visualizing the training process

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot loss
ax1.plot(train_losses, 'b-', label='Training Loss', linewidth=2)
ax1.plot(val_losses, 'r-', label='Validation Loss', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Training and Validation Loss', fontsize=14, fontweight='bold')
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Plot accuracy
ax2.plot(train_accuracies, 'b-', label='Training Accuracy', linewidth=2)
ax2.plot(val_accuracies, 'r-', label='Validation Accuracy', linewidth=2)
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy', fontsize=12)
ax2.set_title('Training and Validation Accuracy', fontsize=14, fontweight='bold')
ax2.legend(fontsize=11)
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 1])

plt.tight_layout()
plt.show()

print(f"\nFinal training accuracy:   {train_accuracies[-1]:.2%}")
print(f"Final validation accuracy: {val_accuracies[-1]:.2%}")

<a name="evaluating-our-network"></a>
## Evaluating our network

<a name="accuracy-and-confusion-matrix"></a>
### Accuracy and confusion matrix

Let's evaluate on the test set (data the network has never seen):

In [None]:
# Test set predictions
y_pred_test = nn.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)

print(f"\n🎯 Test Set Accuracy: {test_accuracy:.2%}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred_test, digits=3))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True, 
            xticklabels=range(10), yticklabels=range(10))
plt.xlabel('Predicted Digit', fontsize=12)
plt.ylabel('True Digit', fontsize=12)
plt.title('Confusion Matrix - MNIST Digit Classification', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

<a name="analyzing-mistakes"></a>
### Analyzing mistakes

Let's look at some examples where our network made mistakes:

In [None]:
# Find misclassified examples
misclassified_idx = np.where(y_pred_test != y_test)[0]

if len(misclassified_idx) > 0:
    # Plot 12 misclassified examples
    n_examples = min(12, len(misclassified_idx))
    fig, axes = plt.subplots(3, 4, figsize=(12, 9))
    fig.suptitle('Misclassified Examples', fontsize=16, fontweight='bold')
    
    for i, ax in enumerate(axes.flat):
        if i < n_examples:
            idx = misclassified_idx[i]
            image = X_test[idx].reshape(28, 28)
            true_label = y_test[idx]
            pred_label = y_pred_test[idx]
            
            # Get prediction confidence
            probs = nn.forward(X_test[idx:idx+1])[0]
            confidence = probs[pred_label]
            
            ax.imshow(image, cmap='gray')
            ax.set_title(f'True: {true_label}, Pred: {pred_label}\nConf: {confidence:.2%}', 
                        fontsize=10, color='red')
            ax.axis('off')
        else:
            ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nTotal misclassified: {len(misclassified_idx)} out of {len(y_test)} ({100*len(misclassified_idx)/len(y_test):.1f}%)")
else:
    print("Perfect accuracy! No misclassifications.")

<a name="visualizing-learned-features"></a>
### Visualizing learned features

Let's visualize what the first layer learned - these are essentially edge detectors!

In [None]:
# Visualize first layer weights (randomly select 16 neurons)
n_neurons_to_show = 16
neuron_indices = np.random.choice(hidden_size, n_neurons_to_show, replace=False)

fig, axes = plt.subplots(4, 4, figsize=(10, 10))
fig.suptitle('First Layer Weights (What the neurons detect)', fontsize=16, fontweight='bold')

for i, ax in enumerate(axes.flat):
    # Get weights for this neuron and reshape to image
    weights = nn.W1[:, neuron_indices[i]].reshape(28, 28)
    
    ax.imshow(weights, cmap='RdBu_r', interpolation='nearest')
    ax.set_title(f'Neuron {neuron_indices[i]}', fontsize=10)
    ax.axis('off')

plt.tight_layout()
plt.show()

print("\nThese patterns show what each neuron is 'looking for' in the input.")
print("Red areas have positive weights (activate the neuron).")
print("Blue areas have negative weights (inhibit the neuron).")
print("You can often see edge detectors, curve detectors, and other basic features!")

<a name="conclusion-our-journey-through-neural-networks"></a>
## Conclusion: Our guide to neural networks

Congratulations! You've just built a neural network from scratch and trained it to recognize handwritten digits with impressive accuracy.

**What we learned:**

1. **Neural networks extend logistic regression** by stacking multiple layers with non-linear activations
2. **Activation functions** (especially ReLU) are crucial for learning non-linear patterns
3. **Forward propagation** computes predictions by passing data through layers
4. **Backpropagation** uses the chain rule to compute gradients efficiently
5. **Gradient descent** updates weights to minimize loss
6. **Hidden layers learn hierarchical features** automatically from data

**Key insights:**
- Our simple 2-layer network achieved ~95% accuracy on MNIST
- The first layer learned edge detectors automatically
- We implemented everything from scratch to understand the fundamentals
- The same principles scale to much deeper networks

<a name="looking-ahead-to-lesson-3b"></a>
### Looking ahead to lesson 3B

In the next lesson, we'll examine:
1. **PyTorch implementation** - Industry-standard deep learning framework
2. **Modern optimizers** - Adam, RMSprop beyond vanilla gradient descent
3. **Regularization techniques** - Dropout, batch normalization, L2 regularization
4. **Deeper architectures** - 3, 4, 5+ layer networks
5. **Learning rate scheduling** - Adaptive learning rates for better convergence
6. **Data augmentation** - Improving generalization with synthetic examples
7. **Model checkpointing** - Saving and loading trained models
8. **GPU acceleration** - Training networks 10-100x faster

<a name="further-reading"></a>
### Further reading

**Books:**
- *Deep Learning* by Goodfellow, Bengio, and Courville - The definitive textbook
- *Neural Networks and Deep Learning* by Michael Nielsen - Free online book with interactive examples

**Papers:**
- LeCun et al. (1998) - "Gradient-Based Learning Applied to Document Recognition" (Original MNIST paper)
- Rumelhart et al. (1986) - "Learning representations by back-propagating errors" (Backpropagation paper)

**Online Resources:**
- Stanford CS231n - Convolutional Neural Networks for Visual Recognition
- 3Blue1Brown - Neural Networks video series on YouTube
- Distill.pub - Beautiful visual explanations of neural network concepts

---

**🎉 You've completed Lesson 3A! You now understand neural networks from first principles. Ready for 3B?**
