# Introduction to Backpropagation in Neural Networks

## Introduction

Welcome to this interactive guide on **Backpropagation**! You already have a foundational understanding of neural networks, linear and logistic regression, and SVMs. This notebook will build upon that knowledge to provide a deep and intuitive understanding of the backpropagation algorithm, which is the cornerstone of training most neural networks. [1]

**What is Backpropagation?**

Backpropagation, short for "backward propagation of errors," is an algorithm used to train artificial neural networks. [1] It is a supervised learning algorithm that works by calculating the gradient of the loss function with respect to the network's weights and biases. This gradient is then used to update the weights and biases in the direction that minimizes the loss. [1, 2]

Think of it as a methodical way to assign blame for the network's overall error to each of its individual connections (weights). Once we know which weights are most responsible for the error, we can adjust them accordingly. This process is repeated iteratively until the network's performance on the training data is satisfactory.

Some resources:
- Backprop visualized: <https://xnought.github.io/backprop-explainer/>
- 3Blue1brown: <https://www.youtube.com/watch?v=Ilg3gGewQ5U>
- Foundations of computer vision: <https://visionbook.mit.edu/backpropagation.html>

## The Core Idea: Gradient Descent

Before diving into backpropagation, let's quickly recap **Gradient Descent**. Imagine you are on a mountain in a thick fog and you want to get to the lowest point. What would you do? You would look at the ground beneath your feet and take a step in the steepest downhill direction. You would repeat this process until you could no longer step down.

This is exactly what gradient descent does. In the context of machine learning, the "mountain" is the **loss function**, and the "position" is the set of weights and biases of our model. The goal is to find the weights and biases that result in the lowest possible loss (error).

The **gradient** of the loss function is a vector that points in the direction of the steepest ascent. Therefore, to move downhill, we take a step in the **negative** direction of the gradient. The size of this step is determined by the **learning rate** (α).

The update rule for a weight *w* is:
$$ w_{new} = w_{old} - \alpha \frac{\partial L}{\partial w} $$

where *L* is the loss function. Backpropagation is the algorithm that allows us to efficiently compute this gradient,  $\frac{\partial L}{\partial w}$ , for all the weights and biases in the network.

## The Mathematics of Backpropagation

Let's consider a simple neural network with one hidden layer to understand the mathematics. Our goal is to derive the gradients of the loss function with respect to the weights and biases.

### The Forward Pass

First, the input data is fed forward through the network to compute the output. For a single training example:

1.  **Input Layer to Hidden Layer:**
    $$ z^{[1]} = W^{[1]}x + b^{[1]} $$
    $$ a^{[1]} = g(z^{[1]}) $$

2.  **Hidden Layer to Output Layer:**
    $$ z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} $$
    $$ \hat{y} = a^{[2]} = \sigma(z^{[2]}) $$

Where:
-  *x* is the input vector.
-  *W<sup>[1]</sup>* and *b<sup>[1]</sup>* are the weight matrix and bias vector for the hidden layer.
-  *g* is the activation function for the hidden layer (e.g., ReLU or tanh).
-  *W<sup>[2]</sup>* and *b<sup>[2]</sup>* are the weight matrix and bias vector for the output layer.
-  *σ* is the activation function for the output layer (e.g., sigmoid for binary classification).
-  *ŷ* is the predicted output.

### The Loss Function

We need a way to measure how wrong our network's prediction is. For binary classification, we often use the **Binary Cross-Entropy Loss**:

$$ L(y, \hat{y}) = - (y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})) $$

### The Backward Pass and the Chain Rule

This is where the magic happens. We need to compute the derivatives of the loss with respect to all weights and biases. The key to this is the **Chain Rule** from calculus.

**Chain Rule:** If *z* depends on *y*, and *y* depends on *x*, then the derivative of *z* with respect to *x* is:
$$ \frac{dz}{dx} = \frac{dz}{dy} \frac{dy}{dx} $$

We start from the end of the network and work our way backward.

**Step 1: Output Layer Gradients**

We first compute the derivative of the loss with respect to the output layer's weighted sum, *z<sup>[2]</sup>*:

$$ \frac{\partial L}{\partial z^{[2]}} = \frac{\partial L}{\partial a^{[2]}} \frac{\partial a^{[2]}}{\partial z^{[2]}} = (a^{[2]} - y) $$

(This is a simplified result for the combination of sigmoid and binary cross-entropy).

Now we can compute the gradients for *W<sup>[2]</sup>* and *b<sup>[2]</sup>*:

$$ \frac{\partial L}{\partial W^{[2]}} = \frac{\partial L}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial W^{[2]}} = (a^{[2]} - y) \cdot (a^{[1]})^T $$
$$ \frac{\partial L}{\partial b^{[2]}} = \frac{\partial L}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial b^{[2]}} = (a^{[2]} - y) $$

**Step 2: Hidden Layer Gradients**

Next, we propagate the error backward to the hidden layer:

$$ \frac{\partial L}{\partial z^{[1]}} = \frac{\partial L}{\partial z^{[2]}} \frac{\partial z^{[2]}}{\partial a^{[1]}} \frac{\partial a^{[1]}}{\partial z^{[1]}} = (W^{[2]})^T (a^{[2]} - y) \ast g'(z^{[1]}) $$

where *g'* is the derivative of the hidden layer's activation function and *∗* denotes element-wise multiplication.

Finally, we can compute the gradients for *W<sup>[1]</sup>* and *b<sup>[1]</sup>*:

$$ \frac{\partial L}{\partial W^{[1]}} = \frac{\partial L}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial W^{[1]}} = \frac{\partial L}{\partial z^{[1]}} \cdot x^T $$
$$ \frac{\partial L}{\partial b^{[1]}} = \frac{\partial L}{\partial z^{[1]}} \frac{\partial z^{[1]}}{\partial b^{[1]}} = \frac{\partial L}{\partial z^{[1]}} $$

Once we have these gradients, we can use them to update the weights and biases using the gradient descent rule.

### Mini-Exercise 1:

If the hidden layer activation function *g(z)* is the Rectified Linear Unit (ReLU), where *g(z) = max(0, z)*, what is its derivative *g'(z)*? How does this simplify the calculation for the hidden layer gradient?

## Implementing Backpropagation from Scratch

Now, let's implement a simple neural network in Python using only NumPy to see backpropagation in action.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Sigmoid activation function and its derivative
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    return sigmoid(z) * (1 - sigmoid(z))

# Loss function
def compute_loss(y_true, y_pred):
    return - (y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)).mean()

class SimpleNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights and biases
        self.W1 = np.random.randn(hidden_size, input_size) * 0.01
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * 0.01
        self.b2 = np.zeros((output_size, 1))
        
    def forward_pass(self, X):
        self.Z1 = np.dot(self.W1, X) + self.b1
        self.A1 = sigmoid(self.Z1)
        self.Z2 = np.dot(self.W2, self.A1) + self.b2
        self.A2 = sigmoid(self.Z2)
        return self.A2

    def backward_pass(self, X, Y):
        m = Y.shape[1]
        
        # Gradients for the output layer
        dZ2 = self.A2 - Y
        self.dW2 = (1 / m) * np.dot(dZ2, self.A1.T)
        self.db2 = (1 / m) * np.sum(dZ2, axis=1, keepdims=True)
        
        # Gradients for the hidden layer
        dZ1 = np.dot(self.W2.T, dZ2) * sigmoid_derivative(self.Z1)
        self.dW1 = (1 / m) * np.dot(dZ1, X.T)
        self.db1 = (1 / m) * np.sum(dZ1, axis=1, keepdims=True)

    def update_parameters(self, learning_rate):
        self.W1 -= learning_rate * self.dW1
        self.b1 -= learning_rate * self.db1
        self.W2 -= learning_rate * self.dW2
        self.b2 -= learning_rate * self.db2

    def train(self, X, Y, epochs, learning_rate):
        losses = []
        for i in range(epochs):
            y_pred = self.forward_pass(X)
            loss = compute_loss(Y, y_pred)
            losses.append(loss)
            
            self.backward_pass(X, Y)
            self.update_parameters(learning_rate)
            
            if i % 1000 == 0:
                print(f"Epoch {i}, Loss: {loss}")
        
        return losses

## Visualizing the Training Process

Let's use the `make_moons` dataset from scikit-learn, which is a simple dataset for binary classification that is not linearly separable.

In [None]:
# Generate and prepare data
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Transpose data for our network's expected input shape
X_train_t = X_train.T
y_train_t = y_train.reshape(1, -1)

# Initialize and train the network
net = SimpleNeuralNetwork(input_size=2, hidden_size=4, output_size=1)
losses = net.train(X_train_t, y_train_t, epochs=10000, learning_rate=0.1)

# Plot the loss over time
plt.figure()
plt.plot(losses)
plt.title("Loss During Training")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.grid(True)
plt.show()

### Visualizing the Decision Boundary

A great way to see what the network has learned is to visualize its decision boundary.

In [None]:
def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                         np.arange(y_min, y_max, 0.01))
    
    Z = model.forward_pass(np.c_[xx.ravel(), yy.ravel()].T)
    Z = Z > 0.5
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdYlBu)
    plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.RdYlBu, edgecolors='k')
    plt.title("Decision Boundary")
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.show()

plot_decision_boundary(net, X_train, y_train)

### Mini-Exercise 2:

In the `SimpleNeuralNetwork` class, change the hidden layer size from 4 to other values (e.g., 1, 2, 20, 50). How does this affect the decision boundary and the final loss? What do you observe about potential overfitting or underfitting?

## Computational Considerations
**Vanishing Gradient Problem**

In deep networks, gradients can become exponentially small:
Each term ∂aᵢ/∂aᵢ₋₁ involves the derivative of the activation function. For sigmoid: σ'(x) ≤ 0.25, so the product shrinks exponentially.

Solutions:
- Use ReLU activations (gradient is 1 or 0)
- Apply gradient clipping
- Use residual connections
- Batch normalization

**Exploding Gradient Problem**

Conversely, gradients can grow exponentially large.

Solutions:
- Gradient clipping: if ||g|| > threshold: g = g × threshold/||g||
- Careful weight initialization
- Learning rate scheduling

## Backpropagation in Other ML Models

The core idea of backpropagation—calculating a gradient to minimize a loss—is not unique to complex neural networks.

### Linear Regression

In linear regression, we want to find the line of best fit by minimizing the Mean Squared Error (MSE) loss function:
$$ L(w, b) = \frac{1}{m} \sum_{i=1}^{m} (y_i - (wx_i + b))^2 $$

The gradients are:
$$ \frac{\partial L}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} -2x_i(y_i - (wx_i + b)) $$
$$ \frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} -2(y_i - (wx_i + b)) $$

This is a very simple form of backpropagation, where the "network" is just a single node with a linear activation function.

### Logistic Regression

Logistic regression can be viewed as a single-neuron neural network with a sigmoid activation function. The loss function is the Binary Cross-Entropy loss, just like in our example network. The process of using gradient descent to find the optimal weights for logistic regression is exactly backpropagation on a network with no hidden layers.

### Support Vector Machines (SVMs)

While standard SVMs are often solved using different optimization techniques (like quadratic programming), they can also be trained using gradient descent, especially for linear SVMs. The loss function for an SVM is the **Hinge Loss**:

$$ L = \max(0, 1 - y(w^T x - b)) $$

By calculating the gradient of this loss function with respect to the weights *w*, one can use gradient descent to find the optimal hyperplane. This approach is particularly useful for large datasets where traditional methods are too slow.

### Final comparison
 Method | How It Uses Gradients | Backprop Connection |
|-------|------------------------|---------------------|
| Linear Regression | ∂L/∂W = -2X(y - y_pred) | Same idea: gradient of loss w.r.t. weights |
| Logistic Regression | ∂L/∂W = X^T(σ(WX+b) - y) | Uses chain rule on sigmoid |
| SVM | Uses hinge loss + gradient | Gradient flow is similar, but loss is different |
| **Neural Networks** | Chain rule through many layers | **Backprop is the generalization** of all three |

## Application in Basic Sciences: Classifying Phases of Matter

Neural networks are powerful tools in the basic sciences. For example, they can be used to identify the phases of matter (e.g., ordered vs. disordered) from raw simulation data, a task that often requires deep physical insight.

Let's simulate a simple dataset for a 2D Ising model, a model of magnetism. We'll have two phases: an ordered (ferromagnetic) phase at low temperatures and a disordered (paramagnetic) phase at high temperatures. The input to our network will be the spin configurations (grids of +1s and -1s).

*This is a simplified example for demonstration purposes.*

In [None]:
def generate_ising_data(L, n_samples):
    # Simple function to generate fake Ising model configurations
    # L: grid size (L x L)
    X = []
    y = []
    for _ in range(n_samples // 2):
        # Ordered phase (low temp): mostly aligned spins
        config = np.random.choice([-1, 1], size=(L, L), p=[0.1, 0.9])
        X.append(config.flatten())
        y.append(0) # Label for ordered
        
    for _ in range(n_samples // 2):
        # Disordered phase (high temp): random spins
        config = np.random.choice([-1, 1], size=(L, L), p=[0.5, 0.5])
        X.append(config.flatten())
        y.append(1) # Label for disordered
        
    return np.array(X), np.array(y).reshape(-1, 1)

L = 4
X_ising, y_ising = generate_ising_data(L, 1000)

# Shuffle the data
permutation = np.random.permutation(X_ising.shape[0])
X_ising = X_ising[permutation, :]
y_ising = y_ising[permutation, :]

# Split and transpose
X_train_ising, y_train_ising = X_ising.T[:, :800], y_ising.T[:, :800]
X_test_ising, y_test_ising = X_ising.T[:, 800:], y_ising.T[:, 800:]

print(f"Input data shape: {X_train_ising.shape}")
print(f"Labels shape: {y_train_ising.shape}")

# Train a network to classify the phases
ising_net = SimpleNeuralNetwork(input_size=L*L, hidden_size=8, output_size=1)
ising_losses = ising_net.train(X_train_ising, y_train_ising, epochs=5000, learning_rate=0.05)

# Plot the loss
plt.figure()
plt.plot(ising_losses)
plt.title("Ising Model Phase Classification Loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()

# Evaluate accuracy on the test set
y_pred_ising = ising_net.forward_pass(X_test_ising)
predictions = (y_pred_ising > 0.5).astype(int)
accuracy = np.mean(predictions == y_test_ising) * 100
print(f"Accuracy on the test set: {accuracy:.2f}%")

## Final Exercises

Here are some exercises to solidify your understanding of backpropagation.

**Exercise 1:**
Modify the `SimpleNeuralNetwork` class to use the **tanh** activation function in the hidden layer instead of the sigmoid function. You will need to find the derivative of tanh and use it in the `backward_pass` method. The tanh function and its derivative are:
$$ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$
$$ \frac{d}{dz}\tanh(z) = 1 - \tanh^2(z) $$
Retrain the model on the `make_moons` dataset and visualize the new decision boundary. Does it perform better or worse?

**Exercise 2:**
Implement **L2 Regularization** to the training process. This involves adding a penalty term to the loss function that is proportional to the square of the weights. The new loss function will be:
$$ L_{reg} = L + \frac{\lambda}{2m} (||W^{[1]}||^2_F + ||W^{[2]}||^2_F) $$
This will add a term to the gradient updates for the weights:
$$ \frac{\partial L_{reg}}{\partial W} = \frac{\partial L}{\partial W} + \frac{\lambda}{m} W $$
Add a `reg_lambda` parameter to the `train` method and modify the `backward_pass` or `update_parameters` method to include this regularization term. Observe its effect on the decision boundary for a model with a large hidden layer (e.g., 50 units).

**Exercise 3:**
Our current implementation uses batch gradient descent (it processes all training examples at once). Modify the training loop to implement **Stochastic Gradient Descent (SGD)**, where the forward and backward passes are performed for one training example at a time. How does the loss curve change compared to batch gradient descent? (It should be much noisier).

**Exercise 4:**
Create a neural network to solve a simple regression problem. For example, try to fit the function *y = sin(x)*. You will need to:
1. Generate training data (e.g., `X = np.linspace(-np.pi, np.pi, 100)`, `y = np.sin(X)`).
2. Change the output layer's activation to be linear (i.e., no activation function).
3. Change the loss function to Mean Squared Error (MSE).
4. Modify the backpropagation equations to account for the new loss and activation.
Plot the network's predictions against the true `sin(x)` curve.

**Exercise 5 (Conceptual):**
Explain in your own words how backpropagation would work in a neural network with **two hidden layers**. Write down the equations for the gradient of the loss with respect to the weights of the *first* hidden layer (*W<sup>[1]</sup>*). How does the error signal from the final layer propagate back to this first layer?

**Exercise 1: Complete Backpropagation Calculation**
Given a 2-layer neural network:

Input: x = [1, 2]
W₁ = [[0.5, 0.3], [0.2, 0.8]], b₁ = [0.1, 0.4]
W₂ = [[0.9, 0.7]], b₂ = [0.2]
Activation: sigmoid for all layers
Loss: MSE
Target: y = 0.8

Calculate:
a) Forward pass (all z and a values)
b) Loss value
c) All gradients (∂J/∂W₁, ∂J/∂b₁, ∂J/∂W₂, ∂J/∂b₂)
d) Updated weights after one gradient descent step (α = 0.1)

**Exercise 2: Vanishing Gradient Analysis**

Consider a 5-layer network with sigmoid activations. If the initial gradient at the output layer has magnitude 1.0:
a) Estimate the gradient magnitude at each layer working backwards
b) What happens to the gradient magnitude at the input layer?
c) Suggest three techniques to mitigate this problem
d) Recalculate assuming ReLU activations (assume 50% of neurons are active)

**Exercise 3: Scientific Application Design**
Design a neural network to predict chemical reaction rates from molecular descriptors:
a) Define input features (at least 5 molecular properties)
b) Design the network architecture (number of layers, neurons, activations)
c) Choose an appropriate loss function and justify your choice
d) Describe how backpropagation would optimize this network
e) Identify potential challenges and solutions


**Exercise 4: Custom Activation Function**
Create a new activation function: f(x) = x × tanh(x)
a) Derive its derivative f'(x)
b) Implement the backpropagation equations for a layer using this activation
c) Analyze its properties: range, monotonicity, gradient behavior
d) Compare advantages/disadvantages vs ReLU and sigmoid

**Exercise 5: Multi-Output Regression**
Design backpropagation for a network predicting multiple outputs:

Input: Environmental measurements `[temperature, humidity, pressure, wind_speed]`
Outputs: `[pm2.5_concentration, ozone_level, no2_level]`

a) Write the forward pass equations
b) Define an appropriate loss function for multiple outputs
c) Derive the backpropagation equations
d) Discuss how to handle outputs with different scales/importance
e) Implement gradient updates for the multi-output case

## Summary
Backpropagation is the fundamental algorithm enabling neural network training. Key takeaways:

- Chain Rule Application: Systematically applies calculus chain rule to compute gradients
- Efficiency: Computes all gradients in O(n) time through clever reuse of computations
- Universal Algorithm: Works for any differentiable network architecture
- Scientific Applications: Enables neural networks to learn complex patterns in physics, chemistry, biology
- Optimization Foundation: Provides gradients for sophisticated optimization algorithms

Understanding backpropagation deeply allows you to:

- Debug training problems
- Design custom architectures
- Apply neural networks to novel scientific problems
- Optimize training procedures

The algorithm's elegance lies in its simplicity: compute errors backward, update weights to reduce errors. Yet this simple principle powers the most sophisticated AI systems solving complex scientific and engineering problems.

## Further Reading

- "Deep Learning" by Goodfellow, Bengio, and Courville (Chapter 6)
"- Pattern Recognition and Machine Learning" by Bishop (Chapter 5)
- Original backpropagation papers by Rumelhart, Hinton, and Williams (1986)
- Modern optimization techniques: Adam, RMSprop, etc.