# Day 10: Backpropagation Basics

Welcome to Day 10!

Today you will learn:
- What backpropagation actually is (not magic)
- Why the chain rule is unavoidable
- How gradients flow backward
- Manual derivative calculations
- A small NumPy example that mirrors real training

This is the heart of deep learning.

---

# What Is Backpropagation?

Backpropagation is NOT an optimization algorithm.

It is:
> A systematic application of the chain rule to compute gradients efficiently in a neural network.
> 
> Backpropagation is an algorithm that computes the gradients of the loss function with respect to each parameter of a neural network by applying the chain rule backward through the network.


### Why Backpropagation Exists

A neural network may have millions of parameters.  
To train it, we must know how the loss changes with respect to every weight and bias.

Formally, we need:
$$
\frac{\partial L}{\partial w_i} \quad \text{and} \quad \frac{\partial L}{\partial b_i}
$$

Computing these derivatives independently would be computationally infeasible.

Backpropagation solves this by reusing intermediate derivatives.

### Core Idea (Chain Rule)

Each layer’s output depends on the previous layer:
$$
a^{[l]} = f(z^{[l]}), \quad z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}
$$

Using the chain rule:

$$
\frac{\partial L}{\partial W^{[l]}}
=
\frac{\partial L}{\partial a^{[l]}}
\cdot
\frac{\partial a^{[l]}}{\partial z^{[l]}}
\cdot
\frac{\partial z^{[l]}}{\partial W^{[l]}}
$$

This allows gradients to flow from the output layer back to the input layer.

### What Backpropagation Actually Does

- Starts at the loss function
- Moves backward layer by layer
- Computes gradients for:
  - Weights
  - Biases
  - Activations
- Stores gradients so they can be reused efficiently

### What Backpropagation Does Not Do

- It does not update weights  
- It does not choose the learning rate  

Those are handled by optimizers (SGD, Adam, RMSProp).

# Why Chain Rule?

In a neural network, the loss does not depend on weights directly.  
Each weight influences the loss through a sequence of intermediate variables.

Example dependency:
$$
w \;\rightarrow\; z \;\rightarrow\; a \;\rightarrow\; L
$$

Where:
- $w$ = weight  
- $z = wx + b$ = linear combination  
- $a = f(z)$ = activation output  
- $L$ = loss  

### Applying the Chain Rule

Because of this indirect dependency, we cannot compute $\frac{dL}{dw}$ in one step.

Instead, we break it into pieces:
$$
\frac{dL}{dw} =
\frac{dL}{da}
\cdot
\frac{da}{dz}
\cdot
\frac{dz}{dw}
$$

Each term answers a specific question:
- $\frac{dL}{da}$ → How sensitive is the loss to the neuron’s output?
- $\frac{da}{dz}$ → How does the activation respond to its input?
- $\frac{dz}{dw}$ → How does the weight affect the neuron input?

### Why This Matters

- Neural networks are deep chains of such dependencies.
- The chain rule lets us compute gradients layer by layer, starting from the loss.
- These gradients tell each weight how much it contributed to the error.

Backpropagation is just the repeated application of the chain rule across all layers of the network. Without the chain rule, training deep networks would be computationally impossible.


# Computational Graph (Single Neuron)

Given:
$$
z = wx + b
$$
$$
a = \sigma(z)
$$
$$
L = (a - y)^2
$$

We will compute gradients step by step:
- $\frac{dL}{da}$
- $\frac{da}{dz}$
- $\frac{dz}{dw}$


### Step 1: Gradient of Loss

Loss:
$$
L = (a - y)^2
$$

Derivative:
$$
\boxed{\frac{dL}{da} = 2(a - y)}
$$

This tells:
> How much loss changes if activation changes


### Step 2: Sigmoid Derivative

Given sigmoid activation:
$$
a = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

Rewrite sigmoid in a differentiable form:
$$
a = (1 + e^{-z})^{-1}
$$

Differentiate w.r.t. $z$. Using the chain rule:
$$
\frac{da}{dz}
= -1 \cdot (1 + e^{-z})^{-2} \cdot \frac{d}{dz}(1 + e^{-z})
$$

Differentiate the inner term:
$$
\frac{d}{dz}(1 + e^{-z}) = -e^{-z}
$$

Substitute back:
$$
\frac{da}{dz}
= - (1 + e^{-z})^{-2} \cdot (-e^{-z})
$$

$$
= \frac{e^{-z}}{(1 + e^{-z})^2}
$$


**Express derivative in terms of $a$**

Recall:
$$
a = \frac{1}{1 + e^{-z}}
$$

So:
$$
1 - a = \frac{e^{-z}}{1 + e^{-z}}
$$

Multiply:
$$
a(1 - a)
= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}}
$$

$$
= \frac{e^{-z}}{(1 + e^{-z})^2}
$$

Final result:
$$
\boxed{
\frac{da}{dz} = a(1 - a)
}
$$


### Step 3: Linear Part

Linear equation:
$$
z = wx + b
$$

Derivatives:
$$
\boxed{\frac{dz}{dw} = x}
$$
$$
\boxed{\frac{dz}{db} = 1}
$$


### Step 4: Putting It All Together

Using chain rule:
$$
\frac{dL}{dw} =
\frac{dL}{da} \cdot
\frac{da}{dz} \cdot
\frac{dz}{dw}
$$

Substitute:
$$
\boxed{\frac{dL}{dw} =
2(a - y) \cdot a(1 - a) \cdot x}
$$

This is backpropagation for one neuron.


# Manual Example (Numbers)

Given

- Input: $x = 2$
- Weight: $w = 0.5$
- Bias: $b = 0$
- True label: $y = 1$
- Activation: Sigmoid
- Loss: MSE

### 1. Forward Pass

Linear step
$$
z = wx + b = (0.5)(2) + 0 = 1
$$

Activation
$$
a = \sigma(z) = \frac{1}{1 + e^{-1}} \approx 0.731
$$

### 2. Loss Calculation

MSE loss
$$
L = (a - y)^2 = (0.731 - 1)^2
$$

$$
L = (-0.269)^2 \approx 0.072
$$

### 3️. Backward Pass (Gradients)

We apply the chain rule:

$$
\frac{dL}{dw}
= \frac{dL}{da}
\cdot \frac{da}{dz}
\cdot \frac{dz}{dw}
$$


Step 1: Loss gradient w.r.t. activation
$$
\frac{dL}{da} = 2(a - y)
$$

$$
\frac{dL}{da} = 2(0.731 - 1) = -0.538
$$


Step 2: Sigmoid derivative
$$
\frac{da}{dz} = a(1 - a)
$$

$$
\frac{da}{dz} = 0.731 \cdot (1 - 0.731)
$$

$$
\frac{da}{dz} \approx 0.196
$$

Step 3: Linear derivative
$$
\frac{dz}{dw} = x = 2
$$

Step 4: Final Gradient w.r.t. Weight

Multiply all terms:
$$
\frac{dL}{dw}
= (-0.538) \cdot (0.196) \cdot 2
$$

$$
\frac{dL}{dw} \approx -0.211
$$


Step 5: Gradient w.r.t. Bias (for completeness)

Since:
$$
\frac{dz}{db} = 1
$$

$$
\frac{dL}{db}
= \frac{dL}{da} \cdot \frac{da}{dz}
$$

$$
\frac{dL}{db}
= (-0.538)(0.196)
\approx -0.105
$$


###  Interpretation

- Gradient is negative → increase $w$ and $b$
- Model prediction ($0.731$) is too low compared to $y = 1$
- Gradient descent will push parameters **upward**

### Code

In [1]:
import numpy as np

# Given values
x = 2
w = 0.5
b = 0
y = 1

# ---------- Forward pass ----------
z = w * x + b
a = 1 / (1 + np.exp(-z))
L = (a - y)**2

# ---------- Backward pass ----------
# dL/da
dL_da = 2 * (a - y)

# da/dz (sigmoid derivative)
da_dz = a * (1 - a)

# dz/dw and dz/db
dz_dw = x
dz_db = 1

# Gradients
dL_dw = dL_da * da_dz * dz_dw
dL_db = dL_da * da_dz * dz_db

L, dL_dw, dL_db

(0.07232948812851325, -0.21150837113706686, -0.10575418556853343)

# Gradient Descent Update

Gradient Descent updates parameters by moving them opposite to the gradient, because the gradient points in the direction of maximum increase in loss.

### Weight Update

Using learning rate $\eta$:

$$
w_{\text{new}} = w - \eta \frac{dL}{dw}
$$

- If $\frac{dL}{dw} > 0$ → decrease $w$
- If $\frac{dL}{dw} < 0$ → increase $w$

This is how the model corrects its mistakes.

### Bias Update

Bias is updated the same way:

$$
b_{\text{new}} = b - \eta \frac{dL}{db}
$$

- Bias shifts the activation left/right
- Learning adjusts when the neuron activates

### Key Points

- Gradients tell which direction to move
- Learning rate controls how far to move
- Repeating these updates gradually minimizes the loss

### One Training Step

1. Forward pass → compute prediction
2. Compute loss
3. Backward pass → compute gradients
4. Update $w$ and $b$
5. Repeat until convergence

This loop is the core engine of neural network learning.


In [2]:
lr = 0.1
w_new = w - lr * dL_dw
b_new = b - lr * dL_db

print("updated w:", w_new)
print("updated b:",b_new)

updated w: 0.5211508371137067
updated b: 0.010575418556853344


# Extending to Multiple Layers

We extend the single-neuron example to a tiny 2-layer network to see how backprop works step by step.

**Network Architecture**

- Inputs: $x_1, x_2$
- Hidden layer: **2 neurons (ReLU)**
- Output layer: **1 neuron (linear)** (the output $\hat{y}$ is directly equal to the pre-activation $Z_2$.) 
- Loss: **Mean Squared Error**

**Given Values**

- input $x_1 = 1$
- input $x_2 = 2$
- target $y = 3$

**Parameters (initial)**

Hidden layer:
$$
W_1 =
\begin{bmatrix}
0.5 & -0.4 \\
0.3 & 0.1
\end{bmatrix},
\quad
b_1 = [0, 0]
$$

Output layer:
$$
W_2 = [0.6, -0.2],
\quad
b_2 = 0.1
$$

### 1️. Forward Pass

Hidden layer pre-activation
$$
z_{h1} = 0.5(1) + 0.3(2) = 1.1
$$
$$
z_{h2} = -0.4(1) + 0.1(2) = -0.2
$$

So:
$$
Z_1 = [1.1, -0.2]
$$

Hidden layer activation (ReLU)
$$
A_1 = [\max(0,1.1), \max(0,-0.2)] = [1.1, 0]
$$

Output layer pre-activation
$$
\hat{y} = z_2 = 0.6(1.1) + (-0.2)(0) + 0.1 = 0.76
$$

Loss
$$
L = (\hat{y} - y)^2 = (0.76 - 3)^2 = 5.02
$$


### 2️. Backward Pass (Chain Rule)

Output layer gradient
$$
\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) = 2(-2.24) = -4.48
$$

Gradients w.r.t output parameters

* $Z_2 = A_1 \cdot W_2 + b_2$  
* $\frac{\partial Z_2}{\partial W_2} = A_1$
* Here, the output linear means $\hat{y}$ is directly equal to the pre-activation $Z_2$.
* There is no sigmoid, ReLU, tanh, or other function applied.
* Because of that, the derivative of the activation is:
  $
  \frac{d\hat{y}}{dZ_2} = 1
  $

$$
\boxed{\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{d\hat{y}}{dZ_2} \cdot \frac{\partial Z_2}{\partial W_2}}
$$

$$
\frac{\partial L}{\partial W_2}
= A_1 \cdot \frac{\partial L}{\partial \hat{y}}
= [1.1, 0] \cdot (-4.48)
= [-4.93, 0]
$$


The output neuron is linear:

$$
\hat{y} = Z_2 = W_2 \cdot A_1 + b_2
$$

By the chain rule:

$$
\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b_2}
$$

We already computed:

$$
\frac{\partial L}{\partial \hat{y}} = 2 (\hat{y} - y) = 2(0.76 - 3) = -4.48
$$

Since $\hat{y} = Z_2 = W_2 \cdot A_1 + b_2$, the derivative w.r.t $b_2$ is:

$$
\frac{\partial \hat{y}}{\partial b_2} = 1
$$

So:

$$
\frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial \hat{y}} \cdot 1 = -4.48
$$


### 3️. Backprop into Hidden Layer

We can write the entire gradient for hidden layer weights as a single formula:

$$
\boxed{
\frac{\partial L}{\partial W_1} = 
\frac{\partial L}{\partial \hat{y}} \;\cdot\; 
\frac{\partial \hat{y}}{\partial A_1} \;\cdot\; 
\frac{\partial A_1}{\partial Z_1} \;\cdot\; 
\frac{\partial Z_1}{\partial W_1}
}
$$

Where each term is:

- $ \frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y) $ → derivative of the loss  
- $ \frac{\partial \hat{y}}{\partial A_1} = W_2 $ → output layer weights  
- $ \frac{\partial A_1}{\partial Z_1} = \text{ReLU}'(Z_1)$ → hidden layer activation derivative  
- $ \frac{\partial Z_1}{\partial W_1} = X$ → inputs to the hidden layer neuron

Gradient flowing into hidden activations
$$
\frac{\partial L}{\partial A_1}
= W_2 \cdot \frac{\partial L}{\partial \hat{y}}
= [0.6, -0.2] \cdot (-4.48)
= [-2.69, 0.90]
$$

Apply ReLU derivative
ReLU′$(z)$ = 1 if $z>0$, else 0  

Since:
$$
Z_1 = [1.1, -0.2]
$$

ReLU mask:
$$
[1, 0]
$$

So:
$$
\frac{\partial L}{\partial Z_1}
= [-2.69, 0.90] \odot [1, 0]
= [-2.69, 0]
$$


**Gradients w.r.t Hidden Weights**

Hidden neuron 1:
$$
\frac{\partial L}{\partial w_{11}} = x_1 \cdot (-2.69) = -2.69
$$
$$
\frac{\partial L}{\partial w_{21}} = x_2 \cdot (-2.69) = -5.38
$$

Hidden neuron 2:
$$
\frac{\partial L}{\partial w_{12}} = 0
\quad
\frac{\partial L}{\partial w_{22}} = 0
$$

Bias gradients:
$$
\frac{\partial L}{\partial b_1} = [-2.69, 0]
$$


### 4. Gradient Descent Update (η = 0.1)

Hidden layer

$$
W_1^{new} =
\begin{bmatrix}
0.5 & -0.4 \\
0.3 & 0.1
\end{bmatrix}
-
0.1
\begin{bmatrix}
-2.69 & 0 \\
-5.38 & 0
\end{bmatrix}
=
\begin{bmatrix}
0.769 & -0.4 \\
0.838 & 0.1
\end{bmatrix}
$$

$$
b_1^{new} = [0.269, 0]
$$

Output layer
$$
W_2^{new} = [0.6, -0.2] - 0.1[-4.93, 0] = [1.093, -0.2]
$$

$$
b_2^{new} = 0.1 - 0.1(-4.48) = 0.548
$$


### Code

In [3]:
import numpy as np

# Inputs and target
x = np.array([1, 2])
y = 3

# Parameters (initial)
W1 = np.array([[0.5, -0.4],
               [0.3,  0.1]])
b1 = np.array([0.0, 0.0])

W2 = np.array([0.6, -0.2])
b2 = 0.1

# Learning rate
eta = 0.1

# ---------- Forward Pass ----------
# Hidden layer pre-activation
Z1 = W1.T @ x + b1  # shape (2,)
# ReLU activation
A1 = np.maximum(0, Z1)

# Output layer (linear)
Z2 = W2 @ A1 + b2
y_hat = Z2

# Loss (MSE)
loss = (y_hat - y)**2
print("Forward Pass:")
print("Z1:", Z1)
print("A1:", A1)
print("Z2 / y_hat:", y_hat)
print("Loss:", loss)

# ---------- Backward Pass ----------
# Output layer gradient
dL_dyhat = 2 * (y_hat - y)  # dL/dy_hat

# Gradients w.r.t output weights and bias
dL_dW2 = dL_dyhat * A1
dL_db2 = dL_dyhat * 1

# Backprop into hidden layer
dL_dA1 = W2 * dL_dyhat
# ReLU derivative
relu_mask = (Z1 > 0).astype(float)
dL_dZ1 = dL_dA1 * relu_mask

# Gradients w.r.t hidden weights and biases
dL_dW1 = np.outer(x, dL_dZ1)  # shape (2,2)
dL_db1 = dL_dZ1

print("\nBackward Pass:")
print("dL/dW2:", dL_dW2)
print("dL/db2:", dL_db2)
print("dL/dW1:\n", dL_dW1)
print("dL/db1:", dL_db1)

# ---------- Gradient Descent Update ----------
W1_new = W1 - eta * dL_dW1
b1_new = b1 - eta * dL_db1

W2_new = W2 - eta * dL_dW2
b2_new = b2 - eta * dL_db2

print("\nUpdated Parameters:")
print("W1_new:\n", W1_new)
print("b1_new:", b1_new)
print("W2_new:", W2_new)
print("b2_new:", b2_new)

Forward Pass:
Z1: [ 1.1 -0.2]
A1: [1.1 0. ]
Z2 / y_hat: 0.76
Loss: 5.017600000000001

Backward Pass:
dL/dW2: [-4.928 -0.   ]
dL/db2: -4.48
dL/dW1:
 [[-2.688  0.   ]
 [-5.376  0.   ]]
dL/db1: [-2.688  0.   ]

Updated Parameters:
W1_new:
 [[ 0.7688 -0.4   ]
 [ 0.8376  0.1   ]]
b1_new: [0.2688 0.    ]
W2_new: [ 1.0928 -0.2   ]
b2_new: 0.548


# Key Takeaways from Day 10

- Backpropagation is **chain rule in action**
- Gradients flow backward through the network
- Learning = gradient descent using backprop gradients
- One neuron backprop = same logic as deep networks
---

<p style="text-align:center; font-size:18px;">
© 2025 Mostafizur Rahman
</p>
