<a href="https://www.kaggle.com/code/mrafraim/dl-day-11-backpropagation-in-numpy?scriptVersionId=287027922" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 11: Backpropagation in NumPy

Welcome to Day 11!

Today you will:
- Build a tiny neural network from scratch
- Perform forward propagation
- Compute loss
- Apply backpropagation manually
- Update weights step by step

This notebook mirrors what deep learning frameworks do internally.

---

# Tiny Neural Network Architecture

We use a 2-layer network:

$$Input (2) → Hidden (2, ReLU) → Output (1, Sigmoid)$$

Mathematically:

Layer 1:
$$
Z^{(1)} = W^{(1)} X + b^{(1)}
$$
$$
A^{(1)} = ReLU(Z^{(1)})
$$

Layer 2:
$$
Z^{(2)} = W^{(2)} A^{(1)} + b^{(2)}
$$
$$
A^{(2)} = \sigma(Z^{(2)})
$$


In [1]:
# Import numpy library
import numpy as np

In [2]:
# Input (2 features, 1 sample)
X = np.array([[1.0],
              [2.0]])

# True label
y = np.array([[1.0]])

# Initialize weights
W1 = np.array([[0.4, -0.2],
               [0.1,  0.6]])
b1 = np.zeros((2, 1))

W2 = np.array([[0.7, -0.3]])
b2 = np.zeros((1, 1))

In [3]:
# Activation functions

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))


# Forward Propagation

We compute:
- Hidden layer output
- Final prediction
- Loss

Loss used:
$$
L = (y - \hat{y})^2
$$


In [4]:
# Layer 1
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)

# Layer 2
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)

# Loss (MSE)
loss = (A2 - y)**2

Z1, A1, Z2, A2, loss

(array([[0. ],
        [1.3]]),
 array([[0. ],
        [1.3]]),
 array([[-0.39]]),
 array([[0.4037173]]),
 array([[0.35555306]]))

# Backpropagation Flow

Backward order:
1. Output layer gradients
2. Hidden layer gradients
3. Parameter gradients
4. Weight updates

Chain rule governs everything.


# Output Layer Gradients

Loss:
$$
L = (A^{(2)} - y)^2
$$

Gradients:
$$
\frac{dL}{dA^{(2)}} = 2(A^{(2)} - y)
$$
$$
\frac{dA^{(2)}}{dZ^{(2)}} = A^{(2)}(1 - A^{(2)})
$$


In [5]:
# Output layer gradients
dL_dA2 = 2 * (A2 - y)
dA2_dZ2 = A2 * (1 - A2)

dZ2 = dL_dA2 * dA2_dZ2

dW2 = np.dot(dZ2, A1.T)
db2 = dZ2


# Hidden Layer Gradients

Backpropagate error:
$$
\frac{dL}{dA^{(1)}} = W^{(2)T} \frac{dL}{dZ^{(2)}}
$$

Apply ReLU derivative:
$$
\frac{dA^{(1)}}{dZ^{(1)}} =
\begin{cases}
1 & Z^{(1)} > 0 \\
0 & \text{otherwise}
\end{cases}
$$


In [6]:
dA1 = np.dot(W2.T, dZ2)
dZ1 = dA1 * relu_derivative(Z1)

dW1 = np.dot(dZ1, X.T)
db1 = dZ1


# Gradient Check (Sanity)

Shapes must match parameters:
- dW1 → W1
- db1 → b1
- dW2 → W2
- db2 → b2


In [7]:
dW1.shape, db1.shape, dW2.shape, db2.shape


((2, 2), (2, 1), (1, 2), (1, 1))

# Gradient Descent Update

Update rule:
$$
\theta = \theta - \alpha \nabla_\theta L
$$


In [8]:
lr = 0.1

W1 = W1 - lr * dW1
b1 = b1 - lr * db1

W2 = W2 - lr * dW2
b2 = b2 - lr * db2

W1, W2


(array([[ 0.4       , -0.2       ],
        [ 0.09138742,  0.58277485]]),
 array([[ 0.7       , -0.26267884]]))

# Result of One Training Step

- Loss decreased (if learning rate is reasonable)
- Weights updated using gradients
- This is exactly how training works internally

Repeating this loop = training a neural network.


In [9]:
# Forward pass again to see loss change
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)

Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)

new_loss = (A2 - y)**2
loss, new_loss

(array([[0.35555306]]), array([[0.32975951]]))

# Deep Dive


### 0️. Freeze the Graph (This Matters)

Before any math, lock this dependency graph in your head:

$$
W^{(1)}, b^{(1)}
\;\rightarrow\;
Z^{(1)}
\;\rightarrow\;
A^{(1)}
\;\rightarrow\;
W^{(2)}, b^{(2)}
\;\rightarrow\;
Z^{(2)}
\;\rightarrow\;
A^{(2)}
\;\rightarrow\;
L
$$

Backpropagation is nothing but walking this graph backward.


### 1️. Output Layer Gradients

Loss function:

$$
L = (A^{(2)} - y)^2
$$

Ask the only valid first question:

> If the output changes slightly, how does the loss change?

That is:

$$
\frac{\partial L}{\partial A^{(2)}} = 2(A^{(2)} - y)
$$

This is called the **output layer gradient** because:

- Loss depends directly on $A^{(2)}$
- No chain rule is needed yet

**Output Is Not a Parameter**

The output comes from:

$$
A^{(2)} = \sigma(Z^{(2)})
$$

So the gradient must pass through the sigmoid:

$$
\frac{\partial A^{(2)}}{\partial Z^{(2)}} = A^{(2)}(1 - A^{(2)})
$$

Now apply the chain rule:

$$
\boxed{
\frac{\partial L}{\partial Z^{(2)}} =
\frac{\partial L}{\partial A^{(2)}}
\cdot
\frac{\partial A^{(2)}}{\partial Z^{(2)}}
}
$$

This quantity is often denoted as:

$$
\delta^{(2)} \equiv \frac{\partial L}{\partial Z^{(2)}}
$$


### 2️. Parameter Gradients at the Output Layer

Ask the next forced question:

> *Which parameters directly created $Z^{(2)}$?*

From the model:

$$
Z^{(2)} = W^{(2)} A^{(1)} + b^{(2)}
$$

So the local derivatives are:

$$
\frac{\partial Z^{(2)}}{\partial W^{(2)}} = A^{(1)},
\quad
\frac{\partial Z^{(2)}}{\partial b^{(2)}} = 1
$$

Apply the chain rule:

$$
\boxed{
\frac{\partial L}{\partial W^{(2)}} =
\frac{\partial L}{\partial Z^{(2)}} \cdot A^{(1)}
}
$$

$$
\boxed{
\frac{\partial L}{\partial b^{(2)}} =
\frac{\partial L}{\partial Z^{(2)}}
}
$$

**Important observation**:

> No new gradient was created.  
> The same gradient simply touched a parameter.

### 3️. Hidden Layer Gradients

The hidden layer never sees the loss directly.

So we must ask:

> *If the hidden activation changes, how does the loss change?*

Chain rule forces:

$$
\frac{\partial L}{\partial A^{(1)}} =
\frac{\partial L}{\partial Z^{(2)}}
\cdot
\frac{\partial Z^{(2)}}{\partial A^{(1)}}
$$

From:

$$
Z^{(2)} = W^{(2)} A^{(1)} + b^{(2)}
$$

We get:

$$
\boxed{
\frac{\partial L}{\partial A^{(1)}} =
(W^{(2)})^T \frac{\partial L}{\partial Z^{(2)}}
}
$$

This is what people casually call the **hidden layer gradient**.

But note:

- It is not new
- It is just the output gradient pushed backward


### 4️. ReLU Gate: The Gradient Filter

Hidden activation:

$$
A^{(1)} = \text{ReLU}(Z^{(1)})
$$

Derivative of ReLU:

$$
\frac{\partial A^{(1)}}{\partial Z^{(1)}} =
\begin{cases}
1 & Z^{(1)} > 0 \\
0 & Z^{(1)} \le 0
\end{cases}
$$

So:

$$
\boxed{
\frac{\partial L}{\partial Z^{(1)}} =
\frac{\partial L}{\partial A^{(1)}}
\cdot
\frac{\partial A^{(1)}}{\partial Z^{(1)}}
}
$$

This is often denoted as:

$$
\delta^{(1)} \equiv \frac{\partial L}{\partial Z^{(1)}}
$$

 **This is where gradients can die** (ReLU = 0).


### 5️. Parameter Gradients at the Hidden Layer

From:

$$
Z^{(1)} = W^{(1)} X + b^{(1)}
$$

We get:

$$
\frac{\partial Z^{(1)}}{\partial W^{(1)}} = X,
\quad
\frac{\partial Z^{(1)}}{\partial b^{(1)}} = 1
$$

Thus:

$$
\boxed{
\frac{\partial L}{\partial W^{(1)}} =
\frac{\partial L}{\partial Z^{(1)}} \cdot X
}
$$

$$
\boxed{
\frac{\partial L}{\partial b^{(1)}} =
\frac{\partial L}{\partial Z^{(1)}}
}
$$


### 6️. One Complete Chain-Rule Expression

For hidden-layer weights $W^{(1)}$:

$$
\frac{\partial L}{\partial W^{(1)}}
=
\frac{\partial L}{\partial A^{(2)}}
\cdot
\frac{\partial A^{(2)}}{\partial Z^{(2)}}
\cdot
\frac{\partial Z^{(2)}}{\partial A^{(1)}}
\cdot
\frac{\partial A^{(1)}}{\partial Z^{(1)}}
\cdot
\frac{\partial Z^{(1)}}{\partial W^{(1)}}
$$


**Substituting Each Local Derivative**

Loss derivative (Mean Squared Error)

$$
\frac{\partial L}{\partial A^{(2)}} = 2(A^{(2)} - y)
$$

Sigmoid activation derivative

$$
\frac{\partial A^{(2)}}{\partial Z^{(2)}} = A^{(2)}(1 - A^{(2)})
$$

Output layer linear transformation

$$
\frac{\partial Z^{(2)}}{\partial A^{(1)}} = (W^{(2)})^T
$$

ReLU activation gate

$$
\frac{\partial A^{(1)}}{\partial Z^{(1)}} =
\mathbb{1}(Z^{(1)} > 0)
$$

Hidden-layer linear transformation

$$
\frac{\partial Z^{(1)}}{\partial W^{(1)}} = X
$$


Final Combined Gradient (Hidden-Layer Weights)

$$
\boxed{
\frac{\partial L}{\partial W^{(1)}} =
2(A^{(2)} - y)
\cdot
A^{(2)}(1 - A^{(2)})
\cdot
(W^{(2)})^T
\cdot
\mathbb{1}(Z^{(1)} > 0)
\cdot
X
}
$$



Each term is a **local derivative** (Local derivative = derivative of the neuron's output with respect to its input.) in the dependency graph.


### 7️. The Core Insight (This Is the Point)

> “Output gradients” and “hidden gradients” are not separate equations.

They are:

- Partial products inside one long chain-rule
- Checkpoints humans name for convenience

Mathematically:

> There is only one gradient signal, repeatedly transformed.


Final Sentence to Remember

> Backpropagation is one gradient flowing backward, multiplied by local derivatives until it reaches a parameter.



# General Backpropagation Formula (Optional)

Here we establish a complete, layer-wise, engineer-level formula for backpropagation, applicable to any fully connected neural network of depth $L$.

### 0️. Network Definition

Consider a neural network with $L$ layers. For each layer $l = 1, 2, ..., L$, we define:

$$
Z^{(l)} = W^{(l)} A^{(l-1)} + b^{(l)}
$$

$$
A^{(l)} = f^{(l)}(Z^{(l)})
$$

Here:

- $A^{(0)} = X$ is the input layer
- $f^{(l)}$ is the activation function for layer $l$ (ReLU, Sigmoid, Tanh, etc.)
- $W^{(l)}$ and $b^{(l)}$ are the trainable weights and biases for layer $l$
- $Z^{(l)}$ is the pre-activation (weighted input) of layer $l$

The loss function is:

$$
L = \mathcal{L}(A^{(L)}, y)
$$

where $y$ is the ground truth and $A^{(L)}$ is the output of the network.


### 1️. Single Gradient Source

The only point where the gradient originates is at the **loss**:

$$
\frac{\partial L}{\partial A^{(L)}} = \frac{\partial \mathcal{L}}{\partial A^{(L)}}
$$

Everything else in backpropagation is just **propagating this gradient** backward through the network using local derivatives.


### 2️. Define the Error Signal

For clarity, define a layer-wise error signal for each layer $l$ as:

$$
\delta^{(l)} \equiv \frac{\partial L}{\partial Z^{(l)}}
$$

This $\delta^{(l)}$ is not a new gradient, but simply a checkpoint representing the gradient of the loss with respect to the pre-activation of layer $l$. These are the “output gradient” and “hidden gradient” that are commonly discussed. 


### 3️. Output Layer Backpropagation

For the output layer $L$, the error signal is computed as:

$$
\delta^{(L)} = \frac{\partial L}{\partial A^{(L)}} \;\odot\; f'^{(L)}(Z^{(L)})
$$

Here:

- $\odot$ denotes element-wise multiplication
- $f'^{(L)}$ is the derivative of the activation function at the output layer

This step is the base case for backpropagation. The gradient is born at the loss and scaled by the derivative of the output activation.


### 4️. Hidden Layer Backpropagation (Recursive Rule)

For any hidden layer $l = L-1, L-2, ..., 1$, the error signal propagates backward recursively:

$$
\delta^{(l)} = (W^{(l+1)})^T \delta^{(l+1)} \;\odot\; f'^{(l)}(Z^{(l)})
$$

Explanation:

- $(W^{(l+1)})^T \delta^{(l+1)}$ propagates the gradient from the next layer
- $f'^{(l)}(Z^{(l)})$ scales it according to the local slope of the activation function
- This captures the **core insight** of backpropagation: one gradient signal flows backward, repeatedly transformed by local derivatives

These $\delta^{(l)}$ values are the **checkpoints** that engineers inspect to understand gradient health, detect vanishing/exploding gradients, or debug networks.


### 5️. Compute Parameter Gradients

Once $\delta^{(l)}$ is known, computing the gradients for the parameters is straightforward.

**Weights:**

$$
\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \left(A^{(l-1)}\right)^T
$$

Explanation:

- Each weight receives a portion of the gradient proportional to its contribution to $Z^{(l)}$
- The input $A^{(l-1)}$ multiplies the error signal to determine its influence

**Biases:**

$$
\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}
$$

- Each bias receives the same error signal as the pre-activation it feeds into
- This is because the derivative of $Z^{(l)}$ w.r.t. $b^{(l)}$ is 1


### 6️. Full Layer-Wise Backpropagation Algorithm

```markdown
Forward pass:
for l = 1 to L:
    Z(l) = W(l) @ A(l-1) + b(l)
    A(l) = f(l)(Z(l))

Backward pass:
δ(L) = dL/dA(L) ⊙ f'(L)(Z(L))

for l = L-1 down to 1:
    δ(l) = W(l+1).T @ δ(l+1) ⊙ f'(l)(Z(l))

Parameter gradients:
dW(l) = δ(l) @ A(l-1).T
db(l) = δ(l)


# Key Takeaways from Day 11

- Backpropagation = chain rule + matrix math
- Gradients flow backward layer by layer
- Each parameter update reduces loss
- NumPy implementation matches framework logic
- Understanding this removes “black box” fear

---

<p style="text-align:center; font-size:18px;">
© 2025 Mostafizur Rahman
</p>
