# Neural Networks - Backpropagation

In this example we go through the forward propagation of a simple neural network and then step by step details of the backpropagation using computational graph and chain rule. 

Let's say we have the following neural network which is used for binary classification. We have the following: 

**Input**:<br>
2 samples with 2 features

**Network Architecture**:<br>
- Layer 1: Fully connected layer with 3 neurons and ReLU activation function.
- Layer 2: Fully connected layer with 2 neurons and ReLU activation function.
- Layer 3 (output): Fully connected layer with 1 neuron (output) and Sigmoid activation function.

**Loss function**:<br>
Binary Cross Entropy

![](https://pooya.io/ai/images/nn_backpropagation.svg)
For more details see [Neural Networks Propagation](https://pooya.io/ai/neural-networks-backpropagation/).




In this example we use a batch dataset with 2 samples. The input $X$ and target $Y$ are defined as follows:

$$X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$$
$$\vec{\mathbf{y}} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$$

Which means, for example 1 $x_1 = 1$ and $x_2 = 2$ and the target class $y = 0$.

Recall that we maintain each sample in **rows** and features in **columns**. So, each row of $X$ and $\vec{\mathbf{y}}$ is associated with one sample.

In [1]:
import torch

X = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
y = torch.tensor([[0.0], [1.0]])

## Define the Neural Network

Let's create our neural network

In [2]:
import torch.nn as nn
import torch.nn.functional as F


class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()

        # Define the model architecture (Layers and nodes)
        self.linear1 = nn.Linear(in_features=2, out_features=3)
        self.linear2 = nn.Linear(in_features=3, out_features=2)
        self.linear3 = nn.Linear(in_features=2, out_features=1)

    def forward(self, x):
        # Forward Propagation happens here.
        # It takes the input tensor x and returns the output tensor for each
        # layer by applying the linear transformation first and then the
        # activation function.
        # It start from layer 1 and goes forward layer by layer to the output
        # layer.

        # Layer 1 linear transformation
        Z1 = self.linear1(x)
        # Layer 1 activation
        A1 = F.relu(Z1)

        # Layer 2 linear transformation
        Z2 = self.linear2(A1)
        # Layer 2 activation
        A2 = F.relu(Z2)

        # Layer 3 (output layer) linear transformation
        Z3 = self.linear3(A2)
        # Layer 3 activation
        A3 = F.sigmoid(Z3)

        # Output of the model A3, along with the intermediate results
        return A3, {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2, "Z3": Z3}

In practice, for classification problems, when we use the Sigmoid or Softmax activation function in the output layer, we defer the activation of the output layer to the outside the model. In other words, the output layer just do the linear transformation $Z$ and output the [logits](). Then we apply the activatio function (Sigmoid or Softmax) outside the model on the logits to get the predicted probabilities.

This approach is the same for both inference and training.

However, in this example, for simplicity and focus on the backpropagation, we will include the Sigmoid activation function in the output layer. So, in this example, the output layer will output the predicted probabilities.

In [3]:
model = NeuralNet()
print(model)

NeuralNet(
  (linear1): Linear(in_features=2, out_features=3, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
  (linear3): Linear(in_features=2, out_features=1, bias=True)
)


Let's see the initial weights and biases of our neural network.

In [4]:
def print_model_parameters(model):
    for i, child in enumerate(model.children()):
        print(f"Layer {i+1}: {type(child).__name__}")
        child_parameters = dict(child.named_parameters())

        for name, param in child_parameters.items():
            print(f"\n{name}: {param.size()} {param}")
            print(f"{name}.grad:\n{param.grad}")

        print("-" * 80)


print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[ 0.5395,  0.6596],
        [-0.6296, -0.4057],
        [ 0.4303,  0.0345]], requires_grad=True)
weight.grad:
None

bias: torch.Size([3]) Parameter containing:
tensor([0.6199, 0.6223, 0.1412], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.2129, -0.3841,  0.3624],
        [-0.4972,  0.2257,  0.5195]], requires_grad=True)
weight.grad:
None

bias: torch.Size([2]) Parameter containing:
tensor([-0.4477,  0.2886], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parameter containing:
tensor([[-0.3017, -0.4939]], requires_grad=True)
weight.grad:
None

bias: torch.Size([1]) Parameter containing:
tensor([-0.3963], requires_grad=True)
bias.grad:
None
----------------

As we expect, gradients of parameters are `None` since we haven't computed any gradients yet.

For this example for simplicity and having reproducible results, we'll set the weights and biases manually. Let's say we have the following weights and biases:

Similar to the way that PyTorch creates weights matrices:
- Each row of $W^{[l]}$ is associated with one neuron in the layer $l$. For example, in layer 1, We have 3 neurons, so we have 3 rows in $W^{[1]}$.
- Each column of $W^{[l]}$ is associated with one feature of input values. For example, the number of columns in $W^{[1]}$ is equal to the number of features in the input layer $X$. We have 2 features in the input layer $X$, so we have 2 columns in $W^{[1]}$.


**Layer 1 (3 neurons):**
$$W^{[1]} = \begin{bmatrix} -1 & 2 \\ 3 & 0.5 \\ -0.1 & -4\end{bmatrix} \quad {\vec{\mathbf{b}}}^{[1]} = \begin{bmatrix} 1 & -2 & 0.3 \end{bmatrix}$$


**Layer 2 (2 neurons):**
$$W^{[2]} = \begin{bmatrix} 0.5 & 1 & -2 \\ 0.7 & 0.1 & 0.3\end{bmatrix} \quad {\vec{\mathbf{b}}}^{[2]} = \begin{bmatrix} -4 & 5 \end{bmatrix}$$

**Layer 3 (output):**
$$W^{[3]} = \begin{bmatrix} 0.5 & -0.3 \end{bmatrix} \quad {\vec{\mathbf{b}}}^{[3]} = \begin{bmatrix} 0.1 \end{bmatrix}$$ 

Note: The number of weight and biases are independent of the number of training samples (in any batch or entire dataset). The whole point of training with sample datasets is to optimize these parameters by exposing them to the entire dataset through cycle of forward and backward propagation. So, no matter what is the size of the dataset, the number of parameters in the model is fixed and defined by the architecture of the neural network.


In [5]:
# Layer 1
W1 = torch.tensor([[-1.0, 2.0], [3.0, 0.5], [-0.1, -4.0]], requires_grad=True)
b1 = torch.tensor([1.0, -2.0, 0.3], requires_grad=True)

# Layer 2
W2 = torch.tensor([[0.5, 1.0, -2.0], [0.7, 0.1, 0.3]], requires_grad=True)
b2 = torch.tensor([-4.0, 5.0], requires_grad=True)

# Layer 3 (Output layer)
W3 = torch.tensor([[0.5, -0.3]], requires_grad=True)
b3 = torch.tensor([0.1], requires_grad=True)

Now we set these weights and biases in our model.

In [6]:
model.linear1.weight.data.copy_(W1)
model.linear1.bias.data.copy_(b1)

model.linear2.weight.data.copy_(W2)
model.linear2.bias.data.copy_(b2)

model.linear3.weight.data.copy_(W3)
model.linear3.bias.data.copy_(b3)

print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[-1.0000,  2.0000],
        [ 3.0000,  0.5000],
        [-0.1000, -4.0000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([3]) Parameter containing:
tensor([ 1.0000, -2.0000,  0.3000], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.5000,  1.0000, -2.0000],
        [ 0.7000,  0.1000,  0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([2]) Parameter containing:
tensor([-4.,  5.], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parameter containing:
tensor([[ 0.5000, -0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([1]) Parameter containing:
tensor([0.1000], requires_grad=True)
bias.grad:
None
----------------------

## Step 1: Forward Propagation

Now let's run the forward propagation using the current weights and biases, and the input $X$.

In [7]:
# Forward Propagation
output, model_results = model(X)

# Print the intermediate results
print(
    "Intermediate results:\n"
    f"Z1:\n{model_results["Z1"]}\n"
    f"A1:\n{model_results["A1"]}\n"
    f"Z2:\n{model_results["Z2"]}\n"
    f"A2:\n{model_results["A2"]}\n"
    f"Z3:\n{model_results["Z3"]}\n"
    f"A3 (Model Output):\n{output}"
)

Intermediate results:
Z1:
tensor([[  4.0000,   2.0000,  -7.8000],
        [  6.0000,   9.0000, -16.0000]], grad_fn=<AddmmBackward0>)
A1:
tensor([[4., 2., 0.],
        [6., 9., 0.]], grad_fn=<ReluBackward0>)
Z2:
tensor([[ 0.0000,  8.0000],
        [ 8.0000, 10.1000]], grad_fn=<AddmmBackward0>)
A2:
tensor([[ 0.0000,  8.0000],
        [ 8.0000, 10.1000]], grad_fn=<ReluBackward0>)
Z3:
tensor([[-2.3000],
        [ 1.0700]], grad_fn=<AddmmBackward0>)
A3 (Model Output):
tensor([[0.0911],
        [0.7446]], grad_fn=<SigmoidBackward0>)


We'll follow the steps manually to understand the computational graph and forward propagation.

![](https://pooya.io/ai/images/nn_computational_graph.svg)

As it shown in the above graph, we start from the first node and go through the graph from left to right.

In [Forward Propagation]() we feed the input $X$ (which could be a single sample, a batch of samples, or the entire dataset) to the model and then compute the output of first layer, then give that output to the next layer (as input) and compute the output of the next layer, and so on until we reach the output layer.

In each layer, we have two steps of computation:

**1. Linear Transformation:**<br>
$$Z^{[l]} = A^{[l-1]} \cdot {W^{[l]}}^\top + {\vec{\mathbf{b}}}^{[l]}$$

**2. Activation Function:**<br>
$$A^{[l]} = g^{[l]}(Z^{[l]})$$

By convention, we consider $X$ as the layer $0$. So, $A^{[0]} = X$.

Let's calculate the output of the layer $1$.

**Layer 1:**
$$Z^{[1]} = X \cdot {W^{[1]}}^\top + {\vec{\mathbf{b}}}^{[1]}$$

$$Z^{[1]} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \cdot \begin{bmatrix} -1 & 3 & -0.1 \\ 2 & 0.5 & -4 \end{bmatrix} + \begin{bmatrix} 1 & -2 & 0.3 \end{bmatrix}$$


$$Z^{[1]} = \begin{bmatrix} 3 & 4 & -8.1 \\ 5 & 11 & -16.3 \end{bmatrix} + \begin{bmatrix} 1 & -2 & 0.3 \end{bmatrix}$$

We broadcast the bias vector to the shape of $(2, 3)$ and add it to the dot product of $X$ and $W^{[1]}$.

$$Z^{[1]} = \begin{bmatrix} 3 & 4 & -8.1 \\ 5 & 11 & -16.3 \end{bmatrix} + \begin{bmatrix} 1 & -2 & 0.3 \\ 1 & -2 & 0.3 \end{bmatrix} = \begin{bmatrix} 4 & 2 & -7.8 \\ 6 & 9 & -16 \end{bmatrix}$$

Let's verify our calculation using numpy.

In [8]:
import numpy as np

X_np = X.numpy()
W1_np = W1.detach().numpy()
b1_np = b1.detach().numpy()

In [9]:
Z1 = np.dot(X_np, W1_np.T) + b1_np

print(f"Z1 (using numpy):\n{Z1}")

Z1 (using numpy):
[[  4.         2.        -7.8     ]
 [  6.         9.       -15.999999]]


> Note: due to how floating point arithmetic works in computer hardware, the result of $-16.3 + 0.3$ is not exactly $-16$ but a number very close to $-16$ like $-15.9999...$.
> For example, `print(0.1 + 0.2)` will output `0.30000000000000004` instead of `0.3`. This is not a bug, it's a limitation of floating point arithmetic in computers and it's not specific to Python or PyTorch.
> 
> However, this is not a problem in practice in machine learning and deep learning. 

Now let's calculate the activation of layer 1 using the [ReLU]() activation function.

$$A^{[1]} = \text{ReLU}(Z^{[1]})$$

$$A^{[1]} = \begin{bmatrix} \text{ReLU}(4) & \text{ReLU}(2) & \text{ReLU}(-7.8) \\ \text{ReLU}(6) & \text{ReLU}(9) & \text{ReLU}(-16) \end{bmatrix}$$

We know that the ReLU function is defined as:
$$\text{ReLU}(z) = \max(0, z)$$

We apply ReLU element-wise to the matrix $Z^{[1]}$. So, the output of the layer 1 is: 

$$A^{[1]} = \begin{bmatrix} 4 & 2 & 0 \\ 6 & 9 & 0 \end{bmatrix}$$

Let's verify our calculation using numpy.

In [10]:
def relu(Z):
    return np.maximum(0, Z)

In [11]:
A1 = relu(Z1)

print(f"A1 (using numpy):\n{A1}")

A1 (using numpy):
[[4. 2. 0.]
 [6. 9. 0.]]


Now if we compare this result with the PyTorch output, we see that they are the same.


Now let's calculate the output of the layer 2.

**Layer 2:**

$$Z^{[2]} = A^{[1]} \cdot {W^{[2]}}^\top + {\vec{\mathbf{b}}}^{[2]}$$

$$Z^{[2]} = \begin{bmatrix} 4 & 2 & 0 \\ 6 & 9 & 0 \end{bmatrix} \cdot \begin{bmatrix} 0.5 & 0.7 \\ 1 & 0.1 \\ -2 & 0.3 \end{bmatrix} + \begin{bmatrix} -4 & 5 \end{bmatrix}$$

Which equals to:

$$Z^{[2]} = \begin{bmatrix} 0 & 8 \\ 8 & 10.1 \end{bmatrix}$$

And then applying the ReLU activation function:

$$A^{[2]} = \begin{bmatrix} 0 & 8 \\ 8 & 10.1 \end{bmatrix}$$

In the same way, we can keep going **forward** and compute the outputs (linear transformations and activations) layer by layer until we reach the output layer.

The output of the output layer is the **prediction** of the model which in this case is the predicted probability of binary classification.

## Step 2: Compute the Loss and Cost

Computing the cost will provide the error of our model in respect to the labels (target value $Y$). To calculate the loss function, we continue moving forward (left to right) in the computational graph.

The [cost]() function is usually the average of the [loss]() function over all the samples in the batch (which pass through in the forward propagation).



The loss function for binary classification is the binary cross-entropy loss which is defined as:

$$L_{BCE}(\theta)^{(i)} = -y^{(i)} \log(f_{\theta}(x^{(i)})) - (1-y^{(i)}) \log(1 - f_{\theta}(x^{(i)}))$$

Where:
- $i$ is the index of the sample in the batch.
- $y$ is the target value (label) of the $i$-th sample.
- $\theta$ is the model's parameters (weights and biases).
- $f_{\theta}$ is the model's function which produces the predicted probability based on the input $x$ and the model's parameters $\theta$. 


The [cost]() function is the average of the loss function over all the samples in the batch.

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} L(\theta)^{(i)}$$

Where:
- $m$ is the number of samples in the batch.

For more details see [Loss and Cost Functions in Machine Learning]()


Using pytorch builtin cost function we can calculate the cost as follows: 

In [12]:
cost = F.binary_cross_entropy(output, y)

print(f"Cost: {cost}")

Cost: 0.19522885978221893


As we discussed earlier, for more stable computation, in practice, we usually don't include the activation function in the output layer. So, the output of the model is the linear transformation $Z$ of the output layer (logits).  

In this example, for simplicity, we include the Sigmoid activation function in the output layer. So, the `output` is the predicted probabilities. We will use `binary_cross_entropy()` loss function to calculate the loss. If we had deferred the activation function to the outside of the model, then the output would be the logits of the output layer, which then we should have used `binary_cross_entropy_with_logits()` loss function instead.


Now let's see how this cost is calculated. We'll use numpy to calculate the cost manually.

> Recall that all of our calculations are matrix operations where each row is associated with one sample in the batch.  

This is the Loss of output matrix $A^{[3]}$:

$$L(\theta) = -\vec{\mathbf{y}} \cdot \log(A^{[3]}) - (1-\vec{\mathbf{y}}) \cdot \log(1 - A^{[3]})$$


$$L(\theta) = -\begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot \log(\begin{bmatrix} 0.0911 \\ 0.7446 \end{bmatrix}) - \begin{bmatrix} 1 \\ 0 \end{bmatrix} \cdot \log(\begin{bmatrix} 1 - 0.0911 \\ 1 - 0.7446 \end{bmatrix})$$


Element-wise operations:

$$L(\theta) = -\begin{bmatrix} 0 \\ 1 \end{bmatrix} \odot \begin{bmatrix} -2.3955 \\ -0.2949 \end{bmatrix} - \begin{bmatrix} 1 \\ 0 \end{bmatrix} \odot \begin{bmatrix} -0.0955 \\ -1.3649 \end{bmatrix}$$

> Note: $\odot$ is the element-wise multiplication.

$$L(\theta) = \begin{bmatrix} 0 \\ 0.2949 \end{bmatrix} - \begin{bmatrix} -0.0955 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.0955 \\ 0.2949 \end{bmatrix}$$


Now, in order to calculate the cost, we take the average of the loss of all samples in the batch.

$$J(\theta) = \frac{1}{2} \sum_{i=1}^{2} L(\theta)^{(i)} = \frac{0.0955 + 0.2949}{2} = 0.1952$$ 


Let's verify our calculation using numpy.

In [13]:
y_np = y.numpy()
A3 = output.detach().numpy()

In [14]:
L_np = -1 * (y_np * np.log(A3) + (1 - y_np) * np.log(1 - A3))

print(f"Loss (using numpy):\n{L_np}")

Loss (using numpy):
[[0.09554543]
 [0.29491228]]


Now, calculating the cost by taking the average of the loss of all samples in the batch.

In [15]:
cost_np = np.mean(L_np)

print(f"Cost (using numpy):\n{cost_np}")

Cost (using numpy):
0.19522884488105774


Now we can see we have reached the same value for the cost as PyTorch. 

## Step 3: Backpropagation

So far, we calculate the output (inference) of the model and the cost (error) of the model in comparison with the target values. Now we need to start optimizing the model's parameters by minimizing the error (cost). For doing this we need to calculate the gradients of the cost function with respect to the model's parameters (weights and biases) and then update the parameters using the gradients.


The backpropagation algorithm is a method for calculating the gradients of the cost function with respect to the model's parameters (weights and biases) using the chain rule of calculus and the computational graph of the neural network. 

In backprop, we calculate the gradients of the loss function with respect to each parameter of the model. As we discussed in the [Backward Propagation](), we start from the last node of the computational graph (the cost node) and then calculate the partial derivative (gradient) of the loss with respect to each part of the graph step by step in backward direction until we reach to all the parameters of the model. Hence, the name **backpropagation** or **backward pass**.

**Right-to-Left**<br>
![](https://pooya.io/ai/images/nn_computational_graph.svg)


**Using Chain Rule:**<br>
We can see the whole model as a huge composite function which is made of many smaller functions (linear transformation and activation function of each layer). These functions are composed together layer by layer like a chain. So, in simple terms we can say that we use chain rule to calculate the gradient of the loss with respect to each parameter from the most outer function (cost) to the most inner function (parameters of the model).

**Gradient of Loss vs Cost**:<br>
We calculate the partial derivative of the **loss** with respect to each parameter of the model. That gives us the gradient of the loss with respect to each parameter for **one single sample**. Then we calculate the average of these gradients (mean gradient) over all the samples in the batch. In that case, we can say we have calculated the gradient of the **cost** with respect to each parameter of the model.

Let's start the backward propagation by first defining our optimizer. We use [Adam]() variation of the [SGD (Stochastic Gradient Descent)]() algorithm to calculate the gradients and then update the parameters of the model. 

In [16]:
import torch.optim as optim

# Define Adam optimizer with learning rate of 0.01
optimizer = optim.Adam(model.parameters(), lr=0.01)

By convention, we usually set all the gradients to zero before starting any new computation. Since PyTorch stores the gradients in the parameters, resetting them (using `zero_grad()`) is a good practice before calculating new gradients.

In this particular example, as we haven't calculated any gradients yet, the gradients are `None`. So, resetting them has no effect. 

In [17]:
optimizer.zero_grad()

In [18]:
print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[-1.0000,  2.0000],
        [ 3.0000,  0.5000],
        [-0.1000, -4.0000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([3]) Parameter containing:
tensor([ 1.0000, -2.0000,  0.3000], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.5000,  1.0000, -2.0000],
        [ 0.7000,  0.1000,  0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([2]) Parameter containing:
tensor([-4.,  5.], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parameter containing:
tensor([[ 0.5000, -0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([1]) Parameter containing:
tensor([0.1000], requires_grad=True)
bias.grad:
None
----------------------

We can see that each parameter (which has `requires_grad=True`) has a `grad` attribute which stores the gradient of the loss with respect to that parameter. We can see that all of our parameters currently has `None` as their gradients.

Now let's run the backpropagation by calling the `backward()` method on the last node of the computational graph (the cost node). This will start the backward step by step calculation of the gradients of the cost with respect to each parameter of the model.

In [19]:
# Backpropagation (compute the gradients)
cost.backward()

In [20]:
print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[-1.0000,  2.0000],
        [ 3.0000,  0.5000],
        [-0.1000, -4.0000]], requires_grad=True)
weight.grad:
tensor([[-0.0249, -0.0396],
        [-0.1814, -0.2428],
        [ 0.0000,  0.0000]])

bias: torch.Size([3]) Parameter containing:
tensor([ 1.0000, -2.0000,  0.3000], requires_grad=True)
bias.grad:
tensor([-0.0147, -0.0614,  0.0000])
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.5000,  1.0000, -2.0000],
        [ 0.7000,  0.1000,  0.3000]], requires_grad=True)
weight.grad:
tensor([[-0.3831, -0.5747,  0.0000],
        [ 0.1752,  0.3175,  0.0000]])

bias: torch.Size([2]) Parameter containing:
tensor([-4.,  5.], requires_grad=True)
bias.grad:
tensor([-0.0639,  0.0246])
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parame

Let's simulate the backpropagation manually step by step. 

Remember we start with the final node and walk backward the computational graph. So, we start with the cost node and calculate the gradient of the cost with respect to the output of the output layer.

Also, recall the word **partial derivative** and **gradient** mean the same thing and we use them interchangeably.

As we discussed earlier, we calculate the gradient of **loss** with respect to each parameter of the model. We do this for all of the calculations. At the end, we calculate the average of these gradients (mean gradient) over all the samples in the batch to give the gradient of the **cost** with respect to each parameter of the model.

So, all the following steps are computing the gradient of the **loss** with respect to each parameter of the model.

### 1. Gradient of $L$ with respect to $A^{[3]}$

We are at the end of the computational graph (the loss node). Now we start our way back to the beginning of the computational graph and calculate the gradient (partial derivative) of the loss with respect to each node step by step. The first step is to calculate the gradient of the loss with respect to the output of the output layer $A^{[3]}$.

$$\frac{\partial L(\theta)}{\partial {A^{[3]}}}=?$$

> Recall that all of our calculations are matrix operations where each row is associated with one sample in the batch.  

This is the Loss of the output layer:

$$L(\theta) = -\vec{\mathbf{y}} \cdot \log(A^{[3]}) - (1-\vec{\mathbf{y}}) \cdot \log(1 - A^{[3]})$$


Partial derivative of the loss with respect to the output of the output layer:

$$\frac{\partial L(\theta)}{\partial {A^{[3]}}} = -\frac{\vec{\mathbf{y}}}{A^{[3]}} + \frac{1-\vec{\mathbf{y}}}{1 - A^{[3]}}$$

> Note: we use chain rule for the derivative of $\log(1 - x)$ which is $-\frac{1}{1 - x}$.


So, now if we substitute the values, we get:

$$\frac{\partial L(\theta)}{\partial {A^{[3]}}} = \frac{\begin{bmatrix} 0 \\ -1 \end{bmatrix}}{\begin{bmatrix} 0.0911 \\ 0.7446 \end{bmatrix}} + \frac{\begin{bmatrix} 1 - 0 \\ 1 - 1 \end{bmatrix}}{\begin{bmatrix} 1 - 0.0911 \\ 1 - 0.7446 \end{bmatrix}}$$


Element-wise operation:

$$\frac{\partial L(\theta)}{\partial {A^{[3]}}} = \begin{bmatrix} \frac{0}{0.0911} \\ \frac{-1}{0.7446} \end{bmatrix} + \begin{bmatrix} \frac{1}{0.9089} \\ \frac{0}{0.2554} \end{bmatrix} = \begin{bmatrix} 0 \\ -1.3430 \end{bmatrix} + \begin{bmatrix} 1.1003 \\ 0 \end{bmatrix} = \begin{bmatrix} 1.1003 \\ -1.3430 \end{bmatrix}$$ 

In [21]:
dL_dA3 = -1 * (y_np / A3 - (1 - y_np) / (1 - A3))

print(f"dL_dA3:\n{dL_dA3}")

dL_dA3:
[[ 1.1002588]
 [-1.3430085]]


**$X$ and $\vec{\mathbf{y}}$ are constants**:<br>
In Backpropagation, the goal is find the gradient of the loss with respect to the parameters of the mode. So, the input value $X$ and the target value $\vec{\mathbf{y}}$ are considered as constants in our computation.

### 2. Gradient of $L$ with respect to $Z^{[3]}$

Now in this step we go one step back in the computational graph and calculate the gradient of the loss with respect to the linear transformation of the output layer.

$$\frac{\partial L(\theta)}{\partial {Z^{[3]}}}=?$$


$A^{[3]}$ is a function of $Z^{[3]}$ through the Sigmoid activation function.

$$A^{[3]} = \sigma(Z^{[3]})$$

Where $\sigma$ is the Sigmoid activation function.


So, using the chain rule, we can write:

$$\frac{\partial L(\theta)}{\partial {Z^{[3]}}} = \frac{\partial L(\theta)}{\partial {A^{[3]}}} \cdot \frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}}$$

We already calculated $\frac{\partial L(\theta)}{\partial {A^{[3]}}}$ in the previous step. So, we need to calculate $\frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}}$.


For the Sigmoid function $\sigma(x)$:

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

The [Derivative of Sigmoid function](https://pooya.io/math/derivatives/#derivative-of-sigmoid-function) is as follows:

$$\frac{d\sigma(x)}{dx} = \sigma(x) \cdot (1 - \sigma(x))$$

So, in our case, we can write the derivative of the Sigmoid function as: 

$$\frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}} = \sigma(Z^{[3]}) \odot (1 - \sigma(Z^{[3]}))$$

We know the $\sigma(Z^{[3]}) = A^{[3]}$ which is the output of the model. So, we can write:

$$\frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}} = A^{[3]} \odot (1 - A^{[3]})$$


$$\frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}} = \begin{bmatrix} 0.0911 \\ 0.7446 \end{bmatrix} \odot \begin{bmatrix} 1 - 0.0911 \\ 1 - 0.7446 \end{bmatrix}$$

Which equals to:

$$\frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}} = \begin{bmatrix} 0.0828 \\ 0.1901 \end{bmatrix}$$

In [22]:
dA3_dZ3 = A3 * (1 - A3)

print(dA3_dZ3)

[[0.08281956]
 [0.19017236]]


Now, we can calculate the gradient.

$$\frac{\partial L(\theta)}{\partial {Z^{[3]}}} = \frac{\partial L(\theta)}{\partial {A^{[3]}}}\odot \frac{\partial {A^{[3]}}}{\partial {Z^{[3]}}}= \begin{bmatrix} 1.1003 \\ -1.3430 \end{bmatrix} \odot \begin{bmatrix} 0.0828 \\ 0.1901 \end{bmatrix}$$

Which equals to:

$$\frac{\partial L(\theta)}{\partial {Z^{[3]}}} = \begin{bmatrix} 0.0911 \\ -0.2554 \end{bmatrix}$$

In [23]:
dL_dZ3 = dL_dA3 * dA3_dZ3

print(f"dL_dZ3:\n{dL_dZ3}")

dL_dZ3:
[[ 0.09112295]
 [-0.2554031 ]]


### 3. Gradient of $L$ with respect to $W^{[3]}$ and ${\vec{\mathbf{b}}}^{[3]}$

Now we again go one step back to in the computational graph to calculate the gradient of the loss with respect to the weights and biases of the output layer (layer 3).

**Gradient of $L$ with respect to $\vec{\mathbf{b}}^{[3]}$:**

$$\frac{\partial L(\theta)}{\partial {\vec{\mathbf{b}}}^{[3]}} = ?$$

The linear transformation of the output layer is:

$$Z^{[3]} = A^{[2]} \cdot {W^{[3]}}^\top + {\vec{\mathbf{b}}}^{[3]}$$

So, we can write the chain rule as:

$$\frac{\partial L(\theta)}{\partial {\vec{\mathbf{b}}}^{[3]}} = \frac{\partial L(\theta)}{\partial {Z^{[3]}}} \cdot \frac{\partial {Z^{[3]}}}{\partial {\vec{\mathbf{b}}}^{[3]}}$$

We already calculated $\frac{\partial L(\theta)}{\partial {Z^{[3]}}}$ in the previous step. So, we just need to calculate $\frac{\partial {Z^{[3]}}}{\partial {\vec{\mathbf{b}}}^{[3]}}$.


$$\frac{\partial {Z^{[3]}}}{\partial {\vec{\mathbf{b}}}^{[3]}} = 1$$

So, we can write:
$$\frac{\partial L(\theta)}{\partial {\vec{\mathbf{b}}}^{[3]}} = \frac{\partial L(\theta)}{\partial {Z^{[3]}}} \cdot 1 = \begin{bmatrix} 0.0911 \\ -0.2554 \end{bmatrix}$$

In [24]:
dL_db3 = dL_dZ3 * 1
print(f"dL_db3:\n{dL_db3}")

dL_db3:
[[ 0.09112295]
 [-0.2554031 ]]


At this point (when reaching model parameters), we calculate the average of these gradients over all the samples in the batch to get the gradient of the cost with respect to the biases of the output layer.

$$\frac{\partial J(\theta)}{\partial {\vec{\mathbf{b}}}^{[3]}} = \frac{1}{2} \sum_{i=1}^{2} \frac{\partial L(\theta)^{(i)}}{\partial {\vec{\mathbf{b}}}^{[3]}} = \frac{0.0911 - 0.2554}{2} = -0.0822$$

In [25]:
# Calculate the cost by averaging the gradients
dJ_db3 = np.mean(dL_db3, axis=0)

print(f"dJ_db3:\n{dJ_db3}")

dJ_db3:
[-0.08214007]


We can see that our calculated gradient is the same as the PyTorch calculated gradient for bias of the layer 3.

In [26]:
print(f"Gradient for b3:\n{model.linear3.bias.grad}")

Gradient for b3:
tensor([-0.0821])


**Gradient of $L$ with respect to $W^{[3]}$:**

$$\frac{\partial L(\theta)}{\partial {W^{[3]}}}= ?$$

Using the chain rule we can write:

$$\frac{\partial L(\theta)}{\partial {W^{[3]}}} = \frac{\partial L(\theta)}{\partial {Z^{[3]}}} \cdot \frac{\partial {Z^{[3]}}}{\partial {W^{[3]}}}$$

Again here, we already calculated $\frac{\partial L(\theta)}{\partial {Z^{[3]}}}$ in the previous step. So, we just need to calculate $\frac{\partial {Z^{[3]}}}{\partial {W^{[3]}}}$.

$$\frac{\partial {Z^{[3]}}}{\partial {W^{[3]}}} = A^{[2]}$$

In [27]:
# Use calculated intermediate results during forward propagation
A2 = model_results["A2"].detach().numpy()

print(f"A2:\n{A2}")

A2:
[[ 0.        8.      ]
 [ 8.       10.099999]]


So, we can write:

$$\frac{\partial L(\theta)}{\partial {W^{[3]}}} = \begin{bmatrix} 0.0911 \\ -0.2554 \end{bmatrix} \odot \begin{bmatrix} 0 & 8 \\ 8 & 10.1 \end{bmatrix}$$

Broadcasting the the first vector to the shape of $(2, 2)$ and then element-wise multiplication:

$$\frac{\partial L(\theta)}{\partial {W^{[3]}}} = \begin{bmatrix} 0 & 0.7289 \\ -2.0432 & -2.5795 \end{bmatrix}$$

In [28]:
dL_dW3 = dL_dZ3 * A2

print(f"dL_dW3:\n{dL_dW3}")

dL_dW3:
[[ 0.         0.7289836]
 [-2.0432248 -2.5795712]]


Now we take the average of gradients over all the samples. Each row is associated with one example in the batch. In other words, each row is the gradient of the loss with respect to the weights of the output layer for one sample. 

The output layer (layer 3) has only one neuron. So, we have:
$$W^{[3]} = \begin{bmatrix} w_{11}^{[3]} & w_{12}^{[3]} \end{bmatrix}$$


For example 1 in the batch:
- $\begin{bmatrix} 0 \\ -2.0432 \end{bmatrix}$ are the gradients of the loss with respect to $w_{11}^{[3]}$ for the first and second samples in the batch, respectively.

For example 2 in the batch:
- $\begin{bmatrix} 0.7289 \\ -2.5795 \end{bmatrix}$ are the gradients of the loss with respect to $w_{12}^{[3]}$ for the first and second samples in the batch, respectively.

So, the cost gradient with respect to the weights of the output layer is the mean of these gradients over all the samples in the batch.

$$\frac{\partial J(\theta)}{\partial {W^{[3]}}} = \frac{1}{2} \sum_{i=1}^{2} \frac{\partial L(\theta)^{(i)}}{\partial {W^{[3]}}} = \begin{bmatrix} \frac{0 -2.0432}{2} & \frac{0.7289 - 2.5795}{2} \end{bmatrix} = \begin{bmatrix} -1.0216 & -0.9253 \end{bmatrix}$$


In [29]:
dJ_dW3 = np.mean(dL_dW3, axis=0)

print(f"dJ_dW3:\n{dJ_dW3}")

dJ_dW3:
[-1.0216124 -0.9252938]


And here is the PyTorch calculated gradient for the weights of the layer 3:

In [30]:
print(f"Gradient of W3:\n{model.linear3.weight.grad}")

Gradient of W3:
tensor([[-1.0216, -0.9253]])


### 4. Gradient of $L$ with respect to $A^{[2]}$
Now, we go one step back in the computational graph to calculate the gradient of the loss with respect to the output of the layer 2.

$$\frac{\partial L(\theta)}{\partial {A^{[2]}}} = ?$$

The linear transformation of the layer 3 we had:

$$Z^{[3]} = A^{[2]} \cdot {W^{[3]}}^\top + {\vec{\mathbf{b}}}^{[3]}$$

So, using the chain rule we can write:

$$\frac{\partial L(\theta)}{\partial {A^{[2]}}} = \frac{\partial L(\theta)}{\partial {Z^{[3]}}} \cdot \frac{\partial {Z^{[3]}}}{\partial {A^{[2]}}}$$

We already calculated $\frac{\partial L(\theta)}{\partial {Z^{[3]}}}$ in the previous step. So, we just need to calculate $\frac{\partial {Z^{[3]}}}{\partial {A^{[2]}}}$.

$$\frac{\partial {Z^{[3]}}}{\partial {A^{[2]}}} = {W^{[3]}}^\top$$

In [31]:
W3_np = W3.detach().numpy()
W3_T_np = W3_np.T

print(f"W3_T:\n{W3_T_np}")

W3_T:
[[ 0.5]
 [-0.3]]


So, we can write:
$$\frac{\partial L(\theta)}{\partial {A^{[2]}}}= \begin{bmatrix} 0.0911 \\ -0.2554 \end{bmatrix} \odot \begin{bmatrix} 0.5 \\ -0.3 \end{bmatrix} = \begin{bmatrix} 0.0456 \\ 0.0766 \end{bmatrix}$$

In [32]:
dL_dA2 = dL_dZ3 * W3_T_np

print(f"dL_dA2:\n{dL_dA2}")

dL_dA2:
[[0.04556147]
 [0.07662094]]


### 5. Gradient of $L$ with respect to $Z^{[2]}$
Now we go one step back in the computational graph to calculate the gradient of the loss with respect to the linear transformation of the layer 2.


$A^{[2]}$ is a function of $Z^{[2]}$ through the ReLU activation function.

$$A^{[2]} = \text{ReLU}(Z^{[2]})$$

Using the chain rule we can write:

$$\frac{\partial L(\theta)}{\partial {Z^{[2]}}} = \frac{\partial L(\theta)}{\partial {A^{[2]}}} \cdot \frac{\partial {A^{[2]}}}{\partial {Z^{[2]}}}$$


We already calculated $\frac{\partial L(\theta)}{\partial {A^{[2]}}}$ in the previous step. So, we just need to calculate $\frac{\partial {A^{[2]}}}{\partial {Z^{[2]}}}$.

ReLU function is defined as:

$$ReLU(x) = \max(0, x)$$



Derivative of ReLU function is as follows:
$$\frac{d}{dx}ReLU(x) = \begin{cases} 0 & \text{if } x \leq 0 \\ 1 & \text{if } x > 0 \end{cases}$$

Note: Derivative of ReLU function is not defined at $x=0$. But in practive, we can set it to $0$ or $1$. In this example we set it to $0$.

We calculated $Z^{[2]}$ in the forward propagation step.

$$Z^{[2]} = \begin{bmatrix} 0 & 8 \\ 8 & 10.1 \end{bmatrix}$$


In [33]:
Z2 = model_results["Z2"].detach().numpy()

print(f"Z2:\n{Z2}")

Z2:
[[ 0.        8.      ]
 [ 8.       10.099999]]



So, we can write:
$$\frac{\partial {A^{[2]}}}{\partial {Z^{[2]}}} = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}$$ 

Now we can calculate the gradient:

$$\frac{\partial L(\theta)}{\partial {Z^{[2]}}} = \frac{\partial L(\theta)}{\partial {A^{[2]}}} \cdot \frac{\partial {A^{[2]}}}{\partial {Z^{[2]}}} = \begin{bmatrix} 0.0456 \\ 0.0766 \end{bmatrix} \odot \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}$$


Broadcasting the first vector to the shape of $(2, 2)$ and then element-wise multiplication:

$$\frac{\partial L(\theta)}{\partial {Z^{[2]}}} = \begin{bmatrix} 0 & 0.0456 \\ 0.0766 & 0.0766 \end{bmatrix}$$

In [34]:
# Calculate the gradient of A2 with respect to Z2
dA2_dZ2 = (Z2 > 0).astype(np.float32)

print(f"dA2_dZ2:\n{dA2_dZ2}")

dA2_dZ2:
[[0. 1.]
 [1. 1.]]


In [35]:
# Calculate the gradient of the loss with respect to Z2
dL_dZ2 = dL_dA2 * dA2_dZ2

print(f"dL_dZ2:\n{dL_dZ2}")

dL_dZ2:
[[0.         0.04556147]
 [0.07662094 0.07662094]]


### 6. Gradient of $L$ with respect to $W^{[2]}$ and ${\vec{\mathbf{b}}}^{[2]}$