# Neural Networks - Backpropagation

In this example we go through the forward propagation and then step by step details of the backpropagation using builtin [computational graph]() in PyTorch.

Our problem is a binary classification. We have 2 input features and our dataset has 2 samples. We define our neural network with the following architecture: 
- Layer 0 (input layer): 2 features
- Layer 1: Fully connected layer with 3 neurons and ReLU activation function.
- Layer 2: Fully connected layer with 2 neurons and ReLU activation function.
- Layer 3 (output): Fully connected layer with 1 neuron (output) and Sigmoid activation function.

In this example we use a batch dataset with 2 samples. The input $X$ and target $Y$ are defined as follows:

$$X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$$
$$Y = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$$

Which means, for example 1 $x_1 = 1$ and $x_2 = 2$ and the target class $y = 0$.

Recall that we maintain each sample in **rows** and features in **columns**. So, each row of $X$ and $Y$ is associated with one sampleDataset with .

In [2]:
import torch

X = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
Y = torch.tensor([[0.0], [1.0]])

## Define the Neural Network

Let's create our neural network

In [3]:
import torch.nn as nn
import torch.nn.functional as F


class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()

        # Define the model architecture (Layers and nodes)
        self.linear1 = nn.Linear(in_features=2, out_features=3)
        self.linear2 = nn.Linear(in_features=3, out_features=2)
        self.linear3 = nn.Linear(in_features=2, out_features=1)

    def forward(self, x):
        # Forward Propagation happens here.
        # It takes the input tensor x and returns the output tensor for each
        # layer by applying the linear transformation first and then the
        # activation function.
        # It start from layer 1 and goes forward layer by layer to the output
        # layer.

        # Layer 1 linear transformation
        Z1 = self.linear1(x)
        # Layer 1 activation
        A1 = F.relu(Z1)

        # Layer 2 linear transformation
        Z2 = self.linear2(A1)
        # Layer 2 activation
        A2 = F.relu(Z2)

        # Layer 3 (output layer) linear transformation
        Z3 = self.linear3(A2)
        # Layer 3 activation
        A3 = F.sigmoid(Z3)

        # Print the intermediate results
        print(
            f"Z1:\n{Z1}\nA1:\n{A1}\nZ2:\n{Z2}\nA2:\n{A2}\nZ3:\n{Z3}\nA3:\n{A3}"
        )

        # Output of the model (prediction)
        return A3

In practice, for classification problems, when we use the Sigmoid or Softmax activation function in the output layer, we defer the activation of the output layer to the outside the model. In other words, the output layer just do the linear transformation $Z$ and output the [logits](). Then we apply the activatio function (Sigmoid or Softmax) outside the model on the logits to get the predicted probabilities.

This approach is the same for both inference and training.

However, in this example, for simplicity and focus on the backpropagation, we will include the Sigmoid activation function in the output layer. So, in this example, the output layer will output the predicted probabilities.

In [4]:
model = NeuralNet()
print(model)

NeuralNet(
  (linear1): Linear(in_features=2, out_features=3, bias=True)
  (linear2): Linear(in_features=3, out_features=2, bias=True)
  (linear3): Linear(in_features=2, out_features=1, bias=True)
)


Let's see the initial weights and biases of our neural network.

In [5]:
def print_model_parameters(model):
    for i, child in enumerate(model.children()):
        print(f"Layer {i+1}: {type(child).__name__}")
        child_parameters = dict(child.named_parameters())

        for name, param in child_parameters.items():
            print(f"\n{name}: {param.size()} {param}")
            print(f"{name}.grad:\n{param.grad}")

        print("-" * 80)


print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[ 0.1059,  0.2517],
        [-0.0551, -0.5693],
        [ 0.3489,  0.0364]], requires_grad=True)
weight.grad:
None

bias: torch.Size([3]) Parameter containing:
tensor([ 0.6874, -0.2069, -0.2842], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.1931, -0.0434,  0.5419],
        [ 0.5230, -0.5720,  0.4680]], requires_grad=True)
weight.grad:
None

bias: torch.Size([2]) Parameter containing:
tensor([ 0.5208, -0.4686], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parameter containing:
tensor([[-0.0407, -0.1357]], requires_grad=True)
weight.grad:
None

bias: torch.Size([1]) Parameter containing:
tensor([-0.0221], requires_grad=True)
bias.grad:
None
-------------

As we expect, gradients of parameters are `None` since we haven't computed any gradients yet.

For this example for simplicity and having reproducible results, we'll set the weights and biases manually. Let's say we have the following weights and biases:

Similar to the way that PyTorch creates weights matrices:
- Each row of $W^{[l]}$ is associated with one neuron in the layer $l$. For example, in layer 1, We have 3 neurons, so we have 3 rows in $W^{[1]}$.
- Each column of $W^{[l]}$ is associated with one feature of input values. For example, the number of columns in $W^{[1]}$ is equal to the number of features in the input layer $X$. We have 2 features in the input layer $X$, so we have 2 columns in $W^{[1]}$.


**Layer 1 (3 neurons):**
$$W^{[1]} = \begin{bmatrix} -1 & 2 \\ 3 & 0.5 \\ -0.1 & -4\end{bmatrix} \quad {\vec{\mathbf{b}}}^{[1]} = \begin{bmatrix} 1 & -2 & 0.3 \end{bmatrix}$$


**Layer 2 (2 neurons):**
$$W^{[2]} = \begin{bmatrix} 0.5 & 1 & -2 \\ 0.7 & 0.1 & 0.3\end{bmatrix} \quad {\vec{\mathbf{b}}}^{[2]} = \begin{bmatrix} -4 & 5 \end{bmatrix}$$

**Layer 3 (output):**
$$W^{[3]} = \begin{bmatrix} 0.5 & -0.3 \end{bmatrix} \quad {\vec{\mathbf{b}}}^{[3]} = \begin{bmatrix} 0.1 \end{bmatrix}$$ 

Note: The number of weight and biases are independent of the number of training samples (in any batch or entire dataset). The whole point of training with sample datasets is to optimize these parameters by exposing them to the entire dataset through cycle of forward and backward propagation. So, no matter what is the size of the dataset, the number of parameters in the model is fixed and defined by the architecture of the neural network.


In [6]:
# Layer 1
W_1 = torch.tensor([[-1.0, 2.0], [3.0, 0.5], [-0.1, -4.0]], requires_grad=True)
b_1 = torch.tensor([1.0, -2.0, 0.3], requires_grad=True)

# Layer 2
W_2 = torch.tensor([[0.5, 1.0, -2.0], [0.7, 0.1, 0.3]], requires_grad=True)
b_2 = torch.tensor([-4.0, 5.0], requires_grad=True)

# Layer 3 (Output layer)
W_3 = torch.tensor([[0.5, -0.3]], requires_grad=True)
b_3 = torch.tensor([0.1], requires_grad=True)

Now we set these weights and biases in our model.

In [7]:
model.linear1.weight.data.copy_(W_1)
model.linear1.bias.data.copy_(b_1)

model.linear2.weight.data.copy_(W_2)
model.linear2.bias.data.copy_(b_2)

model.linear3.weight.data.copy_(W_3)
model.linear3.bias.data.copy_(b_3)

print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[-1.0000,  2.0000],
        [ 3.0000,  0.5000],
        [-0.1000, -4.0000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([3]) Parameter containing:
tensor([ 1.0000, -2.0000,  0.3000], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.5000,  1.0000, -2.0000],
        [ 0.7000,  0.1000,  0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([2]) Parameter containing:
tensor([-4.,  5.], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parameter containing:
tensor([[ 0.5000, -0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([1]) Parameter containing:
tensor([0.1000], requires_grad=True)
bias.grad:
None
----------------------

## Step 1: Forward Propagation

Now let's run the forward propagation using the current weights and biases, and the input $X$.

In [8]:
# Forward Propagation
output = model(X)

Z1:
tensor([[  4.0000,   2.0000,  -7.8000],
        [  6.0000,   9.0000, -16.0000]], grad_fn=<AddmmBackward0>)
A1:
tensor([[4., 2., 0.],
        [6., 9., 0.]], grad_fn=<ReluBackward0>)
Z2:
tensor([[ 0.0000,  8.0000],
        [ 8.0000, 10.1000]], grad_fn=<AddmmBackward0>)
A2:
tensor([[ 0.0000,  8.0000],
        [ 8.0000, 10.1000]], grad_fn=<ReluBackward0>)
Z3:
tensor([[-2.3000],
        [ 1.0700]], grad_fn=<AddmmBackward0>)
A3:
tensor([[0.0911],
        [0.7446]], grad_fn=<SigmoidBackward0>)


In [Forward Propagation]() we feed the input $X$ (which could be a single sample, a batch of samples, or the entire dataset) to the model and then compute the output of first layer, then give that output to the next layer (as input) and compute the output of the next layer, and so on until we reach the output layer.

In each layer, we have two steps of computation:

**1. Linear Transformation:**<br>
$$Z^{[l]} = A^{[l-1]} \cdot {W^{[l]}}^\top + {\vec{\mathbf{b}}}^{[l]}$$

**2. Activation Function:**<br>
$$A^{[l]} = g^{[l]}(Z^{[l]})$$

By convention, we consider $X$ as the layer $0$. So, $A^{[0]} = X$.

Let's calculate the output of the layer $1$.

**Layer 1:**
$$Z^{[1]} = X \cdot {W^{[1]}}^\top + {\vec{\mathbf{b}}}^{[1]}$$

$$Z^{[1]} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \cdot \begin{bmatrix} -1 & 3 & -0.1 \\ 2 & 0.5 & -4 \end{bmatrix} + \begin{bmatrix} 1 & -2 & 0.3 \end{bmatrix}$$


$$Z^{[1]} = \begin{bmatrix} 3 & 4 & -8.1 \\ 5 & 11 & -16.3 \end{bmatrix} + \begin{bmatrix} 1 & -2 & 0.3 \end{bmatrix}$$

We broadcast the bias vector to the shape of $(2, 3)$ and add it to the dot product of $X$ and $W^{[1]}$.

$$Z^{[1]} = \begin{bmatrix} 3 & 4 & -8.1 \\ 5 & 11 & -16.3 \end{bmatrix} + \begin{bmatrix} 1 & -2 & 0.3 \\ 1 & -2 & 0.3 \end{bmatrix} = \begin{bmatrix} 4 & 2 & -7.8 \\ 6 & 9 & -16 \end{bmatrix}$$

Let's verify our calculation using numpy.

In [9]:
import numpy as np

X_np = X.numpy()
W_1_np = W_1.detach().numpy()
b_1_np = b_1.detach().numpy()

In [10]:
Z_1_np = np.dot(X_np, W_1_np.T) + b_1_np

print(f"Z_1 (using numpy):\n{Z_1_np}")

Z_1 (using numpy):
[[  4.         2.        -7.8     ]
 [  6.         9.       -15.999999]]


> Note: due to how floating point arithmetic works in computer hardware, the result of $-16.3 + 0.3$ is not exactly $-16$ but a number very close to $-16$ like $-15.9999...$.
> For example, `print(0.1 + 0.2)` will output `0.30000000000000004` instead of `0.3`. This is not a bug, it's a limitation of floating point arithmetic in computers and it's not specific to Python or PyTorch.
> 
> However, this is not a problem in practice in machine learning and deep learning. 

Now let's calculate the activation of layer 1 using the [ReLU]() activation function.

$$A^{[1]} = \text{ReLU}(Z^{[1]})$$

$$A^{[1]} = \begin{bmatrix} \text{ReLU}(4) & \text{ReLU}(2) & \text{ReLU}(-7.8) \\ \text{ReLU}(6) & \text{ReLU}(9) & \text{ReLU}(-16) \end{bmatrix}$$

We know that the ReLU function is defined as:
$$\text{ReLU}(z) = \max(0, z)$$

We apply ReLU element-wise to the matrix $Z^{[1]}$. So, the output of the layer 1 is: 

$$A^{[1]} = \begin{bmatrix} 4 & 2 & 0 \\ 6 & 9 & 0 \end{bmatrix}$$

Let's verify our calculation using numpy.

In [11]:
def relu(Z):
    return np.maximum(0, Z)

In [12]:
A_1_np = relu(Z_1_np)

print(f"A_1 (using numpy):\n{A_1_np}")

A_1 (using numpy):
[[4. 2. 0.]
 [6. 9. 0.]]


Now if we compare this result with the PyTorch output, we see that they are the same.


Now let's calculate the output of the layer 2.

**Layer 2:**

$$Z^{[2]} = A^{[1]} \cdot {W^{[2]}}^\top + {\vec{\mathbf{b}}}^{[2]}$$

$$Z^{[2]} = \begin{bmatrix} 4 & 2 & 0 \\ 6 & 9 & 0 \end{bmatrix} \cdot \begin{bmatrix} 0.5 & 0.7 \\ 1 & 0.1 \\ -2 & 0.3 \end{bmatrix} + \begin{bmatrix} -4 & 5 \end{bmatrix}$$

Which equals to:

$$Z^{[2]} = \begin{bmatrix} 0 & 8 \\ 8 & 10.1 \end{bmatrix}$$

And then applying the ReLU activation function:

$$A^{[2]} = \begin{bmatrix} 0 & 8 \\ 8 & 10.1 \end{bmatrix}$$

In the same way, we can keep going **forward** and compute the outputs (linear transformations and activations) layer by layer until we reach the output layer.

The output of the output layer is the **prediction** of the model which in this case is the predicted probability of binary classification.

## Step 2: Compute the Loss and Cost

Computing the cost will provide the error of our model in respect to the labels (target value $Y$). 

The [cost]() function is usually the average of the [loss]() function over all the samples in the batch (which pass through in the forward propagation).



As we discussed [here](), the loss function for binary classification is the binary cross-entropy loss which is defined as:

$$L(\theta)^{(i)} = -y^{(i)} \log(f_{\theta}(x^{(i)})) - (1-y^{(i)}) \log(1 - f_{\theta}(x^{(i)}))$$

Where:
- $i$ is the index of the sample in the batch.
- $y$ is the target value (label) of the $i$-th sample.
- $\theta$ is the model's parameters (weights and biases).
- $f_{\theta}$ is the model's function which produces the predicted probability based on the input $x$ and the model's parameters $\theta$. 

The [cost]() function is the average of the loss function over all the samples in the batch.

$$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} L(\theta)^{(i)}$$

Where:
- $m$ is the number of samples in the batch.


Using pytorch builtin cost function we can calculate the cost as follows: 

In [13]:
cost = F.binary_cross_entropy(output, Y)

print(f"Cost: {cost}")

Cost: 0.19522885978221893


As we discussed earlier, for more stable computation, in practice, we usually don't include the activation function in the output layer. So, the output of the model is the linear transformation $Z$ of the output layer (logits).  

In this example, for simplicity, we include the Sigmoid activation function in the output layer. So, the `output` is the predicted probabilities. We will use `binary_cross_entropy()` loss function to calculate the loss. If we had deferred the activation function to the outside of the model, then the output would be the logits of the output layer, which then we should have used `binary_cross_entropy_with_logits()` loss function instead.


Now let's see how this cost is calculated. We'll use numpy to calculate the cost manually.

**Loss for example 1:**

The input and target of the first sample are:
$$x^{(1)} = \begin{bmatrix} 1 & 2 \end{bmatrix}, \quad y^{(1)} = 0$$

The predicted probability of the first sample is:
$$f_{\theta}(x^{(1)}) = {a^{[3]}}^{(1)} = 0.0911$$

So, the loss of the first sample is:

$$L(\theta)^{(1)} = -0 \times \log(0.0911) - (1-0) \times \log(1 - 0.0911) = 0.0955$$


In [14]:
Y_np = Y.numpy()
A3_np = output.detach().numpy()

In [15]:
# Get the first row of the input X
X_1 = X_np[0]

# Get the first row of the output Y
Y_1 = Y_np[0]

# Get the output of the model for the first row
A3_1 = A3_np[0]

# Calculate the loss for the first example
loss_1 = -Y_1 * np.log(A3_1) - (1 - Y_1) * np.log(1 - A3_1)

print(f"Example 1:\nX: {X_1}\nY: {Y_1}\nOutput: {A3_1}\nLoss: {loss_1}")

Example 1:
X: [1. 2.]
Y: [0.]
Output: [0.09112295]
Loss: [0.09554543]


**Loss for example 2:**

The input and target of the first sample are:
$$x^{(2)} = \begin{bmatrix} 3 & 4 \end{bmatrix} \quad y^{(2)} = 1$$

The predicted probability of the first sample is:
$$f_{\theta}(x^{(2)}) = {a^{[3]}}^{(2)} = 0.0.7446$$

So, the loss of the first sample is:

$$L(\theta)^{(2)} = -1 \times \log(0.7446) - (1-1) \times \log(1 - 0.7446) = 0.2949$$



In [16]:
X_2 = X_np[1]
Y_2 = Y_np[1]
A3_2 = A3_np[1]

loss_2 = -Y_2 * np.log(A3_2) - (1 - Y_2) * np.log(1 - A3_2)

print(f"Example 2:\nX: {X_2}\nY: {Y_2}\nOutput: {A3_2}\nloss: {loss_2}")

Example 2:
X: [3. 4.]
Y: [1.]
Output: [0.7445969]
loss: [0.29491228]


Now let's calculate the cost using numpy. As we discussed, the cost is the average of the loss over all the samples in the batch. In this case, we have only 2 samples in the batch.

In [17]:
cost_np = (loss_1 + loss_2) / 2
print(f"Cost (manual): {cost_np}")

Cost (manual): [0.19522884]


Now we can see we have reached the same value for the cost as PyTorch. 

## Step 3: Backward Propagation

In this step, we calculate the gradients of the loss function with respect to each parameter of the model. As we discussed in the [Backward Propagation](), we start from the last node of the computational graph (the cost node) and then calculate the partial derivative (gradient) of the loss with respect to each part of the graph step by step in backward direction until we reach to all the parameters of the model.

**Using Chain Rule:**<br>
We can see the whole model as a huge composite function which is made of many smaller functions (linear transformation and activation function of each layer). These functions are composed together layer by layer like a chain. So, in simple terms we can say that we use chain rule to calculate the gradient of the loss with respect to each parameter from the most outer function (cost) to the most inner function (parameters of the model).

**Gradient of Loss vs Cost**:<br>
We calculate the partial derivative of the **loss** with respect to each parameter of the model. That gives us the gradient of the loss with respect to each parameter for **one single sample**. Then we calculate the average of these gradients (mean gradient) over all the samples in the batch. In that case, we can say we have calculated the gradient of the **cost** with respect to each parameter of the model.

Let's start the backward propagation by first defining our optimizer. We use [Adam]() variation of the [SGD (Stochastic Gradient Descent)]() algorithm to calculate the gradients and then update the parameters of the model. 

In [18]:
import torch.optim as optim

# Define Adam optimizer with learning rate of 0.01
optimizer = optim.Adam(model.parameters(), lr=0.01)

By convention, we usually set all the gradients to zero before starting any new computation. Since PyTorch stores the gradients in the parameters, resetting them (using `zero_grad()`) is a good practice before calculating new gradients.

In this particular example, as we haven't calculated any gradients yet, the gradients are `None`. So, resetting them has no effect. 

In [19]:
optimizer.zero_grad()

In [20]:
print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[-1.0000,  2.0000],
        [ 3.0000,  0.5000],
        [-0.1000, -4.0000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([3]) Parameter containing:
tensor([ 1.0000, -2.0000,  0.3000], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.5000,  1.0000, -2.0000],
        [ 0.7000,  0.1000,  0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([2]) Parameter containing:
tensor([-4.,  5.], requires_grad=True)
bias.grad:
None
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parameter containing:
tensor([[ 0.5000, -0.3000]], requires_grad=True)
weight.grad:
None

bias: torch.Size([1]) Parameter containing:
tensor([0.1000], requires_grad=True)
bias.grad:
None
----------------------

We can see that each parameter (which has `requires_grad=True`) has a `grad` attribute which stores the gradient of the loss with respect to that parameter. We can see that all of our parameters currently has `None` as their gradients.

Now let's run the backpropagation by calling the `backward()` method on the last node of the computational graph (the cost node). This will start the backward step by step calculation of the gradients of the cost with respect to each parameter of the model.

In [22]:
# Backpropagation (compute the gradients)
cost.backward()

In [23]:
print_model_parameters(model)

Layer 1: Linear

weight: torch.Size([3, 2]) Parameter containing:
tensor([[-1.0000,  2.0000],
        [ 3.0000,  0.5000],
        [-0.1000, -4.0000]], requires_grad=True)
weight.grad:
tensor([[-0.0249, -0.0396],
        [-0.1814, -0.2428],
        [ 0.0000,  0.0000]])

bias: torch.Size([3]) Parameter containing:
tensor([ 1.0000, -2.0000,  0.3000], requires_grad=True)
bias.grad:
tensor([-0.0147, -0.0614,  0.0000])
--------------------------------------------------------------------------------
Layer 2: Linear

weight: torch.Size([2, 3]) Parameter containing:
tensor([[ 0.5000,  1.0000, -2.0000],
        [ 0.7000,  0.1000,  0.3000]], requires_grad=True)
weight.grad:
tensor([[-0.3831, -0.5747,  0.0000],
        [ 0.1752,  0.3175,  0.0000]])

bias: torch.Size([2]) Parameter containing:
tensor([-4.,  5.], requires_grad=True)
bias.grad:
tensor([-0.0639,  0.0246])
--------------------------------------------------------------------------------
Layer 3: Linear

weight: torch.Size([1, 2]) Parame

Let's simulate the backpropagation manually step by step. 

Remember we start with the final node and walk backward the computational graph. So, we start with the cost node and calculate the gradient of the cost with respect to the output of the output layer.

Also, recall the word **partial derivative** and **gradient** mean the same thing and we use them interchangeably.

As we discussed earlier, we calculate the gradient of **loss** with respect to each parameter of the model. We do this for all of the calculations. At the end, we calculate the average of these gradients (mean gradient) over all the samples in the batch to give the gradient of the **cost** with respect to each parameter of the model.

So, all the following steps are computing the gradient of the **loss** with respect to each parameter of the model.

### 1. Partial Derivative of the Loss with respect to the Output of Layer 3

**Loss for example 1:**

$$L(\theta)^{(1)} = -y^{(1)} \log(f_{\theta}(x^{(1)}) - (1-y^{(1)}) \log(1 - f_{\theta}(x^{(1)}))$$

The output of the model $f_{\theta}(x^{(1)})$ is the output of the output layer $A^{[3]}$. So:

$$L(\theta)^{(1)} = -y^{(1)} \log({a^{[3]}}^{(1)}) - (1-y^{(1)}) \log(1 - {a^{[3]}}^{(1)})$$

Now let's calculate partial derivative of $L(\theta)^{(1)}$ with respect to ${a^{[3]}}^{(1)}$.

$$\frac{\partial L(\theta)^{(1)}}{\partial {a^{[3]}}^{(1)}} = -\frac{y^{(1)}}{{a^{[3]}}^{(1)} } + \frac{1-y^{(1)}}{1 - {a^{[3]}}^{(1)} }$$



> Note: For calculating the derivative of $\log(1 - x)$, we can also use the chain rule.
> $$f(x) = \log(1 - x)$$
> $$u = 1 - x$$
> $$f(u) = \log(u)$$
> $$\frac{df(x)}{dx} = \frac{df(u)}{du} \cdot \frac{du}{dx}$$
> $$\frac{df(x)}{dx} = \frac{1}{u} \cdot -1$$