# Network Architecture

- Input Layer: 2 neurons (Let's denote them as x1 and x2).
- Hidden Layer: 2 neurons (Let's denote the activation of these neurons as h1 and h2).
- Output Layer: 1 neuron (Let's denote its activation as o1).

## 1. Weights and Biases Initialization

In [1]:
import torch
import torch.nn.functional as F

In [13]:
# device = "cuda" if torch.cuda.is_available() else "cpu"
# tensor = tensor.to(device)

Let's denote the weights from the input layer to the hidden layer as W₁, and the weights from the hidden layer to the output layer as W₂. The biases are denoted as b₁ for the hidden layer and b₂ for the output layer.

- W₁: A matrix where each element W₁ᵢⱼ represents the weight from input neuron i to hidden neuron j. Since there are 2 input neurons and 2 hidden neurons, W₁ will be a 2x2 matrix.
- b₁: A vector of biases for the hidden layer, so it will have 2 elements.
- W₂: A matrix for weights from the hidden layer to the output layer. Here, since there are 2 hidden neurons and 1 output neuron, W₂ will be a 2x1 matrix.
- b₂: The bias for the output neuron, a scalar since there is only one output neuron.

In [192]:
torch.manual_seed(0)  # For reproducibility

# Define the weights and biases with real numbers
W1 = torch.tensor([[0.1, 0.2], [0.3, 0.4]], requires_grad=True)  # 2x2 matrix for input to hidden layer
W1_bef = W1.clone()
b1 = torch.tensor([0.1, 0.2], requires_grad=True)              # 2 elements vector for hidden layer biases
b1_bef = b1.clone()
W2 = torch.tensor([[0.5], [0.6]], requires_grad=True)           # 2x1 matrix for hidden to output layer
W2_bef = W2.clone()
b2 = torch.tensor([0.3], requires_grad=True)                    # Scalar for output layer bias
b2_bef = b2.clone()

# Define the input
X = torch.tensor([1.0, 2.0], requires_grad=False)  # 2 elements vector for input      

In [193]:
W1

tensor([[0.1000, 0.2000],
        [0.3000, 0.4000]], requires_grad=True)

## 2. Forward Pass

Let's denote the input vector as X = [x1, x2].

From Input to Hidden Layer
Linear Transformation: The inputs are multiplied by the weights and added to the biases to compute the input to the hidden layer neurons.

Z₁ = XW₁ + b₁

Here, Z₁ is a 1x2 vector (since there are two neurons in the hidden layer).

Activation Function: An activation function (let's use ReLU for simplicity, which is defined as f(x) = max(0, x)) is applied element-wise to Z₁ to obtain the activations of the hidden layer:

H=f(Z₁)

Here, H is also a 1x2 vector.

From Hidden to Output Layer
Linear Transformation: Similar to the previous layer, we compute the input to the output neuron:

Z₂ = HW₂ + b₂

Z₂ is a scalar since there is only one output neuron.

Activation Function: If this were a binary classification problem, we might use the sigmoid activation function here. But let's keep it simple and assume it's a regression problem, so no activation function is applied, meaning the output is just Z₂.

Output
The final output of the network, O, for the input vector X is just Z₂ after the second linear transformation (since we're not applying another activation function for this example).

In [194]:
# Forward pass from input to hidden layer
Z1 = torch.matmul(X, W1) + b1  # Linear transformation
a1 = F.relu(Z1)  # Activation function (ReLU)

# Forward pass from hidden to output layer
Z2 = torch.matmul(a1, W2) + b2  # Linear transformation
a2 = Z2  # Identity function for the output

# Print the output
print("Output of the network:", a2.item())

Output of the network: 1.4200000762939453


Matmul : Under the hood, if you multiply a matrix A by a matrix B, where A has dimensions m×n and B has dimensions n×p, the resulting matrix C will have dimensions m×p. Each element of C is computed as the dot product of the rows of A with the columns of B.

# Backpropagation

Method used to calculate the gradient of the loss function with respect to each weight in the network, by applying the chain rule repeatedly from the output layer back towards the input layer. This gradient information is then used to update the weights in a way that minimizes the loss.

**Conceptual Foundations:**

*   **Gradient:** This represents the direction and rate at which a function increases most rapidly. For our purposes, we actually want to move in the opposite direction to minimize our loss function, which means we're looking for the path that decreases the function most rapidly.
    
*   **Chain Rule of Calculus:** This rule allows us to compute the derivative of composite functions. To put it simply, if you have a function h(x)\=g(f(x)), then the derivative h′(x) (which represents how h changes as x changes) is given by g′(f(x))⋅f′(x). In simpler terms, this means the rate of change of h with respect to x is the product of the rate of change of g with respect to f and the rate of change of f with respect to x. This is useful for understanding how changes in x affect the output of the function h.

For simplicity, let's assume we're using **mean squared error (MSE)** as the loss function and the identity function as the activation function for the output layer, while using ReLU for the hidden layer as before.

In [195]:
# Assume some target value y (the true label)
y = torch.tensor([0.7])  # For example purposes

#### To understand the effect of W2[0]=0.5 on the loss, we perform the backpropagation:

In [196]:
W2

tensor([[0.5000],
        [0.6000]], requires_grad=True)

### 1. Calculate the loss

In [197]:
# Compute loss
loss = 0.5 * (a2 - y) ** 2
print(f"Initial loss: {loss.item()}")

Initial loss: 0.2592000663280487


### 2. Compute the Gradient of the Loss with Respect to Output:

The derivative of the loss function with respect to the output of the network (o) is:

Since a2 = Z2 <break>
   
∂L/∂a2 = a2−y

This will give us how much the loss changes with respect to the output.

In [198]:
a2 - y

tensor([0.7200], grad_fn=<SubBackward0>)

In [199]:
import sympy as sp

# Define the variables
a2_, y_ = sp.symbols('a2 y')

# Define the function
f = 0.5 * (a2_ - y_) ** 2

# Compute the derivative with respect to y
derivative_f = sp.diff(f, a2_)
derivative_f

1.0*a2 - 1.0*y

### 3. Derivative of Output (Z2) w.r.t. W2[0]

In [200]:
# Forward pass from hidden to output layer
# Z2 = torch.matmul(a1, W2) + b2

∂Z2/∂A1[0]

Z2 depends linearly on W2[0] through the first term of a1

In [201]:
# z2 = a2
a11_, w21_, b2_ = sp.symbols('a11 w21 b2')

# Define the function
z2_ = (a11_ * w21_) + b2_

# Compute the derivative with respect to y
derivative_f = sp.diff(z2_, w21_)
derivative_f

a11

### 4. Chain Rule for W2[0] 

Combine the derivatives using the chain rule to find how Loss changes w.r.t W2[0]

∂L/∂W2[0] = ∂L/∂a2 * ∂a2/Z2 * ∂Z2/∂aW2[0] = (a2−y) * 1 * First element of A1

In [202]:
a2 - y

tensor([0.7200], grad_fn=<SubBackward0>)

In [203]:
a1[0]

tensor(0.8000, grad_fn=<SelectBackward0>)

In [204]:
# Backward pass calculations for W2[0]
# dL/da2
dL_da2 = a2 - y # output - target

# Since a2 = Z2 directly and dZ2/dW2[0] = first element of A1 (as a2 = Z2 = A1 * W2 + b2)
dZ2_dw2_1 = a1[0] # first element of calculated values after first activation function in the hidden layer

# Chain rule to get dL/dw2_1
dL_dw2_1 = dL_da2 * dZ2_dw2_1

# Values for illustration
print(loss)
print(a2)
print(dL_da2)
print(y)
print()
print(f"""{dL_dw2_1}: 
This value indicates that if W2[0] is increased by a small amount,
the loss L is expected to also increase (since the gradient is positive),
suggesting the model would be getting worse.""")

tensor([0.2592], grad_fn=<MulBackward0>)
tensor([1.4200], grad_fn=<AddBackward0>)
tensor([0.7200], grad_fn=<SubBackward0>)
tensor([0.7000])

tensor([0.5760], grad_fn=<MulBackward0>): 
This value indicates that if W2[0] is increased by a small amount,
the loss L is expected to also increase (since the gradient is positive),
suggesting the model would be getting worse.


#### To understand the effect of W1[0][1]= 0.2 on the loss, we perform the backpropagation:

1. Gradient of the Loss with Respect to Output.
2. Derivative of Output (Z2) w.r.t. A12 (Hidden Layer Activation) = W2[1] 
3. Derivative of A12 w.r.t. activation function ReLU.
3. Derivative of Output (Z12) w.r.t. W1[0][1] = X1
3. Chain Rule for W1[0][1]

In [205]:
# Forward pass from hidden to output layer
# Z2 = torch.matmul(a1, W2) + b2

Z2 depends linearly on W2[1] through the second term of a1

In [206]:
# Derivative of Output (Z2) w.r.t. A12 (Hidden Layer Activation) = W2[1] (W22_)
# z2 = a2
a12_, w22_, b2_ = sp.symbols('a12 w22 b2')

# Define the function
z2_ = (a12_ * w22_) + b2_

# Compute the derivative with respect to y
derivative_f = sp.diff(z2_, a12_)
derivative_f

w22

In [207]:
# Derivative of Output (Z12) w.r.t. W1[0][1] (W1)
# z2 = a2
x1_, w12_, b11_ = sp.symbols('x1 w12 b11')

# Define the function
z12_ = (x1_ * w12_) + b11_

# Compute the derivative with respect to y
derivative_f = sp.diff(z12_, w12_)
derivative_f

x1

In [208]:
print(a2 - y)
print(W2[1])
print(F.relu(Z1[1]))
print(X[0])

tensor([0.7200], grad_fn=<SubBackward0>)
tensor([0.6000], grad_fn=<SelectBackward0>)
tensor(1.2000, grad_fn=<ReluBackward0>)
tensor(1.)


In [209]:
# Backward pass calculations for W1[0][1] (w1_2)
# dL/da2
dL_da2 = a2 - y # output - target

# Since a2 = Z2 directly and dZ2/da1_2 = W2[1]
# represents the derivative of the output Z2 w.r.t the activation A12 of the second neuron in the hidden layer.
dZ2_da1_2 = W2[1] 

# derivative of the ReLU activation function applied to the pre-activation Z1[1] of the second neuron.
da1_2_d_z1_2 = F.relu(Z1[1])

# derivative of the pre-activation Z1[1] with respect to the weight W1[1]
d_z1_2_d_w1_2 = X[0]

In [210]:
# Chain rule to get dL/dw1_2
dL_dw1_2 = dL_da2 * dZ2_da1_2 * da1_2_d_z1_2 * d_z1_2_d_w1_2

In [211]:
# Values for illustration
print(loss)
print(a2)
print(dL_da2)
print(y)
print()
print(f"""{dL_dw1_2}: 
This value indicates that if W1[1] is increased by a small amount,
the loss L is expected to also increase (since the gradient is positive),
suggesting the model would be getting worse.""")

tensor([0.2592], grad_fn=<MulBackward0>)
tensor([1.4200], grad_fn=<AddBackward0>)
tensor([0.7200], grad_fn=<SubBackward0>)
tensor([0.7000])

tensor([0.5184], grad_fn=<MulBackward0>): 
This value indicates that if W1[1] is increased by a small amount,
the loss L is expected to also increase (since the gradient is positive),
suggesting the model would be getting worse.


#### To understand the effect of b1[0]= 0.1 on the loss, we perform the backpropagation:

In [212]:
# Derivative of Output (Z11) w.r.t. b11
# z2 = a2
x1_, w11_, b11_ = sp.symbols('x1 w11 b11')

# Define the function
z11_ = (x1_ * w11_) + b11_

# Compute the derivative with respect to y
derivative_f = sp.diff(z11_, b11_)
derivative_f

1

**NOTE**: For the bias, its direct influence on any output is through addition, so the derivative of the output with respect to the bias is =1 since if you increase the bias by a tiny amount, the output increases by the same amount.

The partial derivative of f = ab + c with respect to c is simply 1. This indicates that for any change in c, f changes by the same amount, independent of the values of a and b.

In [213]:
# Backward pass calculations for b1[0]
# dL/da2
dL_da2 = a2 - y # output - target

# Since a2 = Z2 directly
# represents the derivative of the output Z2 w.r.t the activation A11 of the first neuron in the hidden layer.
dZ2_da1_1 = W2[0] 

# derivative of the ReLU activation function applied to the pre-activation Z1[0] of the first neuron.
da1_1_d_z1_1 = F.relu(Z1[0])

# derivative of the pre-activation Z1[0] with respect to the bias b11
z1_1_d_b1_1 = 1

In [214]:
# Chain rule to get dL/db1_1
dL_db1_1 = dL_da2 * dZ2_da1_1 * da1_1_d_z1_1 * z1_1_d_b1_1

In [215]:
# Values for illustration
print(loss)
print(a2)
print()
print(f"""{dL_db1_1}: 
This value indicates that if b1[0] is increased by a small amount,
the loss L is expected to also increase (since the gradient is positive),
suggesting the model would be getting worse.""")

tensor([0.2592], grad_fn=<MulBackward0>)
tensor([1.4200], grad_fn=<AddBackward0>)

tensor([0.2880], grad_fn=<MulBackward0>): 
This value indicates that if b1[0] is increased by a small amount,
the loss L is expected to also increase (since the gradient is positive),
suggesting the model would be getting worse.


In [216]:
# Backward pass
loss.backward()  # PyTorch computes all gradients automatically

# Manually updating the parameters (weights and biases), normally we'd do this with an optimizer
learning_rate = 0.01
with torch.no_grad():  # Updates should not be part of the computational graph
    W1 -= learning_rate * W1.grad
    b1 -= learning_rate * b1.grad
    W2 -= learning_rate * W2.grad
    b2 -= learning_rate * b2.grad

    # Zero the gradients after updating
#     W1.grad.zero_()
#     b1.grad.zero_()
#     W2.grad.zero_()
#     b2.grad.zero_()

# Print updated parameters for verification
print(f"Updated weights1: {W1}")
print(f"Updated biases1: {b1}")
print(f"Updated weights2: {W2}")
print(f"Updated biases2: {b2}")

Updated weights1: tensor([[0.0964, 0.1957],
        [0.2928, 0.3914]], requires_grad=True)
Updated biases1: tensor([0.0964, 0.1957], requires_grad=True)
Updated weights2: tensor([[0.4942],
        [0.5914]], requires_grad=True)
Updated biases2: tensor([0.2928], requires_grad=True)


In [220]:
# W1[0][1]_bef = 0.12
print(W1_bef)
print(W1.grad)
print(W1)

tensor([[0.1000, 0.2000],
        [0.3000, 0.4000]], grad_fn=<CloneBackward0>)
tensor([[0.3600, 0.4320],
        [0.7200, 0.8640]])
tensor([[0.0964, 0.1957],
        [0.2928, 0.3914]], requires_grad=True)


In [224]:
# W2_bef[0] = 0.5
print(W2_bef)
print(W2.grad)
print(W2)

tensor([[0.5000],
        [0.6000]], grad_fn=<CloneBackward0>)
tensor([[0.5760],
        [0.8640]])
tensor([[0.4942],
        [0.5914]], requires_grad=True)


In [226]:
# b1_bef[0] = 0.1
print(b1_bef[0])
print(b1.grad)
print(b1)

tensor(0.1000, grad_fn=<SelectBackward0>)
tensor([0.3600, 0.4320])
tensor([0.0964, 0.1957], requires_grad=True)
