# Exercise 1: Calculate the updated weights in 1 epoch
- Prepared by: Hieng MAO
- Date: 20 / 03 / 2025
- Source code: [Github](https://github.com/maohieng/learn_ai/blob/main/advance_ml/Backpropagation_ex1.ipynb)

![Exercise 1](./Screenshot%202025-03-19%20105410.png)


## What we have
- Inputs:  
  $ x_1 = 0.35 $, $ x_2 = 0.7 $  

- Weights:  
  - Input to Hidden Layer:  
    $ w_{1,1} = 0.2 $, $ w_{1,2} = 0.3 $  
    $ w_{2,1} = 0.2 $, $ w_{2,2} = 0.3 $  
  - Hidden to Output Layer:  
    $ w_{1,3} = 0.3 $, $ w_{2,3} = 0.9 $  

- Activation Function: **Sigmoid**  
  $$
  \sigma(x) = \frac{1}{1 + e^{-x}}
  $$
  $$
  \sigma'(x) = \sigma(x) (1 - \sigma(x))
  $$

- Learning rate: **Assume $ \eta = 0.5 $**  

- **Target output** $ o_{3, \text{true}} $   
  I assume $ o_{3, \text{true}} = 1.0 $.

In [2]:
import math

x1 = 0.35
x2 = 0.7

w1_1 = 0.2
w1_2 = 0.3
w2_1 = 0.2
w2_2 = 0.3

w1_3 = 0.3
w2_3 = 0.9

alpha = 0.5

y_true = 1


## Forward Pass
### Hidden Layer Activations
For $ h_1 $:

$$
S_{h1} = (x_1 \cdot w_{1,1}) + (x_2 \cdot w_{2,1})
$$

In [3]:
h1_net = (x1 * w1_1) + (x2 * w2_1)
print("h1_net", h1_net)

h1_net 0.20999999999999996


$$
h_1 = \sigma(S_{h1}) = \frac{1}{1 + e^{-S_{h1}}} 
$$

In [4]:
h1_out = 1 / (1 + math.exp(-h1_net))
print("h1_out", h1_out)

h1_out 0.5523079095743253


Calculate the $ h_1 $'s derivative, since we will use it later.
$$
h_1^{'} = \frac{\partial{h_1}}{\partial{S_{h1}}} = \sigma^{'}(S_{h1}) = \frac{\partial{\sigma(S_{h1})}}{\partial{S_{h1}}} = S_{h1} \cdot (1 - S_{h1})
$$

In [5]:
derive_h1_out = h1_net * (1 - h1_net)
print("derive_h1_out", derive_h1_out)

derive_h1_out 0.1659


For $ h_2 $:

$$
S_{h2} = (x_1 \cdot w_{1,2}) + (x_2 \cdot w_{2,2})
$$

In [6]:
h2_net = (x1 * w1_2) + (x2 * w2_2)
print("h2_net", h2_net)

h2_net 0.315


$$
h_2 = \sigma(S_{h2}) = \frac{1}{1 + e^{-S_{h2}}}
$$

In [7]:
h2_out = 1 / (1 + math.exp(-h2_net))
print("h2_out", h2_out)

h2_out 0.5781052328843092


Calculate the $ h_2 $'s derivative, since we will use it later.
$$
h_2^{'} = \frac{\partial{h_2}}{\partial{S_{h2}}} = \sigma^{'}(S_{h2}) = \frac{\partial{\sigma(S_{h2})}}{\partial{S_{h2}}} = S_{h2} \cdot (1 - S_{h2})
$$

In [8]:
derive_h2_out = h2_net * (1 - h2_net)
derive_h2_out

0.21577500000000002

### Output Layer Activation
For $ o_3 $:

$$
S_{o3} = (h_1 \cdot w_{1,3}) + (h_2 \cdot w_{2,3})
$$

In [9]:
o3_net = (h1_out * w1_3) + (h2_out * w2_3)
print("o3_net", o3_net)

o3_net 0.6859870824681757


$$
o_3 = \sigma(S_{o3}) = \frac{1}{1 + e^{-S_{o3}}}
$$

In [10]:
o3_out = 1 / (1 + math.exp(-o3_net))
print("o3_out", o3_out)

o3_out 0.6650736395247564


Calculate the $ o_3 $'s derivative, since we will use it later.
$$
o_3^{'} = \frac{\partial{o_3}}{\partial{S_{o3}}} = \sigma^{'}(S_{o3}) = \frac{\partial{\sigma(S_{o3})}}{\partial{S_{o3}}} = S_{o3} \cdot (1 - S_{o3})
$$

In [11]:
derive_o3_out = o3_net * (1 - o3_net)
derive_o3_out

0.21540880515497599

## Computing the Error (total)
$$
E = \frac{1}{2} (o_3 - o_{3, \text{true}})^2
= \frac{1}{2} (o_3^2 - 2 \cdot o_3 \cdot o_{3, \text{true}} + o_{3, \text{true}}^2)
$$

In [12]:
err_total = 0.5 * (y_true - o3_out) ** 2
print("Total error", err_total)

Total error 0.05608783347059641


## Backward Pass

Before moving on to the calculation of adjusting the weights below, we would take a look at the derivative of total error related to a weight.

For example, we want to adjust $ w_{1,3} $ relatively to the total losses, we start with a **chaine rule** below:
$$
\frac{\partial{E}}{\partial{w_{1,3}}} = \frac{\partial{E}}{\partial{o_3}} \cdot \frac{\partial{o_3}}{\partial{S_{o3}}} \cdot \frac{\partial{S_{o3}}}{\partial{w_{1,3}}}
$$
where
$$
\frac{\partial{E}}{\partial{o_3}} = \frac{\partial{}}{\partial{o_3}}[\frac{1}{2} (o_3^2 - 2 \cdot o_3 \cdot o_{3, \text{true}} + o_{3, \text{true}}^2)] = o_3 - o_{3, \text{true}}
$$
 
$$
\frac{\partial{o_3}}{\partial{S_{o3}}} = o_3^{'} = o_3 (1 - o_3)
$$
 
$$
\frac{\partial{S_{o3}}}{\partial{w_{1,3}}} = \frac{\partial{}}{\partial{w_{1,3}}}(h_1 \cdot w_{1,3} + h_2 \cdot w_{2,3}) = h_1
$$

Suppose a $ \delta_{o3} $ called error gradients, where:
$$
\delta_{o3} = \frac{\partial{E}}{\partial{o_3}} \cdot \frac{\partial{o_3}}{\partial{S_{o3}}} = (o_3 - o_3^{\text{true}}) \cdot o_3^{'} = (o_3 - o_3^{\text{true}}) \cdot o_3 \cdot (1 - o_3)
$$

In [13]:
delta_o = (o3_out - y_true) * o3_out * (1 - o3_out)
print("err_o", delta_o)

err_o -0.074605079078696


So
$$
\frac{\partial{E}}{\partial{w_{1,3}}} = \delta_{o3} \cdot h_1
$$

In [14]:
step_w1_3 = delta_o * h1_out
step_w1_3

-0.04120497526958182

Similarly for $ w_{2,3} $:

$$
\frac{\partial E}{\partial w_{2,3}} = \delta_o \times h_2
$$

In [15]:
step_w2_3 = delta_o * h2_out
step_w2_3

-0.04312958661514185

### Update Output Layer Weights (gradient descent)
Using:

$$
w' = w - \eta \frac{\partial E}{\partial w}
$$

In [16]:
new_w1_3 = w1_3 - alpha * step_w1_3
new_w1_3

0.3206024876347909

In [17]:
new_w2_3 = w2_3 - alpha * step_w2_3
new_w2_3

0.9215647933075709

## Compute Hidden Layer Errors

![Exercise 1](./Screenshot%202025-03-19%20105410.png)

Again we want to compute how much error change relatively to the hidden layer's weights. For example, for $ w_{1,1} $, we can write a chain rule like this:

$$
\frac{\partial{E}}{\partial{w_{1,1}}} = \frac{\partial{E}}{\partial{o_3}} \cdot \frac{\partial{o_3}}{\partial{S_{o3}}} \cdot \frac{\partial{S_{o3}}}{\partial{h_1}} \cdot \frac{\partial{h_1}}{\partial{S_{h1}}} \cdot \frac{\partial{S_{h1}}}{\partial{w_{1,1}}}
$$

where
$$
\frac{\partial{E}}{\partial{o_3}} \cdot \frac{\partial{o_3}}{\partial{S_{o3}}} = \delta_{o3}
$$

$$
\frac{\partial{S_{o3}}}{\partial{h_1}} = \frac{\partial{}}{\partial{h_1}}(h_1 \cdot w_{1,3} + h_2 \cdot w_{2,3}) = w_{1,3}
$$

$$
\frac{\partial{h_1}}{\partial{S_{h1}}} = h_1^{'}
$$

$$
\frac{\partial{S_{h1}}}{\partial{w_{1,1}}} = \frac{\partial{}}{\partial{w_{1,1}}} (x_1 \cdot w_{1,1} + x_2 \cdot w_{2,1}) = x_1
$$

So
$$
\frac{\partial E}{\partial{w_{1,1}}} = \delta_{o3} \cdot w_{1,3} \cdot h_1^{'} \cdot x_1 
$$

In [18]:
step_w1_1 = delta_o * w1_3 * derive_h1_out * x1
step_w1_1

-0.0012995831750113448

The same apply for $ w_{1,2}, w_{2,1}, w_{2,2} $:
$$
\frac{\partial E}{\partial{w_{1,2}}} = \delta_{o3} \cdot w_{2,3} \cdot h_2^{'} \cdot x_1 
$$

$$
\frac{\partial E}{\partial{w_{2,1}}} = \delta_{o3} \cdot w_{1,3} \cdot h_1^{'} \cdot x_2
$$

$$
\frac{\partial E}{\partial{w_{2,2}}} = \delta_{o3} \cdot w_{2,3} \cdot h_2^{'} \cdot x_2
$$

In [19]:
step_w1_2 = delta_o * w2_3 * derive_h2_out * x1
step_w1_2

-0.005070841945534772

In [20]:
step_w2_1 = delta_o * w1_3 * derive_h1_out * x2
step_w2_1

-0.0025991663500226896

In [21]:
step_w2_2 = delta_o * w2_3 * derive_h2_out * x2
step_w2_2

-0.010141683891069545

### Compute gradient for hidden layer weights

In [22]:
new_w1_1 = w1_1 - alpha * step_w1_1
print(new_w1_1)

new_w1_2 = w1_2 - alpha * step_w1_2
print(new_w1_2)

new_w2_1 = w2_1 - alpha * step_w2_1
print(new_w2_1)

new_w2_2 = w2_2 - alpha * step_w2_2
print(new_w2_2)

0.20064979158750568
0.30253542097276737
0.20129958317501134
0.30507084194553474


## Final Updated Weights (for this epoch)


In [25]:
w1_1 = new_w1_1
w1_2 = new_w1_2
w2_1 = new_w2_1
w2_2 = new_w2_2
w1_3 = new_w1_3
w2_3 = new_w2_3

print("W_1_1", new_w1_1)
print("W_1_2", new_w1_2)
print("W_2_1", new_w2_1)
print("W_2_2", new_w2_2)
print("W_1_3", new_w1_3)
print("W_2_3", new_w2_3)

W_1_1 0.20064979158750568
W_1_2 0.30253542097276737
W_2_1 0.20129958317501134
W_2_2 0.30507084194553474
W_1_3 0.3206024876347909
W_2_3 0.9215647933075709


## Forward Pass - One more time

Since we have new weights, we want to perform the forward pass one more time to see if the error is decreasing or increasing.

In [27]:
def forward(x1, x2):
    h1_net = (x1 * w1_1) + (x2 * w2_1)
    h1_out = 1 / (1 + math.exp(-h1_net))
    h2_net = (x1 * w1_2) + (x2 * w2_2)
    h2_out = 1 / (1 + math.exp(-h2_net))
    o3_net = (h1_out * w1_3) + (h2_out * w2_3)
    o3_out = 1 / (1 + math.exp(-o3_net))
    return o3_out

In [28]:
new_out = forward(0.35, 0.7)
new_out

0.6706043722221676

In [29]:
def error(y_true, y_pred):
    return 0.5 * (y_true - y_pred) ** 2

Calculate the new error and compare to the previous error, if $ E_{old} - E_{new} > 0$ mean there's an improvement in model's learning, none otherwise. 

In [30]:
new_err = error(1, new_out)
new_err

0.05425073979957615

In [31]:
diff_err = err_total - new_err
diff_err

0.001837093671020265