# Submission

Jacob (Jake) Toronto

10991530

Spring 2024

CS 6480 Advanced Machine Learning

Dr. Larry Zeng

# Assignment

In this assignment, I use Gradient Descent to calculate the weights and biases of the neural network in Assignment 1 (which represents the XOR function).



# Background

First, we show the formulas that define the network.

### Inputs

The inputs are $x_1$ and $x_2$.  They are real-valued.

First we convert these to binary:

$$u_1 = step(x_1)$$

$$u_2 = step(x_2)$$

### Middle Layer

$$v_1 = \sigma(u_1 * w_1 + u_2 * w_2 + b_1)$$

$$v_2 = \sigma(u_1 * w_3 + u_2 * w_4 + b_2)$$


### Output Layer

$$\hat{z} = \sigma(v_1 * w_5 + v_2 * w+6 + b_3)$$


# Derivatives

Here are the derivatives:

### middle-top layer
$$\frac{dz}{w_{1}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * w_{5} * v_{1} * (1 - v_{1}) * u_{1}$$
$$\frac{dz}{w_{2}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * w_{5} * v_{1} * (1 - v_{1}) * u_{2}$$
$$\frac{dz}{b_{1}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * w_{5} * v_{1} * (1 - v_{1}) * 1$$

### middle-bottom layer
$$\frac{dz}{w_{3}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * w_{6} * v_{2} * (1 - v_{2}) * u_{1}$$
$$\frac{dz}{w_{4}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * w_{6} * v_{2} * (1 - v_{2}) * u_{2}$$
$$\frac{dz}{b_{2}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * w_{6} * v_{2} * (1 - v_{2}) * 1$$

### outer layer
$$\frac{dz}{w_{5}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * v_{1}$$
$$\frac{dz}{w_{6}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * v_{2}$$
$$\frac{dz}{b_{3}} = (\hat{z} - z) * \hat{z} * (1 - \hat{z}) * 1$$


# Gradient Descent

Here I run the gradient descent algorithm, updating each weight and bias using the derivative.

This approach is Stochastic, not Batch: we iterate over each row one at a time and update the weights on each row.

In [2]:
import numpy as np
u1_all = [0, 0, 1, 1]
u2_all = [0, 1, 0, 1]
zt_all = [0, 1, 1, 0]

import random as r
w1 = r.random()
w2 = r.random()
b1 = r.random()
w3 = r.random()
w4 = r.random()
b2 = r.random()
w5 = r.random()
w6 = r.random()
b3 = r.random()
things = [w1, w2, b1, w3, w4, b2, w5, w6, b3]

import math 

def s(x):
  return 1 / (1 + math.exp(-x))


def results():
  return {
    "w1": w1,
    "w2": w2,
    "b1": b1,
    "w3": w3,
    "w4": w4,
    "b2": b2,
    "w5": w5,
    "w6": w6,
    "b3": b3
  }

epochs = 20000
learning_rate = 0.1
for i in range(epochs):
    for u1, u2, zt in zip(u1_all, u2_all, zt_all):
        
        # build up the components of the network

        # first layer
        v1 = s(w1*u1 + w2*u2 + b1)
        v2 = s(w3*u1 + w4*u2 + b2)

        # output (z-hat or z-predicted)
        zp = s(w5*v1 + w6*v2 + b3)

        raw_loss = zp - zt
        
        # base gradient
        dz = raw_loss * zp * (1 - zp)

        # middle top
        top_grad = dz * w5 * v1 * (1 - v1)
        w1 -= learning_rate * top_grad * u1
        w2 -= learning_rate * top_grad * u2
        b1 -= learning_rate * top_grad

        # middle bottom
        bot_grad = dz * w6 * v2 * (1 - v2)
        w3 -= learning_rate * bot_grad * u1
        w4 -= learning_rate * bot_grad * u2
        b2 -= learning_rate * bot_grad

        # outer layer adjustments
        w5 -= learning_rate * dz * v1
        w6 -= learning_rate * dz * v2
        b3 -= learning_rate * dz

    if i % 999 == 0:
      print(i, u1, u2, zt, round(zp, 2), round(raw_loss, 2), round(raw_loss**2,2), {k: round(v, 1) for k, v in results().items()})
print({k: round(v, 1) for k, v in results().items()})
        


0 1 1 0 0.8 0.8 0.63 {'w1': 0.0, 'w2': 0.8, 'b1': 0.3, 'w3': 0.4, 'w4': 0.9, 'b2': 0.9, 'w5': 0.9, 'w6': 0.4, 'b3': 0.3}
999 1 1 0 0.52 0.52 0.27 {'w1': 0.3, 'w2': 0.7, 'b1': 0.3, 'w3': 0.4, 'w4': 0.9, 'b2': 0.9, 'w5': 0.5, 'w6': 0.1, 'b3': -0.4}
1998 1 1 0 0.54 0.54 0.29 {'w1': 0.6, 'w2': 0.9, 'b1': 0.3, 'w3': 0.6, 'w4': 1.0, 'b2': 0.9, 'w5': 0.5, 'w6': 0.3, 'b3': -0.6}
2997 1 1 0 0.56 0.56 0.32 {'w1': 1.3, 'w2': 1.4, 'b1': 0.4, 'w3': 0.9, 'w4': 1.1, 'b2': 0.8, 'w5': 0.9, 'w6': 0.4, 'b3': -1.1}
3996 1 1 0 0.63 0.63 0.39 {'w1': 2.6, 'w2': 2.6, 'b1': -0.1, 'w3': 1.2, 'w4': 1.4, 'b2': 0.8, 'w5': 2.1, 'w6': 0.3, 'b3': -1.9}
4995 1 1 0 0.66 0.66 0.43 {'w1': 3.8, 'w2': 3.9, 'b1': -0.6, 'w3': 1.2, 'w4': 1.4, 'b2': 0.7, 'w5': 3.4, 'w6': -0.4, 'b3': -2.4}
5994 1 1 0 0.61 0.61 0.37 {'w1': 4.6, 'w2': 4.7, 'b1': -0.8, 'w3': 0.9, 'w4': 1.1, 'b2': -0.3, 'w5': 4.3, 'w6': -1.6, 'b3': -2.5}
6993 1 1 0 0.3 0.3 0.09 {'w1': 5.2, 'w2': 5.2, 'b1': -1.4, 'w3': 2.1, 'w4': 2.1, 'b2': -3.0, 'w5': 5.3, 'w6': -4

# Results

I used `20,000` epochs to train the weights.

In this case, the error was at `0.0` before the 8,000-th epoch.

However, in other runs, we needed more than 10,000 epochs to reach `0.0` error.

That's why I chose `20,000` epochs.

Each run results in different values for the weights and biases.

Here are a few:

```
{'w1': 7.0, 'w2': -1.9, 'b1': 1.1, 'w3': 7.9, 'w4': 5.0, 'b2': -1.0, 'w5': -5.0, 'w6': 6.4, 'b3': -1.5}
{'w1': 4.2, 'w2': 4.2, 'b1': -6.5, 'w3': 6.1, 'w4': 6.2, 'b2': -2.7, 'w5': -9.3, 'w6': 8.6, 'b3': -3.9}
{'w1': 4.2, 'w2': 4.2, 'b1': -6.4, 'w3': 6.1, 'w4': 6.2, 'b2': -2.7, 'w5': -9.2, 'w6': 8.5, 'b3': -3.9}
{'w1': 4.2, 'w2': 4.2, 'b1': -6.5, 'w3': 6.1, 'w4': 6.1, 'b2': -2.7, 'w5': -9.2, 'w6': 8.6, 'b3': -3.9}
{'w1': 6.1, 'w2': 6.1, 'b1': -2.6, 'w3': 4.2, 'w4': 4.2, 'b2': -6.4, 'w5': 8.5, 'w6': -9.2, 'b3': -3.9}
{'w1': 6.1, 'w2': 6.1, 'b1': -2.7, 'w3': 4.2, 'w4': 4.2, 'b2': -6.4, 'w5': 8.5, 'w6': -9.2, 'b3': -3.9}
{'w1': 6.6, 'w2': 6.6, 'b1': -2.8, 'w3': -4.2, 'w4': -4.2, 'b2': 6.3, 'w5': 6.5, 'w6': 6.8, 'b3': -9.7}
```

As you can see, there is quite a range of options for the weights that still result in a good network.