# Implementing a Multilayer Perceptron

We discussed how a single perceptron, i.e., a neuron with just two weighted inputs $w_1 x_1$ and $w_2 x_2$ and a bias $b$ is basically a linear classifier in the plane $(x_1, x_2)$. Its output

$$
y = \begin{cases}1, \quad\text{if}\ w_1 x_1 + w_2 x_2 + b > 0 \\ 0,\quad\text{otherwise}\end{cases}
$$

separates the $(x_1, x_2)$ plane into two parts as shown below to the left. One half, above the blue line corresponds to $x_1$ and $x_2$ values that yield $y=1$. The other half is the values that yield $y=0$.

![](/workspaces/comp-460-f25-week-05/images/linearClassifier.drawio.png)

The right part of the figure above shows the values of the XOR function for $x_1, x_2 \in\{0,1\}$. There is no single line that can bisect the $(x_1, x_2)$ plane into two parts containing the different values of the XOR function. To separate them, we need two lines. These two lines are effectively implemented by the neurons in the hidden layer below. Then their result is compined by the neuron in the output layer to create the separation between values of similar parity ($(0,0)$ and $(1,1)$) and values of different parity ($(1,0)$ and $(0,1)$).

![](/workspaces/comp-460-f25-week-05/images/XOR.png)

In the notation below, a superscript labels the current layer. A single subscript labels the neuron. A double subscript labels a pair of neurons; the first subscript points to a neuron in the current layer and the second subscript to a neuron in the previous layer. In general,
$\textsf{relation}^{l}_{jk}$ describes a relation between the $j$-th neuron in the $l$-th layer and the $k$-th neuron in the $(l-1)$-th layer while $\textsf{property}^l_j$ describes a property of the $j$-th neuron in the $l$-th layer.

In mathematical typography, we use bolf fonts to denote a vector. In the classroom we use a little arrow over the same variable. For example, 

Using this notation, we can write the output of the first neuron in the second layer, $\textsf{neuron}^2_1$ as: a vector $\vec{x}$ written on whiteboard is the same as $\mathbf x$ appearing in typography. For matrices the typographical convention is to print them in upper case letters and bold fonts. On the whiteboard, a matrix variable is implied from context.

$$
\begin{align*}
  a^2_1 & = \sigma( w^2_{11} x_1 + w^2_{12} x_2 + b^2_1 ) \\ & =
  \sigma\left(
    \begin{bmatrix} w^2_{11} & w^2_{12}\end{bmatrix} \cdot
    \begin{bmatrix}x_1 \\ x_2\end{bmatrix} + b^2_1 \right) \\ &=
    \sigma\left( \mathbf W^2_1\cdot\mathbf x + b^2_1\right) 
\end{align*}
$$

Similarly, the output of the second neuron in the second layer, $\textsf{neuron}^2_2$ is:

$$
\begin{align*}
  a^2_2 & = \sigma\left( \mathbf W^2_2\cdot\mathbf x + b^2_2 \right) \\
\end{align*}
$$

$$
\begin{align*}

  \begin{bmatrix}a^2_1 \\ \\a^2_2 \end{bmatrix} & =
  \sigma \left(
    \begin{bmatrix}\mathbf W^2_1 \\ \\ \mathbf W^2_2 \end{bmatrix} \cdot\mathbf x +
    \begin{bmatrix}b^2_1 \\ \\ b^2_2 \end{bmatrix}
  \right) \Rightarrow \\ \\
  \mathbf a^2 & = \sigma \left(\mathbf W^2 \cdot \mathbf x + \mathbf b^2 \right) \\
  & = \sigma\left(\mathbf z^2\right)
\end{align*}
$$

As we compact the notation, $\mathbf a^2$ is the output vector for the hidden layer in response to $\mathbf z^2=\mathbf W^2 \cdot \mathbf x + \mathbf b^2 $.
Matrix $\mathbf W^2$ contains the input weights for that layer.

$$
\mathbf W^2 =
\begin{bmatrix}
  w^2_{11} & w^2_{12} \\
  w^2_{21} & w^2_{22}
\end{bmatrix}
$$


For the neuron in the third layer, the output is

$$
\begin{align*}
a^3_1 & = \sigma \left( a^2_1 w^3_{11} + a^2_2 w^3_{12} +b^3_1 \right) \\
      & = \sigma \left( \mathbf a^2 \cdot \mathbf W^3 + b^3_1 \right) \\
      & = \sigma \left( \sigma(\mathbf z^2) \cdot \mathbf W^3 + b^3_1 \right) \\
      & =  \sigma \left( \sigma(\mathbf W^2 \cdot \mathbf x + \mathbf b^2) \cdot \mathbf W^3 + b^3_1 \right)
      &
\end{align*}
$$

In vector form, $\mathbf a^3 = \sigma \left( \sigma\left(\mathbf W^2 \cdot \mathbf x + \mathbf b^2\right) \cdot \mathbf W^3 + \mathbf b^3 \right)$,
where $\mathbf a^3 = \begin{bmatrix} a^3_1 \end{bmatrix}$ and $\mathbf b^3 = \begin{bmatrix} b^3_1 \end{bmatrix}$.


Consider the following weights and biases for the second and third layers:

$$
\begin{align*}
\mathbf W^2 &= \begin{bmatrix} 20 & 20 \\ -20 & -20  \end{bmatrix}\qquad
& \mathbf W^3 &= \begin{bmatrix} 20 & 20  \end{bmatrix}  \\
\mathbf b^2 &= \begin{bmatrix}  -10 \\ 30  \end{bmatrix} \qquad
& \mathbf b^3 &= \begin{bmatrix} -30  \end{bmatrix}
\end{align*}
$$


In [None]:
import math


# Sigmoid activation
def sigmoid(z: float) -> float:
    return 1 / (1 + math.exp(-z))


# Forward pass through fixed weights
def mystery(x1: int, x2: int) -> int:
    # Hard-coded weights for hidden layer (2 neurons)

    w2 = [[20, 20], [-20, -20]]  # neuron 1  # neuron 2
    b2 = [-10, 30]  # biases for hidden neurons

    # Hidden activations
    hidden = []
    for j in range(2):
        z = w2[j][0] * x1 + w2[j][1] * x2 + b2[j]
        hidden.append(sigmoid(z))

    # Output neuron combines them: essentially hidden[0] - hidden[1]
    w3 = [20, 20]
    b3 = -30

    z_out = w3[0] * hidden[0] + w3[1] * hidden[1] + b3
    output = sigmoid(z_out)

    return round(output)  # round to 0 or 1


# Test XOR
print(f"\na  b   mystery\n-------------")
for a in [0, 1]:
    for b in [0, 1]:
        print(f"{a}  {b}     {mystery(a,b)}")

# Your assignment

# Reading

- **PDF:** 

## The sigmoid function

The sigmoid function

$$
\sigma(z) = \frac{1}{1+e^{-z}}
$$

is a smoother function that has similar behavior to the step function. For large values of $z$, $\sigma(z) \rightarrow 1$ (and for small values of $z$, $\sigma(z) \rightarrow 0$). For any value inbetween, $\sigma(z)$ has a smoother behavior that the step function and, more importantly, can be differentiated:

$$
\begin{align*}
\frac{d}{dz}\sigma(z) = \sigma(z)\left(1-\sigma(z)\right)
\end{align*}
$$

The derivative above is obtained with the chain rule for $\sigma(z) = f(u(z))$ where $u(z) = 1+e^{-z}$ and $f(u) = u^{-1}$:

$$
\begin{align*}
\frac{d}{dz}\sigma(z) & = \frac{d}{dz} f(u(z)) \\
   & = \frac{df}{du}\frac{du}{dz} \\
\end{align*}
$$

with $df/du = -u^{-2}$ and $du/dz = -e^{z}$, so that

$$
\begin{align*}
  \frac{df}{du}\frac{du}{dz} & = (-(1+e^{-z}))^{-2} (-e^{-z}) \\
                             & = \frac{e^{-z}}{(1+e^{-z})^{-2}} \\
                             & = \left(\frac{1}{1+e^{-z}}\right) \left(\frac{e^{-z}}{1+e^{-z}}\right) \\
                             &  = \left(\frac{1}{1+e^{-z}}\right)   \left(\frac{1+e^{-z}-1}{1+e^{-z}}\right) \\
                            &  = \left(\frac{1}{1+e^{-z}}\right)   \left(\frac{1+e^{-z}}{1+e^{-z}}-\frac{1}{1+e^{-z}}\right) \\
                            & =  \left(\frac{1}{1+e^{-z}}\right)   \left(1-\frac{1}{1+e^{-z}}\right) \\
                            & = \sigma(z)(1-\sigma(z))
\end{align*}
$$


In [None]:
import math


def sigmoid(z: float) -> float:
    return 1 / (1 + math.exp(-z))


def sigmoid_derivative(z: float) -> float:
    s = sigmoid(z)
    return s * (1 - s)

--- 

### Backpropagation Example (2-2-1 Network, XOR input)

We demonstrate **one training step** of backpropagation for input $(x_1,x_2)=(1,0)$ with target $y=1$.

---

## Forward pass

$$
z_1 = w_{11}x_1+w_{12}x_2+b_1 = 0.10, \quad a_1 = \sigma(z_1)\approx 0.524979, \\
z_2 = w_{21}x_1+w_{22}x_2+b_2 = -0.10, \quad a_2 = \sigma(z_2)\approx 0.475021, \\
z_3 = v_1a_1+v_2a_2+b_3 = 0.10, \quad \hat y = \sigma(z_3)\approx 0.524979.
$$

Loss:
$$
L = \tfrac12(y-\hat y)^2 \approx 0.112822.
$$

---

## Backward pass

Output error term:
$$
\delta^{(3)} = (\hat y - y)\,\hat y(1-\hat y) \approx -0.118459.
$$

Hidden error terms:
$$
\delta^{(1)} \approx -0.002954, \qquad \delta^{(2)} \approx -0.002954.
$$

---

## Gradients

Output layer:
$$
\frac{\partial L}{\partial v_1} \approx -0.062188, \quad
\frac{\partial L}{\partial v_2} \approx -0.056270, \quad
\frac{\partial L}{\partial b_3} \approx -0.118459.
$$

Hidden layer (since $x_2=0$):
$$
\frac{\partial L}{\partial w_{11}} \approx -0.002954, \quad
\frac{\partial L}{\partial w_{12}} = 0, \quad
\frac{\partial L}{\partial b_1} \approx -0.002954,
$$
$$
\frac{\partial L}{\partial w_{21}} \approx -0.002954, \quad
\frac{\partial L}{\partial w_{22}} = 0, \quad
\frac{\partial L}{\partial b_2} \approx -0.002954.
$$

---

## Weight updates (learning rate $\eta=0.5$)

$$
v_1 \leftarrow 0.10 - 0.5(-0.062188) = 0.131094, \\
v_2 \leftarrow 0.10 - 0.5(-0.056270) = 0.128135, \\
b_3 \leftarrow 0.00 - 0.5(-0.118459) = 0.059229, \\
w_{11} \leftarrow 0.10 - 0.5(-0.002954) = 0.101477, \\
w_{12} \leftarrow 0.20, \quad
b_1 \leftarrow 0.001477, \\
w_{21} \leftarrow -0.10 - 0.5(-0.002954) = -0.098523, \\
w_{22} \leftarrow 0.10, \quad
b_2 \leftarrow 0.001477.
$$

---

## Improvement

Forward again with updated parameters:

$$
\hat y \approx 0.547137, \quad L \approx 0.102543.
$$

✅ Prediction moved closer to target and loss decreased.
