# Implementing a Multilayer Perceptron


## The sigmoid function

The sigmoid function
$$
\sigma(z) = \frac{1}{1+e^{-z}}
$$ 
is a smoother function that has similar behavior to the step function. For large values of $z$, $\sigma(z) \rightarrow 1$ (and for small values of $z$, $\sigma(z) \rightarrow 0$). For any value inbetween, $\sigma(z)$ has a smoother behavior that the step function and, more importantly, can be differentiated:
$$
\begin{align*}
\frac{d}{dz}\sigma(z) = \sigma(z)\left(1-\sigma(z)\right)
\end{align*}
$$

The derivative above is obtained with the chain rule for $\sigma(z) = f(u(z))$ where $u(z) = 1+e^{-z}$ and $f(u) = u^{-1}$:

$$
\begin{align*}
\frac{d}{dz}\sigma(z) & = \frac{d}{dz} f(u(z)) \\
   & = \frac{df}{du}\frac{du}{dz} \\
\end{align*}
$$
with $df/du = -u^{-2}$ and $du/dz = -e^{z}$, so that

$$
\begin{align*}
  \frac{df}{du}\frac{du}{dz} & = (-(1+e^{-z}))^{-2} (-e^{-z}) \\
                             & = \frac{e^{-z}}{(1+e^{-z})^{-2}} \\
                             & = \left(\frac{1}{1+e^{-z}}\right) \left(\frac{e^{-z}}{1+e^{-z}}\right) \\
                             &  = \left(\frac{1}{1+e^{-z}}\right)   \left(\frac{1+e^{-z}-1}{1+e^{-z}}\right) \\
                            &  = \left(\frac{1}{1+e^{-z}}\right)   \left(\frac{1+e^{-z}}{1+e^{-z}}-\frac{1}{1+e^{-z}}\right) \\
                            & =  \left(\frac{1}{1+e^{-z}}\right)   \left(1-\frac{1}{1+e^{-z}}\right) \\ 
                            & = \sigma(z)(1-\sigma(z))
\end{align*}
$$

In [None]:
import math

def sigmoid(z:float) -> float:
    return 1 / (1 + math.exp(-z))

def sigmoid_derivative(z:float) -> float:
    s = sigmoid(z)
    return s * (1 - s)

![](/workspaces/comp-460-f25-week-05/images/XOR.png)

$$
\begin{align*}
  a^2_1 & = w^2_{11} x_1 + w^2_{12} x_2 + b^2_1 = \begin{bmatrix} w^2_{11} & w^2_{12}\end{bmatrix} \cdot \begin{bmatrix}x_1 \\ x_2\end{bmatrix} + b^2_1 = \mathbf w^2_1\cdot\mathbf x + b^2_1\\
  a^2_2 & = w^2_{21} x_1 + w^2_{22} x_2 + b^2_2  = \begin{bmatrix} w^2_{21} & w^2_{22}\end{bmatrix} \cdot \begin{bmatrix}x_1 \\ x_2\end{bmatrix} + b^2_2 = \mathbf w^2_2\cdot\mathbf x + b^2_2
\end{align*}
$$

$$
\begin{align*}
  
  \begin{bmatrix}a^2_1 \\ a^2_2 \end{bmatrix} & =
  \begin{bmatrix}\mathbf w^2_1 \\ \mathbf w^2_2 \end{bmatrix} \cdot\mathbf x + \begin{bmatrix}b^2_1 \\ b^2_2 \end{bmatrix} \Rightarrow \\
  \mathbf a^2 & = \mathbf w^2 \cdot \mathbf x + \mathbf b^2
\end{align*}
$$

# Your assignment



# Reading

* **PDF:** [An algorithm for the machine calculation of complex Fourier series](https://www.ams.org/mcom/1965-19-090/S0025-5718-1965-0178586-1/S0025-5718-1965-0178586-1.pdf): the original paper on FFT by Cooley and Tukey.

* **PDF:** [The Design and Implementation of FFTW3](http://fftw.org/fftw-paper-ieee.pdf) discusses why faster algorithms some times slow down a bit.

* **PDF:** [FFT material](https://jeffe.cs.illinois.edu/teaching/algorithms/notes/A-fft.pdf) from Jeff Erickson's book. As much as I like Jeff's book, I think this chapter is a bit dense or scattered. With patience, you may find good information but it's not an easy read.