# Neural Network theory

The basic idea behind neural networks is to first setup the architecture.
Then randomly initialize all the weights. We then feed data through the model and make predictions $\hat{y}$.
We then compare these to the actual values $y$ and evaluate how bad our
prediction is using a cost function $J$

We then use a "backpropogation" algorithm to update the network weights to
minimize cost or error.

# Neural Network Equations

$$
z^{(2)} = XW^{(1)} \tag{1}\\
$$
$$
a^{(2)} = f(z^{(2)}) \tag{2}\\
$$
$$
z^{(3)} = a^{(2)}W^{(2)} \tag{3}\\
$$
$$
\hat{y} = f(z^{(3)}) \tag{4}\\
$$
$$
J = \sum \frac{1}{2}(y-\hat{y})^2 \tag{5}
$$

# Backpropogations
Backpropogation asks the questions : "If the cost changes a little bit, how does each weight change?"
$$
\frac{\partial J}{\partial W^{(2)}} =
\frac{\partial \sum \frac{1}{2}(y-\hat{y})^2}{\partial W^{(2)}}=
-(y-\hat{y}) \frac{\partial \hat{y}}{\partial W^{(2)}}=
-(y-\hat{y}) f^\prime(z^{(3)}) \frac{\partial z^{(3)}}{\partial W^{(2)}}=
(a^{(2)})^T\delta^{(3)}\tag{6}
$$
$$
\delta^{(3)} = -(y-\hat{y}) f^\prime(z^{(3)})
$$

$$
\frac{\partial J}{\partial W^{(1)}} = (y-\hat{y})
\frac{\partial \hat{y}}{\partial W^{(1)}}
$$

$$
\frac{\partial J}{\partial W^{(1)}} = (y-\hat{y})
\frac{\partial \hat{y}}{\partial z^{(3)}}
\frac{\partial z^{(3)}}{\partial W^{(1)}}
$$

$$
\frac{\partial J}{\partial W^{(1)}} = -(y-\hat{y}) f^\prime(z^{(3)}) \frac{\partial z^{(3)}}{\partial W^{(1)}}
$$

$$
\frac{\partial z^{(3)}}{\partial W^{(1)}} = \frac{\partial z^{(3)}}{\partial a^{(2)}}\frac{\partial a^{(2)}}{\partial W^{(1)}}
$$

There’s still a nice linear relationship along each synapse, but now we’re interested in the rate of change of z(3) with respect to a(2). Now the slope is just equal to the weight value for that synapse. We can achieve this mathematically by multiplying by W(2) transpose.

$$
\frac{\partial J}{\partial W^{(1)}} = \delta^{(3)}
(W^{(2)})^{T}
\frac{\partial a^{(2)}}{\partial W^{(1)}}
$$

$$
\frac{\partial J}{\partial W^{(1)}} = \delta^{(3)}
(W^{(2)})^{T}
\frac{\partial a^{(2)}}{\partial z^{(2)}}
\frac{\partial z^{(2)}}{\partial W^{(1)}}
$$

$$
\frac{\partial J}{\partial W^{(1)}} = \delta^{(3)}
(W^{(2)})^{T}
f^\prime(z^{(2)})
\frac{\partial z^{(2)}}{\partial W^{(1)}}
$$

Our final computation here is $\frac{dz^{2}}{dW^{(1)}}$. This is very similar to our dz3/dW2 computation, there is a simple linear relationship on the synapses between z2 and w1, in this case though, the slope is the input value, X. We can use the same technique as last time by multiplying by X transpose, effectively applying the derivative and adding our dJ/dW1’s together across all our examples.

$$
\frac{\partial J}{\partial W^{(1)}} =
X^{T}
\delta^{(3)}
(W^{(2)})^{T}
f^\prime(z^{(2)})
$$
or
$$
\frac{\partial J}{\partial W^{(1)}} =
X^{T}\delta^{(2)} \tag{7}
$$
 where
$$
\delta^{(2)} = \delta^{(3)}
(W^{(2)})^{T}
f^\prime(z^{(2)})
$$