## Backpropagation Derivation

We are going to use the following notations for our derivation.

- $w_{ij}^l$ : weight between neuron $i$ at $(l-1)^{th}$ layer and neuron j at $l^{th}$ layer.
- $x_i$      : $i^{th}$ variable
- $z_i^l$    : input of neuron $i$ at layer $l$
- $a_i^l$    : output of neuron $i$ at layer $l$
- $b_i^l$    : bias for neuron $i$ at layer $l$
- $\hat{y}$

![Illustration of a fully connected neural network architecture](ANN.png)



Lets start with a feed forward equations to predict output $\hat{y}$

$$z_1^1 = x_1*w_{11}^1 + x_2*w_{21}^1 + b_1^1$$
$$z_2^1 = x_1*w_{12}^1 + x_2*w_{22}^1 + b_2^1$$
$$z_3^1 = x_1*w_{13}^1 + x_2*w_{23}^1 + b_3^1$$
<br>
$$a_1^1 = g(z_1^1)$$
$$a_2^1 = g(z_2^1)$$
$$a_3^1 = g(z_3^1)$$
<br>

$$z_1^2 = a_1^1*w_{11}^2 + a_2^1*w_{21}^2 + a_3^1*w_{31}^2 + b_1^2$$
$$z_2^2 = a_1^1*w_{12}^2 + a_2^1*w_{22}^2 + a_3^1*w_{32}^2 + b_2^2$$

<br>
$$a_1^2 = g(z_1^2)$$
$$a_2^2 = g(z_2^2)$$

<br>

$$z_1^3 = a_1^2*w_{11}^3 + a_2^2*w_{21}^3 + b_1^3$$

<br>

$$\hat{y} = f(z_1^3)$$

$$g(x) = \frac{1}{1 + e^{-x}}$$
$$f(x) = x$$

We can write these equations in matrix form for all the observations as follows

$$Z^1 = X*W^1 + b^1$$
$$A^1 = g(Z^1)$$
$$Z^2 = A^1*W^2 + b^2$$
$$A^2 = g(Z^2)$$
$$Z^3 = A^2*W^3 + b^3$$
$$\hat{y} = f(Z^3)$$

We are going to define the loss function for single data point as 
$$L = \frac{1}{2}(y-\hat{y})^2$$

Lets find the derivatives

$$\frac{\partial L}{\partial w_{21}^3} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_1^3} \frac{\partial z_1^3}{\partial w_{21}^3}  = (\hat{y} - y)f'(z_1^3)a_2^2$$

We will define the error term on node $i$ at layer $l$ as $\delta_i^l$ which is defined as 

$$\delta_i^l = \frac{\partial L}{\partial z_i^l}$$

So derivative at the last layer can be written as

$$\frac{\partial L}{\partial w_{ij}^{l_{last}}} = \delta_j^{l_{last}} a_i^{l_{last}-1} $$ 

$$\frac{\partial L}{\partial w_{32}^2} =\underbrace{ \overbrace{\frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_1^3}}^{\delta_1^3} \frac{\partial z_1^3}{\partial a_2^2} \frac{\partial a_2^2}{\partial z_2^2}}_{\delta_2^2} \frac{\partial z_2^2}{\partial w_{32}^2}  = \underbrace{\overbrace{(\hat{y} - y)f'(z_1^3)}^{\delta_1^3}w_{21}^3g'(z_2^2)}_{\delta_2^2}a_3^1$$

$$\frac{\partial L}{\partial w_{23}^1} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_1^3} \left( \frac{\partial z_1^3}{\partial a_1^2} \frac{\partial a_1^2}{\partial z_1^2} \frac{\partial z_1^2}{\partial a_3^1} + \frac{\partial z_1^3}{\partial a_2^2} \frac{\partial a_2^2}{\partial z_2^2} \frac{\partial z_2^2}{\partial a_3^1}\right) \frac{\partial a_3^1}{\partial z_3^1} \frac{\partial z_3^1}{\partial w_{23}^1}  $$



\begin{equation} 
\begin{split}
\frac{\partial L}{\partial w_{23}^1} & = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_1^3} \left( \frac{\partial z_1^3}{\partial a_1^2} \frac{\partial a_1^2}{\partial z_1^2} \frac{\partial z_1^2}{\partial a_3^1} + \frac{\partial z_1^3}{\partial a_2^2} \frac{\partial a_2^2}{\partial z_2^2} \frac{\partial z_2^2}{\partial a_3^1}\right) \frac{\partial a_3^1}{\partial z_3^1} \frac{\partial z_3^1}{\partial w_{23}^1} \\
     & = (\hat{y} - y)f'(z_1^3)\left(w_{11}^3g'(z_1^2)w_{31}^2 + w_{21}^3g'(z_2^2)w_{32}^2 \right)g'(z_3^1)x_2 \\
     & = \overbrace{\left(\underbrace{(\hat{y} - y)f'(z_1^3)w_{11}^3g'(z_1^2)}_{\delta_1^2}w_{31}^2g'(z_3^1) + \underbrace{(\hat{y} - y)f'(z_1^3)w_{21}^3g'(z_2^2)}_{\delta_2^2}w_{32}^2g'(z_3^1)\right)}^{\delta_3^1}x_2
\end{split}
\end{equation}



$${\delta_3^1} = \sum_i\delta_i^2w_{3i}g'(z_3^1)$$
$${\delta_i^l} = \sum_k\delta_k^{l+1}w_{ik}^{l+1}g'(z_i^l)$$



$$\frac{\partial L}{\partial w_{ij}^{l}} = \delta_j^{l} a_i^{l-1} $$ 

$$\delta_i^{l_{last}} = \frac{\partial L}{\partial \hat{y}}f'(z_1^{last})$$
$${\delta_i^l} = \sum_k\delta_k^{l+1}w_{ik}^{l+1}g'(z_i^l)$$

