# Back Propagation
Back propagation is essentially iterating through every single weight and adjusting the weight . After we are given a probability score from forward propagation, we use a cost function to calculate the the difference between our predicted and fitted values. Based on the summation difference of errors, we perform gradient descent and adjust the weights in the network accordingly.

### Gradient Descent

$ \dfrac{\delta E_{total}}{\delta W} $

Summation squared of errors. While this provide the difference of our predicted and fitted values, we need to figure our how to minimize the error. One solution is by performing numerical gradient estimation, where we take the smallest value from our equation (need to word better). The problem with this is that, when your neural network becomes more complex and you need a 3D plane, iterating to find the smallest value is slow. This is also prone to suffering from the non-convex problem, which is when we accidentally find the local minima instead of our global one. However, this problem can be esaily optimized....
$$\sum\limits_{} .5 * (y - \hat{y})^2 $$

By taking the partial derivative of summation of errors, this will quickly help find the global minima for our cost function by providing us the rate of change of J in respect to the W (weights). By using a negative slope from our equation, it will help us find the minima of our cost function.
$$ \dfrac{\delta J}{\delta W_{(2)}} = \dfrac{\delta J \sum\limits_{} .5 * (y - \hat{y})^2}{\delta W_{(2)}}$$

So, we don't exactly want to take the partial derivative of a summation. When we use the sum rule in differentiation, we can move the summation outside the partial derivative.
$$ \dfrac{\delta J}{\delta W_{(2)}} = \dfrac{\sum\limits_{} \delta J .5 * (y - \hat{y})^2}{\delta W_{(2)}}$$

Lets remove the summation all together!
$$ \dfrac{\delta J}{\delta W_{(2)}} = \dfrac{\delta J .5 * (y - \hat{y})^2}{\delta W_{(2)}}$$

Product rule to get rid of the 1/2:
$$ \dfrac{\delta J}{\delta W_{(2)}} = (y - \hat{y})$$

Now, lets's perform the chain rule. Y is our inputs, this is our constant and will be zero. $ \hat{y} $ is our fitted equation with our activation function, weights, and inputs. Let's break that down with chain rule.
$$ \dfrac{\delta J}{\delta W_{(2)}} = (y - \hat{y}) * -(\dfrac{\delta \hat{y}}{\delta W_{(2)}}) $$

$$\dfrac{\delta J}{\delta W_{(2)}} = -(y - \hat{y})(\dfrac{\delta \hat{y}}{\delta W_{(2)}})$$

Since $ \hat{y} $ is our total weights combined with our activation function f: $ \hat{y} = f(z_{(3)}) $, we need to perform chain rule again and breakdown $ \dfrac{\delta \hat{y}}{\delta W_{(2)}} $.
$$ \dfrac{\delta J}{\delta W_{(2)}} = -(y - \hat{y}) \dfrac{\delta \hat{y}}{\delta z_{(3)}} \dfrac{\delta z_{3}}{\delta W_{(2)}}$$ 

To find the rate of change of $\hat{y}$ with respect to $ z_{(3)} $, we need to differentiate the activation function in our case, we are going to differentiate the sigmoid function.
$$ f(z) = \dfrac{1}{1 + e^{-2}} $$
$$ f'(z) = \dfrac{e^{-2}}{(1+e^{-2})^2} $$

$$ \dfrac{\delta \hat{y}}{\delta W_{(2)}} = f'(z_{(3)}) = \dfrac{e^{-2}}{(1+e^{-2})^2} $$

$$ \dfrac{\delta J}{\delta W_{(2)}} = -(y - \hat{y}) f'(z_{(3)}) \dfrac{\delta z_{3}}{\delta W_{(2)}}$$ 

In the last part of the equation, $ \dfrac{\delta z_{3}}{\delta W_{(2)}} $ represents the change of z3 (outputs) in respect to weights from the second layer. This is the activity of each synapse. In other words,  $ z_{(3)} = a_{(2)}w_{(2)} $ or the total of all outputs from hidden layer times the weights related to the outputs. It's also important to note that $ a_{(2)} $ is the slope. ... we can take care of the summation???

Backpropagating error:

$
\begin{bmatrix}
    y_1 - \hat{y}_1 \\
    y_2 - \hat{y}_2 \\
    y_3 - \hat{y}_3 \\
\end{bmatrix}
$
x
$
\begin{bmatrix}
    f'(z_{(3)})_1 \\
    f'(z_{(3)})_2 \\
    f'(z_{(3)})_3 \\
\end{bmatrix}
$
=
$
\begin{bmatrix}
    \delta_1 \\
    \delta_2 \\
    \delta_3 \\
\end{bmatrix}
$
=
$
\begin{bmatrix}
    \delta_3 \\
\end{bmatrix}
$

Kind of:
$$\delta_3 = -(y_3 - \hat{y}_3) * f'(z_{(3)})_1 $$

Through multiplying our scalar with $ \dfrac{\delta z_{3}}{\delta W_{(2)}} $ we not only do we get our hidden layer and output size, but this performs our summation for us.
$$ \dfrac{\delta J}{\delta W_{(2)}} =  (a^2)^T * \delta_{(3)}$$ 

$
\begin{bmatrix}
    x_{11}^{(2)} & x_{21}^{(2)} & x_{31}^{(2)} \\
    x_{12}^{(2)} & x_{22}^{(2)} & x_{32}^{(2)} \\
    x_{13}^{(2)} & x_{23}^{(2)} & x_{33}^{(2)} 
\end{bmatrix}
$
*
$
\begin{bmatrix}
    \delta_3 \\
    \delta_3 \\
    \delta_3 \\
\end{bmatrix}
$ 
=
$
\begin{bmatrix}
    x_{11}^{(2)}\delta_3 & x_{21}^{(2)}\delta_3 & x_{31}^{(2)}\delta_3 \\
    x_{12}^{(2)}\delta_3 & x_{22}^{(2)}\delta_3 & x_{32}^{(2)}\delta_3 \\
    x_{13}^{(2)}\delta_3 & x_{23}^{(2)}\delta_3 & x_{33}^{(2)}\delta_3 
\end{bmatrix}
$ 


### Deriving DJ d1
Lets begin deriving
$$ \dfrac{\delta J}{\delta W_{(1)}} = -(y - \hat{y}) \dfrac{\delta \hat{y}}{\delta z_{(3)}} \dfrac{\delta z_{3}}{\delta W_{(1)}}$$ 