# BACKPROPAGATION  CACULUS:

 **Backpropagation is an algorithm used to calculate derivatives quickly. Artificial neural networks use backpropagation as a learning algorithm to compute a gradient descent with respect to weights. ... The algorithm gets its name because the weights are updated backwards, from output towards input.**


* Backpropagation is the backwards movement through a neural network’s nodes to gauge how much each weight in the network contributes to deciding the overall cost.


* By finding out how each weight contributes to the cost, we can adjust the weights accordingly to most efficiently lower cost.



* The best way to see how much a weight’s value changes the overall cost would be changing the value of a particular weight (say, θ1) and seeing the resulting change in the cost.


### Notation

* Stupidly (in hindsight) I use both subscripts and superscripts to signify layer, but it’s pretty clear what I’m referring to.


* I use J() to signify the cost function and C to represent the output of the cost function (the cost itself).


* yhat is our prediction. It’s also equal to the last a value in the neural network.


<img src="https://miro.medium.com/max/2000/1*w0kiMknr9bevLUYiKpFmNw.png"/>

<center>A very simple neural network, with two hidden layers, three weights, and three biases. Typo: the first weight is supposed to be θ1.</center>


* The ratio of the change in θ1 and the resulting change in J() would be the perfect indicator of each weight’s impact on the cost. Weights that have higher ratios, in that the resulting change in C is much higher than the change in θ, have large sway on the cost. Weights with lower ratios have less of an importance in deciding cost.



* For those who know calculus (and if you’re reading this, that should be all of you!) this sounds a lot like a derivative.


<img src="https://miro.medium.com/max/2000/1*wEQFOKXh0yGDWzoraslibQ.png"/>

<center>You could show this ratio of changes (how C changes with changes to θ2) as a derivative. One caveat is when you’re differentiating a function with multiple inputs, you use the partial derivative ∂ (‘partial’) operator instead of d.</center>


* We need to find a way to cycle through every weight and bias in our network, and calculate the partial derivative with the cost function.


$$\frac{\partial C}{\partial \theta_{1} },\frac{\partial C}{\partial b_{1} },\frac{\partial C}{\partial \theta_{2} },\frac{\partial C}{\partial b_{2} },\frac{\partial C}{\partial \theta_{3} },\frac{\partial C}{\partial b_{3} }$$


## Using chain rule in the context of backpropagation:


* It is hard to understate the importance of the chain rule in backpropagation. In fact, backpropagation is just the chain rule executed in sequence.


$$\frac{\mathrm{d}y }{\mathrm{d} x} = \frac{\mathrm{d} y}{\mathrm{d} u}* \frac{\mathrm{d} u}{\mathrm{d} x}$$


* The chain rule is to find the derivative of a composite of functions, a function inside of another function — f(g(x)). Even though it is most commonly seen with a pair of functions, there’s nothing stopping the chain rule from staying relevant even in functions with many compositions — a(b(c(d(e(f(g(x))))), for example. We would just have to multiply more derivatives.


* finding the derivatives of a cost function is like finding the derivatives of a big composite function. And this is where the idea of backpropagation comes from.



* Just like how we start from the outside (closest to the output we are looking for, in the above diagram, a, in the cost function, J()), we start from the “outside” of our neural network function. This would be the end — where we have J(), which encompasses all the little functions inside, and takes in as inputs every weight and bias.


$$\frac{\mathrm{d}a }{\mathrm{d} x}a(b(c(d(e(f(g(x)))))))$$


$$ \frac{\mathrm{d} a}{\mathrm{d} x} = \frac{\mathrm{d} a}{\mathrm{d} b} * \frac{\mathrm{d} b}{\mathrm{d} c} * \frac{\mathrm{d} c}{\mathrm{d} d} * \frac{\mathrm{d} d}{\mathrm{d} e} * \frac{\mathrm{d} f}{\mathrm{d} g} * \frac{\mathrm{d} g}{\mathrm{d} x}$$ 


* We can think of our neural network as one big composite function. Our cost, the final output (the a, if you will, if we’re thinking about the above super-function). In between, we have many smaller functions.


* If we think back to forward propagation, we are essentially calculating a string of composite functions — composing z’s and g’s inside of each other. Each layer of the neural network depends on the outputs of the last.


<img src="https://miro.medium.com/max/700/1*vQUKXNkcEYrYaG3Mlnpstw.png"/>


* So, we can backpropagate through our network to find all the derivatives. Let’s return to our simple example to understand this more intuitively.


<img src="https://miro.medium.com/max/2000/1*GU54FF0xLDDNmlpsOlB2Mg.png"/>



* We can rewrite the neural network as a concrete set of multiplications, which will help us better understand and execute backpropagation.


* Let’s write the derivatives, going backward, from C, to x.


<img src="https://miro.medium.com/max/2000/1*Y4r8aNMcyTw7Y-Rex1Ky2A.png"/>


* By following the a’s back through the algorithm, we can get from C to x, since each layer builds on the one before. On the way back to x, we can calculate, in the z layers, the derivatives for biases and weights, kind of like “turning off” our main derivative track of a’s and logging the derivatives of the z’s with respect to the weights as well.


* Remember that we’re multiplying our derivatives as we go back.


* It’s kind of like using the chain rule to do something like this. This is far from a correct representation of what we’re doing, but it can give a bit more intuition.

<img src="https://miro.medium.com/max/1400/1*X007PDS0ZqoyGfDZ7t0plQ.png"/>


* How would you compute this simplified neural network, in terms of actual derivatives? How would you, with pen and paper, backpropagate? Let’s see.


### Computing simple backprop with squared error cost function


* So, looking at our simple neural network, (now with cost defined as squared error cost) how will we find our derivatives of all 6 weights and biases? Let’s see how we can recursively use the chain rule to compute these.


<img src="https://miro.medium.com/max/2000/1*XrmWZWnNl-6OSO6YaXJfHg.png"/>



* Looking at our backpropagation plan, we can identify the specific derivative ‘paths’ we have to take to get to each of the six of our weights and biases, starting at the cost. We traverse these path backwards always while keeping chain rule intact.


* Here’s an example path that we would follow to get θ2.


<img src="https://miro.medium.com/max/2000/1*A6T4pZVEd8zCq4HfQFDR3g.png"/>


* By following similar paths, we can find the series of derivatives we need to multiply to get to all six of our weights.


$$\frac{\partial C}{\partial \theta_{1} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial a^{2}} * \frac{\partial a^{2}}{\partial z^{2}} * \frac{\partial z^{2}}{\partial a^{1}} * \frac{\partial a^{1}}{\partial z^{1}} * \frac{\partial z^{1}}{\partial \theta_{1}}$$


$$\frac{\partial C}{\partial b_{1} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial a^{2}} * \frac{\partial a^{2}}{\partial z^{2}} * \frac{\partial z^{2}}{\partial a^{1}} * \frac{\partial a^{1}}{\partial z^{1}} * \frac{\partial z^{1}}{\partial b_{1}}$$



$$\frac{\partial C}{\partial \theta_{2} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial a^{2}} * \frac{\partial a^{2}}{\partial z^{2}} * \frac{\partial z^{2}}{\partial \theta_{2}}$$


$$\frac{\partial C}{\partial b_{2} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial a^{2}} * \frac{\partial a^{2}}{\partial z^{2}} * \frac{\partial z^{2}}{\partial b_{2}}$$



$$\frac{\partial C}{\partial \theta_{3} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial \theta_{3}}$$


$$ \frac{\partial C}{\partial b_{3} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial b_{3}}$$


* The actual computation is pretty repetitive for all of them, so let’s see how we compute one of them and I’ll leave the rest for you to infer, since it’s pretty much the exact same calculation.


### Partial derivative of C w.r.t θ2:


* Let’s continue our example. The ‘path’ to know the value of the partial derivative w.r.t θ2 is as follows:



$$\frac{\partial C}{\partial \theta_{2} } = \frac{\partial C}{\partial \hat{y}} * \frac{\partial \hat{y}}{\partial z^{3}} * \frac{\partial z^{3}}{\partial a^{2}} * \frac{\partial a^{2}}{\partial z^{2}} * \frac{\partial z^{2}}{\partial \theta_{2}}$$


* In traditional backpropagation fashion, let’s go backwards from C.


<img src="https://miro.medium.com/max/2000/1*VfxLfEyPxa4lbd1ivAm6Cg.png"/>



* First, we have to figure out the derivative of our cost w.r.t our prediction, yhat. This means differentiating our cost function J() = C, as that’s what gives us cost.


* Let’s go ahead and differentiate J() (or C) w.r.t yhat.



<img src="https://miro.medium.com/max/1400/1*tliJnn5eLZNfW1X97eyuAg.png"/>



* The question is what is in the place of this question mark? The answer, whatever the input into yhat is that can help us travel farther back into the network.


<img src="https://miro.medium.com/max/2000/1*6l8wowfMAbHHoKo_1AE1Fw.png"/>


* yhat has just one input — z3. yhat is gotten by putting z3 through the sigmoid function, so we must find the derivative of the sigmoid function with respect to z3. This is a bit intimidating, but it’s basically just the derivative of the sigmoid function.


<img src="https://miro.medium.com/max/2000/1*Gve5QDWUiDk2HNBH8XFrrQ.png"/>



* Now we’re done two derivatives in our network. Next up we can compute the derivatives of z3 with respect with a2.


<img src="https://miro.medium.com/max/2000/1*Kx1P2qqsghpFh46MQOpG1A.png"/>



* Here, we could “branch off” and calculate the partial derivatives of z3 w.r.t θ3 and b3, but since we’re calculating θ2, we skip these and continue down the main “derivative path”. So, instead, we continue backpropagating through the network.



<img src="https://miro.medium.com/max/1400/1*tx3T1BE_NIhhCFH6CTKx2A.png"/>


* Just to keep track of what derivatives we’ve calculated so far:


<img src="https://miro.medium.com/max/700/1*L2T_2YWcL4t6KwFjUaFhig.png"/>


* As we continue down the line, we find that up next we’re passing through our second activation function, with ∂ a2 / ∂ z2.



* Technically, this would be the ReLU function, as sigmoid is generally only used for the final layer, but the derivation of ReLU is a bit annoying to deal with so I’ll just be wishy washy and not derive it.*


<img src="https://miro.medium.com/max/700/1*_BHI6cAJOUSvQKbHmst7KA.png"/>

<center>* if you’ve ever seen the graph of ReLU you can tell that it isn’t differentiable in the way we are familiar with. It’s technically not continuous — and it’s differentiation is described piecewise (d = 0 if z <  0, d = 1 if z > 0). At exactly 0 it’s not differentiable, but we still are able to use it since the chance of landing on exactly 0.000… is rare, and also, if that does happen, ReLU automatically shifts it to either < 0 or > 0.</center>
    
    
    
* Now that we’ve found our derivative w.r.t z2, we can finally “branch off” and calculate our derivative w.r.t θ2. Let’s take a look at our progress map on the backpropagation    
    
    
    
<img src="https://miro.medium.com/max/2000/1*chmKp-ptMBZkajHutm5ouw.png"/> 
    
    
    
* Since we have ∂a2/∂z2, we can find the adjacent weight (as well as the bias). All we need to do is take our definition of z2 with respect to our θ2, which is, after all the hard work getting here, underwhelmingly easy.  
    
    
    
<img src="https://miro.medium.com/max/1400/1*WyGfG-FlV0Aj3k_Nu_Zbsg.png"/>    
    
    
    
* It’s also interesting to note that the derivative of the weight is just the activation of the previous layer, showing that the weights are only important in proportion to the activations.
    
    
* If you were wondering, it’s pretty easy to also get the bias, where the derivative would just be 1.  
    
    
<img src="https://miro.medium.com/max/700/1*05Xo4Jmz9A1y6Uw1CN-DKw.png"/> 
    

    
* So, we can lay out the backpropagation mathematics for our weight.
    
    
<img src="https://miro.medium.com/max/700/1*FT3fpyYRu6euodEdKsyETw.png"/>
    <center> We did it</center>  
    
    
    
* Remember what this is doing. This line of mathematics backpropagates through the neural network, from the cost towards the input. This product is the result of using chain rule to go further and further through the network by using previously computed numbers.
    
    
* You can easily follow this same type of logic to find any weight or bias in the entire system, by calculating the chain rule to access weights that are farther and farther in (or, should I say, farther and farther out) away from the cost.    