<h1>How the backpropagation algorithm works</h1>

In this chapter we'll explain a fast algorithm for computing such gradients, an algorithm known as backpropagation.

At the heart of backpropagation is an expression for the partial derivative $\frac{\partial C}{\partial w}$ of the cost function $C$ with respect to any weight $w$ (or bias $b$) in the network. The expression tells us how quickly the cost changes when we change the weights and biases.

<h2>A fast matrix-based approach to computing the output from a neural network</h2>

Let's warm up with a fast matrix-based algorithm to compute the output from a neural network. We'll use $w^l_{jk}$ to denote the weight for the connection from the $k^{th}$ neuron in the $(l-1)^{\rm th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer.

![Figure 2.1](imgs/network1.PNG)

We use $b^l_j$ for the bias of the $j^{th}$ neuron in the $l^{th}$ layer. And we use $a^l_j$ for the activation of the $j^{th}$ neuron in the $l^{th}$ layer. For example,

![Figure 2.2](imgs/network2.PNG)

The activation $a^l_j$of the $j^{th}$ neuron in the $l^{th}$ layer is related to the activations in the $(l-1)^{\rm th}$ layer by the equation

$$
\begin{eqnarray} 
  a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),
\tag{23}\end{eqnarray}
$$

over all neurons $k$ in the $(l-1)^{\rm th}$ layer

To rewrite this expression in a matrix form we define a weight matrix $w^l$ for each layer, $l$. The entries of the weight matrix $w^l$ are just the weights connecting to the $l^{th}$ layer of neurons, that is, the entry in the $j^{th}$ row and $k^{th}$ column is $w^l_{jk}$. Similarly, for each layer $l$ we define a bias vector, $b^l$. And finally, we define an activation vector $a^l$ whose components are the activations $a^l_j$.

The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as $\sigma$. We want to apply a function such as Ïƒ to every element in a vector $v$. We use the obvious notation $\sigma(v)$ to denote this kind of elementwise

So, equation (23) can be written as

$$
\begin{eqnarray} 
  a^{l} = \sigma(w^l a^{l-1}+b^l).
\tag{25}\end{eqnarray}
$$

Before we apply the sigmoid function, we comput the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$. We call this the *weighted input* to the neurons in layer $l$.

Equation (25) is sometimes written as $a^l = \sigma(z^l)$

<h2>The two assumptions we need about the cost function</h2>

The goal of backpropagation is to compute the partial derivatives $\partial C / \partial w$ and $\partial C / \partial b$ of the cost function $C$ with respect to any weight w or bias b in the network.

*$\partial C / \partial w$ can be interpreted as the instantaneous change of $C$ with respect to the weights. That is, we want to find the value of weights that make C the smallest.*

We need to make two main assumptions about the cost function for backpropagation to work. We will use the quadratic cost function as an example in the form:

$$
\begin{eqnarray}
  C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,
\tag{26}\end{eqnarray}
$$

* $n$ is the total number of training examples <br>
* the sum is over individual training examples, $x$ <br>
* $y=y(x)$ is the corresponding desired output <br>
* $L$ denotes the number of layers <br>
* $a^L = a^L(x)$ is the vector activations output from the network when x is input

<h4>First Assumption</h4>

**The cost function can be written as an average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for individual training examples, $x$.**

For the quadratic cost function, a single training example is $C_x = \frac{1}{2} \|y-a^L \|^2$

Backpropagation computes the partial derivatives $\frac{\partial C_x}{\partial w}$ and $\frac{\partial C_x}{\partial b}$ at a single training example and recovering $\frac{\partial C}{\partial w}$ and $\frac{\partial C} {\partial b}$ by averaging over training examples.

<h4>Second Assumption</h4>

**The cost can be written as a function of the outputs from the neural network.**

![Network Output](imgs/cost_function_network.PNG)

Remember (in the case of the quadratic cost function) that a training example is fixed, so the output y is also fixed. So, the cost function is only a function of output activations. 

<h2>The Hadamard product</h2>

Suppose $s$ and $t$ are two vectors of the same dimension. Then we use $s \odot t$ to denote the elementwise product of the two vectors.

For example,

$$
\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right] 
  \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].
\tag{28}\end{eqnarray}
$$

<h2>The four fundamental equations behind backpropagation</h2>

The four fundamental equations of backpropagation calculate the error for the entire neural network. Also, we've learnt that a weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated, i.e., is either high- or low-activation.

![Equations](imgs/equations.PNG)

<h2>The backpropagation algorithm</h2>

Backpropagation tells us how much we need to change the weights and biases of the network to get our desired output.

Suppose we wanted to classify a handwritten digit as the number two.

![Example](imgs/backprop_example.PNG)

We need to increase the activation of the neuron associated with a two and decrease the activation of all other output neurons. But we can't directly change to activations. We can only adjust the weights and biases. However, we want to keep track of how much we want each activation neuron to change. Then, we can figure out how much we want the activations in the previous layer to change to get our desired output. Then every other layer has to be changed to adjust the second to last layer. **We will move backwards through the neural network and keep track of how much each layer needs to change to get the desired output of the next layer.**

![Example 2](imgs/backprop_example2.PNG)

The desired adjustments of all the neurons are added together to figure out how much we need to change the second to last layer. We then recursively apply the same process to every layer before the second to last layer. That is, we propagate backwards.

We apply the backpropagation algorithm to every training example and find an average for how much we want to adjust each weight. The average amount we want to change each weight is the negative gradient of the cost function (multiplied by the learning rate).

![Example_3](imgs/backprop_example3.PNG)

Notes taken from: https://www.youtube.com/watch?v=Ilg3gGewQ5U&t=41s