# Back Propagation

Lets first define the cost function that we will be working with:

$$ C = \frac{1}{2m}\sum_{i=1}^{m}||y(x_{i}) - a^{L}(x_{i})||^{2} $$

Where,

$L$ -> number of layers in the network

$y(x)$ -> actual output of the network for input $x$

$a^{L}(x)$ -> the produced output of our network

$m$ -> number of training examples


This is the entire cost function.

Now there are two assumption that are to be made:

1) If the cost function of each layer is $C_{x}$, the
$$ C = \frac{1}{m}\sum_{i=1}^{m}C_{x} $$

2) The cost function of each layer
$$ C_x = \frac{1}{2} \|y-a^L \|^2$$

### The four fundamental equations of Back Propagation

Backpropagation helps us understand the changes in the cost function w.r.t changes in any of the weights or biases

Therefore $\large\frac{\partial{C}}{\partial{w^{l}_{jk}}}$ and $\large\frac{\partial{C}}{\partial{b^{l}_{j}}}$ are basically how the cost function has changed with respect to ${w^{l}_{jk}}$ and ${b^{l}_{j}}$

Where

${w^{l}_{jk}}$ -> weight from $k$ node in previous layer to $j$ node in layer $l$

${b^{l}_{j}}$ -> bias for node $j$ in layer $l$

Let us say that the input to a node $j$ in layer $l$ has an input of $z_{j}^{l}$

Now ** instead of measuring the error in the outputs of the node, we say that the input has a $\delta^{l}_{j}$ error which causes the error in the output node $j$ of layer $l$**.

Therefore we can say that
$$\delta_{j}^{l} = \frac{\partial{C}}{\partial{z^{l}_{l}}}$$

which means that the error from node $j$ in layer $l$ is because of the error in the input node whose input is ${z^{l}_{J}}$

Our main aim is to express $\delta_{j}^{l}$ in terms of $\large\frac{\partial{C}}{\partial{w^{l}_{jk}}}$ and $\large\frac{\partial{C}}{\partial{b^{l}_{j}}}$

Lets expand on this $\delta^{L}_{j}$

$$\delta^L_j = \frac{\partial{C}}{\partial{z^{L}_{j}}} = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)$$
$\delta^L_j$ -> the error of the output layer

$\large\frac{\partial C}{\partial a^L_j}$ -> The rate at which the cost function changes w.r.t change in the output of the $j^{th}$ node of the final layer $l$

$\sigma'(z^L_j)$ -> The rate at which the activation function output is changing w.r.t change in input $z^L_j$


Check this link for the derivations:

http://neuralnetworksanddeeplearning.com/chap2.html

So we come to the conclusion that:

1) $$\frac{\partial C}{\partial b^l_j} =
  \delta^l_j$$

2) $$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$$

1) Input a set of training examples

2) **For each training example xx:**
Set the corresponding input activation $a^{x,1}$, and perform the following steps:

**Feedforward:** For each $l=2,3,…,Ll=2,3,…,L$ compute $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.

**Output error $\delta^{x,L}$:** Compute the vector $$\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$$

**Backpropagate the error:** For each $l = L-1, L-2, \ldots, 2$ compute $$\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1})
  \odot \sigma'(z^{x,l})$$


Gradient descent: For each $l = L, L-1, \ldots, 2$ update the weights according to the rule $\large w^l \rightarrow
  w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the biases according to the rule $\large b^l \rightarrow b^l-\frac{\eta}{m}
  \sum_x \delta^{x,l}$.