# Notes on Backpropagation
2017-04-28 jkang  

This notes summarized notations and fundamentals for backpropagation in Neural Network for simpler understanding and easier implementation in code

Ref:  
- http://neuralnetworksanddeeplearning.com/
- https://cs224d.stanford.edu/

---

## Notations

> ### Data
> * $X$ is an input matrix; size = $(n\_inputs)\ \times \ (n\_features)$  
> &nbsp;&nbsp;&nbsp;&nbsp;Inputs are stacked row-wise in $X$  
> * $Y$ is an output matrix; size = $(n\_outputs)\ \times \ (n\_classes)$  

> ### Network  

> * $W^k$ is a weight matrix which maps $(k-1)$th layer to $k$th layer
> * $b^k$ is a bias vector at $k$th layer

> ### Processes

> * $C$ is a cost function. The choices of cost function can be Cross-entropy, MSE, etc.
> * $\sigma(X)$ is a sigmoid function which maps X into $\sigma(X)$  
> &nbsp;&nbsp;&nbsp;&nbsp; c.f. $\sigma'(X)$ is derivative of $\sigma(X)$
> * $z^l$ is the weighted sum of inputs to $l$th layer (before applied to the activation function); size = $(n\_inputs)\ \times \ (n\_hidden\_units)$  
> &nbsp;&nbsp;&nbsp;&nbsp; c.f. $z^L$ is the weighted sum at the final layer
> * $a^l$ is a transformed version of $z$ by the activation function; size( $z^l$ ) = size( $a^l$ )
> * $\delta^L$ is the final output error at $z^L$; i.e. $\frac{\partial(C)}{\partial(z^L)}$; size( $\delta^L$ ) = size( $Y$ )
> * $\delta^l$ is the $l$th error at $z^l$; size( $\delta^l$ ) = size( $a^l$ )  
>> Why are they called '**error**'? Short answer: this 'error' tells us how sensitive each layer is to the cost. This senstivitive is important because it helps us to know how much we can change weights and biases to reduce the cost. Bottom line is the 'error' appears in the derivative of $\frac{\partial C}{\partial w}$ calculation. So, it would be good to know this error and make it explicit for later calculations. See [Nielson](http://neuralnetworksanddeeplearning.com/chap2.html)

## Goal

> Understand how much network parameters ( $W$ and $b$ ) affect $C$, and calculate the proper amount for parameter update (i.e. derivatives of parameters)  
> Calculate: $$\frac{\partial C}{\partial W}\ and\ \frac{\partial C}{\partial b}$$  
> Update $k$th weights and biases:
> $$W^k = W^k - \eta\frac{\partial C}{\partial W^k}$$
> $$b^k = b^k - \eta\frac{\partial C}{\partial b^k}$$

## Fundamentals for Backpropagation

> ### BP rules (=Back-Propagation)  

> ### <p style="color:blue;font-weight:bold">BP 1: How much the little change in $z^L$ affect $C$?</p>
> $$\delta^L = \frac{\partial C}{\partial z^L} = \frac{\partial C}{\partial z^L} \odot \sigma '(z^L)$$
> ### <p style="color:blue;font-weight:bold">BP 2: What's the relationship between $\delta^l$ and $\delta^{l+1}$</p>
> $$\delta^l = ((W^{l+1})^T \odot \delta^{l+1}) $$
> ### <p style="color:blue;font-weight:bold">BP 3: How much does the bias $b$ affect $C$?
> $$$$
