### Notes

From http://neuralnetworksanddeeplearning.com/chap1.html

### Sigmoid Activation Functon

$$ \sigma(z) \equiv {1 \over{(1 + e ^ {-z})}} $$

### Layer Output

Where $x$ is layer input, $w$ is this layer's weights, $b$ is this layer's biases. This is the layer output before activation function applied.

$$ \sum_j w_j x_j + b = w \cdot x + b $$

Where $a$ is the vector of the previous layer's activations and $a'$ is this layer's activations.

$$ a' = \sigma(w \cdot a + b) $$

### Cost

Where $w$ is all weights, $b$ is all biases, $C(w,b)$ is the always positive cost function.

Where $x$ represents a training input set, $n$ is the count of training input sets in this batch.

Where $y(x)$ is the expected output for input set $x$, the correct labels for the input data $x$.

Where $a$ is the output activations of the layer, this is the actual calulated layer output from $x$ as input.

Where $m$ is the count of elements in $y(x)$ and $a$.

$$ C(w, b) \equiv {1 \over{ 2 n }} \sum_x \mid\mid y(x) - a \mid\mid ^ 2 \equiv {1 \over{ 2 n }} \sum_x m * MSE(y(x), a) $$

Cost when training with one input at a time (on-line)...

$$ C(w, b) \equiv {1 \over{ 2 }} \mid\mid y(x) - a \mid\mid ^ 2 = {m \over{ 2 }} MSE(y(x), a) $$

Vector length here is denoted by operator $ \mid\mid ... \mid\mid $.

$$ \mid\mid v \mid\mid = \sqrt{\sum_{i=1}^m v_i ^ 2} = \sqrt{v_1 ^ 2 + v_2 ^ 2 + ... + v_m ^2 } $$

$$ \mid\mid v \mid\mid ^ 2 = \sqrt{v_1 ^ 2 + v_2 ^ 2 + ... + v_m ^2 }^2 = v_1 ^ 2 + v_2 ^ 2 + ... + v_m ^2 = \sum_{i=1}^m v_i ^ 2 = v \cdot v $$

For $x = 1$ this is the mean squared error between $ y(x) $ observations and $a$ predictions. (Note that the $ \sqrt{...} $ from the vector length formula and the $ ...^2 $ from the cost function cancel each other out...

$$ MSE(y(x), a) \equiv {1 \over{ m }} \sum_{i=1}^m {(y(x)_i - a_i)^2} \equiv {1 \over{ m }} ((y(x)_1 - a_1)^2 + (y(x)_2 - a_2)^2 +  ... + (y(x)_m - a_m)^2) $$

or

$$ m * MSE(y(x), a) \equiv \sum_{i=1}^m {(y(x)_i - a_i)^2} \equiv ((y(x)_1 - a_1)^2 + (y(x)_2 - a_2)^2 +  ... + (y(x)_m - a_m)^2) $$

### Gradient Descent

When $y(x)$ and $a$ are close, cost is low. The goal is to minimize $C(w,b)$, ie to find the smallest value of $C(w,b)$ by changing weights $w$ and biases $b$. This training the model to predict $a$ closer to our expected $y(x)$ .


Where $ \nabla C $ is the gradient vector.

Where $ v $ represents input to $ C $ and $ \Delta v $ represents change in inputs to $ C $.

Where $ \eta $ is a small positive value for the learning rate.

Where $ \Delta C $ is the change in value of the cost function.

$$ \Delta C \approx \nabla C \cdot \Delta v \approx - \eta \nabla C \cdot \nabla C \approx - \eta \mid\mid \nabla C \mid\mid ^2 $$

... because ...

$$ \Delta v = v - \acute{v} = - \eta \nabla C $$

Small fixed $ \epsilon $ is the constrained size of the move.

$$ \mid\mid\Delta v\mid\mid  = \epsilon $$

Where $ k $ indicates one of the weights in a node on our graph and $ l $ indicates a bias on one of the nodes.
$$ v \rightarrow \acute{v} = v - \eta \nabla C $$

$$ w_k \rightarrow \acute{w_k} = w_k - \eta { \partial C \over{\partial w_k}} $$

$$ b_l \rightarrow \acute{b_l} = b_l - \eta { \partial C \over{\partial b_l}} $$

For training batches of $ j $ elements of sets $ X $ where $ X_j $ is one of the training sets in the set of sets....

$$ w_k \rightarrow \acute{w_k} = w_k - {\eta \over{m}} \sum_j { \partial C_{X_j} \over{\partial w_k}} $$

$$ b_l \rightarrow \acute{b_l} = b_l - {\eta \over{m}} \sum_j { \partial C_{X_j} \over{\partial b_l}} $$

Where $ j $ indicates the jth neuron in layer $ l $, and here $\Delta z{^l}{_j}$ is a small change to the weighted input $ z $ of neuron $j$ in layer $l$. So that neuron now outputs...

$$ \sigma(z{_j^l} + \Delta z{_j^l}) $$

instead of

$$ \sigma(z{_j^l}) $$

Error in the neuron $ j $ in layer $ l $ is denoted by $ \delta{_j}{^l} $. Error can be interpreted as failure rate.

$$ \delta{_j^l} \equiv { \partial C \over{\partial z{_j^l}}}  $$

### Error in the Output Layer:

Error in the *output layer* $\delta^L$. $\sigma'(z{_j^L})$ represents how fast the activation function output is changing at $z{_j^L}$. 

$$ \delta{_j^L} = { \partial C \over{\partial a{_j^L}}} \sigma'(z{_j^L}) $$

Vector based for all of output layer $ L $.

$$ \delta{^L} = \nabla_a C \odot \sigma'(z{^L}) $$

### Error in Hidden Layers:

Error in layer $ l $ in terms of next layer. Note that $ T $ indicates the transpose operation.

$$ \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) $$

### Rate of Change of Error With Respect to Weights and Biases:

$$ { \partial C \over{\partial b{_j^l}}} = \delta{_j^l} ~~~~~ and ~~~~~ { \partial C \over{\partial w{_{jk}^l}}} = a{_k^{l-1}} \delta{_j^l} $$

Which give vectorized forms.

$$ { \partial C \over{\partial b}} = \delta ~~~~~ and ~~~~~ { \partial C \over{\partial w}} = a_{in} \delta_{out} $$
