# Training a neural network

### Introduction

As we know, with neural networks we a set of inputs, followed by a layer of linear functions and activation functions, and ultimately our classifier (like a softmax function).

This is our hypothesis function.  In this lesson we'll see how we train this neural network.

### Forward then Backward

The process of training a neural network is similar to that of training of any other machine learning algorithm.  We use a prediction function to predict against our training data, and then we correct.

With something like a logistic regression function, this means that we change our parameters to minimize a cost function.  With a neural network, it's the same thing.  

As we saw in logistic regression, we can look to gradient descent to see how to update our parameters.  

That is, we see how much a change in any of the parameters affects the cost function, and our algorithm alters the parameters proportionate to their impact in reducing our cost function.

Now because our neural network consists of layers functions, with one layer dependent on the one that preceded it, discovering the impact of changing one of the entries in our weight grid can appear a little more complicated than with simple linear regression.

<img src="./multilayer.png" width="60%">

After all how can our algorithm know how much to alter a parameter back in layer two?  To do so, it would need to calculate the impact that this parameter would have down the chain in making a hypothesis, and through this, in impacting the cost function.  So how do we calculate this?

The key, perhaps is to rewrite our network as a series of functions?

$J(h(L_4(L_3(L_2\theta(x)))))$

Note that we can rewrite this question as the following:

$\frac{\delta J}{\delta \theta}$

That is we want to discover the change in the cost as we change the parameters of layer 2.  And we *can* in fact discover this, it's just the chain rule.

### A concrete example

Let's think about what our neural network looks like in practice.  Generally, our formula is something like the following: 

$J = (y - \hat{y})^2$

We can write the output of the final layer as:

$a_{last} = \sigma(x*w + b)$

So our cost formula is really something like:

$J = (y - \hat{y})^2 = (y - \sigma(x*w + b))^2$

Notice that the value $x$ is our inputs to the layer.  For the first layer, this would just be our data.  But for any other layer, these inputs come from the output of the previous layer.  So we can say that the output of any layer is:

$\sigma(a^{L - 1}*w + b)$

Where $a^{L-1}$ is the output of the previous layer.  So we can rewrite our cost formula as: 

$J = (y - \sigma(a^{L -1}*w + b))^2$

### Revisualizing our problem

So when we think about changing our weights, we could visualize it as the following:

<img src="./chained.png" width="30%">

Where we think about changing the parameters of a layer, and this trickles down to influencing our cost function.  Or we could also see this in code.

In [1]:
def cost():
    actual = 1
    return (actual - prediction())**2

In [6]:
def prediction():
    return 1/(1 + np.exp(z()))

In [9]:
def z():
    a_l_1, a_l_2 = 1, 2
    w_1, w_2, b = 3, 4, 5
    return a_l_1*w_1 + a_l_2*w_2 + b

So when we calculate the gradient, the goal is to see how much a parameter like `w_1` would effect our overall cost function.  The way of calculating this mathematically is to ask, what is $\frac{\delta C}{\delta w}$.  How does C change as we nudge w?

$C(a(z)))$

To calculate this, we can break this problem up as answering how much does:

1. z change as w changes
2. a^L change as z changes
3. And C change as a^L changes.

So $\frac{\delta C}{\delta w} = \frac{\delta z}{\delta w}\frac{\delta A^L}{\delta z}\frac{\delta C}{\delta A^L}$

<img src="./chained.png" width="30%">

So we saw that to calculate C, we simply apply the chain rule.

$\frac{\delta C}{\delta w} = \frac{\delta z}{\delta w}\frac{\delta A^L}{\delta z}\frac{\delta C}{\delta A^L}$

Now let's see how to solve this given our respective functions of $z$, $A^L$, and $C$, which are:

* $z = (a^{L -1}*w + b)$
* $a^L = \sigma(z)$
* $C = (y - a^L)^2$

<img src="./backprop-slide.png" width="50%">

So $\frac{\delta z}{\delta w} = a^{L - 1}$

$\frac{\delta A^L}{\delta z} = \sigma'(z)$

$\frac{\delta C}{\delta A^L} = 2(y - a^L)$

or 

$ \frac{\delta C}{\delta w} = a^{L - 1}*\sigma'(z)*2(y - a^L)$

### Resources

[Backpropagation calculus - 3blue1brown](https://www.youtube.com/watch?v=tIeHLnjs5U8)

[backprop](https://www.youtube.com/watch?v=ZIYOSdioBCE)

* Talk about how start off with random numbers, and then backwards propagate
* Change to removing the sigmoid function to make things simpler
