# Taking Stock

### Introduction

At this point, we have learned a lot of the content needed to train a single neuron.  As we'll see, our technique of using gradient descent to find the parameters of a hypothesis function is a large part of how we'll train an entire neural network. 

Let's take a moment to recap what we've learned.

### Hypothesis Function

We saw that we can think of the hypothesis function for our neuron as two components, a linear component and an activation function.  We could even think of this as occurring in a layers, where we first feed our inputs into our linear component $z(x)$, which produces an output that is fed into the activation function:

* $z(x) = w_1x_1 + w_2x_2 + ... w_nx_n + b$
* $a(z) = \frac{1}{1 + e^{-z}}$

Our linear function $z(x)$ produces an output between positive and negative infinity, and our activation function, the sigmoid function, maps this to values between 0 and 1.  The output from our activation function expresses a degree of confidence in the prediction.  For example, the closer the output is to zero, the more confident the hypothesis function is that the observation has a value of 0.


### Training Our Hypothesis Function

We then moved into training our hypothesis function for a neuron.  That is, we spoke about the procedure for finding the parameters of our hypothesis function.  

This procedure was gradient descent, which we described as initializing a parameter at a random value, and then repeatedly updating the parameter according to the formula:

$\theta_{next} = \theta_{current} -\eta*slope\_at(\theta_{current})$

Or in code:

```python
weight = -2
learning_rate = .01
for idx in range(0, 10):
    weight =  weight - learning_rate*sse_slope(weight)
weight
```

### Calculating the Rate of Change

We then described how we can calculate this slope.  The slope of a function at a specific value is equal to the derivative of the function at that value.  The derivative of our function, is the instantaneous rate of change of our function, or:   

$\frac{\delta y}{\delta x} = lim_{\delta x\to0}\frac{y_1 - y_0}{x_1 - x_0}$.

And using the derivative in gradient descent, calculating the derivative means calculating the instantaneous rate of change in our cost function, with respect to our parameter.  In other words, as we nudge a parameter, how much does our cost function change.  If our cost function changes a lot, then we are not close to the minimum and should move the parameter a lot to descend along the cost curve.

We updated our gradient descent formula to use the derivative notation, and letting $J$ equal our cost function so that:

$\theta_{next} = \theta_{current} -\eta*J'(\theta_{current})$

or 

$\theta_{next} = \theta_{current} -\eta*\frac{\delta J}{
\delta \theta}(\theta_{current})$

Where $\theta$ represents any parameter, in previous lessons a weight $w$, and $\eta$ is the learning rate, which is just a small number like .01.

### Moving to two parameters

After seeing how we could use gradient descent to find the parameters of a hypothesis function where there is only one parameter, we then moved onto seeing how we could use gradient descent with two parameters.

With two parameters, our cost curve is now in three dimensions.  And we are seeing how much to update each parameter.  We did this by finding the partial derivative with respect to each parameter.  In other words, we calculate the instantaneous rate of change in the cost function with respect to one parameter -- while holding all others constant, and then with respect to the next parameter while holding all others constant.

We update the parameter in proportion to how much changing them reduces the output of the cost function.  That is, we update our parameters so that we get the most bang for our buck.

```python
w = .5
b = .5
eta = .01

for i in range(0, 150):
    for (x, y) in paired_data:
        dj_dw_calc = dj_dw(w, b, x, y)
        dj_db_calc = dj_db(w, b, x, y)
        w += -eta*dj_dw_calc
        b += -eta*dj_db_calc
```

By following that procedure, we saw how we can find the parameters that produce the lowest output of our cost function.  

### Moving Forward

So over the last several lessons, we learned about both the our hypothesis function and the training procedure for a single neuron.  There is one component that we left out, something about the chain rule.  So, we'll close by learning about how using the chain rule can allow us to find the parameters for an entire neural network, through something called back propagation.

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="jigsaw-icon.png" width="15%" style="text-align: center"></a>
</center>