# Understanding what's next

### Introduction

So in the last lesson, we saw the Pytorch code needed to train a neural network that can determine the labels of images of handwritten digits.  But what we focused on so far, was the *hypothesis function* of a neural network.  In other words, we focused in on the construction of our different layers, and saw how we could pass through features of an observation to ultimately get a prediction.  

But what we still have yet to learn some of the crucial ways that a neural network learns the parameter values -- the numbers in our linear layers -- that will predict our neural network.  In other words, we still have more to learn about what occurs in training a neural network.  

Now we have learned a bit about training a neural network, but it's a bit more complicated than we've seen so far.  In this lesson, we'll review what we've learned so far about how a neural network trains through gradient descent, and then we'll see what makes gradient descent so challenging when working with a neural network with many parameters, and many linear layers. 

## Loading the Data

Now so far we have learned how to train a single neuron that has a single parameter.  Let's jog our memory by returning to our example of trying to predict if a cell is cancerous.  We can begin by loading up our data.

In [3]:
import pandas as pd
url = 'https://raw.githubusercontent.com/jigsawlabs-student/pytorch-intro-curriculum/main/5-training-mathematically/cell_data-2.csv'
df = pd.read_csv(url, index_col = 0)
df[:3]

Unnamed: 0,mean_area,is_cancerous
0,1.001,1
1,1.326,1
2,1.203,1


And then convert this into our Pytorch tensors.

In [4]:
import torch
mean_area = df[['mean_area']].values # select just the mean area data

X_tensor = torch.tensor(mean_area).float() # convert to tensors
X_tensor[:2]

tensor([[1.0010],
        [1.3260]])

In [5]:
X_tensor.shape

torch.Size([569, 1])

And then let's do the same for our target data.

In [6]:
y_tensor = torch.tensor(df['is_cancerous']).float()
y_tensor[:2]

tensor([1., 1.])

## Reviewing the hypothesis function

So we'll want to to feed this data into a neuron that can predict whether each slide has cells that are cancerous or not.  How do we do this?  Well we'll start with our linear function and feed the outputs from this linear function into an activation function.

* $z(mean\_area) = w_1*mean\_area $
* $a(z) =  \frac{1}{1 + e^{-z(x)}} $

So above, we pass multiply the mean area by a weight, $w_1$, and pass this output to our sigmoid function that gives us a value between 0 and 1.  The closer the output is to one, the more confident the prediction of cancer.  So now, we can define our neuron with the following:

In [41]:
W = torch.tensor([[1.]], requires_grad = True)

def linear_layer(X):
    z = X @ W
    return z
    
def activation_layer(z):
    a = torch.sigmoid(z)
    return a

So in the linear_layer above, we created a feature vector `W` that has a single neuron (one column) with a single weight (one row) whose value we randomly set to just be the number `1`.  After initially setting this value, we'll need to learn the value that will produce predictions that best predict the labels in the training data. 

Before doing this training let's remember how this works.

> We can select a single observation.

In [42]:
first_obs = X_tensor[:1]

first_obs

tensor([[1.0010]])

> Pass it through the linear layer.

In [43]:
z = linear_layer(first_obs) # z = X @ w
z

tensor([[1.0010]], grad_fn=<MmBackward>)

And then pass this output through the activation layer to get a prediction.

In [44]:
activation_layer(z)

tensor([[0.7313]], grad_fn=<SigmoidBackward>)

Which we can then compare with the actual value.

In [45]:
y_tensor[:1]

tensor([1.])

## Evaluating the predictions

Ok, so now let's review how we perform training. Above we made our prediction with a random parameter value of $W = [[1]]$.  But our next step is to see if parameter produces predictions that are anywhere close to the actual labels.  And then from there we update the parameter value.

So our next step is to check our predictions against the actual labels in the training data.  Our predictions look like the following:

In [49]:
y_hats.shape # torch.Size([569, 1])

y_hats[:2]

tensor([[0.7313],
        [0.7902]], grad_fn=<SliceBackward>)

But our actual labels takes on a different shape.

In [50]:
y_tensor.shape # torch.Size([569])

y_tensor[:2]

tensor([1., 1.])

So let's transform the data so that it takes the same shape, and then we can perform the subtraction.

In [51]:
y_actual = y_tensor.view(-1, 1)

y_actual.shape # torch.Size([569, 1])

y_actual[:2]

tensor([[1.],
        [1.]])

Ok, so now what's pretty cool, is that we can perform the calculate $e_i = (y\_actual - y\_hat)$ to get a vector of errors at each position.

In [52]:
errors = (y_actual - y_hats)

errors[:2]

tensor([[0.2687],
        [0.2098]], grad_fn=<SliceBackward>)

Then we square each of these errors, and take the sum: $SSE = \sum  (y\_actual - y\_hat)^2 $.

In [53]:
squared_errors = (y_actual - y_hats)**2

sse = torch.sum(squared_errors)

In [54]:
sse

tensor(152.0503, grad_fn=<SumBackward0>)

> So this is how we can calculate how well initial parameter value of $w_1 = 1$ performs against the entire training set.

To convert this to a mean squared error, we can simply divide by the number of observations like so.

In [55]:
squared_errors = (y_actual - y_hats)**2
n = len(y_actual) # 569 observations

mse = torch.sum(squared_errors)/n

mse # mse = sse/n => average squared error

tensor(0.2672, grad_fn=<DivBackward0>)

### Adding some Math

Now, we would probably like to keep moving through this review, but things will soon become difficult for us, if we do not get some math notation out of the way.  Let's go through some notation for better expressing our hypothesis function and our cost function.

1. The hypothesis function

Ok, so we just defined our hypothesis function as the linear layer followed by our activation layer.

In [62]:
x = X_tensor[0]

z = linear_layer(x)
activation_layer(z)

tensor([0.7313], grad_fn=<SigmoidBackward>)

And if we want we can package both of these functions up into a hypothesis function `h(x)`.

In [63]:
def h(x):
    z = linear_layer(x)
    return activation_layer(z)

In [64]:
h(x)

tensor([0.7313], grad_fn=<SigmoidBackward>)

The function $h(x)$ is how we represent our function mathematically as well.  And if we want, we can express our hypothesis function as:

> $h(x) = \sigma(z(x))$

Read the function above from the inner, out.  So we pass $x$ to our linear function $z$, and we pass this output to our activation function $\sigma$.

Finally, to make this even more explicit, we may write this add a little letter $i$ underneath the $x$:

$h(x_i) = \sigma(z(x_i))$

This signifies that we are passing through a single observation $x$ into our hypothesis function, and thus are getting a single output.  This is as opposed to passing through all of the observations at once, to get back a list of outputs.

In [66]:
h(X_tensor)[:10]

tensor([[0.7313],
        [0.7902],
        [0.7691],
        [0.5953],
        [0.7853],
        [0.6171],
        [0.7388],
        [0.6406],
        [0.6271],
        [0.6168]], grad_fn=<SliceBackward>)

> See the difference?

To specify that we are just passing through one input, we add our $x_i$: $h(x_i) = \sigma(z(x_i))$

2. The cost function

So we previously expressed our cost function with something like the following:

$SSE = \sum  (y - \hat{y})^2 $

Where $\hat{y}$ was our predicted value, and $y$ was our actual value.  But if you think about it, our predicted value is the value that is outputted from our hypothesis function: $h(x_i)$, so we can rewrite that our squared error above to be:

> $squared\_error = (y_i - h(x_i))^2 $

So now we are indicating that the squared error of an individual observation is an actual value $y_i$ minus the output from our hypothesis function squared.

And to then indicate we are adding the squared error for each individual observation, we add back in the summation function: 

$SSE = \sum_{i = 1}^n  (y_i - h(x_i))^2 $

This time under the summation function we added an $i = 1$ on the bottom and an $n$ on top.  This is because in math, like programming there is the concept of an index for a list of elements.  So here, it is saying from the first element (we begin at index 1 in math) to the last element $n$, add up the squared error of each.  Finally, for mean squared error, we divide by the number of elements to get the average squared error.

$MSE = \frac{1}{n}\sum_{i = 1}^n  (y_i - h(x_i))^2 $

One more thing before we move on.  We saw that our hypothesis function is given the notation $h(x)$.  We'll use $J(x)$ to indicate our cost function.  So now we have:

$J(x) = \frac{1}{n}\sum_{i = 1}^n  (y_i - h(x_i))^2 $

> And notice that the $x$ in J(x) does not have a little $i$ underneath.  This is to indicate that the *total cost* for our set of observations is the summation of all of the individual squared errors, divided by the number of observations.

Maybe read that sentence above twice so that it sinks in.

### Performing Gradient Descent

Ok, so the math review is over.

Now remember what comes next.  We want to find the parameter value for $w_1$ that minimizes the cost curve.  Now we could try every value, but instead it's more efficient to use the slope of the cost curve to update our parameter value such that it will move towards the minimum.

<img src="cost-curve-slopes.png" width="70%">

So we started with our random value for $w_1$, and then our approach will be calculate the slope of the cost curve for that value of $w_1$.  And we'll descend along the cost curve by taking the negative value of the slope, and multiplying it by a learning rate:

$next\_w_1 = w_1 - learning\_rate*slope\_at(w_1)$

Or, more generally we write this as:

$\theta_{next} = \theta_{current} -\eta*slope\_at(\theta_{current})$

Where $\theta$ represents a parameter, and $\eta$ is the learning rate.  This process is called gradient descent.  And it's called gradient descent because, that *slope* is referred to as the **gradient**, and we use the gradient to descend along our cost curve. 

In summary, with gradient descent we take the current parameter value, find the slope of the cost curve at that value, and update our parameter by the negative slope multiplied by a learning rate.  Then we repeat.  

Remember this is what we saw above.

```python
for (x, y) in zip(X_train_tensor_gpu, y_train_tensor_gpu): # loop through observations and labels
    net.zero_grad() # explained below
    y_hat = net(X_tensor)             # 1. start with current weights in our neural net and make a prediction
    loss = criterion(y_hat, y_tensor) # 2. See how off the prediction is according to the cost function
    loss.backward()                   # 3a. Calculate the slope of the cost curve at that weight
    opt.step()                        # 4. Update the parameters based on learning rate and the calculated slope 
```

### Behind the scenes

Now the key component of neural networks which we still need to understand is what occurs when we call `loss.backward()`.  What occurs is when we call `loss.backward()`, Pytorch calculates the slopes we see in our cost curve picture.

<img src="cost-curve-slopes.png" width="60%">

That is, it calculates how nudging the current parameter value in $w_1$ will alter our cost curve.  We can see this if we look at `w.grad`.

> Press shift + return, and **nothing happens**.

In [56]:
W.grad

So **nothing happens** when we call `w.grad`.  But if we first call `sse.backward()` and this will trigger Pytorch calculating this slope.

In [57]:
sse.backward()

So notice that this time Pytorch has calculated the gradient.

In [58]:
W.grad

tensor([[27.4926]])

So our procedure would be to repeatedly update this parameter value by the negative gradient (ie slope of the cost curve) multiplied by a learning rate.

$\theta_{next} = \theta_{current} -\eta*gradient\_at(\theta_{current})$

```python
w = w - .01 * w.grad
```

So now see if that training loop makes even more sense.

```python
for (x, y) in zip(X_train_tensor_gpu, y_train_tensor_gpu): # loop through observations and labels
    net.zero_grad() # explained later
    y_hat = net(X_tensor)             # 1. start with current weights in our neural net and make a prediction
    loss = criterion(y_hat, y_tensor) # 2. See how off the prediction is according to the cost function
    loss.backward()                   # 3a. Calculate the slope of the cost curve at that weight
    opt.step()                        # 4. Update the parameters based on learning rate and the calculated slope 
```

The `zero_grad` line simply clears the calculated gradient.

### A moment of appreciation

So now we saw what that `sse.backward()` call does.  It calculates how the cost curve changes as we change our parameter $w_1$.  This may seem pretty simple, but there's a lot behind it.  

1. There's a lot of parameters

First, while in this example we performed the operation by calculating how changing *a single parameter* changes the cost curve -- with neural networks we need to calculate how each of our thousands of parameters in change the cost curve.  

But it's not just the number of parameters that makes difficult.  

2. There are multiple layers

Let's take another look at what we did above above.

In [59]:
w = torch.tensor([[1.]], requires_grad = True)

z = X_tensor @ w # layer 1
a = torch.sigmoid(z) # layer 2

mse = torch.sum((y_tensor.view(-1, 1) - a)**2)/len(y_tensor) # layer 3, total cost

When we then called `sse.backward()`, our neural network had to calculate the effect of changing $w$ back in layer 1, on the sum of squared errors down in layer 3.  So to calculate the effect of changing a parameter in layer 1, we need to consider how $w$ affects the output of the linear layer, which changes the output in the activation layer, which then changes the output of the cost function.  Changing this parameter back in layer 1, causes a chain reaction, and Pytorch needs to account for it.

So this is what we'll need to cover in the coming lessons.  First, how neural descend a cost curve when there is more than one parameter.  And secondly, how neural networks calulates the indirect affect that changing a parameter has on the cost curve through multiple layers. 

### Summary

In this lesson, we reviewed the key ideas behind training the parameters for a single neuron.  We first defined our hypothesis function as the following:

* $z(mean\_area) = w_1*mean\_area $
* $a(z) =  \frac{1}{1 + e^{-z(x)}} $

And coded it like so:

In [102]:
def linear_layer(X):
    W = torch.tensor([[1.]], requires_grad = True)
    z = X @ w
    return z
    
def activation_layer(z):
    a = torch.sigmoid(z)
    return a

From there, we moved to training our hypothesis function.  And saw that we do this with gradient descent.

Or, in Pytorch, as the following:

```python
for (x, y) in zip(X_train_tensor_gpu, y_train_tensor_gpu): # loop through observations and labels
    net.zero_grad() # explained later
    y_hat = net(X_tensor)             # 1. start with current weights in our neural net and make a prediction
    loss = criterion(y_hat, y_tensor) # 2. See how off the prediction is according to the cost function
    loss.backward()                   # 3a. Calculate the slope of the cost curve at that weight
    opt.step()                        # 4. Update the parameters based on learning rate and the calculated slope 
```

We then saw that the `loss.backward()` step calculates the gradient -- that is the slope of our cost curve at the current value of our parameter.  And this is what we use to guide how we repeatedly update our parameter.

$\theta_{next} = \theta_{current} -\eta*slope\_at(\theta_{current})$

Finally, we discussed what's next.  First, we need to learn how gradient descent works when we move beyond a single parameter.  And second, we need to see how we can calculate the effect of changing a parameter on multiple layers of our neural network, until it ultimately affects our cost function.