* Notes

### Introduction

So in the last lesson we learned about both forward propagation and backward propagation.  With forward propagation we saw the output from our neural network at each layer.

* $z(x_i) = w_1*x_i $
* $a(z) =  \frac{1}{1 + e^{-z(x)}} $
* $ J(\hat{y}, y) = \sum  (y - \hat{y})^2 $

And with backward propagation, we moved backward through each of our layers, each time calculating how a change in the layer's output affected the cost function.  Remember that with backpropagation, we perform this calculatin by multiplying the local derivative by the upstream derivative we calculated in the previous layer.

* $\frac{\delta J}{\delta \sigma}$

* $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$

* $\frac{\delta J}{\delta w} = \frac{\delta w}{\delta z}*\frac{\delta J}{\delta z}$

But in the last lesson, we did this with only a single parameter in our neuron.  And we updated our neural network only one observation at a time.  In this lesson, we'll see how we can use matrix algebra to build more complex neural networks.

### Getting Setup

Now we'll need the linear and activation layers to make a prediction.  And we'll need the derivatives we calculated in the previous lessons to update the parameters of our hypothesis function.

Remember that we'll need the components needed to calculate $\frac{\delta J}{\delta w}$ and $\frac{\delta J}{\delta b}$.

* $\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$

* $\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta b}$

And as we calculated in previous lessons, these component derivatives are the following:

### Understanding the Gradient

### Multiple Features

In [105]:
import pandas as pd

df = pd.read_csv('./cell_data.csv')
df[:2]

Unnamed: 0,mean_area,mean_concavity,is_cancerous
0,1.001,0.3001,0
1,1.326,0.0869,0


And we convert our data into tensors like so:

In [109]:
import torch
X_tensor = torch.tensor(df[['mean_area', 'mean_concavity']].values).float()
y_tensor = torch.tensor((df['is_cancerous'] == 0).values).float()

Now when using multiple features, really our same procedure of forward and backward propagation hold.  We start with forward propagation, where we pass our data through multiple layers.

In [112]:
first_x = X_tensor[0] 
first_y = y_tensor[0]


first_y # tensor(1.)
first_x # tensor([1.0010, 0.3001])

tensor([1.0010, 0.3001])

And then we'll define our hypothesis function.

In [48]:
def linear_fn(w, x, b):
    return x @ w + b 

In [49]:
def activation_fn(z):
    return torch.sigmoid(z)

And then we perform forward propagation.

In [115]:
w = torch.tensor([2., 2.])
b = torch.tensor(-2.)

z = linear_fn(w, first_x, b)
z

tensor(0.6022)

> So, in the linear layer above, we perform multiplication of $x_{1x2}$ and $w_{2x1}$ giving us a single output.

And from there, we pass this output through our activation function, where we stay with an individual output.

In [119]:
y_hat = activation_fn(z)
y_hat

tensor(0.6462)

### Backward propagation

Now what's more interesting is backward propagation.  It turns out that all of our same formulas hold.

In [120]:
import torch
def delta_J_delta_sigma(y_hat, y):
    return torch.sum(2*(y_hat - y))

In [121]:
def delta_sigma_delta(z):
    return torch.sigmoid(z)*(1 - torch.sigmoid(z))

In [122]:
def delta_z_delta_w(x):
    return x

In [123]:
def delta_z_delta_b():
    return 1

So really, even though we are using a multiparameter hypothesis function we can use the same rules as above.

In [128]:
dj_dsig = delta_J_delta_sigma(y_hat, first_y)

In [129]:
dz_dJ = delta_sigma_delta(z)*dj_dsig

In [130]:
dw_dJ = delta_z_delta_w(first_x)*dz_dJ

In [131]:
dw_dJ

tensor([-0.1620, -0.0486])

Now the most interesting part is what happens at the bottom-most layer -- $\frac{\delta z}{\delta w}$.  Remember, at this layer, our hypothesis function looks like the following:

$z = w_1*x_1 + w_2*x_2 + ... b$

And remember, with gradient descent, we want to determine how to nudge each of our weights, $w_1, w_2, ... w_n$ and our bias term $b$.  So when we find $\frac{\delta z}{\delta w_1}$ we treat every term but the first term as a constant, and so we just get $x_1$.

$\frac{\delta z}{\delta w_1} = x_1 + 0 + 0 +  ... 0 $

In [None]:
For the second term

$\frac{\delta z}{\delta w_2} = x_2$