# Backpropagation Lab

### Introduction

In this lesson, we'll code out both the hypothesis functions and training mechanisms for a sigmoid function.  We'll do so to walk through the chain rule once more, and hopefully, to place it in context.

### Our Starting Function

Remember that when we make a prediction with a single neuron, we do so with the following hypothesis function:

$h(x) = \sigma(w x + b)$

Ok, let's start by translating this into two functions `z(x)` and our sigmoid function.

> Let z be a function of $w$, $b$ and $x$.

In [64]:
def z(w, b, x):
    return w*x + b

In [65]:
z(1.5, .5, 1)

# 2.0

2.0

In fact, let's have our parameters be stored in a dictionary with keys of $w$ and $b$.

In [66]:
def z(params, x):
    return params['w']*x + params['b']

In [67]:
params = {'w': -1.0, 'b': .5}

z(params, -1.808)

# 2.308401245182045

2.308

Next, let's write our sigmoid function, which translates this into a value between 0 and 1.

In [68]:
import numpy as np
def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [69]:
params = {'w': 1.5, 'b': .5}
x = 1

sigmoid(z(params, x))

# 0.8807970779778823

0.8807970779778823

Ok, now that's starting to feel like a hypothesis.  Finally, let's wrap this in a function called `h(params, x)`.

In [70]:
def h(params, x):
    return sigmoid(z(params, x))

In [71]:
h(params, 1)

# 0.8807970779778823

0.8807970779778823

### Calculating the cost

Now that we've written out our hypothesis function, let's begin writing the code for our cost function.  We can begin by loading up some data from the breast cancer dataset.

In [72]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer_dataset = load_breast_cancer()

X = pd.DataFrame(cancer_dataset.data[:, :1], columns = cancer_dataset.feature_names[:1])
y = pd.Series(cancer_dataset.target)

In [73]:
mean_radiuses = X.iloc[:, 0].values

mean_radiuses[:4]

array([17.99, 20.57, 19.69, 11.42])

> Ok, let's write a functin called loss which calculates the loss for a single observation, given a set of parameters.

In [74]:
def loss(x, y, params):
    return (y - h(params, x))**2

In [75]:
first_mean_radius = mean_radiuses[0]

In [76]:
loss(first_mean_radius, y[0], params)

# 0.9999999999976854

0.9999999999976854

Let's move to our cost function, which calculates the total losses.

$J_{w, b}(X) = \sum_{i = 1}^n( y - h(x))^2$

In [79]:
def total_cost(X, y, params):
    return np.sum([loss(mean_radius, y, params)for mean_radius, y in 
                    zip(mean_radiuses, y.values)])

In [81]:
total_cost(x, y, params)

# 211.99999962013467

211.99999962013467

### Calculating the Gradient

Now that we have some initial parameters and have calculated our first total cost, the next task is to work on updating the parameters to minimize this cost.  To do so, we need to determine the amount that we should update the parameters by to approach the minimum step by step.  

As we know, we do this by moving in the negative direction of the gradient.  So if our hypothesis function is the following:

$h(x) = \sigma(z(w, x, b))$

Then the question is how much does nudging our parameter $w$:

1. Change the output of $z$,
2. Which then changes the output of $\sigma$
3. Which then changes the total cost $J(X)$

Ok, let's break this down into steps.

1. Calculate the change in the loss as we have a change in our hypothesis function

> Here, let's calculate the change in the loss for a single observation, as we change the parameters.

$\frac{\delta \ell}{\delta \sigma} = (y - h(x))^2$

* Find $\frac{d\ell}{d\sigma}$

$\frac{\delta z}{\delta w}$

$\frac{\delta z}{\delta b}$

In [82]:
params = {'w': 1.0, 'b': -1}

In [83]:
def dl_dsig(params, x, y):
    loss = h(params, x) - y
    return 2*loss

In [84]:
# x = 1, y = 0
params = {'w': -1.0, 'b': .5}

dl_dsig(params, 1, y[0])
# 0.7550813375962908

0.7550813375962908

So we can see that with the current parameters, the change in output of $h(x) = \sigma(x)$ results in a `0.0014` change in the cost at a single observation.

Now this calculated cost is only useful in that we'll need it later on.  Let's keep going.

$h(x) = \sigma(z(w, x, b))$

2. Finding $\frac{\delta \ell}{\delta z}$ 

To start we know that a change in the output in z will change the output in sigma which will then change the our loss of $\ell$.  So we get:

* $\frac{\delta \ell}{\delta z} = \frac{\delta \ell}{\delta \sigma} \frac{\delta \sigma}{\delta z}$

Now we already wrote functions for calculating $\frac{\delta l}{\delta \sigma}$, above so what's left is to calculate $\frac{\delta \sigma}{\delta z}$.

> Now the derivative $\frac{\delta \sigma}{\delta z}$ is the following:

$\sigma'(z(x)) = \sigma(z(x))*(1 - \sigma(z(x)))$

Now translate this into code.

In [43]:
def dsig_dz(params, x, y):
    return sigmoid(z(params, x))*(1 - sigmoid(z(params, x)))

In [44]:
x = 1  #y[0]
params = {'w': -1.0, 'b': .5}
dsig_dz(params, x, y[0])

# 0.2350037122015945

0.2350037122015945

3. Finding $\frac{\delta l}{\delta w}$

* Now the change in the loss with respect to a change $w$ is the following

* $\frac{\delta l}{\delta w} = \frac{\delta \ell}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w}$

And we already wrote functions to calculate $\frac{\delta \ell}{\delta \sigma}$ and $\frac{\delta \sigma}{\delta w}$.  So now it's time to write a function that calculates $\frac{\delta z}{\delta w}$.

In [45]:
def dz_dw(x):
    return x

In [46]:
dz_dw(x)
# 1

1

4. Finding $\frac{\delta l}{\delta b}$

* Similarly, finding the change in the loss with respect to a change $b$ is the following:

* $\frac{\delta l}{\delta b} = \frac{\delta \ell}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b}$

So now, let's just write a function that returns the $\frac{\delta z}{\delta b}$.

In [47]:
def dz_db():
    return 1

### Putting it together with backpropagation

So we've now written functions that will allow us to calculate how much the loss function changes as we change $w$ and $b$.

In [None]:
def forward(x):
    z = wx + b
    h = sigmoid(z)
    l = (y - h)^2

def backwards():
    dl/dh =  2*(h - y) # 4
    dl/dz = sigmoid(z)(1 - sigmoid(z))*dl/dh # 3*4
    dl/dw = x*dl/dz
    dz/db = 1

In [48]:
def dJ_dw_and_dJ_db(params, x, y):
    
    dJ_dsig_result = dl_dsig(params, x, y)
    dJ_dz_result = dJ_dsig_result*dsig_dz(params, x, y)
    # the upstream derivatives above, are shared
    # by the partial derivatives below
    
    dJ_dw = dJ_dz_result*dz_dw(x)
    dJ_db = dJ_dz_result*dz_db()
    return (dJ_dw.round(6), dJ_db.round(6))

In [55]:
params = {'w': 3, 'b': -1}
x, y_val = 2, 0
dJ_dw_and_dJ_db(params, x, y_val)

# (0.026414, 0.013207)
# y[0]

(0.026414, 0.013207)

So we should move our parameters in the negative direction of the gradient.  And here, we see that we should update our parameters, w and b by some fraction of $0.026$ and $0.013$.

Ok, so the function `dJ_dw_and_dJ_db` calculates the gradient for w and b.

In [256]:
def step(gradient, current_params, learning_rate):
    new_w = current_params['w'] - learning_rate*gradient[0] 
    new_b = current_params['b'] - learning_rate*gradient[1]
    return {'w': new_w, 'b': new_b}

### Trying it out

So we just wrote the components to calculate the hypothesis of our neuron as well as how to update our neuron.  Let's see how our code performs.

As a first step, we should scale our data to ensure that our model is not too influenced by outliers, or uneven data.  

In [257]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_transformed = scaler.fit_transform(X)

Then let's initialize some parameters, set a learning rate, and repeatedly update our parameters according to our calculated gradient.

> One way that our neural network is not quite right is that we are making an update at each instance instead of calculating the gradient of the total cost.  

Still let's give it a shot.

In [294]:
# -3, 1
params = {'w': 0, 'b': 0}
learning_rate = .001

# stochastic gradient descent  
for i in range(1000):
    for (x, y_value) in zip(X_transformed[:, 0], y):
        gradient = dJ_dw_and_dJ_db(params, x, y_value)
        params = step(gradient, params, learning_rate)

In [295]:
params

{'w': -3.4708154290001905, 'b': 0.6869705380000174}

In [108]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
# model.fit(X_transformed, y)

In [297]:
model.coef_, model.intercept_ 

(array([[-3.31967618]]), array([0.64237044]))

So we see that this performs pretty similarly to our logistic regression model.

### Summary

In this lesson, we went through the complete steps of coding out a single neuron, and then we compared how this performed to an equivalent model in sklearn.

We started by coding out the hypothesis function:

$h(x) = \sigma(wx + b)$

And then moved onto the loss function:

$\ell = (y - h(x))^2$

Finally, to perform gradient descent, we needed to write a function in charge of calculating the gradients of both $w$ and $b$.  To do so, we used the chain rule to calculate the impact that altering either parameter has on our cost function.  And we used backpropagation so that we could reuse the intermediary derivatives in our calculation of the gradient.

> Problems here: 
    
    1. X and y values are not explicit 
    2. Changed between y and targets
    3. The loss function is reversed  mean_radiuses, and 