# Applying the chain rule

### Introduction

At this point, we would probably like to go directly towards using gradient descent to minimize our cost curve instead of using gradient descent to find the minimum of some random functions.  But doing so is not so simple.  

When we ask the question of how a cost curve changes as we change our parameters $w_1$ and $w_2$, the problem is more indirect than we seen in the past.

### Reviewing our Sigmoid Neuron

Remember that with our sigmoid neuron is really the application of two functions:

The first is our *weighted input*:

$f(w) = w_1x_1 + w_2x_2 + w_3x_3 + b$

And the second is the $sigmoid function$:

$g(x) = \frac{1}{1 + e^x}$

Where $x$ is the output of our weighted input $f(x)$.  So our sigmoid neuron is, $g(f(w))$ given the functions defined above.

So part of the problem of gradient descent with our sigmoid neuron, seeing how, if we change one of the weights in $f(w)$, it alters the output in $g(f(w))$.  

### Welcome the chain rule

When we have nested functions like this, the problem of finding the change in output given a change an input even more complicated.  Mathematically, the task is the following: 

Given a function $g(f(x))$, find the change in output of $g$ given a change in $x$. 

To understand the problem we can think of changing the value of $x$, as causing a chain reaction.  Altering $x$ has an effect on the output $f(x)$, and altering $f(x)$ has an effect on the output of $g(f(x))$.

Let's see this problem of a chain reaction with a different example than that of our sigmoid neuron.  

$$h(x) = (3x + 1)^2$$

We can take the function $h(x)$ above, and express it as two separate functions, where:

$$f(x) =  3x + 1$$

$$g(y) =  y^2$$

And we can then represent $h(x)$ as

$$h(x) = g(f(x))$$

Now to find the derivative with respect to $x$ we have:

$\frac{\delta h}{\delta x} = \frac{\delta g}{\delta f}*\frac{\delta f}{\delta x}$

Let's solve these individually.

$g(f) =  f^2$ and $\frac{\delta g}{\delta f} =  2f$

$f(x) =  3x + 1$, and $\frac{\delta f}{\delta x} =  6$

Now we can use these components to solve for $\frac{\delta h}{\delta x}$.

$\frac{\delta h}{\delta x} = \frac{\delta g}{\delta f}*\frac{\delta f}{\delta x} = 2f*3$, and because $f = 3x + 1$, substituting we get:

$\frac{\delta h}{\delta x} = 3*2*(3x + 1) = 18x + 6$

In [28]:
def f(x):
    return 3*x + 1

def g(y):
    return y**2

(g(f(3.1)) - g(f(3)))/(3.1 - 3)


60.90000000000012

In [29]:
18*3 + 6

60

### The chain rule with the gradient

* $g(x) =  w_1x_1 + w_2x_2 + w_3x_3$

Now we'll continue to skip the sigmoid function.  Instead let's just assume that directly this input predicts whether or a cell is cancerous or not.  And the loss is.  So $g(x)$ is directly our hypothesis function, and our loss function is still directly:

* $J(\theta) = (actual - expected)^2$ or to place it another way

* $J(\theta) = \sum (y_i - g(x_i))^2$

Now for the derivative we can get rid of ignore the summation, it won't make any difference.

* $J(\theta) = (y - g(x))^2$

And remember that $g(x)$ is our weighted input, so let's plug that back in:

$J(\theta) = (y - (w_1x_1 + w_2x_2 + w_3x_3 + w_0) )^2 $

Now remember we want to find how we can update each of our parameters $w$ such that we start at a random set of parameters $w$, and then descend along our cost curve $J$.  And to do so, we move in the direction of the negative gradient.  Doing so will point us in the direction that descends along our cost curve.

So to move in the direction of the negative gradient, we need to find the partial derivative with respect to each parameter $w$.

$\frac{\delta J}{\delta w_1} = 2*(y - (w_1x_1 + w_2x_2 + w_3x_3 + w_0))*-x_1 $

$\frac{\delta J}{\delta w_2} = 2*(y - (w_1x_1 + w_2x_2 + w_3x_3 + w_0))*-x_2 $

And because it's the negative gradient, we reverse the signs of the partial derivatives to get:

$\frac{\delta J}{\delta w_1} = 2(y - (w_1x_1 + w_2x_2 + w_3x_3 + w_0))*x_1$

$\frac{\delta J}{\delta w_2} = 2*(y - (w_1x_1 + w_2x_2 + w_3x_3 + w_0))*x_2 $