### Introduction

In the last lesson, we filled in some more information on how to find "best fit" regression line with gradient descent.  Namely, we more carefully change our y-intercept or slope value of the regression line to minimize the residual sum of squares.  We do this by calibrating both the direction and size of of our change in $b$ or $m$ to the slope of the tangent line to the cost curve at that point. 

![](./tangent-lines.png)

So when our slope is negative we increase our $m$ or $b$ variable, and vice versa.  Also, the larger the absolute value of the slope, the larger our change in our variable -- that is, the larger our step size.  So a lot of our gradient descent technique depends of finding the slope of a curve at a specific point.  In this lesson, we'll learn different approaches for finding this slope.

## Talking about derivatives

The slope of a tangent line to a function is called the **derivative**.  A derivative answers questions about change.  For example, if you look at our blue curve above, the various slopes indicate how much is our output changing (in this case, our RSS), changing as we increase our input (here, our value of $b$).  At the point $b = 70$, the cost curve decreases a lot as you move forward.  At the point, $b = 90$ the cost curve is still decreasing as you move $b$ forward, but significantly less.  Thus at both $b = 90$ and $b = 70$ the derivative is negative as the change to cost is downward.  But the magnitude of change at the two points varies: when $b = 70 $ the derivative is -146.17 versus a derivative of -21.07 to reflect the smaller rate of change when $b = 90$.

Ok, so the derivative of a function is the rate of change of a function.  But how do we calculate the rate of change of a function?

### Calculating a derivative

> For this section, let's go back to move away from talking about derivatives just in the context of our cost curve, and talk about the derivatives of functions in general.  So instead of talking about our input as a value of $b$, let it be a value of $x$.  And instead of our output being a value RSS, let it be a value $y$.  

By way of explaining how to calculate a derivative, it's worth pointing out the notation for a derivative: $\frac{dy}{dx}$.  You can read that as the change in $y$ with respect to a change in $x$.  So if we want to denote that we are taking the derivative a of a function, like the function $ y = 3x$, we would write $\frac{dy}{dx}(y)$ where $y = 3x$.  Ok so now to actually calculate the derivative of that function, we have to answer the question what is a change in $y$ for a change in $x$.     

In [1]:
import plotly
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

def y(x):
    return 3*x

def line_function_data(line_function, x_values):
    y_values = list(map(lambda x: line_function(x), x_values))
    return {'x': x_values, 'y': y_values}

def plot(traces):
    plotly.offline.iplot(traces)

y_3x_trace = line_function_data(y, list(range(0, 11)))

plot([y_3x_trace])

Ok to measure $\frac{dy}{dx}(y)$ where $y = 3(x)$, we can simply choose a value of $x$, increase that value, and then see how $y$ changes as we increase $x$.  That is the amount that $y$ changes given a change in $x$.  Let's do it by choosing the $x$ = 4.  

* $ y_{x=4} = 3*4 = 12 $
* Then we change our value of $x$, so let's change $x$ from 4 to 5
* $ y_{x=5} = 3*5 = 15 $

Now our derivative, $\frac{dy}{dx}$ of the function, is the proportion of change in $y$ divided by change in $x$.  So here, we have change in $y$ over change in change in $x$: 

$(\Delta y)/(\Delta x) = (15 - 12)/(5 - 4) = 3/1 = 3 $

This makes sense, with the formula $y = 3x$,  for every unit of $x$ that we increase, $y$ increases by 3.  So this means $\frac{dy}{dx}(3x) = 3$.  



In [12]:
def build_tangent_line(original_function, x, line_length = 5, delta = .01):
    curve_at_point = derivative_at(original_function, x, delta)
    slope = curve_at_point['slope']
    x_minus = x - line_length
    x_plus = x + line_length
    y = original_function(x)
    y_minus = y - slope * line_length
    y_plus = y + slope * line_length
    text = '    slope:' + format(slope, '.2f')
    return {'x': [x_minus, x, x_plus], 'y': [y_minus, y, y_plus], 'mode': 'lines+text', 'text': [text], 'textposition': 'right'}

def derivative_at(original_function, x, delta = .01):
    numerator = original_function(x + delta) - original_function(x)
    slope = numerator/delta
    return {'value': x, 'slope': slope}

tangent_at_four = build_tangent_line(y, 4, .5, .01)
tangent_at_eight = build_tangent_line(y, 8, .5, .01)
plot([y_3x_trace, tangent_at_four, tangent_at_eight])

As you can see above, the derivative of a function at a certain point is really the slope of the function at that point.  And when our function is $y = 3x $ that slope is the for every $x$ value of the function.  

### Derivatives of non-linear functions

But things quickly becomes trickier when working with more complicated functions.  And we will run into these functions.  For example, let's consider how to take the derivative of something resembles our cost curve.  After all, figuring out the slope at a given point of a cost curve is what led us here.

In [19]:
import plotly
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

def fake_cost_curve(x):
    return (300*x - 300)**2

def line_function_data(line_function, x_values):
    y_values = list(map(lambda x: line_function(x), x_values))
    return {'x': x_values, 'y': y_values}

def plot(traces):
    plotly.offline.iplot(traces)


unscaled_values = list(range(-30, 50, 1))
x_values = list(map(lambda point: point/10, unscaled_values))
y_x_squared_trace = line_function_data(fake_cost_curve, x_values)

plot([y_x_squared_trace])

This is the graph of the function $y = (300*x - 300)^2 $.  How do we take the derivative when our function looks like this? Let's start by using our earlier technique to calculate the derivative at the point $x = 0$.

* when $x = 0$ then $y = 90,000 $
* when $x = 1 $ then $y = 0 $
* $(\Delta y)/(\Delta x) = (0 - 90,000)/(1- 0) = -90,000 $

In [20]:
y_x_squared_trace = line_function_data(fake_cost_curve, x_values)
tangent_at_zero = build_tangent_line(fake_cost_curve, 0, 1, 1)
plot([y_x_squared_trace, tangent_at_zero])

Take a look at the straight line in the graph above.  The straight line is a supposed to have the same slope as the blue curve at the point $x = 0 $.  But it doesn't seem to be doing a good job.  The slope should be point more downwards.  How did that happen?  After all, the slope of the line equals the slope from our calculation above.  Let's take another look at our calculation of the derivative. 

* when $x = 0$ then $y = 90,000 $
* when $x = 1 $ then $y = 0 $
* $(\Delta y)/(\Delta x) = (0 - 90,000)/(1- 0) = -90,000 $

The problem is that if we calculate change in y/change in x, and change x by one, then we are really calculating the rate of change of the function from zero to one.  But what we **want** to do is calculate the rate of change at just that point x = 0 - and that is a different matter.  Unlike in our function of the line $y = 3x $, here the amount that $y$ decreases or increases is always changing.  The larger our delta, the less our derivative reflects the rate of change at just that point.    


So what we need to do is decrease our Delta ( $\Delta $) to such a small number that it is zero.  But it's ludicrous to calculate the amount of change in $y$ when the change in our input is zero.  When change in our input is zero, it means there is no change.  So we use imagination.  We calculate the derivative with a $\Delta $ of .1, then calculate it again with a $\Delta $ of .01, then again with $\Delta $ .001.  Our derivative calculation should show convergance on a single number as our $\Delta $ approaches zero and that number is our derivative.  ** The derivative of a function is a change in the function's output as the change in the input variable approaches zero **.    

In [57]:
y_x_squared_trace = line_function_data(fake_cost_curve, x_values)
# delta = 1
# delta = .1
# delta = .01
delta = .001
tangent_at_delta = build_tangent_line(fake_cost_curve, 0, 1, delta)
plot([y_x_squared_trace, tangent_at_delta])

In the curve above, you can change the tangent line simply by changing the value of Delta.  Give it a shot!.  What you'll see is the following:

| $ \Delta x $        | $ \Delta y/\Delta x $|
| ------------- |:-------------:|
| .1      | -171,000      |
| .01 | -179,100     |
| .001 | -179,910      |
| .0001 | -179,991      |


As you can see, as $\Delta x $ approaches zero, $\Delta y / \Delta x $ approaches $ -180,000 $.  This convergance around one number as we change another number, is called the **limit **.  So to describe the above, we would say, the limit of $\Delta y / \Delta x $ -- that is the number $\Delta y / \Delta x $ approaches -- as  $ \Delta x $ approaches zero is -180,000.  We can abbreviate this into the following expression: 

When $x = 0,\lim_{\Delta x\to0} \Delta y / \Delta x = -180,000  $.

### Summary

### Our rules for calculating the derivative

From the above section you know that the derivative is equal to the slope of the tangent line along a graph.  The importance of the derivative is that it tells us the rate of change at a given point.  Or in the context of our cost curve, how much will our error change with a nudge of one of our values.

The derivative is the change in the value of an output with an infintessimally small increase of an input.  Notice that, the numbers we used above in calculating our derivative above were not very good.  The reason is because we had a change of x of about 2, that is way too large.  The change in x should be much less than one -- instead, .0001 is more like it. 

Luckily for us, we can follow some simple rules for calculating the derivative.

##### The power rule

The first rule for us to learn is the power rule.  The power rule states that given a function $f(x) = x^r$ then the derivative of $f(x)$, denoted $f'(x)$ is:

$ f'(x) = r*x^r-1 $

So for example, with the function above, $f(x) = x^2 $, so this means that $f'(x) = 2*x^{2-1} = 2*x^1 = 2*x$.  Another way to read this is that a small increase in x will produce an increase in y equal to 2 times the x value.  So when x = 2, we solve for the derivative at that point by simply plugging in 2 whenever we see x.  This gives us $f'(2) = 2*2 = f'(2) = 4 $.  And when x = 10, then $f'(10) = 2*10 = 20$.  So our calculations from above were close, but these are more accurate.  The derivative of $f(x) = x^2$ is $2*x$.

We won't prove the power rule here.  But hopefully you can see that it does seem to fit our view of $f(x) = x^2$ well.  It seems reasonable that the slope of the line tangent to a curve is $2*x$.  For example, let's assume that our error changes in the following as we change b, e = 3b.  Now a plot of e = 3b, looks like the following:

In [3]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML

init_notebook_mode(connected=True)
layout = dict(title = 'Error Values For line f(b) = 3*b')
b_values = list(range(0, 20, 1))
error_values = list(map(lambda b_value: b_value*3,b_values))
trace0 = go.Scatter(
    x = b_values,
    y = error_values,
    name = 'markers'
)





data = [trace0]
fig = dict(data=data, layout=layout)
iplot(fig)

Notice that for the line e = 3b, the derivative is constant.  That is the rate of change in our function is stable for all values of x.  Unlike where $e = b^2$, and the slope constantly changed, here a nudge in the b direction will produce the same increase in the output, regardless of where we are on the curve.  For example, when b = 8, we can see the following: 

$ de/db = 24.0003 - 24/(8.0001 - 8) = .0003/.0001 = 3 $

Note that our power rule also gives us a derivative of 3.

$f(b) = 3b = 3b^1$ 

$f'(b) = 1*3b^{1-1} = 3b^{0} = 3$

So our power rule shows that a change in b should produce a proportional increase of 3 times that change in our error.  And this is always the case for this error curve.

One more, let's consider that our function is the following: 

$f(b) = 1000$

In that case, our function is simply a constant line, no matter what the value of b, our error is always 1000.

In [4]:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML

init_notebook_mode(connected=True)
layout = dict(title = 'Error Values For line f(b) = 1000')
b_values = list(range(0, 20, 1))
error_values = list(map(lambda b_value: 1000,b_values))
trace0 = go.Scatter(
    x = b_values,
    y = error_values,
    name = 'markers'
)





data = [trace0]
fig = dict(data=data, layout=layout)
iplot(fig)

Note that here, the change in the error as we change b is always the same: 0.

So when f(b) = 1000, f'(b) = 0.  In fact, if the function is any constant, then the derivative of that function is zero. 

##### The constant factor rule

The above made use of the constant factor rule.  The constant factor addresses how to take the derivative of a function multiplied by a constant. So in the above example, we had that with our function of $f(b) = 3*x$.  Now, the derivative of that function is the same as $ 3 * de/db(b) $ leading to $ 3*1 $ as we simply apply the power rule to b.

In the general case, we can say, consider the function $a*f(b)$ where $a$ is a number.  Then the derivative $db/de(a*f(b)) = a * db/de(f(b) $.  

Don't let the fancy equations confuse you.  The rule simply says to focus on the taking the derivative of the variable, and if it was multiplied by a number, then multiply that derivative of that by the same number.

So if $f(b) = 2*b^2 $ this means that $f'(b) = 2*2*b $.  The constant factor rule in action.

##### The addition rule

Now consider that we receive a function like the following: 
    
$ f(b) = 4b^3 - b^2 + 3b $

First, we say that this function has two terms.  A term is a constant or variable that is separated by a plus or minus sign.  Ok, so to take a derivative of a function that has multiple terms, simply take the derivative of each of the terms individually.  So $ f'(b) = 12b^2 - 2b + 3  $.  Do you see what we did there, we simply applied our previous rules to each of the terms individually and continued to add or subtract the terms accordingly.

### Summary

In this section we saw that we can find the minimum error by following the line tangent to a graph.  And we can move along by following the line tangent to the spot we are currently located.  We then saw how this holds for a two-dimensional graph, by considering how our error changes with respect to a change in b.  We identified this change in output from an infintesimally small change in input as our derivative.  

Then we considered three rules that allow us to calculate our derivative.  The most tricky of these is the power rule, which says that if $f(b) = b^n$, then $ f'(b) = n * b^{n -1} $.  We still haven't seen how derivativesgive us a way to understand gradient descent, but we will shortly when we consider how to take derivatives when we have functions with multiple variables, like an error function that is dependent on both m and b.

But first, let's practice what we know about derivatives in a lab.