### Introduction

As you know, we entered our discussion of derivatives to understand speed and direction with which to move along a cost curve.  This led to learn about finding a derivative in a single variable function and then in a multivariable function with partial derivatives.  However, we have not yet explicitly showed how partial derivatives apply to gradient descent.

Well, that's what we hope to show in this lesson: how partial derivatives the path to minimize our cost function, and thus find our "best fit" regression line.

### Finding the steepest path

Now gradient descent literally means that we are taking the shortest path to *descend* towards our minimum.  However, it is somewhat easier to understand gradient ascent than descent, and the two are quite related, so that's where we'll begin.  Gradient ascent simply means that we want to move in the direction of steepest ascent.

Now moving in the direction of greatest ascent for a function $f(x,y)$, means that our next step is a step some distance in the $x$ direction and some distance in the $y$ direction that is the steepest upward at that point.

![](./Denali.jpg)

Note how this is a different task from what we have previously worked on for multivariable functions.   So far, we have used partial derivatives to calculate the **gain** from moving directly in either the $x$ direction or the $y$ direction.  Here in finding gradient ascent, our task is not to calculate the gain from a move in either the $x$ or $y$ direction.  Instead our task is to find some combination of a change in x,y that brings the largest change in output.  

In finding that direction of largest change, we do use partial derivatives.  As we know, the partial derivative $\frac{df}{dx}$ calculates the change in output from moving a little bit in the $x$ direction, and the partial derivative $\frac{df}{dy}$ calculates the change in output from moving in the $y$ direction.  Because moving in the direction of steepest ascent is the change in x, y that produces the greatest change in output, if $\frac{df}{dy} > \frac{df}{dx}$, we should move more in the $y$ direction than the $x$ direction, and vice versa.  That is, we want to get the biggest bang for our buck.    

In fact, the direction of greatest ascent for a function $f(x, y)$ is the step where me move a proportion of $\frac{df}{dy}$ steps in the $y$ direction and $\frac{df}{dx}$ steps in the $x$ direction.  So if $\frac{df}{dy}$ = 5 and $\frac{df}{dx}$ = 1, our next step she be five times more in the $y$ direction than the $x$ direction.

### Applying Gradient Descent 

Let's see this in an example.  Here is a plot of a function:
$$f(x,y) = 2x + 3y $$

![](./3dx3y.png)

Now if you imagine being at the bottom left of the graph at the point $x = 1$, $y = 1$, what would be the direction of steepest ascent?  It seems, just sizing it up visually, that we should move diagonally in the positive $y$ direction and positive $x$ direction.  Looking more carefully, it seems we should move more in the $y$ direction than the $x$ direction.  Let's see what our technique of taking the partial derivative indicates.   

The partial derivatives of the function $f(x,y) = 2x + 3y $ are the following: 

$\frac{df}{dx}(2x + 3y) = 2 $ and $\frac{df}{dy}(2x + 3y) = 3 $.

And what this tells us is that for the function above, to move in the direction of greatest ascent, we should move up three and to the right two.  So we would expect our path of greatest ascent to look like the following.

![](./DirectionGradientAscent.png)

So this path seems to map up pretty well to what we saw visually.  That is the idea behind gradient descent.  By now you know that we can take the derivative of a single-variable function to calculate how a small change in the input will change the output of the function.  And we know that the partial derivative allows us to determine how a change along a specific dimension or axis will effect our output.  The gradient is the partial derivative with respect to each type of variable, in this case x and y.  And the import of the gradient is that it's direction is the direction of steepest ascent.  The negative gradient, that is the negative of each of the partial derivatives, is the direction of steepest descent.  So our direction of gradient descent is $x = -2$, $y = -3$.

### Applying Gradient Descent to our Cost Function

Ok, so now that we know how to calculate gradient descent to find the direction of steepest descent we can apply this to our cost curve.  That is, we can see how changes in our slope value and y-intercept value change our cost.  And then, given a regression line, can figure how to change our regression line next. 

![](./gradientdescent.png)

Ok, remember that for our cost function, we use the formula that $ RSS = \sum(guess - actual)^2 = \sum(\overline{y} - y)^2 = \sum(mx + b - y)^2$, for all $x$ values, where $mx + b $ represents our regression line.  Let's call our cost function $J$, and our error, RSS, is a function of our slope and our y-intercept, we represent this $J(m,b)$. So we say:

$J(m, b) = \sum(mx + b - y)^2$

Ok, so remember we want to find the values of $m$ and $b$ that minimize our RSS.  So we should have an idea of how to do that: use gradient descent to tell us how the direction of steepest descent to find our minimum.  So, as we know, to find the gradient of our function $J(m,b)$, we take the partial derivative with respect to each variable of the function, that is $\frac{dJ}{dm}$ and $\frac{dJ}{db}$.  In calculating the partial derivatives of our function $J(m,b)$, we won't change the result if we ignore the summation then replace it back at the end, so that's what we'll do to make our lives easier.

Ok, so we can take the partial derivative.   $\frac{dJ}{dm}J(m, b) = \frac{dJ}{dm}(mx + b - y)^2$.

Now to take a derivative of a function like this, we can use functional composition followed by the chain rule.  Using functional composition, we can rewrite our function $J$ as two functions: 

$$g(m,b) = mx +b - y$$
$$J(g(m,b)) = (g(m,b))^2$$

So using the chain rule, to find the partial derivative with respect to a change in the slope we have: $$\frac{dJ}{dm}J(g) = \frac{dJ}{dg}J(g)*\frac{dg}{dm}g(m,b)$$

Ok, now solving our derivatives individually we have: 
* $\frac{dJ}{dg}J(g) = \frac{dJ}{dg}g^2 = 2*g$
* $\frac{dg}{dm}g(m,b) = \frac{dg}{dm}mx +\frac{dg}{dm}b - \frac{dg}{dm}y = x $

Now plugging these back into our chain rule we have: 

$\frac{dJ}{dg}J(g)*\frac{dg}{dm}g(m,b) = (2*g(m,b))*x = 2*(mx + b -y)*x $

Ok, now let's calculate the partial derivative with respect to a change in the y-intercept we have:

$$\frac{dJ}{db}J(g) = \frac{dJ}{dg}J(g)*\frac{dg}{db}g(m,b)$$

Once again, we view our cost function as the same two functions $g(m,b)$ and $J(g(m,b))$.  From earlier, we know that $\frac{dJ}{dg}J(g) = \frac{dJ}{dg}g^2 = 2*g$.  The only thing left to calculate is $\frac{dg}{db}g(m,b)$.

$\frac{dg}{db}g(m,b) = \frac{dg}{db}mx +\frac{dg}{db}b - \frac{dg}{db}y = 1$

Now we plug our terms into our chain rule and get: 

$$ \frac{dJ}{dg}J(g)*\frac{dg}{db}g(m,b) = 2*g*1 = 2*(mx + b -y) $$

Ok, so now we have our two partial derivatives.  But as we know, to move point us in the direction of greatest descent we have to reverse the direction of each of these derivatives by reversing the sign, giving us:

* $ \frac{dJ}{dm}J(m,b) = -2*x(mx + b -y) $
* $ \frac{dJ}{db}J(m,b) = -2*(mx + b -y) $

And as $mx + b$ = is just our regression line, we can simplify these formulas to be: 

* $ \frac{dJ}{dm}J(m,b) = -2*x(\overline{y} - y) $
* $ \frac{dJ}{db}J(m,b) = -2*(\overline{y} - y) $

and adding back in our summations we have: 

* $ \frac{dJ}{dm}J(m,b) = -2*\sum x(\overline{y} - y) $
* $ \frac{dJ}{db}J(m,b) = -2*\sum(\overline{y} - y) $

So that is what what we'll do to find the our "best fit regression line."  We'll start with an initial regression line with values of $m$ and $b$.  Then we'll go through our dataset, and with each point will use the above formulas to tell us how to update our regression line such that it continues to minimize our cost function.  We'll spend the next lesson we'll walking through this technique.

### Summary

In this section we developed some intuition for why the gradient of a function is the direction of steepest ascent and the negative gradient of a function is the direction of steepest decent.  Essentially, the gradient uses the partial derivatives to see what change will result from a of any of the function's dimensions, and then moves in that direction weights towards the partial derivative with the larger magnitude.

We also practiced calculating some gradients, and ultimately calculated the gradient for our cost function.  This gave us two formulas which tell us how to update our regression line so that it descends along our cost function and approaches a "best fit line".