# Applying Gradient Descent

### Introduction

In the last lesson, we learned how to calculate the partial derivatives of a function.  In calculating the partial derivative, we hold every variable constant, except for the variable we are differentiating with respect to.

So we saw that for a function like $f(x, y) = 3x^4y^2$, we could calculate the derivative with respect to $x$, $\frac{\delta f}{\delta x}$, which is the instantaneous rate of change of our function $f$ as we nudge our value of $x$.  

$\frac{\delta f}{\delta x} = 3*4*x^3y^2 = 12x^3y^2$.  So because we were differentiating with respect to $x$, we treated $y^2$ like a constant.

### Calculating the gradient

Believe it or not, we have already done all of the work necessary to calculate the gradient of the function $f(x,y) = 3x^4y^2$.  

Remember that our technique for gradient descent is to, while standing on a given point, find the direction that descends the most at that point.  And we do this by finding the slope in each direction.  So to figure out how to descend along the function $f(x,y) = 3x^4y^2$, we look at our two partial derivatives:

$$\frac{\delta f}{\delta y} = 6x^4y$$

$$\frac{\delta f}{\delta x} = 12x^3y^2$$



So for example, if we were standing on our curve at the point, $x = 2$, $y = 2$, then we would change our values of $x$ and $y$ in the following proportions:

$$\frac{\delta f}{\delta y}(2, 2) = 6x^4y = 6*2^4*2  = 192$$

$$\frac{\delta f}{\delta x}(2, 2) =  12x^3y^2 = 12*2^3*2^2 = 384$$

So, when standing at the point $(2, 2)$, we should move twice as much in the x direction as we do in the y direction.  

There is just one twist.  What we have found above, by taking the two partial derivatives, is the direction of **greatest increase**.  So for the direction of **greatest decrease**, we simply move in the opposite direction.  This means we move -192 in the $y$ direction, and -384 in the $x$ direction.

The coordinates that we found above can be represented as a vector.  A vector that represents the rate of change each direction.  And we represent this vector with the greek letter nabla,$\nabla$, which represents our gradient.  So our gradient looks like the following:

$$\nabla f(x,y) = \begin{bmatrix}
\frac{\delta f}{\delta x} \\
  \frac{\delta f}{\delta y} 
\end{bmatrix}$$

Or applying this to gradient of our function $f(x, y) = 3xy^2$

$$\nabla f(x,y) = \begin{bmatrix}
12x^3y^2 \\
  6x^4y 
\end{bmatrix}$$

And for the direction of greatest descent, we move in the direction of the negative gradient.

$$ - \nabla f(x,y) = \begin{bmatrix}
 -12x^3y^2 \\
  -6x^4y 
\end{bmatrix}$$

> The **gradient** of a function is a vector whose entries are the partial derivatives of the function.  It is the direction of fastest increase.  For gradient descent, we move in the direction of the negative gradient.

## From gradient to gradient descent

Ok, now let's use our calculation of the gradient in our gradient descent procedure.  Let's use gradient descent to approach the minimum of the function we've been working with in this lesson: 

$$f(x,y) = 3x^4y^2$$

In [23]:
def f(x, y):
    return 3*x**4*(y**2)

In [116]:
import plotly.graph_objects as go
import json
with open('./three_x_y_squared.json') as file:
    data = json.load(file)
multi_param_fig = go.Figure(data)
multi_param_fig

So our plot is a cone (or a parabaloid, if you prefer) that is steeper in the $x$ direction than the $y$ direction.

Now for our gradient descent procedure, we pick a random value, like $(2, 2)$, and then move in the direction of the negative gradient.

$$f(x,y) = 3x^4y^2$$

$$ - \nabla f(x,y) = \begin{bmatrix}
 -12x^3y^2 \\
  -6x^4y 
\end{bmatrix}$$

We want to keep our step size small, so that we don't deviate too far from the direction of greatest descent.  So we do this, by multiplying our gradient by a learning rate, like $.01$.  We'll  represent our learning rate by the greek letter $\eta$ (eta).

So for a function $f(x, y)$, our procedure for gradient descent is to repeatedly update $x$ and $y$ by: 
    
$$ x = x - \eta \frac{\delta f}{\delta x}$$

$$ y = y - \eta \frac{\delta f}{\delta y}$$

So now for gradient descent we do the following:

In [139]:
x = 1.5
y = 1.5

iterations = 20000
eta = .0001

for iteration in range(0, iterations):
    x = x -eta*12*(x**3)*(y**2)
    y = y -eta*6*(x**4)*(y)

In [140]:
x, y

(0.13269008745477753, 1.0697499501309853)

> Here, we are having some trouble getting even closer to the minimum, because x and y are raised to very different powers.    

And this is made more clear, if we plot each of these steps.

In [113]:
initial_x = 2
initial_y = 2

iterations = 100
x_vals_d = [2]
y_vals_d = [2]
eta = .001

for iteration in range(0, iterations):
    current_x_val = x_vals_d[-1]
    current_y_val = y_vals_d[-1]
    x_val = current_x_val -eta*12*(current_x_val**3)*(current_y_val**2)
    y_val = current_y_val -eta*6*(current_x_val**4)*(current_y_val)
    x_vals_d.append(x_val)
    y_vals_d.append(y_val)

In [114]:
descent_vals = [(f(x_val,y_val)) for x_val, y_val in zip(x_vals_d, y_vals_d)]

In [141]:
descent_three_d = go.Scatter3d(
        x=x_vals_d,
        y=y_vals_d,
        z=descent_vals,
    )
multi_param_fig.add_trace(descent_three_d)

### Summary

In this lesson, we saw how we can use the partial derivative to first calculate the gradient and then descend along the cost curve by repeatedly taking steps in the negative direction of the gradient.

We can express this mathematically, for a function $f(x, y)$, as calculating each update to $x$ and $y$ as:

$$ x = x - \eta \frac{\delta f}{\delta x}$$

$$ y = y - \eta \frac{\delta f}{\delta y}$$

And we were able to translate our gradient descent procedure into code, with the following: 

In [89]:
x = 1.5
y = 1.5

iterations = 100
eta = .01

for iteration in range(0, iterations):
    x = x -eta*12*(x**3)*(y**2)
    y = y -eta*6*(x**4)*(y)
(x, y)

(0.13620829648389088, 1.4424943413061868)