## Gradient descent

* Derivatives
* Gradients and partial derivatives
* Gradient descents

In [None]:
%pylab inline

## Linear models

$f(x) = \alpha x + \beta$ 

In [None]:
def f(x):
    return 1.5 * x + 10

xs = np.arange(0, 100)
plt.plot(xs, f(xs))

In [None]:
plt.ylabel('Murders in million people')
plt.xlabel('Growth rate in %')
plt.plot(xs, f(xs))

* How many murders do we get if our growth rate is 2%?

In [None]:
f(60)

In [None]:
plt.ylabel('Murders in million people')
plt.xlabel('Growth rate in %')
plt.plot(xs, f(xs))

* What is the minimum amount of murders we can get, if we have a positive inflation?

In [None]:
f(0)

## Non-linear models


\begin{align} y =\tfrac{1}{4}x^3+\tfrac{3}{4}x^2-\tfrac{3}{2}x + 10 \end{align} 

In [None]:
def g(x):
    return x ** 3 / 4 + 3 * x ** 2 / 4 - 3 * x / 2 + 10

plt.ylabel('Murders')
plt.xlabel('Growth rate')

xs = np.arange(-5, 5, 0.1)
plt.plot(xs, g(xs))

* How many murders do we get if our growth rate is 2%?

In [None]:
g(2)

In [None]:
plt.plot(xs, g(xs))

* What is the minimum amount of murders we can get, if we have a positive inflation?

In [None]:
g(0)

In [None]:
g(1)

## Optimum

* Global: The best value for the **entire** model
* Local: The best value for the model, in a local, small place


* **Very** important term
  * Models exist to be optimised: deaths, cancer, stock prices...

## Loss functions and optimum

* Your models has a loss function
  * Remember the MAE or RMSE
* You want to keep the loss function *low*

* Think about yourself as a human. You want to minimise pain and optimise pleasure

* New life purpose: How do we find the local or global optimum?
  * Video: [CCC: The ghost in the machine](https://media.ccc.de/v/35c3-10030-the_ghost_in_the_machine)

## Function gradients

\begin{align} y =\tfrac{1}{4}x^3+\tfrac{3}{4}x^2-\tfrac{3}{2}x + 10 \end{align} 


\begin{align} y =\tfrac{3}{4}x^2+\tfrac{6}{4}x-\tfrac{3}{2} \end{align} 

In [None]:
def g_prime(x):
    return 3 / 4 * x ** 2 + 6 / 4 * x - 3 / 2

plt.plot(xs, g(xs))
plt.plot(xs, g_prime(xs))

In [None]:
plt.plot(xs, g(xs))
plt.plot(xs, g_prime(xs))
plt.plot(np.arange(-5, 5), np.repeat(0, 10))

In [None]:
X, Y = np.meshgrid(np.arange(-5, 5), np.arange(-5, 5))
plt.gca(projection='3d') # plt.gca: Get Current Axis
plt.gca().plot_surface(X, Y, X ** 2 + Y ** 2)

## Estimating gradients

* Goal: figure out how $f(x)$ changes when $x$ changes

In [None]:
def difference_quotient(f, x, delta):
    return (f(x + delta) - f(x)) / delta

In [None]:
plt.plot(xs, f(xs))
plt.plot(xs, difference_quotient(f, xs, 0.001), 'x')

In [None]:
difference_quotient(f, 0, 0.001)

In [None]:
# Plot actual and estimated derivative of f(x) = x^2

def square(x):
    return x * x

def derivative(x):
    return 2 * x

def derivative_estimate(x):
    return difference_quotient(square, x, delta=0.00001)

xs = np.arange(-10,10)
plt.plot(xs, square(xs))
plt.plot(xs, derivative(xs), 'rx')           # red  x
plt.plot(xs, derivative_estimate(xs), 'b+')  # blue +
plt.show() # purple *, hopefully

## Differentiation

$f(x) = x^2$

$f'(x) = 2x$

* We are looking for the change $d$ in $f(x)$, when we change $x$ with $d$
  * Remember: `(f(x + delta) - f(x)) / delta`

${df \over dx} = f'(x)$

## Linear models

$\alpha x_1 + \beta x_2 + c$ 

... But how do we take the derivative of $x_1$, **and** $x_2$?


* Answer: By deriving one variable at the time

## Gradients and partial derivatives

$f(x, y) = x^2 + xy + y^2 = z$


![](https://upload.wikimedia.org/wikipedia/commons/2/2d/Partial_func_eg.svg)

Derivative between $x$ and $z$
![](https://upload.wikimedia.org/wikipedia/commons/f/fe/X2%2BX%2B1.svg)

We can say that our function $f(x, y) = x^2 + xy + y^2$ is the same as $f_y(x) = x^2 + xy + y^2$

In Python this is the same as

In [None]:
def fy(y):
    def fx(x): 
        return x ** 2 + x * y + y ** 2
    return fx

In [None]:
fy(1)

So now we know that $y$ is constant, so that means that the derivative of 

$f_y(x) = x^2 + xy + y^2$ 
is

$f_y'(x) = 2x + y$

In math, this is written

${\partial f \over \partial x}(x, y) = 2x + y$

## Partial derivative

$\frac{df_{a_1,\ldots,a_{i-1},a_{i+1},\ldots,a_n}}{dx_i}(a_i) = \frac{\partial f}{\partial x_i}(a_1,\ldots,a_n).$

Taking the derivative of a function with many inputs, does not just give you *one* derivative, but *a list* of **partial derivatives**.

## Gradients

$\nabla f(a) = \left(\frac{\partial f}{\partial x_1}(a), \ldots, \frac{\partial f}{\partial x_n}(a)\right)$

In one point $a$, this gives us a list of partial derivatives, that tells us which direction each dimension is moving, in that exact point $a$.

This is called a **gradient**.


## Gradient descent

Forest analogy: You're lost in the mountain and has to find your way down

![](https://upload.wikimedia.org/wikipedia/commons/c/c7/Okanogan-Wenatchee_National_Forest%2C_morning_fog_shrouds_trees_%2837171636495%29.jpg)

Math analogy: You have a function, and the only thing you know is the *direction* it moves

In multiple dimensions the *direction* is in multiple dimensions. So we don't just need the derivative of the function in one dimension, we need it in multiple dimensions! 

We need the gradients! That's why it's called gradient descent.

![](https://upload.wikimedia.org/wikipedia/commons/f/ff/Gradient_descent.svg)

## Gradient descent: summary

* We have a loss function for our models
  * We like our loss to be **small**
* That loss function can be in multiple dimensions
  * It's **hard** to predict where the loss function is small
* Gradients gives us an idea on the *direction* the function is going
  * Direction **small** is good because it means a small loss

* Gradient descent steps in the direction of the loss
  * Until you find the smallest point

## Gradient descent in `sklearn`

In [None]:
from sklearn.linear_model import SGDRegressor
SGDRegressor?

## Gradient descent example with science spending data

In [None]:
import pandas as pd
data = pd.read_csv("science.csv")
data

In [None]:
xs = np.array(data['US science spending']).reshape(-1, 1)
ys = np.array(data['Suicides']).reshape(-1, 1)

In [None]:
data.plot.scatter(x = 1, y = 2)

In [None]:
model = SGDRegressor()
model.fit(xs, ys)

In [None]:
model.predict([[100], [10000]])

In [None]:
data.plot.scatter(x = 1, y = 2)
plt.plot(xs, model.predict(xs))

## Scaling data

Gradient descent goes in the direction of the gradient with respect to x.
If that gradient is very large, the steps we take are laaaarge.

What can we do to fix that?
Scale the data to be smaller!

In [194]:
from sklearn.preprocessing import scale

In [None]:
xs_scaled = scale(xs)
xs_scaled

In [None]:
model = SGDRegressor()
model.fit(xs_scaled, ys)

In [None]:
plt.plot(xs_scaled, ys)
plt.plot(xs_scaled, model.predict(xs_scaled))

## Why don't we get any good results?!

To get the optimal solution, we have to take many steps towards the correct solution. 

Challenge:

* How many steps is the model taking now? (read the documentation)
* Can you make it take 100 steps? 1000 steps? 10000 steps?

## Recap

* Derivatives
  * The **rate of growth** for a function $f$
  * Normally defined at points like $x$: $f'(x)$
* Partial derivatives
  * Derivatives in multiple dimensions

* Gradient
  * A vector of derivatives, one for each dimension in a function $f$
* Gradient descent
  * A way to optimise a function by moving towards an optimum
  * For instance minimising a loss function

### Scaling data 
  * Standardising data into a smaller space