<br>
<div style="font-size:87px; font-weight:bold; text-align: center;"> Gradient Descent! </div>
<br>

`whoami`

`stu`  
Machine Learning Engineer @Opendoor  
@mstewart141  


# Initial goal: Puzzle through the gradient descent algorithm to give us a better understanding of how neural nets work

Per Wikipedia:
> Gradient descent is a first-order iterative optimization algorithm for finding the __minimum__ of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point

## Ok, but what is a gradient?

Per Khan Academy:
> The gradient stores all the partial derivative information of a multivariable function.
  
The gradient is a vector-valued function: a vector of partial derivates.

![title](gd.png)
source: [Wikipedia](https://en.wikipedia.org/wiki/Gradient_descent)

In [None]:
%matplotlib inline

import numpy as np
import sympy

from numpy.linalg import inv
from scipy.special import expit
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import make_classification
from sympy import diff
from sympy.solvers import solve
from sympy.plotting import plot
from toolz import compose, pipe

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

np.random.seed = RS = 47

## Time out! (_Python familiarity check_)

## Any initial questions? _(Please ask questions!!)_

## Let's look at an example

In [None]:
#plot

In [None]:
#diff

In [None]:
#gradient

In [None]:
#solve

### You told us to "step proportional to the negative of the gradient"?

what if `x > 0`, `x < 0`?

# Now, we'll do gradient descent live

## Starting with linear regression

In [None]:
X, Y = make_classification(n_samples=1000, n_features=3, n_informative=3,
                           n_redundant=0, n_repeated=0, n_classes=2, 
                           random_state=RS)

#### We need to introduce a `bias` term:

In [None]:
#bias_column

#X
#Y

### Aside: do we even need gradient descent?

# $$X_{m \times n}, Y_{m \times 1}$$

### Meet the `'normal'` equation:

# $$\begin{equation}y_{mx1} = X_{m \times n}\space\beta_{n \times 1} + \epsilon_{m \times 1}\end{equation}$$
# $$\begin{equation}\beta_{n \times 1} = (X^{T}X)^{-1}_{n \times n}\space X^{T}_{n \times m}\space Y_{m \times 1}\end{equation}$$

#### This gives us a way to compute our `beta` vector:

In [None]:
#betas_normal_eq

## Sanity check: Scikit-learn

In [None]:
#linreg

In [None]:
#betas_sklearn

In [None]:
betas_normal_eq
betas_sklearn

## But where are the gradients???

### Spoiler: linear and logistic regression aren't so different to optimize

#### To implement gradient descent for linear regression, we will use the identity function.  
  
#### For logistic regression, we will use the `sigmoid` function:

# $$\begin{equation} F(z) = z \end{equation}$$

# $$\begin{equation} F(z) = \dfrac{1}{1+e^{-z}} \end{equation}$$

### These two functions, let's write 'em up

In [None]:
#identity

#sigmoid

#sigmoid

### Why do these $F(z)$ equations matter? What is our hypothesis?

#### Our two hypotheses will be identical, except for the aforementioned functions!

#### Recall the normal equation?

In [None]:
#hypothesis
#hypothesis_shape

#hypothesis_shape

## Gradient Descent!

> Gradient descent is a first-order iterative optimization algorithm

We must define an update step that moves us closer to the solution each iteration.

In [None]:
#update_step

#gradient_descent

### Does it work for linear regression?

In [None]:
#betas_gd

In [None]:
betas_gd
betas_normal_eq
betas_sklearn

### Great! How about for logistic regression?

### What saith Scikit?

In [None]:
#logr

In [None]:
#betas_logr
#betas_gdl

In [None]:
betas_logr
betas_gdl

# Ok, Now your turn!

<br>
<div style="font-size:80px; font-weight:bold; text-align: center;"> Questions? </div>
<br>
  
  
_@[mstewart141](https://twitter.com/mstewart141) [twitter, github, linkedin]_

# Now, what could go wrong with the above plan?

## How can we make the algorithm faster?

## Can someone summarize _why_ that was faster?

## Even if it were _slower_, why might it still be useful?