# P-ai AI/ML Workshop: Session 3

Welcome to P-ai's third session of the AI/ML workshop series! Today we'll learn about
- Gradient descent
- A host of new machine learning algorithms to try out
- How to use Git and GitHub

<img src="https://images.squarespace-cdn.com/content/5d5aca05ce74150001a5af3e/1580018583262-NKE94RECI46GRULKS152/Screen+Shot+2019-12-05+at+11.18.53+AM.png?content-type=image%2Fpng" width="200px">

## 0. Session 2 Review

Here are some key points from last week's session:
- Supervised learning, unsupervised learning, reinforcement learning
    - Supervised: have labeled data, want model to predict $y$ from $X$ (e.g. predict house price from picture)
    - Unsupervised: don't have labeled data, want to learn patterns directly in data (e.g. clustering)
    - Reinforcement: want to train agent to perform actions in an environment that optimize some reward (e.g. play Mario)
- Linear regression
    - Used to predict a value from ≥1 input variable(s)
    - Expects linear relationship between input and output
- Logistic regression
    - Used to predict a binary class (0 or 1) from ≥1 input variable(s)
    - Finds linear decision boundary to separate data from different classes

## 1. Gradient descent
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/28/Chocolate_Hills_overview.JPG/1200px-Chocolate_Hills_overview.JPG" width="500px">

What's gradient descent, and why are we talking about it now? This is actually one answer to a question that was posed in the previous two workshops, but which we didn't have time to answer in detail:

> *How* do machine learning models find the optimal parameters?

There's an asterisk here, because not all machine learning algorithms use gradient descent, like decision trees (which we'll cover later today). That being said, it's pretty pervasive in machine learning and absolutely crucial in deep learning. Let's get started!

Imagine you're blind, and you're standing on the side of a hill. You want to go downward- how do you know which way to go? You can feel with your feet *which direction the slope is*, and then go in the opposite direction of the upward slope. You then *walk some distance* in that direction, and *repeat the process*. That's pretty much it! Gradient descent is a fairly intuitive process.

> So what do the hills represent?

The hills represent **loss** (reminder: loss is a synonym for cost; the thing you want to minimize). If we're sticking with three dimensions for the sake of visualizing, that means that if you pick two coordinates (say, $x$ and $y$), there's a height (loss) associated with those coordinates. Also recall that in the case of 2D linear regression, we want to find a **weight** and a **bias**; let's use $w$ and $b$ as our variables. Also, we'll call the loss $J$ (it's just convention). That means that we want to find:
$$
J(w, b)_{min}
$$

**\[WARNING\] CALCULUS INCOMING**  
(If you're not comfortable with calculus, don't worry about this next stuff– but if you want to get as concrete as possible, this is for you)

Mathematically, slope can be found by taking the *derivative* of a function. When you take the partial derivative of a function in all directions, that's called the *gradient*. The gradient of a function is a *vector* that points in the **direction of steepest ascent**. So, to go *down the gradient*, we would ideally wish to find the gradient of the loss function with respect to the weights, then take the *negative of the gradient* (direction of steepest descent), and update the weights in that direction.

Here's that same statement, written more in more math-y terms:
$$
w \leftarrow w - \alpha \nabla J
$$

here $\leftarrow$ means "set equal to", $\nabla$ is the symbol for taking the gradient, and alpha ($\alpha$) is what's called the **learning rate**. The learning rate decides *how much* to update the weight based on the gradient. Let's try to visualize this a little better; here's an example where the loss ($J$) is a function of only one variable:

<img src="https://miro.medium.com/max/1200/1*iNPHcCxIvcm7RwkRaMTx1g.jpeg" width="400px">

Okay! Enough of that for now, let's summarize what all of this was about:

- Loss is a function of the model parameters, like weights and biases
    - That is, there are certain combinations of parameters that minimize loss $\approx$ maximize "learning"
- Gradient is like slope, but it has a direction (always points in the direction of steepest ascent)
- To find a minimum (global, hopefully, but not necessarily), we "descend" the "gradient"
- At each step, the model makes a prediction and uses the difference between the predicted and actual value to estimate the gradient
- The weights are then updated accordingly to incrementally decrease the loss, and the process repeats

## Taking a deeper look at gradient descent

### The simplified case

A couple sessions ago we talked about methods to quantify error in our predictions, the cost function. We opted to use mean squared error, defined to be:

$$\frac{1}{N} \sum_{i=1}^n (\hat{Y} - Y)^2$$

We mentioned that MSE is useful as the power makes the derivative meaningful. Now we'll see why that matters!

Let's look at the simple 2D case first. We'll call back to the regression case of $$\hat{y} = wx + b$$. Remember we our looking for some w and some b that minimizes our error. Let's imagine our error can be quantified as some quadratic function which we'll approximate to be $$J(\tilde{Y}) = \tilde{Y}^2$$. Note that this is an unrealistic example. When we perform machine learning, our error usually cannot be quantified by a simple function, and even if we could, we don't know it in advance. However, let's just play with this for a second.

This is basically a parabolic function. Visually, it looks like this:

<img src = https://miro.medium.com/max/720/1*wOcqaaLlNo7X56PJ-lYFqQ.png>

We are trying to make it to the local minima. How do we find that point with our current architecture? Computing the derivative.

$$\frac{d}{d\tilde{Y}} J(\tilde{Y})$$

This basically defines the steepness of the parabola at some arbitrary point. How do we find the minima? Notice at the base of the parabola where error is lowest, the slope is actually zero! So we can take:

$$
\frac{d}{d\tilde{Y}} J(\tilde{Y}) = \frac{d}{d\tilde{Y}} \tilde{Y}^2 = 2\tilde{Y} = 0
$$

Solving this, we see that for a parabola centered at the point (0, 0), the point of minimum error is (0, 0).

Let's think about this from the gradient descent point of view. How would we be able to arrive at (0, 0) by going through a bunch of trials? Consider this parabola shape again. Let's hypothesize that our error is in this parabolic shape, but we DON'T KNOW IT! 

#### Discussion Question
What can we do to figure it out? What is the derivative telling us?


### Concerning Problems
However, life is not so simple... Let's look at the following 2D error function

<img src= images/false_error.png>

Note that there are two local minima in this plot... Why might this be troubling?

#### Getting Unstuck: Some Examples

- Stochastic Gradient Descent (SGD)
    - Injects some randomness
- Momentum
    - Aggregates our gradient: analogous to physical concept of momentum

We'll talk about these later

### The Learning Rate Question:

How quickly we move down the curve is dictated by a parameter we can set called the learning rate. Let's think about two cases:
- Learning rate is too big
- Learning rate is too small

<img src = https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png width = 800px>

- You can also set an adaptive learning rate that can move based off your iteration number
- This can have its own problems - why is that?

### The 3D Case

- Its a similar story but our mathematical framework must slightly change
- Now need to figure out the slope in 3-D... or, the gradient $$\nabla$$

Think of it as a replacement of the 2-D slope with a vector that describes the direction and rate you are headed down the hill

This is an increase in complexity! Wandering to the correct minima might be a bit harder now... why is that?

<img src = https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/3d-gradient-cos.svg/525px-3d-gradient-cos.svg.png width = 400px>

We will encounter the same problem as the 2-D case... misleading local minima:

<img src="https://www.fromthegenesis.com/wp-content/uploads/2018/06/Gradie_Desce.jpg">

### Getting out: Stochastic Gradient Descent

