## Gradient Descent Code-Along

Let's walk through how gradient descent works using code.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Set a random seed.
np.random.seed(42)

In [3]:
# Randomly generate data from a Poisson(45) distribution.
temp = np.random.poisson(45, 100)

In [4]:
# View array.
temp

array([42, 50, 37, 47, 52, 38, 41, 44, 47, 41, 44, 38, 47, 47, 41, 49, 36,
       40, 41, 46, 58, 47, 34, 29, 43, 52, 40, 37, 51, 49, 51, 42, 53, 42,
       41, 50, 55, 36, 50, 51, 45, 41, 56, 43, 39, 41, 57, 48, 52, 55, 41,
       39, 43, 36, 59, 45, 63, 45, 40, 47, 30, 56, 37, 48, 39, 42, 48, 34,
       41, 49, 45, 48, 49, 58, 42, 40, 52, 46, 55, 42, 48, 47, 35, 46, 48,
       49, 41, 48, 48, 34, 40, 55, 51, 46, 38, 40, 48, 56, 44, 41])

In [5]:
# Calculate mean and sample variance of array.
print(np.mean(temp))
print(np.var(temp, ddof = 1))

45.18
45.07838383838384


**Ohio State Fun Facts:**
1. Ohio Stadium can seat 102,082 people. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Ohio_Stadium).)
2. Ohio Stadium's record attendance is 110,045 people. (Source: [Wikipedia](https://en.wikipedia.org/wiki/Ohio_Stadium).)
3. Ohio State is better than Michigan. (Source: It's just a fact.)
4. Ohio State students enjoy soda. (Source: first-hand knowledge.)

In [6]:
# sodas ~ N(200000 + 1000 * temp, 20000)
sodas_sold = 200000 + 1000*temp + np.round(np.random.normal(0,20000, 100))

In [7]:
sodas_sold

array([233070., 267128., 241282., 222085., 255464., 245706., 223323.,
       247075., 248164., 218141., 251156., 249216., 268661., 268076.,
       213447., 230243., 246301., 250276., 251301., 323055., 269418.,
       269711., 253080., 242028., 236695., 267179., 224543., 232264.,
       241293., 250637., 297293., 204655., 266725., 209746., 231561.,
       271779., 256286., 214445., 235694., 264592., 230393., 245329.,
       256911., 229968., 281879., 253678., 216497., 251729., 238764.,
       272049., 225150., 236705., 253100., 253315., 234994., 238310.,
       253501., 231933., 275309., 255100., 204782., 274357., 279443.,
       268649., 208613., 232315., 273338., 219847., 249876., 264493.,
       226461., 246809., 184175., 237512., 236949., 215044., 284648.,
       217397., 246199., 244615., 276825., 218283., 258263., 246205.,
       228370., 258242., 244981., 235996., 249396., 226294., 242270.,
       268243., 282720., 221244., 280661., 200958., 244964., 267766.,
       249620., 2285

$$ \text{sodas_sold}_i = 200000 + 1000 * \text{temp}_i + \varepsilon_i $$

In [8]:
# Create dataframe with temp and sodas_sold.
df = pd.DataFrame({'temp':temp,
                  'sodas':sodas_sold})

In [9]:
# Check the first five rows.
df.head()

Unnamed: 0,temp,sodas
0,42,233070.0
1,50,267128.0
2,37,241282.0
3,47,222085.0
4,52,255464.0


#### Our goal is to fit a model here.
- You and I know that our $y$-intercept $\beta_0$ is 200,000.
- You and I know that our slope $\beta_1$ is 1,000.
- However, our computer does not know that. Our computer has to estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ from the data.
    - We might say that our **machine** has to... **learn**.

#### Our workflow:
1. Instantiate model.
2. Select a learning rate $\alpha$.
3. Select a starting point $\hat{\beta}_{1,0}$.
4. Calculate the gradient of the loss function.
5. Calculate $\hat{\beta}_{1,i+1} = \hat{\beta}_{1,i} - \alpha * \frac{\partial L}{\partial \beta_1}$.
6. Check value of $\left|\hat{\beta}_{1,i+1} - \hat{\beta}_{1,i}\right|$.
7. Repeat steps 4 through 6 until "stopping condition" is met.

#### Step 1. Instantiate model.

Our model takes on the form:
$$ Y = \beta_0 + \beta_1 X + \varepsilon$$

#### Step 2. Select a learning rate $\alpha$.

$$\alpha = 0.1$$

In [10]:
alpha = 0.1

#### Step 3. Select a starting point.
The zero-th iteration of $\hat{\beta}_1$ is going to start at, say, 20.
$$\hat{\beta}_{1,0} = 20$$

Two points:
- You and I know that the true value of $\beta_1$ is 1000. We need the computer to figure (machine to learn) that part out!
- We're going to pretend like the computer already knows the value for $\beta_0$. In reality, we'd have to do this for $\beta_0$ and for $\beta_1$ at the same time.

In [None]:
beta_1 = 20

#### Step 4. Calculate the gradient of the loss function with respect to parameter $\beta_1$.

The loss function, $L$, is our mean square error.

$$L = \frac{1}{n}\sum_{i = 1} ^ n (y_i - \hat{y}_i)^2 $$

$$\Rightarrow L = \frac{1}{n}\sum_{i = 1} ^ n \left(y_i - \left(\hat{\beta}_0 + \hat{\beta}_1x_i\right)\right)^2 $$

The gradient of this loss function with respect to $\beta_1$ is:

$$\frac{\partial L}{\partial \beta_1} = \frac{2}{n} \sum_{i=1}^n -x_i\left(y_i - \left(\hat{\beta}_1x_i + \hat{\beta}_0\right)\right) $$

In [11]:
# Calculate gradient of beta_1.
def beta_1_gradient(x, y, beta_1, beta_0):
    # determine the number of elements to iterate over
    n = len(x)
    # start gradient at 0
    gradient = 0
    # begin summation
    for i in range(n):
        # calculate predicted value
        pred = beta_0+beta_1*x[i]
        error = y[i] - pred
        gradient+= -1 * x[i] * error
    # multiply gradient by 2/n
    gradient *= (2/n)
    return gradient

#### Step 5. Calculate $\hat{\beta}_{1,i+1} = \hat{\beta}_{1,i} - \alpha * \frac{\partial L}{\partial \beta_1}$.

In [12]:
# Define function to calculate new value of beta_1.
def update_beta_1(beta_1, alpha, gradient):
    beta_1 -= alpha * gradient
    return beta_1

#### Step 6. Check value of $\left|\hat{\beta}_{1,i+1} - \hat{\beta}_{1,i}\right|$.

In [14]:
def check_update(beta_1, updated_beta_1, tolerance = 0.1):
    if abs(beta_1 - updated_beta_1) < tolerance:
        return True
    else:
        return False

#### Step 7: Save final value of $\hat{\beta}_1$.

#### Putting it all together...

In [30]:
def gradient_descent(x, y, beta_1 = 0, alpha = 0.01, max_iter = 100):
    # set converged = False
    converged = False
    
    # Iterate through each of our observations
    for i in range(max_iter):
        
        # Calculate  gradient
        new_gradient = beta_1_gradient(x=x, y=y, beta_1=beta_1, beta_0=200000)
        
        # update beta_1
        new_updated_beta_1 = update_beta_1(beta_1=beta_1, alpha=alpha, gradient=new_gradient)
        
        # Check for convergence
        converged = check_update(beta_1=beta_1, updated_beta_1=new_updated_beta_1)
        
        #Overwrite beta_1
        beta_1 = new_updated_beta_1
        
        # If we've converged, let us know!
        if converged==True:
            print(f'Our algorithm converged after {i} iterations with a beta_1 value of {beta_1}.')
            break
        print(f'Iteration {i} with beta_1 value of {beta_1}.')
    
    # If we didn't converge by the end of the loop, please let us know!
    if converged == False:
        print("Our algorithm did not converge. Kindly try not to trust the values I provided for beta_1.")
        
    # return beta_1
    return beta_1        

In [45]:
# Call gradient_descent with an initial beta_1 of 20, alpha of 0.01, and 100 iterations.
gradient_descent(x= df['temp'], 
                 y=df['sodas'], 
                 beta_1 = 200000, 
                 alpha = 0.00001, 
                 max_iter = 100000)

Iteration 0 with beta_1 value of 191698.8098804.
Iteration 1 with beta_1 value of 183743.92216925736.
Iteration 2 with beta_1 value of 176120.89009973803.
Iteration 3 with beta_1 value of 168815.86958366923.
Iteration 4 with beta_1 value of 161815.5940694734.
Iteration 5 with beta_1 value of 155107.35044895834.
Iteration 6 with beta_1 value of 148678.95596920906.
Iteration 7 with beta_1 value of 142518.73610765036.
Iteration 8 with beta_1 value of 136615.50337010028.
Iteration 9 with beta_1 value of 130958.53697330914.
Iteration 10 with beta_1 value of 125537.5633750862.
Iteration 11 with beta_1 value of 120342.73761665505.
Iteration 12 with beta_1 value of 115364.62544335354.
Iteration 13 with beta_1 value of 110594.18617120806.
Iteration 14 with beta_1 value of 106022.75626826653.
Iteration 15 with beta_1 value of 101642.033620872.
Iteration 16 with beta_1 value of 97444.06245630335.
Iteration 17 with beta_1 value of 93421.21889440126.
Iteration 18 with beta_1 value of 89566.19710193

1015.0473028933842

<details><summary>What should we do?</summary>

- We **should not** adjust our maximum iterations. It doesn't look like we'll converge.
- We should adjust our alpha!
</details>

In [40]:
converged

NameError: name 'converged' is not defined