## Introduction

In this mission and the next, we'll discuss the two most common ways for finding the optimal parameter values for a linear regression model. Each combination of unique parameter values forms a unique linear regression model, and the process of finding these optimal values is known as **<font color=blue>model fitting</font>**. In both approaches to model fitting, we'll aim to minimize the following function:

$MSE = \frac{1}{n} \sum_{i=1}^{n} ({\hat{y_i} - y_i})^2$

$\hat{SalePrice} = a_1 * Gr Liv Area$

![Jupyter](./single_var_operation.gif)

## Single Variable Gradient Descent

Here's an overview of the gradient descent algorithm for a single parameter linear regression model:

* select initial values for the parameter: $a1$
* repeat until convergence (usually implemented with a max number of iterations):
    - calculate the error (MSE) of model that uses current parameter value: $MSE(a_1) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}^{(i)} - y^{(i)} ) ^2$
    - calculate the derivative of the error (MSE) at the current parameter value: $\frac{d}{da_1} MSE(a_1)$
    - update the parameter value by subtracting the derivative times a constant ($\alpha$, called the learning rate): $a_1 := a_1 - \alpha \frac{d}{da_1} MSE(a_1)$

Selecting an appropriate `initial parameter` and `learning rate` will reduce the number of iterations required to converge, and is part of hyperparameter optimization. 

## Derivative Of The Cost Function

In mathematical optimization, a function that we optimize through minimization is known as a **cost function** or sometimes as the [loss function](https://en.wikipedia.org/wiki/Loss_function).

In [3]:
import pandas as pd
data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]

In [5]:
def derivative(a1, xi_list, yi_list):
    # Modify this function.
    n = len(xi_list)
    cost_err = 0
    for i in range(n):
        cost_err += (a1*xi_list[i] - yi_list[i])*xi_list[i]
    cost_err = 2*cost_err/n
    return cost_err

def gradient_descent(xi_list, yi_list, max_iterations, alpha, a1_initial):
    a1_list = [a1_initial]

    for i in range(0, max_iterations):
        a1 = a1_list[i]
        deriv = derivative(a1, xi_list, yi_list)
        a1_new = a1 - alpha*deriv
        a1_list.append(a1_new)
    return(a1_list)

# Uncomment when ready.
param_iterations = gradient_descent(train['Gr Liv Area'], train['SalePrice'], 20, .0000003, 150)
final_param = param_iterations[-1]

In [6]:
final_param

120.14219147202738

In [7]:
param_iterations

[150,
 106.24258269493151,
 126.61281661731272,
 117.12993450021699,
 121.54446668425497,
 119.48938531096931,
 120.44607998998796,
 120.00071333893449,
 120.20804328256295,
 120.11152571569237,
 120.15645719327628,
 120.13554040327286,
 120.1452777216869,
 120.14074474268385,
 120.14285496418101,
 120.14187260031741,
 120.14232991665213,
 120.142117023815,
 120.14221613105579,
 120.14216999401657,
 120.14219147202738]

## Understanding Multi Parameter Gradient Descent

$SalePrice = a_1 * Gr Liv Area + a_0$

In the below image, we've generated a 3D scatter plot with:
* $a_0$ on the x-axis
* $a_1$ on the y-axis
* $MSE$ on the z-axis

![Jupyter](./surface_plot.gif)

## Gradient Of The Cost Function

In [9]:
def a1_derivative(a0, a1, xi_list, yi_list):
    len_data = len(xi_list)
    error = 0
    for i in range(0, len_data):
        error += xi_list[i]*(a0 + a1*xi_list[i] - yi_list[i])
    deriv = 2*error/len_data
    return deriv

def a0_derivative(a0, a1, xi_list, yi_list):
    len_data = len(xi_list)
    error = 0
    for i in range(0, len_data):
        error += (a0 + a1*xi_list[i] - yi_list[i])
    deriv = 2*error/len_data
    return deriv

def gradient_descent(xi_list, yi_list, max_iterations, alpha, a1_initial, a0_initial):
    a1_list = [a1_initial]
    a0_list = [a0_initial]

    for i in range(0, max_iterations):
        a1 = a1_list[i]
        a0 = a0_list[i]
        
        a1_deriv = a1_derivative(a0, a1, xi_list, yi_list)
        a0_deriv = a0_derivative(a0, a1, xi_list, yi_list)
        
        a1_new = a1 - alpha*a1_deriv
        a0_new = a0 - alpha*a0_deriv
        
        a1_list.append(a1_new)
        a0_list.append(a0_new)
    return(a0_list, a1_list)


In [10]:
# Uncomment when ready.
a0_params, a1_params = gradient_descent(train['Gr Liv Area'], train['SalePrice'], 20, .0000003, 150, 1000)

In [11]:
a0_params

[1000,
 999.9729797812329,
 999.985903701066,
 999.980232547139,
 999.9832179015052,
 999.9821734177915,
 999.983004932363,
 999.9829631191217,
 999.9833278635107,
 999.98350334434,
 999.9837669324418,
 999.9839895042135,
 999.9842311701743,
 999.9844639472566,
 999.9847008623329,
 999.9849358510428,
 999.9851717365096,
 999.9854072044933,
 999.985642866808,
 999.9858784386378,
 999.986114052572]

In [12]:
a1_params

[150,
 105.34801721547944,
 126.13471917628125,
 116.45794862200977,
 120.96274606972909,
 118.86564116059868,
 119.84189984026605,
 119.38742488614261,
 119.59899502291616,
 119.50050320781361,
 119.54635359313434,
 119.52500879150305,
 119.53494516153384,
 119.53031930255781,
 119.53247255390217,
 119.53146994657168,
 119.53193647656232,
 119.53171908350993,
 119.53182007507831,
 119.53177285001942,
 119.53179462379771]

## Gradient Descent For Higher Dimensions

What if we want to use many parameters in our model? Gradient descent actually scales to as many variables as you want. Each parameter value will need its own update rule, and it closemly matches the update rule for $a_1$:


$\displaystyle  a_0 := a_0 - \alpha \dfrac{d}{da_0} MSE \\
 a_1 := a_1 - \alpha \dfrac{d}{da_1} MSE \\ 
 a_2 := a_2 - \alpha \dfrac{d}{da_2} MSE \\ 
 \vdots\\
 a_n := a_n - \alpha \dfrac{d}{da_n} MSE \\
 $

Besides the derivative for the MSE with respect to the intercept value $(a_0)$, the derivatives for other parameters are identical:

$\displaystyle  \dfrac{d}{da_1} MSE = \frac{2}{n} \sum_{i=1}^{n} x_1^{(i)}\left(\hat{y}^{(i)} - y^{(i)}\right) \\  
\displaystyle \dfrac{d}{da_2} MSE = \dfrac{2}{n} \sum_{i=1}^{n} x_2^{(i)}\left(\hat{y}^{(i)} - y^{(i)}\right) \\
 \vdots\\
\displaystyle \dfrac{d}{da_n} MSE = \dfrac{2}{n} \sum_{i=1}^{n} x_n^{(i)}\left(\hat{y}^{(i)} - y^{(i)}\right)  \\
$

## Summarization

In this mission, we explored how to find a linear regression model using the gradient descent algorithm. The main challenges with gradient descent include:

* Choosing good initial parameter values
* Choosing a good learning rate (falls under the domain of hyperparameter optimization)