## Gradient Descent in Linear Regression

We want to model the grades of students based on the number of hours they study for the course. Let's define our data:

In [1]:
import numpy as np

In [2]:
hours  = np.array([10,  9,  7,  2,  3,  6,  2,  8, 11, 8])
grades = np.array([98, 91, 73, 24, 35, 57, 18, 93, 87, 83])

the linear model that we define can be written as:
<br>
$$
grade = \beta_0 + \beta_1.hours
$$
<br>
While training, the MSE can be written as:
<br>
$$
MSE = \frac{1}{n}\sum_{n=1}^{n}{(grade_i - \beta_0 - \beta_1.hours_i)}^2
$$
<br>
Differentiating the RSS with respect to $\beta_0$ and $\beta_1$ yields:
<br>
$$
\frac{\partial RSS}{\partial \beta_0} = \frac{-2}{n} \sum_{n=1}^{n}(grade_i - \beta_0 - \beta_1.hours_i)\\
\frac{\partial RSS}{\partial \beta_1} = \frac{-2}{n} \sum_{n=1}^{n}(grade_i - \beta_0 - \beta_1.hours_i).hours_i
$$

In [9]:
def MSE(grades, hours, beta_0, beta_1):
    return np.mean(np.square(grades - beta_0 - beta_1*hours))

def partial_derivative_beta_0 (grades, hours, beta_0, beta_1):
    return -2*np.mean(grades - beta_0 - beta_1*hours)

def partial_derivative_beta_1 (grades, hours, beta_0, beta_1):
    return -2*np.mean(np.multiply(grades - beta_0 - beta_1*hours, hours))

def gradient_mag (part_der_beta_0, part_der_beta_1):
    return np.sqrt(np.sum(np.square(part_der_beta_0)) + np.sum(np.square(part_der_beta_1)))

now we define parameters $\alpha$ (learning rate) and $\epsilon$ (used as a stopping criteria):

In [10]:
alpha   = 1e-2
epsilon = 1e-1

The optimization loop would be:

In [18]:
beta_0 = 0
beta_1 = 10

while True:
    MSE_ = MSE(grades, hours, beta_0, beta_1)
    
    print ("beta0:%.2f, beta1:%.2f MSE: %.2f"%(beta_0, beta_1, MSE_))
    
    partial_derivative_beta_0_ = partial_derivative_beta_0 (grades, hours, beta_0, beta_1)
    partial_derivative_beta_1_ = partial_derivative_beta_1 (grades, hours, beta_0, beta_1)
    
    beta_0 = beta_0 - alpha * partial_derivative_beta_0_
    beta_1 = beta_1 - alpha * partial_derivative_beta_1_
    
    if gradient_mag (partial_derivative_beta_0_, partial_derivative_beta_1_)<epsilon:
        break

beta0:0.00, beta1:10.00 MSE: 77.50
beta0:-0.00, beta1:9.77 MSE: 75.07
beta0:0.03, beta1:9.79 MSE: 74.99
beta0:0.05, beta1:9.78 MSE: 74.92
beta0:0.08, beta1:9.78 MSE: 74.86
beta0:0.10, beta1:9.78 MSE: 74.79
beta0:0.13, beta1:9.77 MSE: 74.72
beta0:0.15, beta1:9.77 MSE: 74.66
beta0:0.18, beta1:9.77 MSE: 74.59
beta0:0.20, beta1:9.76 MSE: 74.53
beta0:0.23, beta1:9.76 MSE: 74.46
beta0:0.25, beta1:9.76 MSE: 74.40
beta0:0.28, beta1:9.75 MSE: 74.34
beta0:0.30, beta1:9.75 MSE: 74.27
beta0:0.33, beta1:9.75 MSE: 74.21
beta0:0.35, beta1:9.74 MSE: 74.15
beta0:0.38, beta1:9.74 MSE: 74.09
beta0:0.40, beta1:9.74 MSE: 74.03
beta0:0.43, beta1:9.74 MSE: 73.97
beta0:0.45, beta1:9.73 MSE: 73.91
beta0:0.48, beta1:9.73 MSE: 73.85
beta0:0.50, beta1:9.73 MSE: 73.79
beta0:0.52, beta1:9.72 MSE: 73.73
beta0:0.55, beta1:9.72 MSE: 73.67
beta0:0.57, beta1:9.72 MSE: 73.61
beta0:0.60, beta1:9.71 MSE: 73.55
beta0:0.62, beta1:9.71 MSE: 73.50
beta0:0.64, beta1:9.71 MSE: 73.44
beta0:0.67, beta1:9.71 MSE: 73.38
beta0:0.69, 