# Mathematics for Data Science

## Final Exam (Part 2) - Coding Problem 

<i>Note: This is constitutes 10% of your Final Exam grade.</i>

This notebook should be submitted <b><u>individually</u></b>.   

Name:

<b><u>Read the instructions and questions carefully.</u></b>

Do <b><u>NOT</u></b> import any other libraries aside from those below.

In [7]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

# Load the diabetes dataset
x, y = datasets.load_diabetes(return_X_y=True)

# Use three features
x = x[:, [2, 3, 9]]
x

array([[ 0.06169621,  0.02187235, -0.01764613],
       [-0.05147406, -0.02632783, -0.09220405],
       [ 0.04445121, -0.00567061, -0.02593034],
       ...,
       [-0.01590626,  0.01728186,  0.01549073],
       [ 0.03906215,  0.00121513, -0.02593034],
       [-0.0730303 , -0.08141377,  0.00306441]])

---

## Fitting a Ridge Regression Using Gradient Descent (10 pts.)

Implement batch gradient descent to fit the diabetes dataset to a multiple linear regression model,

\begin{align*}
    y &= \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3
\end{align*}

with squared loss (not MSE loss, but notice how they are equivalent) and ridge regularization (also known as L2 regularization),

\begin{align*}
    J(\hat{\theta}_0, \hat{\theta}_1, \hat{\theta}_2, \hat{\theta}_3) &= \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{3} (\hat{\theta}_j)^2
\end{align*}

Note that $\hat{\theta}_0$ is NOT included in the penalty term.

---

### Interlude

One of the goals of ridge regression (and regularization in general) is to reduce "overfitting". For linear regression, this is especially important when the features are highly correlated with each other.

From a classical statistics perspective, it turns out that we can decompose the MSE statistic into <b><i>bias</b></i> and <b><i>variance</b></i> terms. Ridge regularization allows us to achieve a lower MSE by introducing a "slight" amount of bias in order to get a lower variance.

We will revisit this concept in ACS and you will be using Ridge (and Lasso) as a regularization tool in ML! Very cool.

---

For this problem, set the regularization parameter $\lambda = 1$.

Print a diagnostic output for the first five iterations ($k = 0, 1, 2, 3, 4$), then print the output every 1000 iterations until convergence ($k = 1000, 2000, 3000, \ldots$).

The diagnostic output should contain the following: the iteration $k$, parameters $\hat{\theta}_{k}$, and loss $J$.

Your solution should be <u>stable</u> and <u>exact</u> up to 4 decimal places.

In [8]:
# Use the following variables
theta_k = np.array([1, 1, 1, 1])
lam = 1
lr = 0.001

# Your code here

x=x.T

def J(theta, x, y):
    term1 = sum((y-(theta[0] + theta[1]*x[0] + theta[2]*x[1] + theta[3]*x[2]))**2)
    term2 = lam*(theta[1]**2 + theta[2]**2 + theta[3]**2)
    return term1 + term2

def grad_T0(theta, x, y):
    return -2*sum((y-(theta[0] + theta[1]*x[0] + theta[2]*x[1] + theta[3]*x[2])))

def grad_T1(theta, x, y):
    return -2*sum(x[0]*(y-(theta[0] + theta[1]*x[0] + theta[2]*x[1] + theta[3]*x[2]))) + 2*lam*theta[1]

def grad_T2(theta, x, y):
    return -2*sum(x[1]*(y-(theta[0] + theta[1]*x[0] + theta[2]*x[1] + theta[3]*x[2]))) + 2*lam*theta[2]

def grad_T3(theta, x, y):
    return -2*sum(x[2]*(y-(theta[0] + theta[1]*x[0] + theta[2]*x[1] + theta[3]*x[2]))) + 2*lam*theta[3]


for k in range(0, 10000):
    if ((k<5) or (k % 1000 == 0)):
        print("k = %d --" %k,"theta_k[0] = %.4f --" %theta_k[0],"theta_k[1] = %.4f --" %theta_k[1], "theta_k[2] = %.4f --" %theta_k[2], "theta_k[3] = %.4f --" %theta_k[3], "J = %.4f" %J(theta_k,x,y))
    theta_k = theta_k - lr*np.array([grad_T0(theta_k,x,y), grad_T1(theta_k,x,y), grad_T2(theta_k,x,y), grad_T3(theta_k,x,y)])



k = 0 -- theta_k[0] = 1.0000 -- theta_k[1] = 1.0000 -- theta_k[2] = 1.0000 -- theta_k[3] = 1.0000 -- J = 12712318.5496
k = 1 -- theta_k[0] = 134.6020 -- theta_k[1] = 2.8933 -- theta_k[2] = 2.4239 -- theta_k[3] = 2.2329 -- J = 2745188.1244
k = 2 -- theta_k[0] = 150.0998 -- theta_k[1] = 4.7769 -- theta_k[2] = 3.8397 -- theta_k[3] = 3.4583 -- J = 2604131.4706
k = 3 -- theta_k[0] = 151.8976 -- theta_k[1] = 6.6510 -- theta_k[2] = 5.2473 -- theta_k[3] = 4.6762 -- J = 2595370.4540
k = 4 -- theta_k[0] = 152.1061 -- theta_k[1] = 8.5155 -- theta_k[2] = 6.6469 -- theta_k[3] = 5.8866 -- J = 2588464.7590
k = 1000 -- theta_k[0] = 152.1335 -- theta_k[1] = 384.6549 -- theta_k[2] = 244.0755 -- theta_k[3] = 188.5413 -- J = 1960916.9497
k = 2000 -- theta_k[0] = 152.1335 -- theta_k[1] = 390.1390 -- theta_k[2] = 243.9078 -- theta_k[3] = 186.2808 -- J = 1960852.2453
k = 3000 -- theta_k[0] = 152.1335 -- theta_k[1] = 390.3207 -- theta_k[2] = 243.8628 -- theta_k[3] = 186.1551 -- J = 1960852.1567
k = 4000 -- th

In [10]:
optimal_theta = np.array([152.1335, 390.3281, 243.8608, 186.1498])
J = 1960852.1565