# Course: Introduction to Machine Learning
# Aissigment I
### Instructor: Tan Bui-Thanh
### TA: Hai Nguyen
#### Due day ??? Sep 2022 


## **Question 1** (Linear regression for 2 linear features)

Loading UCI Diabetes Dataset and considering the *BMI* and *BP* features, denoted as $x_1, x_2$, respectively. And, the diabetes risk score is the target denoted as $y$. We assuming the linear model has the form of

$$ \bar{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2.$$

The mean square error loss function

$$J(\theta)=\frac{1}{2}\sum_{i=1}^n(y^{(i)}-\bar{y}^{(i)})^2$$
where $n$ is the number of patients

- (1.a) (+25 pt) Using the gradient descent algorithm to find the optimal $\theta_1$ and $\theta_2$.

- (1.b) (+5 pt) Plotting the contour map of the loss function 

- (1.c) (+10 pt) Plotting the gradient descent direction lines on the contour map from two initial guess $(\theta_1, \theta_2) = (4,4)$ and $(\theta_1, \theta_2) = (0,0)$. (You should plot a point each 100 iterations)

In [38]:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

import pandas as pd
from sklearn import datasets

def f(X, theta):
    """The linear model we are trying to fit.
    
    Parameters:
    theta (np.array): d-dimensional vector of parameters
    X (np.array): (n,d)-dimensional data matrix
    
    Returns:
    y_pred (np.array): n-dimensional vector of predicted targets
    """
    return X.dot(theta) + 0.4638

def mean_squared_error(theta, X, y):
    """The cost function, J, describing the goodness of fit.
    
    Parameters:
    theta (np.array): d-dimensional vector of parameters
    X (np.array): (n,d)-dimensional design matrix
    y (np.array): n-dimensional vector of targets
    """
    return 0.5*np.mean((y-f(X, theta))**2)

def mse_gradient(theta, X, y):
    """The gradient of the cost function.
    
    Parameters:
    theta (np.array): d-dimensional vector of parameters
    X (np.array): (n,d)-dimensional design matrix
    y (np.array): n-dimensional vector of targets
    
    Returns:
    grad (np.array): d-dimensional gradient of the MSE
    """
    return np.mean((f(X, theta) - y) * X.T, axis=1)

# Load the diabetes dataset
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# # add an extra column of onens
# X['one'] = 1

# Collect 20 data points and only use bmi dimension
X_train = X.iloc[-20:].loc[:, ['bmi', 'bp']]
y_train = y.iloc[-20:] / 300

threshold = 1e-6
step_size = 4e-1
theta, theta_prev = np.array([0,0]), np.array([1,1])
opt_pts = [theta]
opt_grads = []
iter = 0

while np.linalg.norm(theta - theta_prev) > threshold:
    if iter % 100 == 0:
        print('Iteration %d. MSE: %.6f' % (iter, mean_squared_error(theta, X_train, y_train)))
        print(theta)
    theta_prev = theta
    gradient = mse_gradient(theta, X_train, y_train)
    theta = theta_prev - step_size * gradient
    opt_pts += [theta]
    opt_grads += [gradient]
    iter += 1

Iteration 0. MSE: 0.027770
[0 0]
Iteration 100. MSE: 0.024186
bmi    0.312557
bp     0.213375
dtype: float64
Iteration 200. MSE: 0.021491
bmi    0.587644
bp     0.392295
dtype: float64
Iteration 300. MSE: 0.019460
bmi    0.830233
bp     0.541622
dtype: float64
Iteration 400. MSE: 0.017926
bmi    1.044618
bp     0.665551
dtype: float64
Iteration 500. MSE: 0.016762
bmi    1.234510
bp     0.767699
dtype: float64
Iteration 600. MSE: 0.015876
bmi    1.403118
bp     0.851184
dtype: float64
Iteration 700. MSE: 0.015199
bmi    1.553217
bp     0.918693
dtype: float64
Iteration 800. MSE: 0.014677
bmi    1.687205
bp     0.972544
dtype: float64
Iteration 900. MSE: 0.014272
bmi    1.807156
bp     1.014731
dtype: float64
Iteration 1000. MSE: 0.013956
bmi    1.914866
bp     1.046973
dtype: float64
Iteration 1100. MSE: 0.013706
bmi    2.011887
bp     1.070753
dtype: float64
Iteration 1200. MSE: 0.013507
bmi    2.099564
bp     1.087343
dtype: float64
Iteration 1300. MSE: 0.013346
bmi    2.179060
bp    

In [32]:
theta

bmi    1.308136
bp     2.050266
dtype: float64

## **Question 2** (Linear regression for 2 nonlinear features, Multivariate Polynomial Regression)

We still use the pairs of *BMI* and *BP* features ($x_1, x_2$). However, the linear model now is changed to

$$ \bar{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 + \theta_4 x_1^2 + \theta_5 x_2^2$$

The mean square error loss function

$$J(\theta)=\frac{1}{2}\sum_{i=1}^n(y^{(i)}-\bar{y}^{(i)})^2$$
where $n$ is the number of patients

- (2.a) (+30 pt) Using the gradient descent algorithm to find the optimal $\theta_i, i = 0, ..., 5$

## **Question 3** (Linear regression for 2 nonlinear features)

Now instead of using gradient descent in question 2, you shall use the ordinary least squares approach to 

- (3.a) (+30 pt) derive the optimal solution $$\theta^* = ( \theta_0, \theta_1,\theta_2,\theta_3,\theta_4,\theta_5)$$

$\text{Hint}^*:$ you can consider the matrix data are columns of 

$$ X = \begin{bmatrix}
- & \bar{x}^{(1)} & - \\
- & \bar{x}^{(2)} & -  \\
- & \vdots  & - \\
- & \bar{x}^{(n)} & - 
\end{bmatrix}
$$

where $ \bar{x} = (1, x_1, x_2,x_1x_2,x^2_1, x^2_2)$ is a row vector and $n$ is the number of patients

- (3.b) (Bounus +5): Is your $\theta^*$ is exactly the same as the question 2? why (Yes/No)?