# Simple Linear versus Ridge Regression 

## Step 1:  Getting, understanding, and preprocessing the dataset

We first import the standard libaries and some libraries that will help us scale the data and perform some "feature engineering" by transforming the data into $\Phi_2({\bf x})$

In [14]:
import numpy as np
import sklearn
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import sklearn.linear_model
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

In [15]:
###  Importing the dataset

In [16]:
# Import the boston dataset from sklearn
# Load dataset to some variable 
# boston_data = .....

boston_data = load_boston()



In [17]:
#  Create X and Y variables - X holding the .data and Y holding .target 
# X = boston_data.....
# y = boston_data.....
X, y = load_boston(return_X_y=True)
y = y.reshape((-1,1))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print('The number of features is: ', X.shape[1])
print('The features: ', boston_data.feature_names)
print('The number of exampels in our dataset: ', X.shape[0])
print(X[0:2])
#  Reshape Y to be a rank 2 matrix using y.reshape()

# Observe the number of features and the number of labels
# print('The number of features is: ', X.shape[1])
# Printing out the features
# print('The features: ', boston_data.feature_names)
# The number of examples
# print('The number of exampels in our dataset: ', X.shape[0])
# Observing the first 2 rows of the data
# print(X[0:2])


(404, 13)
(404, 1)
The number of features is:  13
The features:  ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
The number of exampels in our dataset:  506
[[6.3200e-03 1.8000e+01 2.3100e+00 0.0000e+00 5.3800e-01 6.5750e+00
  6.5200e+01 4.0900e+00 1.0000e+00 2.9600e+02 1.5300e+01 3.9690e+02
  4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 0.0000e+00 4.6900e-01 6.4210e+00
  7.8900e+01 4.9671e+00 2.0000e+00 2.4200e+02 1.7800e+01 3.9690e+02
  9.1400e+00]]


We will also create polynomial feeatures for the dataset to test linear and ridge regression on data with d = 1 and data with d = 2. Feel free to increase the # of degress and see what effect it has on the training and test error. 

In [18]:
# Create a PolynomialFeatures object with degree = 2. Using PolynomialFeatures(degree=2)
# Transform X and save it into X_2 using poly.fit_transform(X)
# Simply copy Y into Y_2 

# X_2 = ....
# y_2 = ....

poly = PolynomialFeatures(2)
X_2 = poly.fit_transform(X)
y_2 = y


In [19]:
# the shape of X_2 and Y_2 - should be (506, 105) and (506, 1) respectively
print(X_2.shape)
print(y_2.shape)
# print(X[:,0])
# plt.figure()
# plt.scatter(X[:,0],y)
# plt.show()

(506, 105)
(506, 1)


# Your code goes here

In [20]:
# Define the get_coeff_ridge_normaleq function. Use the normal equation method.
# Return w values

def get_coeff_ridge_normaleq(X_train, y_train, alpha):
    # use np.linalg.pinv(...)
    m,n = X_train.shape
    I = np.eye(n)
    w = np.dot(np.linalg.pinv(np.dot(X_train.T, X_train)+alpha*I), np.dot(X_train.T, y_train))

    return w

In [21]:
# Define the get_coeff_ridge_normaleq function. Use the normal equation method.
# Return w values

def get_coeff_linear_normaleq(X_train, y_train):
    # use np.linalg.pinv(...)
    w = np.dot(np.linalg.pinv(np.dot(X_train.T, X_train)), np.dot(X_train.T, y_train))
    return w



In [22]:
# Define the evaluate_err_ridge function.
# Return the train_error and test_error values
#     return train_error, test_error

def evaluate_err(X_train, X_test, y_train, y_test, w): 

    y_pred=np.dot(X_train,w)
    MSE_train = np.mean((y_pred-y_train)**2)
    
    y_pred=np.dot(X_test,w)
    MSE_test = np.mean((y_pred-y_test)**2)
    return MSE_test , MSE_train
    


In [23]:
# Finish writting the k_fold_cross_validation function. 
# Returns the average training error and average test error from the k-fold cross validation
# Sklearns K-Folds cross-validator: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

def k_fold_cross_validation(k, X, y, alpha=None):
    kf = KFold(n_splits=k, random_state=21, shuffle=True)
    total_E_val_test = 0
    total_E_val_train = 0
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Centering the data so we do not need the intercept term (we could have also chose w_0=average y value)
        
        # Subtract y_train_mean from y_train and y_test
        y_train_mean=np.mean(y_train)
        # y_train_mean = ...
        # y_train = ...
        # y_test = ...
        y_train= y_train-y_train_mean
        y_test=y_test-y_train_mean
        # Scaling the data matrix
        # Using scaler=preprocessing.StandardScaler().fit(...)
        # And scaler.transform(...)
        # X_train = 
        # X_test =
        scaler = preprocessing.StandardScaler().fit(X_train)
        X_train=scaler.transform(X_train)
        X_test=scaler.transform(X_test)
        # Determine the training error and the test error
        # Use get_coeff_linear_normaleq or get_coeff_ridge_normaleq to get w
        # And use evaluate_err()
        if alpha==None:
            w=get_coeff_linear_normaleq(X_train,y_train)
        else:
#             print(alpha,'alpha testing')
            w=get_coeff_ridge_normaleq(X_train, y_train, alpha)
        
        total_E_val_test, total_E_val_train=evaluate_err(X_train, X_test, y_train, y_test, w)
        
       ##############
    return  total_E_val_test, total_E_val_train
    


In [24]:
# print the error for the both linear regression and ridge regression
# the error should include both training error and testing error

In [25]:
# use the model to predict the new test case.
def predict(X_test):
    w_simple= get_coeff_linear_normaleq(X_train,y_train)
    w_ridge = get_coeff_ridge_normaleq(X_train, y_train, 10)
    predict_simple = np.dot(X_test,w_simple)
    predict_ridge = np.dot(X_test,w_ridge)
    return predict_simple,predict_ridge

In [26]:
# print(predict(X_test))
alpha =  np.logspace(1, 7, num=13)
print('Ridge Linear Regression Error')
print()
for i in alpha:
    print('                         Alpha Value: ',i)
    print()
    total_E_val_test, total_E_val_train = k_fold_cross_validation(5, X, y, i)
    print('Test Error: ',total_E_val_test, '             Training Error: ',total_E_val_train)
    print()
    

print('Simple Linear Regression Error')
print()
total_E_val_test, total_E_val_train = k_fold_cross_validation(5, X, y)
print('Test Error: ',total_E_val_test, '             Training Error: ',total_E_val_train)

print()
print('                            Polynomial features')
print()

print('Ridge Linear Regression Error')
print()
for i in alpha:
    print('                         Alpha Value: ',i)
    print()
    total_E_val_test, total_E_val_train = k_fold_cross_validation(5, X_2, y_2, i)
    print('Test Error: ',total_E_val_test, '             Training Error: ',total_E_val_train)
    print()
    

print('Simple Linear Regression Error')
print()
total_E_val_test, total_E_val_train = k_fold_cross_validation(5, X_2, y_2)
print('Test Error: ',total_E_val_test, '             Training Error: ',total_E_val_train)

Ridge Linear Regression Error

                         Alpha Value:  10.0

Test Error:  28.51372751066697              Training Error:  20.72314093498285

                         Alpha Value:  31.622776601683793

Test Error:  29.00483225377633              Training Error:  21.138364993622243

                         Alpha Value:  100.0

Test Error:  30.484127136312544              Training Error:  22.682471987114713

                         Alpha Value:  316.22776601683796

Test Error:  34.91174932308663              Training Error:  27.496108333619837

                         Alpha Value:  1000.0

Test Error:  45.35910577785945              Training Error:  38.547243209998605

                         Alpha Value:  3162.2776601683795

Test Error:  61.023468357328404              Training Error:  54.47066705725698

                         Alpha Value:  10000.0

Test Error:  75.87357279068969              Training Error:  69.33026881273167

                         Alpha Value:  3



# Linear Regression Report

If given a choice to implement one of the above models to predict housing prices, I would definitely choose to transform my data using polynomial features. This allows the model to identify non-linear patterns. My test error was the least when I tested my model with polynomial regression. In terms of simple linear or ridge linear regression, for most of the alpha values my model was overfitting. Therefore I concluded that its better to use simple linear regression because it had the lowest test error. 

## To run the program:

1. Run the last block of code to display all testing and training erros using Kfold validation.
2. Run the predict function to predict all the house prices for the testing set. Function returns predicted values using simple and ridge regression.

## Deatiled description:

1. I import tha dataset and and split it.
2. I make a instance of the data with polynomial features.
3. In the kfold cross validation function, I use sklearn's built in kfold function. I run the simple regression and the ridge regression closed form solutions on the outputs.
4. I run evauate error to return the error.
5. I run the kfold function and iterate through every alpha value and print out the errors.


