## 2 Regression Models

In this problem, your goal is predicting the final price of a house based on a number of explanatory variables.

(a) In many settings, our data contains variables with missing values. Search online and find suitable ways to impute missing values in a dataset. Use one of these method to treat such variables in the data.

(b) Name two categorical features in the data, and choose ten features from the dataset that you think will be most predictive of the outcome.

(c) Transform the categorical features in your chosen set to make them suitable for modeling.

(d) Apply ridge regression to the data and find the best value of the regu- larization parameter using cross-validation. Report the RMSE of your best model both in training and validation sets. Do you see any over- fitting in your model?

(e) Applyk-nearestneighborsregression,findthebestkusingcross-validation, and report the RMSE of your best model in training set and from cross- validation. Do you see any overfitting in your model?

(f) Compute the RMSE of your best model from previous steps in the test set.

In [143]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import metrics

np.random.seed(0)

In [144]:
test_data = pd.read_csv("./test.csv")
train_data = pd.read_csv("./train.csv")

In [145]:
test_data.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [146]:
train_data.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### PART A

In [147]:
# Mean Imputation: Used for numerical variable to replace missing values with the average
# Mode Imputation: Used for categorical variables to replace missing values with the most frequent value

# Mean imputation for numerical columns in train data
num_columns_train = train_data.select_dtypes(include='number').columns
for col in num_columns_train:
    train_data[col] = train_data[col].fillna(train_data[col].mean())

# Mean imputation for numerical columns in test data
num_columns_test = test_data.select_dtypes(include='number').columns
for col in num_columns_test:
    test_data[col] = test_data[col].fillna(test_data[col].mean())

# Mode imputation for categorical columns in train data
cat_columns_train = train_data.select_dtypes(include='object').columns
for col in cat_columns_train:
    train_data[col] = train_data[col].fillna(train_data[col].mode()[0])

# Mode imputation for categorical columns in test data
cat_columns_test = test_data.select_dtypes(include='object').columns
for col in cat_columns_test:
    test_data[col] = test_data[col].fillna(test_data[col].mode()[0])

In [148]:
# Test data post imputing missing values
test_data.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,Grvl,Reg,Lvl,AllPub,...,120,0,Ex,MnPrv,Shed,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,Grvl,IR1,Lvl,AllPub,...,0,0,Ex,MnPrv,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,Grvl,IR1,Lvl,AllPub,...,0,0,Ex,MnPrv,Shed,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,Grvl,IR1,Lvl,AllPub,...,0,0,Ex,MnPrv,Shed,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,Grvl,IR1,HLS,AllPub,...,144,0,Ex,MnPrv,Shed,0,1,2010,WD,Normal


In [149]:
# Train data post imputing missing values
train_data.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,Grvl,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,Grvl,Reg,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,Grvl,IR1,Lvl,AllPub,...,0,Gd,MnPrv,Shed,0,12,2008,WD,Normal,250000


### PART B
2 Categorical Features: 

*     CentralAir
*     Neighborhood

10 Other Features: 

*     OverallQual: Overall material and finish quality.
*     GrLivArea: Above grade (ground) living area square feet.
*     GarageCars: Size of garage in car capacity.
*     TotalBsmtSF: Total square feet of basement area.
*     YearBuilt: Original construction date.
*     FullBath: Number of full bathrooms above grade.
*     YearRemodAdd: Remodel date (same as construction date if no remodeling or additions).
*     GarageArea: Size of garage in square feet.
*     BedroomAbvGr: Number of bedrooms above grade.
*     LotArea: Lot size in square feet.


In [150]:
# Filter test and train data for variables specified above
features_train = ['Neighborhood', 'CentralAir', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF',
            'YearBuilt', 'FullBath', 'YearRemodAdd', 'GarageArea', 'LotArea', 'BedroomAbvGr', 'SalePrice']
features_test = ['Neighborhood', 'CentralAir', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF',
            'YearBuilt', 'FullBath', 'YearRemodAdd', 'GarageArea', 'LotArea', 'BedroomAbvGr']

filtered_train_data = train_data[features_train]
filtered_test_data = test_data[features_test]

filtered_train_data.head(5)

Unnamed: 0,Neighborhood,CentralAir,OverallQual,GrLivArea,GarageCars,TotalBsmtSF,YearBuilt,FullBath,YearRemodAdd,GarageArea,LotArea,BedroomAbvGr,SalePrice
0,CollgCr,Y,7,1710,2,856,2003,2,2003,548,8450,3,208500
1,Veenker,Y,6,1262,2,1262,1976,2,1976,460,9600,3,181500
2,CollgCr,Y,7,1786,2,920,2001,2,2002,608,11250,3,223500
3,Crawfor,Y,7,1717,3,756,1915,1,1970,642,9550,3,140000
4,NoRidge,Y,8,2198,3,1145,2000,2,2000,836,14260,4,250000


In [151]:
# Convert categorical features to 0-1 indicator variables
filtered_train_data = pd.get_dummies(
    filtered_train_data,
    columns = ["Neighborhood", "CentralAir"],
    dtype = int,
    drop_first = True
)

filtered_test_data = pd.get_dummies(
    filtered_test_data,
    columns = ["Neighborhood", "CentralAir"],
    dtype = int,
    drop_first = True
)

X = filtered_train_data.drop("SalePrice", axis = 1)
y = filtered_train_data.SalePrice

print("Shape of X is: ", X.shape)
print("Shape of y is: ", y.shape)

filtered_train_data.head(5)

print(len(filtered_test_data))
print(len(filtered_train_data))

Shape of X is:  (1460, 35)
Shape of y is:  (1460,)
1459
1460


### PART D

In [124]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV

# Split the data in training and test sets for cross validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    shuffle = True, random_state = 0)

# Find the best alpha, regularization parameter
alphas = np.logspace(-3, 3, 100)
ridge_cv = RidgeCV(alphas = alphas, cv = 5)
ridge_cv.fit(X_train, y_train)
best_alpha = ridge_cv.alpha_
print("Best value of regularization parameter (alpha):", best_alpha)

# Ridge Regression using best alpha
ridge_model = Ridge(alpha = best_alpha)
ridge_model.fit(X_train, y_train)

#Predictions on train set
preds_y_train = ridge_model.predict(X_train)

# Compute RMSE on train set
rmse_train = np.sqrt(np.mean((preds_y_train - y_train)**2))
print("RMSE on train set (Ridge Regression):", rmse_train)

# Predictions on test set
preds_y_test = ridge_model.predict(X_test)

# Compute RMSE on test set
rmse_test = np.sqrt(np.mean((preds_y_test - y_test)**2))
print("RMSE on test set (Ridge Regression):", rmse_test)

Best value of regularization parameter (alpha): 1.873817422860385
RMSE on train set (Ridge Regression): 31059.456472807098
RMSE on test set (Ridge Regression): 45666.7648393596


To check for overfitting we compare the RMSE on the train set and the RMSE on the test set. In this case, the rmse_train value is much lower than the rmse_test value, which indicates overfitting, so the model fits the current data well, but does not generalize well to new data points.  

### PART E

In [125]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

# Split the data in training and test sets for cross validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    shuffle = True, random_state = 0)

# Define a range of k values
k_values = range(1, 50)

# Define the parameter grid
param_grid = {'n_neighbors': k_values}

# Instantiate the KNN regressor
knn = KNeighborsRegressor()

# Grid search with cross-validation to find the best k
grid_search = GridSearchCV(knn, param_grid, cv = 5)
grid_search.fit(X_train, y_train)

# Best k value
best_k = grid_search.best_params_['n_neighbors']
print("Best value of n neighbors (k):", best_k)

# KNeighbors using best k value
knn_model = KNeighborsRegressor(n_neighbors = best_k)
knn_model.fit(X_train, y_train)

#Predictions on train set
preds_y_train = knn_model.predict(X_train)

# Compute RMSE on train set
rmse_train = np.sqrt(np.mean((preds_y_train - y_train)**2))
print("RMSE on train set (KNN):", rmse_train)

# Predictions on test set
preds_y_test = ridge_model.predict(X_test)

# Compute RMSE on test set
rmse_test = np.sqrt(np.mean((preds_y_test - y_test)**2))
print("RMSE on test set (KNN):", rmse_test)

Best value of n neighbors (k): 4
RMSE on train set (KNN): 35221.76194139594
RMSE on test set (KNN): 45666.7648393596


To check for overfitting we compare the RMSE on the train set and the RMSE on the test set. In this case, the rmse_train value is much lower than the rmse_test value, which indicates overfitting, so the model fits the current data well, but does not generalize well to new data points. Although the model is still overfitting, there is an improvement in using KNN vs. Ridge Regression. 

### PART F

In [160]:
# Split the data in training and test sets for cross validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    shuffle = True, random_state = 0)

# Apply Ridge Regression to test data
# Train model on entire training set and use this to make predictions for test set
ridge_model.fit(X_train, y_train)
test_preds_ridge = ridge_model.predict(filtered_test_data)
print("Predictions for test data using Ridge Regression: ", test_preds_ridge)

rmse_test_ridge = np.sqrt(np.mean((test_preds_ridge - (y[:-1]))**2))
print("RMSE on test set (KNN):", rmse_test_ridge)

#Apply KNN Regression to test data
# Train model on entire training set and use this to make predictions for test set
knn_model.fit(X_train, y_train)
test_preds_knn = knn_model.predict(filtered_test_data)
print("Predictions for test data using KNN Regression: ", test_preds_knn)

rmse_test_knn = np.sqrt(np.mean((test_preds_knn - (y[:-1]))**2))
print("RMSE on test set (KNN):", rmse_test_knn)

Predictions for test data using Ridge Regression:  [127875.12524657 159423.70302033 175086.68843113 ... 149532.12677116
 109118.87456701 221650.57678359]
RMSE on test set (KNN): 107658.26113713578
Predictions for test data using KNN Regression:  [143825.  199883.  179575.  ... 202742.5 107750.  250325. ]
RMSE on test set (KNN): 102023.67660987082
