<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2 - Ames Housing Data and Kaggle Challenge

# Part 3: Model Validation

<h3 style='text-align: justify;'> This section consists of model validation for the 5 different models. The train dataset is used here to train test split into the training and holdout dataset for model validating purpose. Data preprocessing such as one hot encoding on the nominal features and standard scaling on the continuous and ordinal features before fitting to the 5 different models. The cross-validation of the training set for different models is performed in this section before validating with the holdout dataset. The root mean square error (RMSE) is used as the metric to determine the performance of the models. </h3>

### Contents:
* [Organisation of Notebooks](#Organisation-of-Notebooks)
* [Data Dictionary](#Data-Dictionary)
* [Import Libraries and Dataset](#Import-Libraries-and-Dataset)
* [Functions for One Hot Encoding, Standard Scaling and Subsetting Dataframe into Feature Type](#Functions-for-One-Hot-Encoding,-Standard-Scaling-and-Subsetting-Dataframe-into-Feature-Type)
* [One Hot Encoding on Nominal Features](#One-Hot-Encoding-On-Nominal_Features)
* [Standard Scale on the Continuous and Ordinal Features](#Standard-Scale-on-the-Continuous-and-Ordinal-Features)
* [Model Validation](#Model-Validation)
    * [Simple Model](#Simple-Model)
    * [Linear Regression](#Linear-Regression)
    * [RIdge Regression](#Ridge-Regression)
    * [Lasso Regression](#Lasso-Regression)
    * [ElasticNet Regression](#ElasticNet-Regression)
* [Conclusion](#Conclusion)
* [Summary](#Summary)

## Organistation of Notebooks:
1. [Introduction](./01_Introduction.ipynb)
2. [Data Preprocessing and EDA](./02_EDA_DataPreprocessing_FeatureEngineering.ipynb)
3. Model Validation
4. [Model Testing with Kaggle Dataset](./04_ModelTesting.ipynb)

## Data Dictionary

| Feature | Feature Type | Data Type | Dataset | Description |
| --- | --- | --- | --- | --- |
| Total Bsmt BF | Continous | Float | Train/Test | Total Basement Area in Square Feet |
| Gr Liv Area | Continuous | Integer | Train/Test | Total Living Area above Ground in Square Feet |
| Overall Qual | Ordinal | Integer | Train/Test | Rating of Overall Material and Finish of the House |
| Exter Qual | Ordinal | Integer | Train/Test | Quality of Exterior Material |
| Heating QC | Ordinal | Integer | Train/Test | Quality and Condition of Heating |
| Kitchen Qual | Ordinal | Integer | Train/Test | Kitchen Quality |
| Garage Finish | Ordinal | Integer | Train/Test | Interior Finiah of Garage |
| MS SubClass | Nominal | Object | Train/Test | Type of Dwelling Sold|
| MS Zoning | Nominal | Object | Train/Test | General Zoning Classification |
| Lot Config | Nominal | Object | Train/Test | Lot Configuration |
| Neighborhood | Nominal | Object | Train/Test | Physical Locations within Ames City |
| House Style | Nominal | Object | Train/Test | Style of Dwelling |
| Roof Style | Nominal | Object | Train/Test | Roof Type |
| Exterior 1st | Nominal | Object | Train/Test | Exterior covering the the House |
| Exterior 2nd | Nominal | Object | Train/Test | Exterior covering the House if more than 1 Material |
| Mas Vnr Type | Nominal | Object | Train/Test | Masonary Veneer Type |
| Foundation | Nominal | Object | Train/Test | Foundation Type |
| Garage Type | Nominal | Object | Train/Test | Garage Location |
| SalePrice | Continuous | Integer | Train | Sale Price ($$) |

## Import Libraries and Dataset

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [2]:
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

In [3]:
# import the train set that is filtered in section 2
train = pd.read_csv('./datasets/train_filtered.csv', na_filter=False)

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Total_Bsmt_SF  2051 non-null   float64
 1   Gr_Liv_Area    2051 non-null   int64  
 2   Overall_Qual   2051 non-null   int64  
 3   Exter_Qual     2051 non-null   int64  
 4   Heating_QC     2051 non-null   int64  
 5   Kitchen_Qual   2051 non-null   int64  
 6   Garage_Finish  2051 non-null   int64  
 7   MS_SubClass    2051 non-null   object 
 8   MS_Zoning      2051 non-null   object 
 9   Lot_Config     2051 non-null   object 
 10  Neighborhood   2051 non-null   object 
 11  House_Style    2051 non-null   object 
 12  Roof_Style     2051 non-null   object 
 13  Exterior_1st   2051 non-null   object 
 14  Exterior_2nd   2051 non-null   object 
 15  Mas_Vnr_Type   2051 non-null   object 
 16  Foundation     2051 non-null   object 
 17  Garage_Type    2051 non-null   object 
 18  SalePric

In [5]:
train.isna().sum()

Total_Bsmt_SF    0
Gr_Liv_Area      0
Overall_Qual     0
Exter_Qual       0
Heating_QC       0
Kitchen_Qual     0
Garage_Finish    0
MS_SubClass      0
MS_Zoning        0
Lot_Config       0
Neighborhood     0
House_Style      0
Roof_Style       0
Exterior_1st     0
Exterior_2nd     0
Mas_Vnr_Type     0
Foundation       0
Garage_Type      0
SalePrice        0
dtype: int64

In [6]:
# check which columns with null value, can use len to check how many columns
# any() means any of the rows with null, all() means all of the rows with null
train.columns[train.isnull().any()]

Index([], dtype='object')

In [7]:
cont_features = [
    'Total_Bsmt_SF', 
    'Gr_Liv_Area'
]

ordinal_features = [
    'Overall_Qual', 
    'Exter_Qual', 
    'Heating_QC', 
    'Kitchen_Qual', 
    'Garage_Finish'
]

nominal_features = [
    'MS_SubClass', 
    'MS_Zoning', 
    'Lot_Config', 
    'Neighborhood',
    'House_Style',
    'Roof_Style',
    'Exterior_1st',
    'Exterior_2nd',
    'Mas_Vnr_Type',
    'Foundation',
    'Garage_Type'
]

In [8]:
selected_features = [cont_features, ordinal_features, nominal_features]

In [9]:
selected_features = list(np.concatenate(selected_features).flat)
selected_features

['Total_Bsmt_SF',
 'Gr_Liv_Area',
 'Overall_Qual',
 'Exter_Qual',
 'Heating_QC',
 'Kitchen_Qual',
 'Garage_Finish',
 'MS_SubClass',
 'MS_Zoning',
 'Lot_Config',
 'Neighborhood',
 'House_Style',
 'Roof_Style',
 'Exterior_1st',
 'Exterior_2nd',
 'Mas_Vnr_Type',
 'Foundation',
 'Garage_Type']

## Functions for One Hot Encoding, Standard Scaling and Subsetting Dataframe into Feature Type

### Define some functions to one hot encode features, standard scale features and subsetting the different types of features

In [10]:
def ohe_features(df_train, df_test, features, ohe):

    encoded_train = ohe.fit_transform(df_train[features])
    encoded_features = ohe.get_feature_names(features)
    encoded_test = ohe.transform(df_test[features])
    return encoded_train, encoded_features, encoded_test

In [11]:
def subset_features(df, feature_type):
    return df[feature_type]

In [12]:
def standard_scale_features(df_train, df_test , features, ss):#, train=True):

    df_train_sc = ss.fit_transform(df_train[features])
    df_test_sc = ss.transform(df_test[features])
    return df_train_sc, df_test_sc

## One Hot Encoding on Nominal Features

### Train test split on the train dataset before one hot encoding 

In [13]:
X = train[selected_features]
y = train['SalePrice']

In [14]:
# X.info()

In [15]:
# test size set as 20% of train data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# X_train.head()

### One hot encode on the train and test data

In [17]:
# instantiate ohe
# sparse=False -> give array
# handle_unknown='ignore' -> ignore unseen col in train set
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
# one hot encode the nonminal features
# returns encoded nominal data, encoded nominal features, encoded nominal test data
encoded_nominal_data, encoded_nominal_features, encoded_nominal_data_test = ohe_features(X_train, X_test, nominal_features, ohe)

In [18]:
encoded_nominal_data, encoded_nominal_features, encoded_nominal_data_test

(array([[1., 0., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 1., 0.],
        [0., 0., 1., ..., 0., 0., 0.]]),
 array(['MS_SubClass_high', 'MS_SubClass_low', 'MS_SubClass_mid',
        'MS_Zoning_A (agr)', 'MS_Zoning_C (all)', 'MS_Zoning_FV',
        'MS_Zoning_RH', 'MS_Zoning_RL', 'MS_Zoning_RM',
        'Lot_Config_Corner', 'Lot_Config_CulDSac', 'Lot_Config_FR2',
        'Lot_Config_FR3', 'Lot_Config_Inside', 'Neighborhood_high',
        'Neighborhood_low', 'Neighborhood_mid', 'House_Style_1.5Fin',
        'House_Style_1.5Unf', 'House_Style_1Story', 'House_Style_2.5Fin',
        'House_Style_2.5Unf', 'House_Style_2Story', 'House_Style_SFoyer',
        'House_Style_SLvl', 'Roof_Style_Flat', 'Roof_Style_Gable',
        'Roof_Style_Gambrel', 'Roof_Style_Hip', 'Roof_Style_Mansard',
        'Roof_Style_Shed', 'Exterior_1st_AsbShng', 'Exterior_1st_AsphSh

In [19]:
# put the encoded nominal data and encoded nominal features into a df
train_nominal_encoded = pd.DataFrame(data=encoded_nominal_data , columns=encoded_nominal_features)
test_nominal_encoded = pd.DataFrame(data=encoded_nominal_data_test , columns=encoded_nominal_features)

In [20]:
train_nominal_encoded.head()

Unnamed: 0,MS_SubClass_high,MS_SubClass_low,MS_SubClass_mid,MS_Zoning_A (agr),MS_Zoning_C (all),MS_Zoning_FV,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Lot_Config_Corner,Lot_Config_CulDSac,Lot_Config_FR2,Lot_Config_FR3,Lot_Config_Inside,Neighborhood_high,Neighborhood_low,Neighborhood_mid,House_Style_1.5Fin,House_Style_1.5Unf,House_Style_1Story,House_Style_2.5Fin,House_Style_2.5Unf,House_Style_2Story,House_Style_SFoyer,House_Style_SLvl,Roof_Style_Flat,Roof_Style_Gable,Roof_Style_Gambrel,Roof_Style_Hip,Roof_Style_Mansard,Roof_Style_Shed,Exterior_1st_AsbShng,Exterior_1st_AsphShn,Exterior_1st_BrkComm,Exterior_1st_BrkFace,Exterior_1st_CBlock,Exterior_1st_CemntBd,Exterior_1st_HdBoard,Exterior_1st_ImStucc,Exterior_1st_MetalSd,Exterior_1st_Plywood,Exterior_1st_Stone,Exterior_1st_Stucco,Exterior_1st_VinylSd,Exterior_1st_Wd Sdng,Exterior_1st_WdShing,Exterior_2nd_AsbShng,Exterior_2nd_AsphShn,Exterior_2nd_Brk Cmn,Exterior_2nd_BrkFace,Exterior_2nd_CBlock,Exterior_2nd_CmentBd,Exterior_2nd_HdBoard,Exterior_2nd_ImStucc,Exterior_2nd_MetalSd,Exterior_2nd_Plywood,Exterior_2nd_Stone,Exterior_2nd_Stucco,Exterior_2nd_VinylSd,Exterior_2nd_Wd Sdng,Exterior_2nd_Wd Shng,Mas_Vnr_Type_BrkCmn,Mas_Vnr_Type_BrkFace,Mas_Vnr_Type_None,Mas_Vnr_Type_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Garage_Type_2Types,Garage_Type_Attchd,Garage_Type_Basment,Garage_Type_BuiltIn,Garage_Type_CarPort,Garage_Type_Detchd,Garage_Type_NA
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [21]:
test_nominal_encoded.head()

Unnamed: 0,MS_SubClass_high,MS_SubClass_low,MS_SubClass_mid,MS_Zoning_A (agr),MS_Zoning_C (all),MS_Zoning_FV,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Lot_Config_Corner,Lot_Config_CulDSac,Lot_Config_FR2,Lot_Config_FR3,Lot_Config_Inside,Neighborhood_high,Neighborhood_low,Neighborhood_mid,House_Style_1.5Fin,House_Style_1.5Unf,House_Style_1Story,House_Style_2.5Fin,House_Style_2.5Unf,House_Style_2Story,House_Style_SFoyer,House_Style_SLvl,Roof_Style_Flat,Roof_Style_Gable,Roof_Style_Gambrel,Roof_Style_Hip,Roof_Style_Mansard,Roof_Style_Shed,Exterior_1st_AsbShng,Exterior_1st_AsphShn,Exterior_1st_BrkComm,Exterior_1st_BrkFace,Exterior_1st_CBlock,Exterior_1st_CemntBd,Exterior_1st_HdBoard,Exterior_1st_ImStucc,Exterior_1st_MetalSd,Exterior_1st_Plywood,Exterior_1st_Stone,Exterior_1st_Stucco,Exterior_1st_VinylSd,Exterior_1st_Wd Sdng,Exterior_1st_WdShing,Exterior_2nd_AsbShng,Exterior_2nd_AsphShn,Exterior_2nd_Brk Cmn,Exterior_2nd_BrkFace,Exterior_2nd_CBlock,Exterior_2nd_CmentBd,Exterior_2nd_HdBoard,Exterior_2nd_ImStucc,Exterior_2nd_MetalSd,Exterior_2nd_Plywood,Exterior_2nd_Stone,Exterior_2nd_Stucco,Exterior_2nd_VinylSd,Exterior_2nd_Wd Sdng,Exterior_2nd_Wd Shng,Mas_Vnr_Type_BrkCmn,Mas_Vnr_Type_BrkFace,Mas_Vnr_Type_None,Mas_Vnr_Type_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Garage_Type_2Types,Garage_Type_Attchd,Garage_Type_Basment,Garage_Type_BuiltIn,Garage_Type_CarPort,Garage_Type_Detchd,Garage_Type_NA
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [22]:
# combine continous and ordinal features together for standard scaling purpose
cont_ordinal_features = list(np.concatenate([cont_features, ordinal_features]).flat)
cont_ordinal_features

['Total_Bsmt_SF',
 'Gr_Liv_Area',
 'Overall_Qual',
 'Exter_Qual',
 'Heating_QC',
 'Kitchen_Qual',
 'Garage_Finish']

In [23]:
# merge continuous and ordinal 
# one usage is to merge back to the encoded nominal features for X_train
# another usage is to standard scale the continuous and ordinal features before merging back
# to encoded nominal features for ridge, lasso and enet regression model
train_cont_ordinal = subset_features(X_train, cont_ordinal_features)
test_cont_ordinal = subset_features(X_test, cont_ordinal_features)
# reset the index before merging with encoded nominal features with continuous and ordinal features
train_cont_ordinal.reset_index(drop=True, inplace=True)
test_cont_ordinal.reset_index(drop=True, inplace=True)

In [24]:
# X_train and X_test to be used for linear regression model
X_train = train_cont_ordinal.merge(train_nominal_encoded, how='left', left_index=True, right_index=True)
X_test = test_cont_ordinal.merge(test_nominal_encoded, how='left', left_index=True, right_index=True)

In [25]:
X_train.head()

Unnamed: 0,Total_Bsmt_SF,Gr_Liv_Area,Overall_Qual,Exter_Qual,Heating_QC,Kitchen_Qual,Garage_Finish,MS_SubClass_high,MS_SubClass_low,MS_SubClass_mid,MS_Zoning_A (agr),MS_Zoning_C (all),MS_Zoning_FV,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Lot_Config_Corner,Lot_Config_CulDSac,Lot_Config_FR2,Lot_Config_FR3,Lot_Config_Inside,Neighborhood_high,Neighborhood_low,Neighborhood_mid,House_Style_1.5Fin,House_Style_1.5Unf,House_Style_1Story,House_Style_2.5Fin,House_Style_2.5Unf,House_Style_2Story,House_Style_SFoyer,House_Style_SLvl,Roof_Style_Flat,Roof_Style_Gable,Roof_Style_Gambrel,Roof_Style_Hip,Roof_Style_Mansard,Roof_Style_Shed,Exterior_1st_AsbShng,Exterior_1st_AsphShn,Exterior_1st_BrkComm,Exterior_1st_BrkFace,Exterior_1st_CBlock,Exterior_1st_CemntBd,Exterior_1st_HdBoard,Exterior_1st_ImStucc,Exterior_1st_MetalSd,Exterior_1st_Plywood,Exterior_1st_Stone,Exterior_1st_Stucco,Exterior_1st_VinylSd,Exterior_1st_Wd Sdng,Exterior_1st_WdShing,Exterior_2nd_AsbShng,Exterior_2nd_AsphShn,Exterior_2nd_Brk Cmn,Exterior_2nd_BrkFace,Exterior_2nd_CBlock,Exterior_2nd_CmentBd,Exterior_2nd_HdBoard,Exterior_2nd_ImStucc,Exterior_2nd_MetalSd,Exterior_2nd_Plywood,Exterior_2nd_Stone,Exterior_2nd_Stucco,Exterior_2nd_VinylSd,Exterior_2nd_Wd Sdng,Exterior_2nd_Wd Shng,Mas_Vnr_Type_BrkCmn,Mas_Vnr_Type_BrkFace,Mas_Vnr_Type_None,Mas_Vnr_Type_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Garage_Type_2Types,Garage_Type_Attchd,Garage_Type_Basment,Garage_Type_BuiltIn,Garage_Type_CarPort,Garage_Type_Detchd,Garage_Type_NA
0,1358.0,1358,6,4,5,4,2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1935.0,1973,8,4,4,4,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1244.0,1356,6,3,3,3,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1141.0,2263,8,4,4,4,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,528.0,1855,6,4,5,3,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [26]:
X_test.head()

Unnamed: 0,Total_Bsmt_SF,Gr_Liv_Area,Overall_Qual,Exter_Qual,Heating_QC,Kitchen_Qual,Garage_Finish,MS_SubClass_high,MS_SubClass_low,MS_SubClass_mid,MS_Zoning_A (agr),MS_Zoning_C (all),MS_Zoning_FV,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Lot_Config_Corner,Lot_Config_CulDSac,Lot_Config_FR2,Lot_Config_FR3,Lot_Config_Inside,Neighborhood_high,Neighborhood_low,Neighborhood_mid,House_Style_1.5Fin,House_Style_1.5Unf,House_Style_1Story,House_Style_2.5Fin,House_Style_2.5Unf,House_Style_2Story,House_Style_SFoyer,House_Style_SLvl,Roof_Style_Flat,Roof_Style_Gable,Roof_Style_Gambrel,Roof_Style_Hip,Roof_Style_Mansard,Roof_Style_Shed,Exterior_1st_AsbShng,Exterior_1st_AsphShn,Exterior_1st_BrkComm,Exterior_1st_BrkFace,Exterior_1st_CBlock,Exterior_1st_CemntBd,Exterior_1st_HdBoard,Exterior_1st_ImStucc,Exterior_1st_MetalSd,Exterior_1st_Plywood,Exterior_1st_Stone,Exterior_1st_Stucco,Exterior_1st_VinylSd,Exterior_1st_Wd Sdng,Exterior_1st_WdShing,Exterior_2nd_AsbShng,Exterior_2nd_AsphShn,Exterior_2nd_Brk Cmn,Exterior_2nd_BrkFace,Exterior_2nd_CBlock,Exterior_2nd_CmentBd,Exterior_2nd_HdBoard,Exterior_2nd_ImStucc,Exterior_2nd_MetalSd,Exterior_2nd_Plywood,Exterior_2nd_Stone,Exterior_2nd_Stucco,Exterior_2nd_VinylSd,Exterior_2nd_Wd Sdng,Exterior_2nd_Wd Shng,Mas_Vnr_Type_BrkCmn,Mas_Vnr_Type_BrkFace,Mas_Vnr_Type_None,Mas_Vnr_Type_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Garage_Type_2Types,Garage_Type_Attchd,Garage_Type_Basment,Garage_Type_BuiltIn,Garage_Type_CarPort,Garage_Type_Detchd,Garage_Type_NA
0,911.0,954,5,3,4,3,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,924.0,2157,7,3,5,4,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,948.0,2088,8,5,5,4,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,806.0,1647,6,3,3,2,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,851.0,1737,7,4,5,4,2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## Standard Scale on the Continuous and Ordinal Features

### Standard scale the continuous and ordinal features to be used for Ridge. Lasso and ElasticNet regession model
#### Preprocessing is required for Ridge, Lasso and ElasticNet regression model as there is a penalty term that will try to minimize the mean square error by a hyperparameter. Hence, there is a need to scale the continuous and ordinal features so that the hyperparameter will influence the features equally.

In [27]:
# instantiate standard scaler
ss = StandardScaler()
# scale the X_train and X_test respectively using the function specifed previously
X_train_sc, X_test_sc = standard_scale_features(X_train, X_test, cont_ordinal_features, ss)

In [28]:
# after ss, X_train_sc and X_test_sc become array
# need to convert them back into df
X_train_sc = pd.DataFrame(data=X_train_sc , columns=cont_ordinal_features)
X_test_sc = pd.DataFrame(data=X_test_sc , columns=cont_ordinal_features)

# merge back with nominal features
# merged the encoded nominal features with continous and ordinal features
X_train_sc = X_train_sc.merge(train_nominal_encoded, how='left', left_index=True, right_index=True)
X_test_sc = X_test_sc.merge(test_nominal_encoded, how='left', left_index=True, right_index=True)
X_train_sc.head()

Unnamed: 0,Total_Bsmt_SF,Gr_Liv_Area,Overall_Qual,Exter_Qual,Heating_QC,Kitchen_Qual,Garage_Finish,MS_SubClass_high,MS_SubClass_low,MS_SubClass_mid,MS_Zoning_A (agr),MS_Zoning_C (all),MS_Zoning_FV,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Lot_Config_Corner,Lot_Config_CulDSac,Lot_Config_FR2,Lot_Config_FR3,Lot_Config_Inside,Neighborhood_high,Neighborhood_low,Neighborhood_mid,House_Style_1.5Fin,House_Style_1.5Unf,House_Style_1Story,House_Style_2.5Fin,House_Style_2.5Unf,House_Style_2Story,House_Style_SFoyer,House_Style_SLvl,Roof_Style_Flat,Roof_Style_Gable,Roof_Style_Gambrel,Roof_Style_Hip,Roof_Style_Mansard,Roof_Style_Shed,Exterior_1st_AsbShng,Exterior_1st_AsphShn,Exterior_1st_BrkComm,Exterior_1st_BrkFace,Exterior_1st_CBlock,Exterior_1st_CemntBd,Exterior_1st_HdBoard,Exterior_1st_ImStucc,Exterior_1st_MetalSd,Exterior_1st_Plywood,Exterior_1st_Stone,Exterior_1st_Stucco,Exterior_1st_VinylSd,Exterior_1st_Wd Sdng,Exterior_1st_WdShing,Exterior_2nd_AsbShng,Exterior_2nd_AsphShn,Exterior_2nd_Brk Cmn,Exterior_2nd_BrkFace,Exterior_2nd_CBlock,Exterior_2nd_CmentBd,Exterior_2nd_HdBoard,Exterior_2nd_ImStucc,Exterior_2nd_MetalSd,Exterior_2nd_Plywood,Exterior_2nd_Stone,Exterior_2nd_Stucco,Exterior_2nd_VinylSd,Exterior_2nd_Wd Sdng,Exterior_2nd_Wd Shng,Mas_Vnr_Type_BrkCmn,Mas_Vnr_Type_BrkFace,Mas_Vnr_Type_None,Mas_Vnr_Type_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Garage_Type_2Types,Garage_Type_Attchd,Garage_Type_Basment,Garage_Type_BuiltIn,Garage_Type_CarPort,Garage_Type_Detchd,Garage_Type_NA
0,0.656487,-0.279707,-0.081747,1.016168,0.884799,0.723473,0.309155,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.929778,0.936761,1.322072,1.016168,-0.146523,0.723473,1.423473,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.404919,-0.283663,-0.081747,-0.693079,-1.177845,-0.782232,1.423473,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.177624,1.51038,1.322072,1.016168,-0.146523,0.723473,1.423473,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,-1.17511,0.703357,-0.081747,1.016168,0.884799,-0.782232,0.309155,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


## Model Validation

### Simple Model
#### Linear Regression Model using 2 continuous features like the total basement square feet area (Total_Bsment) and the above ground level square feet area (Gr_Liv_Area) as a baseline

In [29]:
# Simple model use sq of bsmt and gr for sale price
# subset the X_train using the 2 continuous features for X_train_simple
X_train_simple = X_train[['Total_Bsmt_SF', 'Gr_Liv_Area']]
# subset the X_test using the 2 continuous features for X_test_simple
X_test_simple = X_test[['Total_Bsmt_SF', 'Gr_Liv_Area']]

In [30]:
# instantiate model
lr = LinearRegression()

# fit the model
# y_train and y_test remain unchanged as SalePrice is still the target to predict
lr.fit(X_train_simple, y_train)

LinearRegression()

In [31]:
# cross validation with k-folds=5 for the simple model
# using mean sq error as the metrics
cross_val_score(
            lr,
            X_train_simple,
            y_train,
            cv=5,
            scoring='neg_mean_squared_error'
).mean()

-2666496473.9900713

In [32]:
# predict the sale price using the simple model with 2 continuous features for train set
y_pred_simple = lr.predict(X_train_simple)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_simple, y_train)**0.5

50740.399827510584

In [33]:
# predict the sale price using the simple model with 2 continuous features for test set
y_pred_simple_test = lr.predict(X_test_simple)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_simple_test, y_test)**0.5

44562.217701691356

In [34]:
lr.coef_

array([66.20616689, 82.43503232])

In [35]:
lr.intercept_

-12097.519169844483

## Linear Regression
### Linear Regression with selected continuous, ordinal and nominal features (18)

In [36]:
X_train.head()

Unnamed: 0,Total_Bsmt_SF,Gr_Liv_Area,Overall_Qual,Exter_Qual,Heating_QC,Kitchen_Qual,Garage_Finish,MS_SubClass_high,MS_SubClass_low,MS_SubClass_mid,MS_Zoning_A (agr),MS_Zoning_C (all),MS_Zoning_FV,MS_Zoning_RH,MS_Zoning_RL,MS_Zoning_RM,Lot_Config_Corner,Lot_Config_CulDSac,Lot_Config_FR2,Lot_Config_FR3,Lot_Config_Inside,Neighborhood_high,Neighborhood_low,Neighborhood_mid,House_Style_1.5Fin,House_Style_1.5Unf,House_Style_1Story,House_Style_2.5Fin,House_Style_2.5Unf,House_Style_2Story,House_Style_SFoyer,House_Style_SLvl,Roof_Style_Flat,Roof_Style_Gable,Roof_Style_Gambrel,Roof_Style_Hip,Roof_Style_Mansard,Roof_Style_Shed,Exterior_1st_AsbShng,Exterior_1st_AsphShn,Exterior_1st_BrkComm,Exterior_1st_BrkFace,Exterior_1st_CBlock,Exterior_1st_CemntBd,Exterior_1st_HdBoard,Exterior_1st_ImStucc,Exterior_1st_MetalSd,Exterior_1st_Plywood,Exterior_1st_Stone,Exterior_1st_Stucco,Exterior_1st_VinylSd,Exterior_1st_Wd Sdng,Exterior_1st_WdShing,Exterior_2nd_AsbShng,Exterior_2nd_AsphShn,Exterior_2nd_Brk Cmn,Exterior_2nd_BrkFace,Exterior_2nd_CBlock,Exterior_2nd_CmentBd,Exterior_2nd_HdBoard,Exterior_2nd_ImStucc,Exterior_2nd_MetalSd,Exterior_2nd_Plywood,Exterior_2nd_Stone,Exterior_2nd_Stucco,Exterior_2nd_VinylSd,Exterior_2nd_Wd Sdng,Exterior_2nd_Wd Shng,Mas_Vnr_Type_BrkCmn,Mas_Vnr_Type_BrkFace,Mas_Vnr_Type_None,Mas_Vnr_Type_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Garage_Type_2Types,Garage_Type_Attchd,Garage_Type_Basment,Garage_Type_BuiltIn,Garage_Type_CarPort,Garage_Type_Detchd,Garage_Type_NA
0,1358.0,1358,6,4,5,4,2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1935.0,1973,8,4,4,4,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1244.0,1356,6,3,3,3,3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,1141.0,2263,8,4,4,4,3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,528.0,1855,6,4,5,3,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [37]:
# merge all the features together 
# so that later we can check for which features have top 3 highest coefficient
features = list(np.concatenate([cont_features, ordinal_features, encoded_nominal_features]).flat)
len(features)

85

In [38]:
# fit the model
# use the original X_train and y_train from the split
# contains the selected features
lr.fit(X_train, y_train)

LinearRegression()

In [39]:
# cross validation with k-folds=5 for the lr model with 18 selected features
# using mean sq error as the metrics
cross_val_score(
            lr,
            X_train,
            y_train,
            cv=5,
            scoring='neg_mean_squared_error'
).mean()

-1234065608.6005664

In [40]:
lr.coef_

array([ 1.12394870e+01,  5.10188476e+01,  1.23069503e+04,  1.16134421e+04,
        2.36594175e+03,  1.32770702e+04,  5.31233048e+03, -7.31350582e+03,
        2.78438798e+03,  4.52911784e+03, -6.75598859e+02, -7.09648990e+03,
        5.67469640e+03,  3.51067410e+02,  5.81322704e+03, -4.06690210e+03,
        1.59807073e+02,  1.79114570e+04, -1.01443795e+04, -9.50779250e+03,
        1.58090798e+03,  3.51255004e+04, -2.29588570e+04, -1.21666434e+04,
       -1.02330764e+04,  4.05762453e+03, -5.12392225e+03,  2.88482192e+04,
       -9.10375082e+03, -1.03219900e+04,  8.20381742e+03, -6.32692157e+03,
        1.10998831e+04,  2.01366078e+02, -9.86243880e+03,  1.45653012e+04,
       -3.78627812e+04,  2.18586696e+04,  1.30175961e+04, -3.38444940e+04,
       -3.73308269e+04,  1.96966684e+04, -1.15452628e+04,  6.11222127e+04,
        4.65661897e+03, -8.37723375e+03,  1.07392086e+03,  5.69270871e+03,
       -5.29577312e+02, -1.60537106e+04, -4.90232938e+03,  2.17991502e+03,
        5.14379406e+03, -

In [41]:
lr.intercept_

-93264.31514665132

In [42]:
# check for top 3 features that influence the sale price
sorted(zip(lr.coef_, features), reverse=True)[:3]

[(61122.212674562456, 'Exterior_1st_CemntBd'),
 (51507.67333579246, 'Exterior_2nd_AsphShn'),
 (35125.50039072676, 'Neighborhood_high')]

In [43]:
# predict the sale price using the lr model with 18 selected features for train set
y_pred_train_lr_selected = lr.predict(X_train)
# predict the sale price using the lr model with 18 selected features for train set
metrics.mean_squared_error(y_pred_train_lr_selected, y_train)**0.5

32275.2734147699

In [44]:
# predict the sale price using the lr model with 18 selected features for test set
y_pred_test_lr_selected = lr.predict(X_test)
# predict the sale price using the lr model with 18 selected features for ttest set
metrics.mean_squared_error(y_pred_test_lr_selected, y_test)**0.5

30525.47573532446

### Ridge Regression
#### RIdge Regression with selected continuous, ordinal and nominal features (18)

In [45]:
# Set up a list of ridge alphas to check.
r_alphas = np.logspace(0, 10, 100)
# Cross-validate over our list of ridge alphas.
ridge_selected = RidgeCV(
                    alphas=r_alphas,
                    cv=5,
                    scoring='neg_mean_squared_error'
                )
# Fit model using best ridge alpha!
ridge_selected.fit(X_train_sc, y_train)

RidgeCV(alphas=array([1.00000000e+00, 1.26185688e+00, 1.59228279e+00, 2.00923300e+00,
       2.53536449e+00, 3.19926714e+00, 4.03701726e+00, 5.09413801e+00,
       6.42807312e+00, 8.11130831e+00, 1.02353102e+01, 1.29154967e+01,
       1.62975083e+01, 2.05651231e+01, 2.59502421e+01, 3.27454916e+01,
       4.13201240e+01, 5.21400829e+01, 6.57933225e+01, 8.30217568e+01,
       1.04761575e+02, 1.32194115e+0...
       4.75081016e+07, 5.99484250e+07, 7.56463328e+07, 9.54548457e+07,
       1.20450354e+08, 1.51991108e+08, 1.91791026e+08, 2.42012826e+08,
       3.05385551e+08, 3.85352859e+08, 4.86260158e+08, 6.13590727e+08,
       7.74263683e+08, 9.77009957e+08, 1.23284674e+09, 1.55567614e+09,
       1.96304065e+09, 2.47707636e+09, 3.12571585e+09, 3.94420606e+09,
       4.97702356e+09, 6.28029144e+09, 7.92482898e+09, 1.00000000e+10]),
        cv=5, scoring='neg_mean_squared_error')

In [46]:
# cross validation to compare different models
cross_val_score(
            ridge_selected,
            X_train_sc,
            y_train,
            cv=5,
            scoring='neg_mean_squared_error'
).mean()

-1165952306.308704

In [47]:
# get the optimal value of alpha
ridge_selected.alpha_

20.565123083486515

In [48]:
# get the coefficient for respecitve features
ridge_selected.coef_

array([  6249.43648478,  24990.94943677,  18216.58091453,   7150.18417412,
         2530.55824694,   8886.05736768,   4862.12901082,  -5952.21067924,
         1621.32094986,   4330.88972937,    703.79217106,  -4458.59884381,
         3228.5542181 ,   -559.74367289,   5368.76973143,  -4282.77360388,
        -2352.54574309,  13305.16970822,  -8086.14837858,  -1965.94984543,
         -900.52574113,  30483.4507845 , -19589.59096908, -10893.85981542,
        -4659.34222597,   2533.10598678,  -1501.50584729,   4434.23569915,
        -1482.65435086,  -5889.486124  ,   7648.09604162,  -1082.44917944,
         2778.53079804,  -3376.23069583,  -3297.95310137,  10194.85451413,
        -7844.63990589,   1545.43839092,  -1940.67749017,    571.10444857,
        -3368.3813197 ,   7355.72118974,   -290.58298488,   8969.89880833,
         -281.03277554,   -642.48804316,   1485.54578136,   1100.22437992,
         -633.55452303, -10546.61445189,   -568.12571244,  -1954.26505619,
          743.22774906,  

In [49]:
# check for top 3 features that influence the sale price
sorted(zip(ridge_selected.coef_, features), reverse=True)[:3]

[(30483.450784500223, 'Neighborhood_high'),
 (24990.949436766696, 'Gr_Liv_Area'),
 (18216.580914525504, 'Overall_Qual')]

In [50]:
# predict the sale price using the ridge model with selected features for train set
y_pred_train_ridge_selected = ridge_selected.predict(X_train_sc)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_train_ridge_selected, y_train)**0.5

32601.176795493197

In [51]:
# predict the sale price using the ridge model with selected features for test set
y_pred_test_ridge_selected = ridge_selected.predict(X_test_sc)
# compute the test mean sqrt error
metrics.mean_squared_error(y_pred_test_ridge_selected, y_test)**0.5

29580.4903716658

### Lasso Regression 
#### Lasso Regression Model with selected continuous, ordinal and nominal features (18)

In [52]:
# Set up a list of Lasso alphas to check.
l_alphas = np.logspace(-3, 10, 100)
# Cross-validate over our list of Lasso alphas.
lasso_selected = LassoCV(
                alphas=l_alphas,
                cv=5,
                #scoring='neg_mean_squared_error', # cannot choose scoring, only can pick MSE
                max_iter=50000
            )
# Fit model using best ridge alpha!
lasso_selected.fit(X_train_sc, y_train)

LassoCV(alphas=array([1.00000000e-03, 1.35304777e-03, 1.83073828e-03, 2.47707636e-03,
       3.35160265e-03, 4.53487851e-03, 6.13590727e-03, 8.30217568e-03,
       1.12332403e-02, 1.51991108e-02, 2.05651231e-02, 2.78255940e-02,
       3.76493581e-02, 5.09413801e-02, 6.89261210e-02, 9.32603347e-02,
       1.26185688e-01, 1.70735265e-01, 2.31012970e-01, 3.12571585e-01,
       4.22924287e-01, 5.72236766e-0...
       9.54548457e+06, 1.29154967e+07, 1.74752840e+07, 2.36448941e+07,
       3.19926714e+07, 4.32876128e+07, 5.85702082e+07, 7.92482898e+07,
       1.07226722e+08, 1.45082878e+08, 1.96304065e+08, 2.65608778e+08,
       3.59381366e+08, 4.86260158e+08, 6.57933225e+08, 8.90215085e+08,
       1.20450354e+09, 1.62975083e+09, 2.20513074e+09, 2.98364724e+09,
       4.03701726e+09, 5.46227722e+09, 7.39072203e+09, 1.00000000e+10]),
        cv=5, max_iter=50000)

In [53]:
# cross validation to compare different models
cross_val_score(
            lasso_selected,
            X_train_sc,
            y_train,
            cv=5,
            scoring='neg_mean_squared_error'
).mean()

-1165686197.2396388

In [54]:
# get the optimal value of alpha
lasso_selected.alpha_

132.19411484660313

In [55]:
# get the coefficient for respecitve features
lasso_selected.coef_

array([  6207.03765972,  24912.54289384,  18671.44648106,   6729.65590866,
         2664.36200762,   8741.8248811 ,   5231.06166657,  -6122.39521279,
           -0.        ,   3744.94507964,      0.        ,     -0.        ,
         1013.00770083,     -0.        ,   4148.05232073,  -4667.39040357,
           -0.        ,  16002.89972346,  -5105.56397851,     -0.        ,
          854.86904397,  44533.12625541, -10310.81428754,      0.        ,
        -1804.74559948,      0.        ,      0.        ,      0.        ,
           -0.        ,  -4721.75213379,   8398.08038093,      0.        ,
            0.        ,     -0.        ,     -0.        ,  13082.25982262,
           -0.        ,      0.        ,     -0.        ,      0.        ,
           -0.        ,   3497.54912003,     -0.        ,  11649.90284741,
         -144.82651187,     -0.        ,    246.44677519,      0.        ,
           -0.        , -14854.79643739,      0.        ,     -0.        ,
            0.        ,  

In [56]:
# check number of features for lasso
lasso_features = [(feature, coe) for coe, feature in zip(lasso_selected.coef_, features) if coe != 0]
# len(lasso_features)
lasso_features

[('Total_Bsmt_SF', 6207.037659715106),
 ('Gr_Liv_Area', 24912.54289384445),
 ('Overall_Qual', 18671.44648105852),
 ('Exter_Qual', 6729.655908659805),
 ('Heating_QC', 2664.3620076161037),
 ('Kitchen_Qual', 8741.824881096421),
 ('Garage_Finish', 5231.061666573944),
 ('MS_SubClass_high', -6122.395212788793),
 ('MS_SubClass_mid', 3744.94507963885),
 ('MS_Zoning_FV', 1013.0077008309977),
 ('MS_Zoning_RL', 4148.052320727696),
 ('MS_Zoning_RM', -4667.390403570852),
 ('Lot_Config_CulDSac', 16002.899723460909),
 ('Lot_Config_FR2', -5105.56397851444),
 ('Lot_Config_Inside', 854.8690439726412),
 ('Neighborhood_high', 44533.12625540694),
 ('Neighborhood_low', -10310.814287535686),
 ('House_Style_1.5Fin', -1804.7455994803129),
 ('House_Style_2Story', -4721.752133793714),
 ('House_Style_SFoyer', 8398.080380925832),
 ('Roof_Style_Hip', 13082.259822618273),
 ('Exterior_1st_BrkFace', 3497.549120028711),
 ('Exterior_1st_CemntBd', 11649.902847405554),
 ('Exterior_1st_HdBoard', -144.82651186861432),
 ('Ex

In [57]:
# check for top 3 features that influence the sale price
sorted(zip(lasso_selected.coef_, features), reverse=True)[:3]

[(44533.12625540694, 'Neighborhood_high'),
 (24912.54289384445, 'Gr_Liv_Area'),
 (18671.44648105852, 'Overall_Qual')]

In [58]:
# predict the sale price using the lasso model with selected features for train set
y_pred_train_lasso_selected = lasso_selected.predict(X_train_sc)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_train_lasso_selected, y_train)**0.5

32764.561681152452

In [59]:
# predict the sale price using the lasso model with selected features for train set
y_pred_test_lasso_selected = lasso_selected.predict(X_test_sc)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_test_lasso_selected, y_test)**0.5

29669.9543984543

### ElasticNet Regression
#### ElasticNet Regression  with selected continuous, ordinal and nominal features (18)

In [60]:
# Set up a list of alphas to check.
enet_alphas = np.logspace(-3, 10, 100)

# Set up list of l1 ratio
enet_ratio = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

# Instantiate model.
enet_model_selected = ElasticNetCV(
                            alphas=enet_alphas, 
                            l1_ratio=enet_ratio, 
                            cv=5, 
                            max_iter=10000
                        )

# Fit model using optimal alpha.
enet_model_selected.fit(X_train_sc, y_train)

ElasticNetCV(alphas=array([1.00000000e-03, 1.35304777e-03, 1.83073828e-03, 2.47707636e-03,
       3.35160265e-03, 4.53487851e-03, 6.13590727e-03, 8.30217568e-03,
       1.12332403e-02, 1.51991108e-02, 2.05651231e-02, 2.78255940e-02,
       3.76493581e-02, 5.09413801e-02, 6.89261210e-02, 9.32603347e-02,
       1.26185688e-01, 1.70735265e-01, 2.31012970e-01, 3.12571585e-01,
       4.22924287e-01, 5.722367...
       3.19926714e+07, 4.32876128e+07, 5.85702082e+07, 7.92482898e+07,
       1.07226722e+08, 1.45082878e+08, 1.96304065e+08, 2.65608778e+08,
       3.59381366e+08, 4.86260158e+08, 6.57933225e+08, 8.90215085e+08,
       1.20450354e+09, 1.62975083e+09, 2.20513074e+09, 2.98364724e+09,
       4.03701726e+09, 5.46227722e+09, 7.39072203e+09, 1.00000000e+10]),
             cv=5, l1_ratio=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
             max_iter=10000)

In [61]:
# cross validation to compare different models
cross_val_score(
            enet_model_selected,
            X_train_sc,
            y_train,
            cv=5,
            scoring='neg_mean_squared_error'
).mean()

-1167278973.4733353

In [62]:
# get the optimal value of alpha
enet_model_selected.alpha_

0.037649358067924694

In [63]:
# get the optimal value l1 ratio
enet_model_selected.l1_ratio_

0.6

In [64]:
# get the coefficient for respecitve features
enet_model_selected.coef_

array([  6419.31285917,  24853.30270399,  18250.51116201,   7248.87109378,
         2541.50476885,   8914.92936661,   4897.22122516,  -5735.49137886,
         1461.04557921,   4272.9045725 ,    638.69917901,  -4047.92056831,
         2933.13773892,   -603.7076709 ,   5354.10676207,  -4274.31748062,
        -2406.15057169,  12820.76485972,  -7732.12952537,  -1667.35211977,
        -1010.6324601 ,  29763.63038051, -19097.27697541, -10664.77922605,
        -4277.19615678,   2244.55722164,  -1362.50697771,   3845.2729775 ,
        -1227.51052093,  -5608.68420123,   7238.93901473,   -849.87154808,
         2423.93431738,  -3672.95729131,  -3031.24558824,   9785.92825995,
        -6785.44414344,   1279.78366107,  -1960.31746026,    489.43479329,
        -2878.85444071,   6701.63090331,   -220.06894203,   8401.30737563,
         -402.88076482,   -534.07715029,   1337.26379898,    964.62374863,
         -540.45928576,  -9807.13707307,   -359.05000446,  -1907.83136913,
          720.91684202,  

In [65]:
# check for top 3 features that influence the sale price
sorted(zip(enet_model_selected.coef_, features), reverse=True)[:3]

[(29763.630380514045, 'Neighborhood_high'),
 (24853.302703988353, 'Gr_Liv_Area'),
 (18250.511162005703, 'Overall_Qual')]

In [66]:
# predict the sale price using the enet model with selected features for train set
y_pred_train_enet_selected = enet_model_selected.predict(X_train_sc)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_train_enet_selected, y_train)**0.5

32647.067946915817

In [67]:
# predict the sale price using the enet model with selected features for test set
y_pred_test_enet_selected = enet_model_selected.predict(X_test_sc)
# compute the train mean sqrt error
metrics.mean_squared_error(y_pred_test_enet_selected, y_test)**0.5

29527.186226927584

## Conclusion

Below is the table that show the information and results of the 5 different models.

|  | Description | Hyperparameters | Number of Features | CV RMSE | Holdout RMSE|
| :-: | :-: | :-: | :-: | :-: |:-:|
| Model 1 | Linear Regression | - | 2 | 51638.13 | 44562.22 |
| Model 2 | Linear Regression | - | 85 | 35129.27 | 30525.48 |
| Model 3 | Ridge Regression | alpha = 20.6 | 85 | 34146.04 | 32601.18 |
| Model 4 | Lasso Regression | alpha = 132.2 | 35 | 34142.15 | 29669.95 |
| Model 5 | ElasticNet Regression | alpha = 0.04, l1_ratio = 0.6 | 85 | 34165.46 | 29527.19 |

From the table above, it can be observed that the ElastisNet and Lasso regression model achieve the better performance among the other models. However, Lasso regression model results in more than 50% reduction in the features required in predicting the sale price of a house. Therefore, I would recommend to use the Lasso regression model for the prediction of sale price as more features would affect the computation speed for predicting the sale price. 

Another thing to take note of is that the holdout root mean square error (RSME) achieve better results than the cross validation (CV) RMSE. This might due to that the model trained is specific to the holdout dataset. This makes sense as the holdout dataset is part of the train dataset that is used for exploratory data analysis (EDA), hence the holdout dataset is not completely unseen data.

In the next section, the train dataset will be used to fit the model only and then a unseen test dataset will be use for the prediction of sale price to be submitted to Kaggle to test how each of the model perform before any further conclusion is made.

## Summary

In this section, the train dataset is split into training data and holdout data to cross validate the 5 different models performance in terms of their RMSE. Before fitting to the models, the nominal features of the training and testing dataset need to be one hot encoded as the sklearn library can only take in integers. Ordinal and continuous features will also need to be standard scale before fitting into the Ridge, Lasso and ElasticNet so that the hyperparameter, alpha and l1 ratio will affect the coefficient of the features equally and make the prediction more accurate.

The ElastisNet and Lasso regression model achieve the better performance among the other models. However, Lasso regression model results in more than 50% reduction in the features required in predicting the sale price of a house. Therefore, Lasso regression model is recommended for the prediction of sale price as it has lesser features to compute while achieving same performance as ElasticNet regression model. 

The holdout root mean square error (RSME) achieve better results than the cross validation (CV) RMSE. This might due to the holdout dataset being part of the train dataset that is used for exploratory data analysis (EDA), hence the holdout dataset is not completely unseen data.

In the next section, the train dataset will be used to fit the model only and then a unseen test dataset will be use for the prediction of sale price to be submitted to Kaggle to test how each of the model perform before any further conclusion is made.