# Pre-Processing: Housing Price Predicition

In the preprocessing phase we are going to accomplish two things. We are going to seperate the discrete and continuous features from one another, and scale the continuous features using StandardScaler() and one-hot encode the discrete features. We will then concatenate them back together into one dataset and save the preprocessed dataset for modeling. We will do this three seperate times with different combinations of features that we think may have an effect on the feature importance potion prior to modeling.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder
from library.sb_utils import save_file

In [2]:
pd.set_option('display.max_rows', 500)

In [3]:
data = pd.read_csv("Data Files/train_data_engineered.csv")

In [4]:
data.shape

(1137, 62)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1137 entries, 0 to 1136
Data columns (total 62 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SalePrice      1137 non-null   int64  
 1   LotFrontage    1137 non-null   float64
 2   LotArea        1137 non-null   int64  
 3   MasVnrArea     1137 non-null   float64
 4   BsmtFinSF1     1137 non-null   int64  
 5   TotalBsmtSF    1137 non-null   int64  
 6   1stFlrSF       1137 non-null   int64  
 7   2ndFlrSF       1137 non-null   int64  
 8   GrLivArea      1137 non-null   int64  
 9   BsmtFullBath   1137 non-null   int64  
 10  FullBath       1137 non-null   int64  
 11  HalfBath       1137 non-null   int64  
 12  TotRmsAbvGrd   1137 non-null   int64  
 13  Fireplaces     1137 non-null   int64  
 14  GarageArea     1137 non-null   int64  
 15  WoodDeckSF     1137 non-null   int64  
 16  OpenPorchSF    1137 non-null   int64  
 17  MSZoning       1137 non-null   object 
 18  Alley   

Let's go ahead and drop some categorical variables based on common sense, such as Sale Condition not being a good predictor for home value before it is on the market.

In [9]:
data0= data.drop(['SaleCondition','SaleType','Alley','Condition1','BldgType','Exterior1st','Exterior2nd','Foundation','Electrical','MSSubClass','MSZoning','BsmtFinType1','BsmtFinType2','Neighborhood'],axis=1)

In [10]:
data0.columns

Index(['SalePrice', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType', 'ExterQual',
       'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating',
       'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'Fence', 'MiscFeature', 'OverallQual', 'OverallCond', 'House_Age',
       'Remod_Age', 'Garage_Age', 'Remod_Age_Avg'],
      dtype='object')

In [11]:
catfeatures = ['LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature']
numfeatures = data0.drop(columns=['SalePrice','LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature'])

In [16]:
dummies0 = pd.get_dummies(data0[catfeatures])
dummies0.head().T

Unnamed: 0,0,1,2,3,4
LotShape_IR1,0,0,1,1,0
LotShape_IR2,0,0,0,0,0
LotShape_IR3,0,0,0,0,0
LotShape_Reg,1,1,0,0,1
LandContour_Bnk,0,0,0,0,0
LandContour_HLS,0,0,0,0,0
LandContour_Low,0,0,0,0,0
LandContour_Lvl,1,1,1,1,1
LotConfig_Corner,0,0,0,0,0
LotConfig_CulDSac,0,0,0,0,0


In [18]:
scaler = StandardScaler()
scaler.fit(numfeatures)
scaled0 = scaler.transform(numfeatures)
scaled0 = pd.DataFrame(scaled0, index=numfeatures.index, columns=numfeatures.columns)

In [19]:
scaled0.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,...,Fireplaces,GarageArea,WoodDeckSF,OpenPorchSF,OverallQual,OverallCond,House_Age,Remod_Age,Garage_Age,Remod_Age_Avg
0,0.279857,-0.217666,0.888197,0.682376,-0.393202,-0.754347,1.344658,0.692021,1.167737,0.875418,...,-0.875318,0.460518,-0.820323,0.436367,0.762179,-0.488014,-1.024963,-0.869248,-0.821371,-1.058828
1,0.760132,0.078136,-0.619443,1.339277,0.679639,0.478393,-0.786663,-0.376695,-0.781921,0.875418,...,0.764281,0.024408,1.98501,-0.768407,0.011878,2.236397,-0.106294,0.423131,0.219722,0.125345
2,0.375912,0.502548,0.626668,0.151058,-0.224084,-0.560023,1.374606,0.873321,1.167737,0.875418,...,0.764281,0.757865,-0.820323,0.06111,0.762179,-0.488014,-0.956913,-0.821382,-0.744253,-0.993041
3,0.888205,1.276779,2.072771,0.559207,0.37047,0.123145,1.8413,1.856159,1.167737,0.875418,...,0.764281,1.887784,0.98714,0.890627,1.512479,-0.488014,-0.922889,-0.72565,-0.705694,-0.927253
4,0.60004,0.202631,0.811277,2.283573,1.800044,1.790077,-0.786663,0.653853,1.167737,0.875418,...,0.764281,0.896627,1.580214,0.357366,1.512479,-0.488014,-1.058988,-0.96498,-0.85993,-1.124615


In [20]:
data_preprocessed0 = pd.concat([scaled0,dummies0],axis=1)

In [21]:
data_preprocessed0 = pd.concat([data['SalePrice'],data_preprocessed0],axis=1)

In [22]:
data_preprocessed0.head()

Unnamed: 0,SalePrice,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,...,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed
0,208500,0.279857,-0.217666,0.888197,0.682376,-0.393202,-0.754347,1.344658,0.692021,1.167737,...,0,0,1,0,0,0,0,1,1,0
1,181500,0.760132,0.078136,-0.619443,1.339277,0.679639,0.478393,-0.786663,-0.376695,-0.781921,...,0,0,1,0,0,0,0,1,1,0
2,223500,0.375912,0.502548,0.626668,0.151058,-0.224084,-0.560023,1.374606,0.873321,1.167737,...,0,0,1,0,0,0,0,1,1,0
3,250000,0.888205,1.276779,2.072771,0.559207,0.37047,0.123145,1.8413,1.856159,1.167737,...,0,0,1,0,0,0,0,1,1,0
4,307000,0.60004,0.202631,0.811277,2.283573,1.800044,1.790077,-0.786663,0.653853,1.167737,...,0,0,1,0,0,0,0,1,1,0


In [23]:
data_preprocessed0.shape

(1137, 140)

In [24]:
data_preprocessed0.head().T

Unnamed: 0,0,1,2,3,4
SalePrice,208500.0,181500.0,223500.0,250000.0,307000.0
LotFrontage,0.279857,0.760132,0.375912,0.888205,0.60004
LotArea,-0.217666,0.078136,0.502548,1.276779,0.202631
MasVnrArea,0.888197,-0.619443,0.626668,2.072771,0.811277
BsmtFinSF1,0.682376,1.339277,0.151058,0.559207,2.283573
TotalBsmtSF,-0.393202,0.679639,-0.224084,0.37047,1.800044
1stFlrSF,-0.754347,0.478393,-0.560023,0.123145,1.790077
2ndFlrSF,1.344658,-0.786663,1.374606,1.8413,-0.786663
GrLivArea,0.692021,-0.376695,0.873321,1.856159,0.653853
BsmtFullBath,1.167737,-0.781921,1.167737,1.167737,1.167737


In [26]:
#Saving preprocessed data to new csv
datapath = 'C:\Springboard_\CapstoneTwo\Data Files'
save_file(data_preprocessed0, 'data_preprocessed.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "C:\Springboard_\CapstoneTwo\Data Files\data_preprocessed.csv"


## Second dataset with ordinal Encoding

In [27]:
data1 = data.drop(['SaleCondition','SaleType','Alley','Condition1','BldgType','Exterior1st','Exterior2nd','Foundation','Electrical','MSSubClass','MSZoning','BsmtFinType1','BsmtFinType2','Neighborhood'],axis=1)

In [28]:
data1.columns

Index(['SalePrice', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType', 'ExterQual',
       'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating',
       'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'Fence', 'MiscFeature', 'OverallQual', 'OverallCond', 'House_Age',
       'Remod_Age', 'Garage_Age', 'Remod_Age_Avg'],
      dtype='object')

In [29]:
ordfeatures= ['OverallQual','OverallCond']
catfeatures = ['LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature']
numfeatures = data1.drop(columns=['SalePrice','LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature','OverallQual','OverallCond'])

In [30]:
dummies1 = pd.get_dummies(data1[catfeatures])
dummies1.head().T

Unnamed: 0,0,1,2,3,4
LotShape_IR1,0,0,1,1,0
LotShape_IR2,0,0,0,0,0
LotShape_IR3,0,0,0,0,0
LotShape_Reg,1,1,0,0,1
LandContour_Bnk,0,0,0,0,0
LandContour_HLS,0,0,0,0,0
LandContour_Low,0,0,0,0,0
LandContour_Lvl,1,1,1,1,1
LotConfig_Corner,0,0,0,0,0
LotConfig_CulDSac,0,0,0,0,0


In [31]:
encoder = OrdinalEncoder()
ordinalencoded1=encoder.fit_transform(data1[ordfeatures])

In [32]:
ordinalencoded1

array([[6., 4.],
       [5., 7.],
       [6., 4.],
       ...,
       [6., 4.],
       [5., 4.],
       [5., 5.]])

In [33]:
ord_encoded1 = pd.DataFrame(ordinalencoded1, index=data1[ordfeatures].index,columns=data1[ordfeatures].columns)

In [34]:
data1[ordfeatures].head(20)

Unnamed: 0,OverallQual,OverallCond
0,7,5
1,6,8
2,7,5
3,8,5
4,8,5
5,5,6
6,5,5
7,7,5
8,6,5
9,7,8


In [35]:
ord_encoded1.head(20)

Unnamed: 0,OverallQual,OverallCond
0,6.0,4.0
1,5.0,7.0
2,6.0,4.0
3,7.0,4.0
4,7.0,4.0
5,4.0,5.0
6,4.0,4.0
7,6.0,4.0
8,5.0,4.0
9,6.0,7.0


In [36]:
scaler = StandardScaler()
scaler.fit(numfeatures)
scaled1 = scaler.transform(numfeatures)
scaled1 = pd.DataFrame(scaled1, index=numfeatures.index, columns=numfeatures.columns)

In [37]:
scaled1.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,HalfBath,TotRmsAbvGrd,Fireplaces,GarageArea,WoodDeckSF,OpenPorchSF,House_Age,Remod_Age,Garage_Age,Remod_Age_Avg
0,0.279857,-0.217666,0.888197,0.682376,-0.393202,-0.754347,1.344658,0.692021,1.167737,0.875418,1.27376,1.20827,-0.875318,0.460518,-0.820323,0.436367,-1.024963,-0.869248,-0.821371,-1.058828
1,0.760132,0.078136,-0.619443,1.339277,0.679639,0.478393,-0.786663,-0.376695,-0.781921,0.875418,-0.73493,-0.203655,0.764281,0.024408,1.98501,-0.768407,-0.106294,0.423131,0.219722,0.125345
2,0.375912,0.502548,0.626668,0.151058,-0.224084,-0.560023,1.374606,0.873321,1.167737,0.875418,1.27376,-0.203655,0.764281,0.757865,-0.820323,0.06111,-0.956913,-0.821382,-0.744253,-0.993041
3,0.888205,1.276779,2.072771,0.559207,0.37047,0.123145,1.8413,1.856159,1.167737,0.875418,1.27376,1.914232,0.764281,1.887784,0.98714,0.890627,-0.922889,-0.72565,-0.705694,-0.927253
4,0.60004,0.202631,0.811277,2.283573,1.800044,1.790077,-0.786663,0.653853,1.167737,0.875418,-0.73493,0.502307,0.764281,0.896627,1.580214,0.357366,-1.058988,-0.96498,-0.85993,-1.124615


In [41]:
data_preprocessed1 = pd.concat([data['SalePrice'],scaled1,dummies1,ord_encoded1],axis=1)

In [42]:
data_preprocessed1.head().T

Unnamed: 0,0,1,2,3,4
SalePrice,208500.0,181500.0,223500.0,250000.0,307000.0
LotFrontage,0.279857,0.760132,0.375912,0.888205,0.60004
LotArea,-0.217666,0.078136,0.502548,1.276779,0.202631
MasVnrArea,0.888197,-0.619443,0.626668,2.072771,0.811277
BsmtFinSF1,0.682376,1.339277,0.151058,0.559207,2.283573
TotalBsmtSF,-0.393202,0.679639,-0.224084,0.37047,1.800044
1stFlrSF,-0.754347,0.478393,-0.560023,0.123145,1.790077
2ndFlrSF,1.344658,-0.786663,1.374606,1.8413,-0.786663
GrLivArea,0.692021,-0.376695,0.873321,1.856159,0.653853
BsmtFullBath,1.167737,-0.781921,1.167737,1.167737,1.167737


In [43]:
data_preprocessed1.shape

(1137, 140)

In [44]:
data_preprocessed1.head().T

Unnamed: 0,0,1,2,3,4
SalePrice,208500.0,181500.0,223500.0,250000.0,307000.0
LotFrontage,0.279857,0.760132,0.375912,0.888205,0.60004
LotArea,-0.217666,0.078136,0.502548,1.276779,0.202631
MasVnrArea,0.888197,-0.619443,0.626668,2.072771,0.811277
BsmtFinSF1,0.682376,1.339277,0.151058,0.559207,2.283573
TotalBsmtSF,-0.393202,0.679639,-0.224084,0.37047,1.800044
1stFlrSF,-0.754347,0.478393,-0.560023,0.123145,1.790077
2ndFlrSF,1.344658,-0.786663,1.374606,1.8413,-0.786663
GrLivArea,0.692021,-0.376695,0.873321,1.856159,0.653853
BsmtFullBath,1.167737,-0.781921,1.167737,1.167737,1.167737


In [45]:
#Saving preprocessed data to new csv
datapath = 'C:\Springboard_\CapstoneTwo\Data Files'
save_file(data_preprocessed1, 'data_preprocessed_ordinal.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "C:\Springboard_\CapstoneTwo\Data Files\data_preprocessed_ordinal.csv"


## Third Dataset with ordinal encoding and all scaled

In [46]:
data2 = data.drop(['SaleCondition','SaleType','Alley','Condition1','BldgType','Exterior1st','Exterior2nd','Foundation','Electrical','MSSubClass','MSZoning','BsmtFinType1','BsmtFinType2','Neighborhood'],axis=1)

In [47]:
data2.columns

Index(['SalePrice', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType', 'ExterQual',
       'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating',
       'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'Fence', 'MiscFeature', 'OverallQual', 'OverallCond', 'House_Age',
       'Remod_Age', 'Garage_Age', 'Remod_Age_Avg'],
      dtype='object')

In [48]:
ordfeatures= ['OverallQual','OverallCond']
catfeatures = ['LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature']
numfeatures = data1.drop(columns=['SalePrice','LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature','OverallQual','OverallCond'])

In [49]:
dummies2 = pd.get_dummies(data2[catfeatures])
dummies2.head()

Unnamed: 0,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,...,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed
0,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
1,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
2,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
3,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
4,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0


In [50]:
encoder = OrdinalEncoder()
ordinalencoded2=encoder.fit_transform(data2[ordfeatures])

In [51]:
ordinalencoded1

array([[6., 4.],
       [5., 7.],
       [6., 4.],
       ...,
       [6., 4.],
       [5., 4.],
       [5., 5.]])

In [52]:
ord_encoded2 = pd.DataFrame(ordinalencoded2, index=data2[ordfeatures].index,columns=data2[ordfeatures].columns)

In [53]:
data2[ordfeatures].head(20)

Unnamed: 0,OverallQual,OverallCond
0,7,5
1,6,8
2,7,5
3,8,5
4,8,5
5,5,6
6,5,5
7,7,5
8,6,5
9,7,8


In [54]:
ord_encoded2.head(20)

Unnamed: 0,OverallQual,OverallCond
0,6.0,4.0
1,5.0,7.0
2,6.0,4.0
3,7.0,4.0
4,7.0,4.0
5,4.0,5.0
6,4.0,4.0
7,6.0,4.0
8,5.0,4.0
9,6.0,7.0


In [55]:
data2_prescaled = pd.concat([numfeatures,dummies2,ord_encoded2],axis=1)
data2_prescaled

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,...,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed,OverallQual,OverallCond
0,65.0,8450,196.0,706,856,856,854,1710,1,2,...,1,0,0,0,0,1,1,0,6.0,4.0
1,80.0,9600,0.0,978,1262,1262,0,1262,0,2,...,1,0,0,0,0,1,1,0,5.0,7.0
2,68.0,11250,162.0,486,920,920,866,1786,1,2,...,1,0,0,0,0,1,1,0,6.0,4.0
3,84.0,14260,350.0,655,1145,1145,1053,2198,1,2,...,1,0,0,0,0,1,1,0,7.0,4.0
4,75.0,10084,186.0,1369,1686,1694,0,1694,1,2,...,1,0,0,0,0,1,1,0,7.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1132,35.0,3675,80.0,547,547,1072,0,1072,1,1,...,1,0,0,0,0,1,1,0,4.0,4.0
1133,90.0,17217,0.0,0,1140,1140,0,1140,0,1,...,1,0,0,0,0,1,1,0,4.0,4.0
1134,62.0,7500,0.0,410,1221,1221,0,1221,1,2,...,1,0,0,0,0,1,1,0,6.0,4.0
1135,62.0,7917,0.0,0,953,953,694,1647,0,2,...,1,0,0,0,0,1,1,0,5.0,4.0


In [56]:
scaler = StandardScaler()
scaler.fit(data2_prescaled)
scaled2 = scaler.transform(data2_prescaled)
scaled2 = pd.DataFrame(scaled2, index=data2_prescaled.index, columns=data2_prescaled.columns)

In [57]:
scaled1.head()

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,HalfBath,TotRmsAbvGrd,Fireplaces,GarageArea,WoodDeckSF,OpenPorchSF,House_Age,Remod_Age,Garage_Age,Remod_Age_Avg
0,0.279857,-0.217666,0.888197,0.682376,-0.393202,-0.754347,1.344658,0.692021,1.167737,0.875418,1.27376,1.20827,-0.875318,0.460518,-0.820323,0.436367,-1.024963,-0.869248,-0.821371,-1.058828
1,0.760132,0.078136,-0.619443,1.339277,0.679639,0.478393,-0.786663,-0.376695,-0.781921,0.875418,-0.73493,-0.203655,0.764281,0.024408,1.98501,-0.768407,-0.106294,0.423131,0.219722,0.125345
2,0.375912,0.502548,0.626668,0.151058,-0.224084,-0.560023,1.374606,0.873321,1.167737,0.875418,1.27376,-0.203655,0.764281,0.757865,-0.820323,0.06111,-0.956913,-0.821382,-0.744253,-0.993041
3,0.888205,1.276779,2.072771,0.559207,0.37047,0.123145,1.8413,1.856159,1.167737,0.875418,1.27376,1.914232,0.764281,1.887784,0.98714,0.890627,-0.922889,-0.72565,-0.705694,-0.927253
4,0.60004,0.202631,0.811277,2.283573,1.800044,1.790077,-0.786663,0.653853,1.167737,0.875418,-0.73493,0.502307,0.764281,0.896627,1.580214,0.357366,-1.058988,-0.96498,-0.85993,-1.124615


In [58]:
data_preprocessed2 = pd.concat([data['SalePrice'],scaled2],axis=1)

In [59]:
data_preprocessed2.head()

Unnamed: 0,SalePrice,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,...,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed,OverallQual,OverallCond
0,208500,0.279857,-0.217666,0.888197,0.682376,-0.393202,-0.754347,1.344658,0.692021,1.167737,...,0.298463,-0.190953,-0.195847,-0.348284,-0.078706,0.478737,0.172891,-0.172891,0.762179,-0.488014
1,181500,0.760132,0.078136,-0.619443,1.339277,0.679639,0.478393,-0.786663,-0.376695,-0.781921,...,0.298463,-0.190953,-0.195847,-0.348284,-0.078706,0.478737,0.172891,-0.172891,0.011878,2.236397
2,223500,0.375912,0.502548,0.626668,0.151058,-0.224084,-0.560023,1.374606,0.873321,1.167737,...,0.298463,-0.190953,-0.195847,-0.348284,-0.078706,0.478737,0.172891,-0.172891,0.762179,-0.488014
3,250000,0.888205,1.276779,2.072771,0.559207,0.37047,0.123145,1.8413,1.856159,1.167737,...,0.298463,-0.190953,-0.195847,-0.348284,-0.078706,0.478737,0.172891,-0.172891,1.512479,-0.488014
4,307000,0.60004,0.202631,0.811277,2.283573,1.800044,1.790077,-0.786663,0.653853,1.167737,...,0.298463,-0.190953,-0.195847,-0.348284,-0.078706,0.478737,0.172891,-0.172891,1.512479,-0.488014


In [60]:
data_preprocessed2.shape

(1137, 140)

In [61]:
data_preprocessed2.head().T

Unnamed: 0,0,1,2,3,4
SalePrice,208500.0,181500.0,223500.0,250000.0,307000.0
LotFrontage,0.279857,0.760132,0.375912,0.888205,0.60004
LotArea,-0.217666,0.078136,0.502548,1.276779,0.202631
MasVnrArea,0.888197,-0.619443,0.626668,2.072771,0.811277
BsmtFinSF1,0.682376,1.339277,0.151058,0.559207,2.283573
TotalBsmtSF,-0.393202,0.679639,-0.224084,0.37047,1.800044
1stFlrSF,-0.754347,0.478393,-0.560023,0.123145,1.790077
2ndFlrSF,1.344658,-0.786663,1.374606,1.8413,-0.786663
GrLivArea,0.692021,-0.376695,0.873321,1.856159,0.653853
BsmtFullBath,1.167737,-0.781921,1.167737,1.167737,1.167737


In [62]:
#Saving preprocessed data to new csv
datapath = 'C:\Springboard_\CapstoneTwo\Data Files'
save_file(data_preprocessed2, 'data_preprocessed_allscaled.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "C:\Springboard_\CapstoneTwo\Data Files\data_preprocessed_allscaled.csv"


## Fourth Dataset with no scaling, only one-hot encoding

In [63]:
data3 = data.drop(['SaleCondition','SaleType','Alley','Condition1','BldgType','Exterior1st','Exterior2nd','Foundation','Electrical','MSSubClass','MSZoning','BsmtFinType1','BsmtFinType2','Neighborhood'],axis=1)

In [64]:
data3.columns

Index(['SalePrice', 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType', 'ExterQual',
       'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating',
       'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'Fence', 'MiscFeature', 'OverallQual', 'OverallCond', 'House_Age',
       'Remod_Age', 'Garage_Age', 'Remod_Age_Avg'],
      dtype='object')

In [65]:
catfeatures = ['LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature']
numfeatures = data3.drop(columns=['SalePrice','LotShape', 'LandContour', 'LotConfig',
       'LandSlope', 'HouseStyle', 'RoofStyle', 'MasVnrType',
       'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'Fence',
       'MiscFeature'])

In [66]:
dummies3 = pd.get_dummies(data3[catfeatures])
dummies3.head()

Unnamed: 0,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,LotConfig_Corner,LotConfig_CulDSac,...,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed
0,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
1,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
2,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
3,1,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
4,0,0,0,1,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0


In [67]:
data3_prescaled = pd.concat([numfeatures,dummies3],axis=1)
data3_prescaled

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,...,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed
0,65.0,8450,196.0,706,856,856,854,1710,1,2,...,0,0,1,0,0,0,0,1,1,0
1,80.0,9600,0.0,978,1262,1262,0,1262,0,2,...,0,0,1,0,0,0,0,1,1,0
2,68.0,11250,162.0,486,920,920,866,1786,1,2,...,0,0,1,0,0,0,0,1,1,0
3,84.0,14260,350.0,655,1145,1145,1053,2198,1,2,...,0,0,1,0,0,0,0,1,1,0
4,75.0,10084,186.0,1369,1686,1694,0,1694,1,2,...,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1132,35.0,3675,80.0,547,547,1072,0,1072,1,1,...,0,0,1,0,0,0,0,1,1,0
1133,90.0,17217,0.0,0,1140,1140,0,1140,0,1,...,0,0,1,0,0,0,0,1,1,0
1134,62.0,7500,0.0,410,1221,1221,0,1221,1,2,...,0,0,1,0,0,0,0,1,1,0
1135,62.0,7917,0.0,0,953,953,694,1647,0,2,...,0,0,1,0,0,0,0,1,1,0


In [68]:
data_preprocessed3 = pd.concat([data['SalePrice'],data3_prescaled],axis=1)

In [69]:
data_preprocessed3.head()

Unnamed: 0,SalePrice,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,TotalBsmtSF,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,...,PavedDrive_N,PavedDrive_P,PavedDrive_Y,Fence_GdPrv,Fence_GdWo,Fence_MnPrv,Fence_MnWw,Fence_None,MiscFeature_None,MiscFeature_Shed
0,208500,65.0,8450,196.0,706,856,856,854,1710,1,...,0,0,1,0,0,0,0,1,1,0
1,181500,80.0,9600,0.0,978,1262,1262,0,1262,0,...,0,0,1,0,0,0,0,1,1,0
2,223500,68.0,11250,162.0,486,920,920,866,1786,1,...,0,0,1,0,0,0,0,1,1,0
3,250000,84.0,14260,350.0,655,1145,1145,1053,2198,1,...,0,0,1,0,0,0,0,1,1,0
4,307000,75.0,10084,186.0,1369,1686,1694,0,1694,1,...,0,0,1,0,0,0,0,1,1,0


In [70]:
data_preprocessed3.shape

(1137, 140)

In [71]:
data_preprocessed3.head().T

Unnamed: 0,0,1,2,3,4
SalePrice,208500.0,181500.0,223500.0,250000.0,307000.0
LotFrontage,65.0,80.0,68.0,84.0,75.0
LotArea,8450.0,9600.0,11250.0,14260.0,10084.0
MasVnrArea,196.0,0.0,162.0,350.0,186.0
BsmtFinSF1,706.0,978.0,486.0,655.0,1369.0
TotalBsmtSF,856.0,1262.0,920.0,1145.0,1686.0
1stFlrSF,856.0,1262.0,920.0,1145.0,1694.0
2ndFlrSF,854.0,0.0,866.0,1053.0,0.0
GrLivArea,1710.0,1262.0,1786.0,2198.0,1694.0
BsmtFullBath,1.0,0.0,1.0,1.0,1.0


In [74]:
#Saving preprocessed data to new csv
datapath = 'C:\Springboard_\CapstoneTwo\Data Files'
save_file(data_preprocessed3, 'data_preprocessed_notscaled.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "C:\Springboard_\CapstoneTwo\Data Files\data_preprocessed_notscaled.csv"
