# Data Cleaning & Pre-processing

**Goal**: Maximize meaningful information capture, account for multicollinearity and minimize number of dummified columns for regression phase.

#### 1. Feature Classification
- **Numeric**: continuous, converted to float
- **Ordinal**: label/integer encode treatment to maximize info capture from “ordinance”
- **Categorical**: binary classification: simplifying certain categories with very skewed distributions to have binary values (0 or 1)

#### 2. Feature combining & ‘new’ features:
- Really try to capture “repeated” information found in features 

#### 3. Summary: 
- made ordinal: 16 features 
- binary classification: 5 features 
- combining features: 16 —> 5 features
- ***50 features pre-dummification —> 110 'features' post dummification***

**4. Next**: Exploratory Data Analysis 

In [58]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

#officially import the raw data here
data = pd.read_csv('../data/raw/Ames_HousePrice.csv').iloc[:, 1:]

In [59]:
data.shape

(2580, 81)

In [60]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2580 entries, 0 to 2579
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PID            2580 non-null   int64  
 1   GrLivArea      2580 non-null   int64  
 2   SalePrice      2580 non-null   int64  
 3   MSSubClass     2580 non-null   int64  
 4   MSZoning       2580 non-null   object 
 5   LotFrontage    2118 non-null   float64
 6   LotArea        2580 non-null   int64  
 7   Street         2580 non-null   object 
 8   Alley          168 non-null    object 
 9   LotShape       2580 non-null   object 
 10  LandContour    2580 non-null   object 
 11  Utilities      2580 non-null   object 
 12  LotConfig      2580 non-null   object 
 13  LandSlope      2580 non-null   object 
 14  Neighborhood   2580 non-null   object 
 15  Condition1     2580 non-null   object 
 16  Condition2     2580 non-null   object 
 17  BldgType       2580 non-null   object 
 18  HouseSty

In [61]:
data.describe()

Unnamed: 0,PID,GrLivArea,SalePrice,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,2580.0,2580.0,2580.0,2580.0,2118.0,2580.0,2580.0,2580.0,2580.0,2580.0,2566.0,2579.0,2579.0,2579.0,2579.0,2580.0,2580.0,2580.0,2578.0,2578.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2451.0,2579.0,2579.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0,2580.0
mean,714830000.0,1486.039922,178059.623256,57.69186,68.516053,10120.153488,6.046124,5.618605,1970.313953,1983.751938,99.308262,444.346258,53.238852,539.10159,1036.6867,1144.975194,336.820155,4.244574,0.435221,0.062064,1.550775,0.378295,2.850388,1.04031,6.387209,0.604264,1976.982048,1.747577,466.842575,95.919767,46.085271,23.214341,2.51124,16.200388,1.662016,48.731395,6.150775,2007.838372
std,188662600.0,488.650181,75031.089374,42.802105,22.835831,8126.937892,1.36759,1.122008,29.719705,20.490242,175.87233,429.334957,174.42392,425.199639,418.555417,375.958955,424.072452,44.403603,0.518827,0.244513,0.545825,0.499237,0.822863,0.20255,1.535186,0.648604,24.659801,0.738678,207.476842,129.052016,66.060664,64.107825,25.293935,56.824783,30.339396,498.725058,2.670647,1.312333
min,526301100.0,334.0,12789.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1895.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,531363000.0,1112.0,129975.0,20.0,57.0,7406.75,5.0,5.0,1953.0,1965.0,0.0,0.0,0.0,215.0,792.0,871.75,0.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1960.0,1.0,318.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,535454600.0,1436.0,159900.0,50.0,68.0,9391.0,6.0,5.0,1972.0,1992.0,0.0,384.0,0.0,448.0,979.0,1071.0,0.0,0.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1978.0,2.0,474.0,0.0,25.5,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,907181100.0,1733.0,209625.0,70.0,80.0,11494.0,7.0,6.0,1999.0,2003.0,158.0,732.0,0.0,784.0,1266.5,1364.0,703.0,0.0,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2000.0,2.0,576.0,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,1007100000.0,4676.0,755000.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,2288.0,1526.0,2336.0,3206.0,3820.0,1872.0,1064.0,3.0,2.0,4.0,2.0,6.0,3.0,13.0,4.0,2010.0,5.0,1488.0,1424.0,742.0,1012.0,508.0,576.0,800.0,15500.0,12.0,2010.0


# Part 1 — handling null values

In [62]:
#investigate nulls
data.isnull().sum().sort_values(ascending = False).head(15)

#number of nulls and corresponding feature
# PoolQC           2571
# MiscFeature      2483
# Alley            2412
# Fence            2055
# FireplaceQu      1241
# LotFrontage       462
# GarageFinish      129
# GarageQual        129
# GarageYrBlt       129
# GarageCond        129
# GarageType        127
# BsmtExposure       71
# BsmtFinType2       70
# BsmtFinType1       69
# BsmtCond           69
# BsmtQual           69
# MasVnrArea         14
# MasVnrType         14
# BsmtHalfBath        2
# BsmtFullBath        2
# GarageArea          1
# GarageCars          1
# Electrical          1
# BsmtUnfSF           1
# BsmtFinSF2          1
# BsmtFinSF1          1
# TotalBsmtSF         1

PoolQC          2571
MiscFeature     2483
Alley           2412
Fence           2055
FireplaceQu     1241
LotFrontage      462
GarageFinish     129
GarageQual       129
GarageYrBlt      129
GarageCond       129
GarageType       127
BsmtExposure      71
BsmtFinType2      70
BsmtFinType1      69
BsmtCond          69
dtype: int64

### LotFrontage

In [40]:
#Lot Frontage
data['LotFrontage'].value_counts().head()

#Examining only the null values
lot_nulls = data[data['LotFrontage'].isna()] #a lot of properties with no linear feat of street connected to property. 

#These seem to be real properties, let's set the NaNs to the average of the corresponding type of :otConfig. 
lot_nulls.head()

Unnamed: 0.1,Unnamed: 0,PID,GrLivArea,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,909176150,856,126000,30,RL,,7890,Pave,,Reg,Lvl,AllPub,Corner,Gtl,SWISU,Norm,Norm,1Fam,1Story,6,6,1939,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,CBlock,TA,TA,No,Rec,238.0,Unf,0.0,618.0,856.0,GasA,TA,Y,SBrkr,856,0,0,1.0,0.0,1,0,2,1,TA,4,Typ,1,Gd,Detchd,1939.0,Unf,2.0,399.0,TA,TA,Y,0,0,0,0,166,0,,,,0,3,2010,WD,Normal
13,14,535105100,1394,159000,20,RL,,9500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609.0,Unf,0.0,785.0,1394.0,GasA,Gd,Y,SBrkr,1394,0,0,1.0,0.0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2.0,514.0,TA,TA,Y,0,76,0,0,185,0,,,,0,7,2009,WD,Normal
18,19,534152050,1610,205000,20,RL,,10603,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,6,7,1977,2001,Gable,CompShg,Plywood,Plywood,BrkFace,28.0,TA,TA,PConc,TA,TA,Mn,ALQ,1200.0,Unf,0.0,410.0,1610.0,GasA,Gd,Y,SBrkr,1610,0,0,1.0,0.0,2,0,3,1,Gd,6,Typ,2,TA,Attchd,1977.0,RFn,2.0,480.0,TA,TA,Y,168,68,0,0,0,0,,,,0,2,2010,WD,Normal
27,28,533221090,1573,177500,160,FV,,2117,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,Twnhs,2Story,6,5,2000,2000,Gable,CompShg,MetalSd,MetalSd,BrkFace,216.0,Gd,TA,PConc,Gd,TA,No,GLQ,378.0,Unf,0.0,378.0,756.0,GasA,Ex,Y,SBrkr,769,804,0,0.0,0.0,2,1,3,1,Gd,5,Typ,0,,Detchd,2000.0,Unf,2.0,440.0,TA,TA,Y,0,32,0,0,0,0,,,,0,6,2010,WD,Normal
28,29,534128010,2090,200000,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859.0,BLQ,32.0,216.0,1107.0,GasA,Ex,Y,SBrkr,1107,983,0,1.0,0.0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2.0,484.0,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal


In [65]:
# Group lots by configuration
data.groupby(['LotConfig']).agg({'LotFrontage' : 'mean'})

Unnamed: 0_level_0,LotFrontage
LotConfig,Unnamed: 1_level_1
Corner,81.468023
CulDSac,56.45679
FR2,59.413793
FR3,79.3
Inside,66.633846


In [72]:
#impute np.NaN with average for LotConfig of that type. 

data.loc[(data['LotFrontage'].isna()) & (data['LotConfig'] == 'Corner'), 'LotFrontage'] = 81.47
data.loc[(data['LotFrontage'].isna()) & (data['LotConfig'] == 'CulDSac'), 'LotFrontage'] = 56.46
data.loc[(data['LotFrontage'].isna()) & (data['LotConfig'] == 'FR2'), 'LotFrontage'] = 59.41
data.loc[(data['LotFrontage'].isna()) & (data['LotConfig'] == 'FR3'), 'LotFrontage'] = 79.30
data.loc[(data['LotFrontage'].isna()) & (data['LotConfig'] == 'Inside'), 'LotFrontage'] = 66.63

data['LotFrontage'].isna().sum()

0

In [73]:
#taking a look at outliers
data['LotFrontage'].sort_values().tail()

74      168.0
1146    174.0
527     195.0
2008    200.0
981     313.0
Name: LotFrontage, dtype: float64

In [75]:
#take note of maybe drop
LF_maybe_drop = data[data["LotFrontage"] == 313]

### Alley

In [81]:
data['Alley'].isnull().sum() #2412 null values
data['Alley'].value_counts()

Grvl    105
Pave     63
Name: Alley, dtype: int64

In [82]:
#make note to one hot encode this later a having alley or no alley
data['Alley'] = data['Alley'].replace(np.nan, 'NA')

In [83]:
data['Alley'].value_counts()

NA      2412
Grvl     105
Pave      63
Name: Alley, dtype: int64

### MasVnrType &  MasVnrArea

In [88]:
data[data['MasVnrType'].isna()].head()

Unnamed: 0,PID,GrLivArea,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition


In [87]:
data['MasVnrType'] = data['MasVnrType'].replace(np.nan, 'None')
data['MasVnrArea'] = data['MasVnrArea'].replace(np.nan, 0)

### Basement related features

In [96]:
data[['BsmtExposure', 'BsmtFinType2', 'BsmtFinType1', 'BsmtCond', 'BsmtQual', 'BsmtHalfBath', 'BsmtFullBath', 'BsmtUnfSF', 
      'BsmtFinSF2', 'BsmtFinSF1', 'TotalBsmtSF']].isnull.sum()

AttributeError: 'function' object has no attribute 'sum'