# 15. AMES HOUSING: FEATURE ENGINEERING
---

The goal of this chapter is to:
- deal with missing values
- create new features
- scale numerical features
- encode categorical features
- create a preprocessing pipeline ready for model testing

## 1. Introducing the Data

In [1]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", 99)
pd.set_option("display.max_rows", 999)
pd.set_option('precision', 3)

ames = pd.read_csv('data/Ames_Housing1_train')
print(ames.shape)
ames.head()

(2344, 80)


Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,20,RL,80.0,10400.0,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,7,5,1976,1976,Gable,CompShg,HdBoard,HdBoard,BrkFace,189.0,TA,TA,CBlock,Gd,TA,No,Unf,0.0,Unf,0.0,1090.0,1090.0,GasA,TA,Y,SBrkr,1370.0,0.0,0.0,1370.0,0.0,0.0,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2.0,479.0,TA,TA,Y,0.0,0.0,0.0,0.0,0.0,0.0,,MnPrv,,0.0,6,2009,WD,Family,152000.0
1,60,RL,,28698.0,Pave,,IR2,Low,AllPub,CulDSac,Sev,ClearCr,Norm,Norm,1Fam,2Story,5,5,1967,1967,Flat,Tar&Grv,Plywood,Plywood,,0.0,TA,TA,PConc,TA,Gd,Gd,LwQ,249.0,ALQ,764.0,0.0,1013.0,GasA,TA,Y,SBrkr,1160.0,966.0,0.0,2126.0,0.0,1.0,2,1,3,1,TA,7,Min2,0,,Attchd,1967.0,Fin,2.0,538.0,TA,TA,Y,486.0,0.0,0.0,0.0,225.0,0.0,,,,0.0,6,2009,WD,Abnorml,185000.0
2,90,RL,70.0,9842.0,Pave,,Reg,Lvl,AllPub,FR2,Gtl,NAmes,Norm,Norm,Duplex,1Story,4,5,1962,1962,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,Slab,,,,,0.0,,0.0,0.0,0.0,GasA,TA,Y,SBrkr,1224.0,0.0,0.0,1224.0,0.0,0.0,2,0,2,2,TA,6,Typ,0,,CarPort,1962.0,Unf,2.0,462.0,TA,TA,Y,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,3,2007,WD,Normal,101800.0
3,90,RL,60.0,7200.0,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,Duplex,1Story,4,5,1949,1950,Gable,CompShg,BrkFace,Stone,,0.0,TA,TA,Slab,,,,,0.0,,0.0,0.0,0.0,Wall,Fa,N,FuseF,1040.0,0.0,0.0,1040.0,0.0,0.0,2,0,2,2,TA,6,Typ,0,,Detchd,1956.0,Unf,2.0,420.0,TA,TA,Y,0.0,0.0,0.0,0.0,0.0,0.0,,,,0.0,6,2009,WD,Normal,90000.0
4,190,RM,63.0,7627.0,Pave,,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Artery,Norm,2fmCon,2Story,4,6,1920,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,Fa,TA,BrkTil,Fa,Po,No,Unf,0.0,Unf,0.0,600.0,600.0,GasA,Gd,N,SBrkr,1101.0,600.0,0.0,1701.0,0.0,0.0,2,0,4,2,Fa,8,Typ,0,,,,,0.0,0.0,,,N,0.0,0.0,148.0,0.0,0.0,0.0,,,,0.0,10,2009,WD,Normal,94550.0


## 2. Dealing with Continuous Missing Values

In [2]:
continuous = ['Lot Frontage', 'Lot Area', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',
              'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Gr Liv Area',
              'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Low Qual Fin SF', 'Enclosed Porch', 
              '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val', 'SalePrice']
cont_df = ames[continuous].copy()
cont_null_counts = cont_df.isnull().sum()
cont_null = cont_df[cont_null_counts[cont_null_counts!=0].index]
cont_null = cont_null.isnull().sum().sort_values()
cont_null

BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Garage Area        1
Mas Vnr Area      19
Lot Frontage     393
dtype: int64

In [3]:
print(ames['Lot Frontage'].mean())
ames.groupby(['MS SubClass'])['Lot Frontage'].mean()

69.2075858534085


MS SubClass
20     77.476
30     60.874
40     53.000
45     57.286
50     63.883
60     78.693
70     64.010
75     75.789
80     78.657
85     72.296
90     71.425
120    44.248
150       NaN
160    27.556
180    27.000
190    68.935
Name: Lot Frontage, dtype: float64

We want to group by `MS SubClass` and then fill in the mean of the group into missing values. If the group mean doesn't exist, we will fill in the mean of the entire column!

In [4]:
cont_miss_cols = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF',
                 'Total Bsmt SF', 'Garage Area', 'Mas Vnr Area']
for col in cont_miss_cols:
    cont_df[col] = cont_df[col].fillna(cont_df[col].mean())
    
cont_df[cont_miss_cols].isnull().sum().sum()

0

In [5]:
cont_df['Lot Frontage'] = ames.groupby(['MS SubClass'])['Lot Frontage'].apply(
lambda x:x.fillna(x.mean()))
cont_df['Lot Frontage'] = cont_df['Lot Frontage'].fillna(
    cont_df['Lot Frontage'].mean())
cont_df['Lot Frontage'].isnull().sum()

0

## 3. Continuous Missing Values Transformers

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

class ContMissFiller(BaseEstimator, TransformerMixin):    
    def __init__(self):
        self    
        
    def fit(self, df, y = None):               
        return self

    def transform(self, df):        
        df = df.copy()
        miss_cols = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF',
                 'Total Bsmt SF', 'Garage Area', 'Mas Vnr Area']
        for col in miss_cols:
            df[col] = df[col].fillna(df[col].mean())        
        return df
    
class LotFrontFiller(BaseEstimator, TransformerMixin):    
    def __init__(self):
        self    
        
    def fit(self, df, y = None):               
        return self

    def transform(self, df):        
        use_cols = ['MS SubClass', 'Lot Frontage']
        df = df.copy()
        mean = df['Lot Frontage'].mean()
        df['Lot Frontage'] = df.groupby(['MS SubClass'])['Lot Frontage'].apply(
            lambda x:x.fillna(x.mean()))
        df['Lot Frontage'] = df['Lot Frontage'].fillna(mean)
        return df

In [7]:
# pipeline and testing
from sklearn.pipeline import Pipeline

test_pipe = Pipeline(steps = [
    ('filler1', ContMissFiller()),
    ('filler2', LotFrontFiller())
])
test_df = ames[continuous + ['MS SubClass']]
test_df_ = test_pipe.fit_transform(test_df)
test_df_.isnull().sum()

Lot Frontage       0
Lot Area           0
Mas Vnr Area       0
BsmtFin SF 1       0
BsmtFin SF 2       0
Bsmt Unf SF        0
Total Bsmt SF      0
1st Flr SF         0
2nd Flr SF         0
Gr Liv Area        0
Garage Area        0
Wood Deck SF       0
Open Porch SF      0
Low Qual Fin SF    0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
Misc Val           0
SalePrice          0
MS SubClass        0
dtype: int64

## 4. Dealing with Discrete Missing Values

In [8]:
discrete = ['Year Built', 'Year Remod/Add', 'Garage Yr Blt', 'Garage Cars', 'Bsmt Full Bath',
           'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
           'TotRms AbvGrd', 'Fireplaces', 'Mo Sold', 'Yr Sold']
disc_df = ames[discrete].copy()
disc_null_counts = disc_df.isnull().sum()
disc_null = disc_df[disc_null_counts[disc_null_counts!=0].index]
disc_null = disc_null.isnull().sum().sort_values()
disc_null

Garage Cars         1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Yr Blt     122
dtype: int64

For the 3 columns missing just one value, we can simply fill in the median. But for the `Garage Yr Built`, we can group by house type, `MS SubClass`, and then fill in the median of each class. Let's first see what the data looks like:

In [17]:
df_GYB = pd.DataFrame()
df_GYB['MS_SC_Classes'] = ames.groupby(['MS SubClass'])['Garage Yr Blt'].median().astype(int).index
df_GYB['GYB_median'] = ames.groupby(['MS SubClass'])['Garage Yr Blt'].median().astype(int).values
df_GYB['GYB_Count'] = ames.groupby(['MS SubClass'])['Garage Yr Blt'].count().values
df_GYB['GYB_Null_Count'] = ames['Garage Yr Blt'].isnull().groupby(ames['MS SubClass']).sum().astype(int).values
df_GYB

Unnamed: 0,MS_SC_Classes,GYB_median,GYB_Count,GYB_Null_Count
0,20,1975,849,22
1,30,1938,92,21
2,40,1949,5,0
3,45,1950,11,3
4,50,1950,210,15
5,60,1999,445,1
6,70,1932,104,8
7,75,1941,19,2
8,80,1976,94,0
9,85,1977,39,2


In [20]:
print(df_GYB['GYB_median'].median())
print(ames['Garage Yr Blt'].median())

1975.5
1978.0


- First of all, it's important to notice that the two medians above aren't the same as expected. This is due to the fact that the presence of null values affect the median. 
- Second, if we were to impute the median of the entire `Garage Yr Blt` column, `1978`, into null values we would be altering the data in a bigger way than if we did by class. For example, look at `MS SubClass=30`. The subclass is missing more than 20% of the values and replacing them with `1978`, the column median, would not be as well as replacing them with `1938`, the subclass median.

In [32]:
df_2 = ames[ames['MS SubClass']==30]
df_2a = df_2['Garage Yr Blt'].fillna(df_2['Garage Yr Blt'].median())
df_2a.median()

1938.5

In [None]:
df_1