### Ames Data Exploration


In [11]:
import numpy as np
import pandas as pd
import itertools
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from amesHousing import cardinality_check, segment_features
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [16]:
ames = pd.read_csv('Data/train.csv')
ames.drop('Id', axis=1, inplace=True)
ames_target = ames.pop('SalePrice')

I've imported a function called cardinality_check that I defined to create a summary table of data types and unique values. The segment_features function will use this dataframe to separate features into numeric, categorical and ordinal - and for now, I'll group ordinal variables with categorical ones.

In [17]:
data_check = cardinality_check(ames)
data_check.head(5)

Unnamed: 0,n_unique,pct_unique,data_type
LotArea,1073,0.734932,int64
GrLivArea,861,0.589726,int64
BsmtUnfSF,780,0.534247,int64
1stFlrSF,753,0.515753,int64
TotalBsmtSF,721,0.493836,int64


All features with an "object" data type will be classified as categorical.  For int or float variables that are actually categorical, I defined a cutoff in terms of percent of unique values for that feature - ie features with a low number of distinct values will be classified as ordinal and treated as categorical for this stage of the analysis.  I went with a rather arbitrary cutoff of 1.5%.

In [18]:
features = segment_features(ames, .015)

Next, I define a pipeline to prep the features for a model.  Here, I'll take the extra step of trying to recover feature names from several of sklearn's transformers, which can be a bit of a process.  

In [19]:
numeric_transform = Pipeline(steps=[
        ('impute', IterativeImputer(add_indicator=True))
        ])
    
categorical_transform = Pipeline(steps=[
        ('impute', SimpleImputer(strategy='most_frequent', add_indicator=True)),
        ('one_hot', OneHotEncoder(handle_unknown='ignore'))
        ])
   
preprocessor = ColumnTransformer(transformers=[
        ('numeric', numeric_transform, features['num']),
        ('categorical', categorical_transform, np.concatenate(
                (features['cat'], features['ord']),
                )
    )])

In [20]:
X = preprocessor.fit_transform(ames)

In [36]:
num_missing = preprocessor.named_transformers_['numeric'].named_steps['impute'].indicator_
cat_missing = preprocessor.named_transformers_['categorical'].named_steps['impute'].indicator_
print(num_missing.features_, cat_missing.features_)

[ 8 14 15] [ 9 13 14 19 20 22 24 25 27 28 30 31 32 37 38 41]


Scikit gives feature lists for their missing indicator method by index.  As a sanity check, I'll take a look at the indices returned and cross reference with the original data set to make sure everything is working properly.

In [32]:
num_feature_df = ames[features['num']]
num_feature_df.isna().sum()

LotArea            0
GrLivArea          0
BsmtUnfSF          0
1stFlrSF           0
TotalBsmtSF        0
BsmtFinSF1         0
GarageArea         0
2ndFlrSF           0
MasVnrArea         8
WoodDeckSF         0
OpenPorchSF        0
BsmtFinSF2         0
EnclosedPorch      0
YearBuilt          0
LotFrontage      259
GarageYrBlt       81
ScreenPorch        0
YearRemodAdd       0
LowQualFinSF       0
dtype: int64

In [34]:
features['num'][num_missing.features_]

array(['MasVnrArea', 'LotFrontage', 'GarageYrBlt'], dtype=object)

As you can see, the missing indicator output appears to tie out with features missing data. 

According to scikit's documentation, the "add_indicator" keyword arg for its imputer functions will "stack on" to the output of that transformer, so for numeric variables, we should get three indicator columns appended to the numeric data.