This notebook aims to build a reliable pipeline to set up proper experiments.

Custom transformers and imputers will be implemented. The transformations proposed are coming from the Cleaning and Transformation phase.

In [1]:
import pandas as pd
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer

import warnings

import sys
sys.path.append("..")
import source.utility as ut

pd.set_option('max_columns', 500)

In [3]:
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')

train_set, test_set = ut.make_test(df_train, 
                                test_size=0.2, random_state=654, 
                                strat_feat='Neighborhood')

The main problem I have with the available tools is that we lose the Dataframe structure in favor of a lighter and memory efficient one based on numpy arrays. For example, a simple scaler does the following to a simple dataframe

In [4]:
tmp = train_set[['GrLivArea', 'TotRmsAbvGrd']].copy()
tmp.head()

Unnamed: 0,GrLivArea,TotRmsAbvGrd
68,747,4
1097,1088,5
219,1248,5
901,1306,5
505,1960,10


In [5]:
tmp = StandardScaler().fit_transform(tmp)

tmp

array([[-1.48066336, -1.558394  ],
       [-0.83100519, -0.94573692],
       [-0.52618025, -0.94573692],
       ...,
       [ 0.28541617,  0.89223433],
       [-0.89197018, -0.94573692],
       [-1.27681168, -0.94573692]])

However, in order to efficiently inspect, modify, or automate the pipeline, it can be very useful to preserve the Dataframe structure. To achieve that, we can wrap the Scaler around a class that takes care of the Dataframe structure

In [6]:
class df_scaler(TransformerMixin):
    def __init__(self, method='standard'):
        self.scl = None
        self.scale_ = None
        self.method = method
        if self.method == 'sdandard':
            self.mean_ = None
        elif method == 'robust':
            self.center_ = None

    def fit(self, X, y=None):
        if self.method == 'standard':
            self.scl = StandardScaler()
            self.scl.fit(X)
            self.mean_ = pd.Series(self.scl.mean_, index=X.columns)
        elif self.method == 'robust':
            self.scl = RobustScaler()
            self.scl.fit(X)
            self.center_ = pd.Series(self.scl.center_, index=X.columns)
        self.scale_ = pd.Series(self.scl.scale_, index=X.columns)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xscl = self.scl.transform(X)
        Xscaled = pd.DataFrame(Xscl, index=X.index, columns=X.columns)
        return Xscaled

In [7]:
tmp = train_set[['GrLivArea', 'TotRmsAbvGrd']].copy()

scaler = df_scaler()

tmp = scaler.fit_transform(tmp)

tmp.head()

Unnamed: 0,GrLivArea,TotRmsAbvGrd
68,-1.480663,-1.558394
1097,-0.831005,-0.945737
219,-0.52618,-0.945737
901,-0.415681,-0.945737
505,0.830291,2.117548


The advantage of this approach is that the data can be easily wrangled by the next element of the pipeline without too much work. Moreover, as we see it is not difficult to preserve all the functionalities that the original sklearn Scaler had

In [8]:
scaler.mean_

GrLivArea       1524.187500
TotRmsAbvGrd       6.543664
dtype: float64

In [9]:
scaler.scale_

GrLivArea       524.891424
TotRmsAbvGrd      1.632234
dtype: float64

By using this concept, we can build all the other elements of the pipeline.

In [10]:
class feat_sel(BaseEstimator, TransformerMixin):
    '''
    This transformer selects either numerical or categorical features.
    In this way we can build separate pipelines for separate data types.
    '''
    def __init__(self, dtype='numeric'):
        self.dtype = dtype
 
    def fit( self, X, y=None ):
        return self 
    
    def transform(self, X, y=None):
        if self.dtype == 'numeric':
            num_cols = X.columns[X.dtypes != object].tolist()
            return X[num_cols]
        elif self.dtype == 'category':
            cat_cols = X.columns[X.dtypes == object].tolist()
            return X[cat_cols]


class df_imputer(TransformerMixin, BaseEstimator):
    '''
    Just a wrapper for the SimpleImputer that keeps the dataframe structure
    '''
    def __init__(self, strategy='mean'):
        self.strategy = strategy
        self.imp = None
        self.statistics_ = None

    def fit(self, X, y=None):
        self.imp = SimpleImputer(strategy=self.strategy)
        self.imp.fit(X)
        self.statistics_ = pd.Series(self.imp.statistics_, index=X.columns)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Ximp = self.imp.transform(X)
        Xfilled = pd.DataFrame(Ximp, index=X.index, columns=X.columns)
        return Xfilled

    
class df_scaler(TransformerMixin, BaseEstimator):
    '''
    Wrapper of StandardScaler or RobustScaler
    '''
    def __init__(self, method='standard'):
        self.scl = None
        self.scale_ = None
        self.method = method
        if self.method == 'sdandard':
            self.mean_ = None
        elif method == 'robust':
            self.center_ = None
        self.columns = None  # this is useful when it is the last step of a pipeline before the model

    def fit(self, X, y=None):
        if self.method == 'standard':
            self.scl = StandardScaler()
            self.scl.fit(X)
            self.mean_ = pd.Series(self.scl.mean_, index=X.columns)
        elif self.method == 'robust':
            self.scl = RobustScaler()
            self.scl.fit(X)
            self.center_ = pd.Series(self.scl.center_, index=X.columns)
        self.scale_ = pd.Series(self.scl.scale_, index=X.columns)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xscl = self.scl.transform(X)
        Xscaled = pd.DataFrame(Xscl, index=X.index, columns=X.columns)
        self.columns = X.columns
        return Xscaled

    def get_feature_names(self):
        return list(self.columns)
    
    
class dummify(TransformerMixin, BaseEstimator):
    '''
    Wrapper for get dummies
    '''
    def __init__(self, drop_first=False, match_cols=True):
        self.drop_first = drop_first
        self.columns = []  # useful to well behave with FeatureUnion
        self.match_cols = match_cols

    def fit(self, X, y=None):
        return self
    
    def match_columns(self, X):
        miss_train = list(set(X.columns) - set(self.columns))
        miss_test = list(set(self.columns) - set(X.columns))
        
        err = 0
        
        if len(miss_test) > 0:
            for col in miss_test:
                X[col] = 0  # insert a column for the missing dummy
                err += 1
        if len(miss_train) > 0:
            for col in miss_train:
                del X[col]  # delete the column of the extra dummy
                err += 1
                
        if err > 0:
            warnings.warn('The dummies in this set do not match the ones in the train set, we corrected the issue.',
                         UserWarning)
            
        return X
        
        
    
    def transform(self, X):
        X = pd.get_dummies(X, drop_first=self.drop_first)
        if (len(self.columns) > 0):
            if self.match_cols:
                X = self.match_columns(X)
        else:
            self.columns = X.columns
        return X
    
    def get_features_name(self):
        return self.columns

    
class general_cleaner(BaseEstimator, TransformerMixin):
    '''
    This class applies what we know from the documetation.
    It cleans some known missing values
    If flags the missing values

    This process is supposed to happen as first step of any pipeline

    TODO: decide how to drop the outliers as the target is created before this point
    '''
    def __init__(self, train=True):
        self._train = train
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        #LotFrontage
        X.loc[X.LotFrontage.isnull(), 'LotFrontage'] = 0
        #Alley
        X.loc[X.Alley.isnull(), 'Alley'] = "NoAlley"
        #MSSubClass
        X['MSSubClass'] = X['MSSubClass'].astype(str)
        #MissingBasement
        fil = ((X.BsmtQual.isnull()) & (X.BsmtCond.isnull()) & (X.BsmtExposure.isnull()) &
              (X.BsmtFinType1.isnull()) & (X.BsmtFinType2.isnull()))
        fil1 = ((X.BsmtQual.notnull()) | (X.BsmtCond.notnull()) | (X.BsmtExposure.notnull()) |
              (X.BsmtFinType1.notnull()) | (X.BsmtFinType2.notnull()))
        X.loc[fil1, 'MisBsm'] = 0
        X.loc[fil, 'MisBsm'] = 1 # made explicit for safety
        #BsmtQual
        X.loc[fil, 'BsmtQual'] = "NoBsmt" #missing basement
        #BsmtCond
        X.loc[fil, 'BsmtCond'] = "NoBsmt" #missing basement
        #BsmtExposure
        X.loc[fil, 'BsmtExposure'] = "NoBsmt" #missing basement
        #BsmtFinType1
        X.loc[fil, 'BsmtFinType1'] = "NoBsmt" #missing basement
        #BsmtFinType2
        X.loc[fil, 'BsmtFinType2'] = "NoBsmt" #missing basement
        #BsmtFinSF1
        X.loc[fil, 'BsmtFinSF1'] = 0 # No bsmt
        #BsmtFinSF2
        X.loc[fil, 'BsmtFinSF2'] = 0 # No bsmt
        #BsmtUnfSF
        X.loc[fil, 'BsmtUnfSF'] = 0 # No bsmt
        #TotalBsmtSF
        X.loc[fil, 'TotalBsmtSF'] = 0 # No bsmt
        #BsmtFullBath
        X.loc[fil, 'BsmtFullBath'] = 0 # No bsmt
        #BsmtHalfBath
        X.loc[fil, 'BsmtHalfBath'] = 0 # No bsmt
        #FireplaceQu
        X.loc[(X.Fireplaces == 0) & (X.FireplaceQu.isnull()), 'FireplaceQu'] = "NoFire" #missing
        #MisGarage
        fil = ((X.GarageYrBlt.isnull()) & (X.GarageType.isnull()) & (X.GarageFinish.isnull()) &
              (X.GarageQual.isnull()) & (X.GarageCond.isnull()))
        fil1 = ((X.GarageYrBlt.notnull()) | (X.GarageType.notnull()) | (X.GarageFinish.notnull()) |
              (X.GarageQual.notnull()) | (X.GarageCond.notnull()))
        X.loc[fil1, 'MisGarage'] = 0
        X.loc[fil, 'MisGarage'] = 1
        #GarageYrBlt
        X.loc[X.GarageYrBlt > 2200, 'GarageYrBlt'] = 2007 #correct mistake
        X.loc[fil, 'GarageYrBlt'] = 0
        #GarageType
        X.loc[fil, 'GarageType'] = "NoGrg" #missing garage
        #GarageFinish
        X.loc[fil, 'GarageFinish'] = "NoGrg" #missing
        #GarageQual
        X.loc[fil, 'GarageQual'] = "NoGrg" #missing
        #GarageCond
        X.loc[fil, 'GarageCond'] = "NoGrg" #missing
        #Fence
        X.loc[X.Fence.isnull(), 'Fence'] = "NoFence" #missing fence
        #Pool
        fil = ((X.PoolArea == 0) & (X.PoolQC.isnull()))
        X.loc[fil, 'PoolQC'] = 'NoPool' 
        
        del X['Id']
        del X['MiscFeature']
        #del X['MSSubClass']
        #del X['Neighborhood']  # this should be useful
        del X['Condition1']
        del X['Condition2']
        del X['ExterCond']  # maybe ordinal
        del X['Exterior1st']
        del X['Exterior2nd']
        del X['Functional']
        del X['Heating']
        del X['PoolQC']
        del X['RoofMatl']
        del X['RoofStyle']
        del X['SaleCondition']
        del X['SaleType']
        del X['Utilities']
        del X['BsmtCond']
        del X['Electrical']
        del X['Foundation']
        del X['Street']
        del X['Fence']
        del X['LandSlope']
        
        return X

In [11]:
class tr_numeric(BaseEstimator, TransformerMixin):
    def __init__(self, SF_room=True, bedroom=True, bath=True, lot=True, service=True):
        self.columns = []  # useful to well behave with FeatureUnion
        self.SF_room = SF_room
        self.bedroom = bedroom
        self.bath = bath
        self.lot = lot
        self.service = service
     

    def fit(self, X, y=None):
        return self
    

    def remove_skew(self, X, column):
        X[column] = np.log1p(X[column])
        return X


    def SF_per_room(self, X):
        if self.SF_room:
            X['sf_per_room'] = X['GrLivArea'] / X['TotRmsAbvGrd']
        return X


    def bedroom_prop(self, X):
        if self.bedroom:
            X['bedroom_prop'] = X['BedroomAbvGr'] / X['TotRmsAbvGrd']
            del X['BedroomAbvGr'] # the new feature makes it redundant and it is not important
        return X


    def total_bath(self, X):
        if self.bath:
            X['total_bath'] = (X[[col for col in X.columns if 'FullBath' in col]].sum(axis=1) +
                             0.5 * X[[col for col in X.columns if 'HalfBath' in col]].sum(axis=1))
            del X['FullBath']  # redundant 

        del X['HalfBath']  # not useful anyway
        del X['BsmtHalfBath']
        del X['BsmtFullBath']
        return X


    def lot_prop(self, X):
        if self.lot:
            X['lot_prop'] = X['LotArea'] / X['GrLivArea']
        return X 


    def service_area(self, X):
        if self.service:
            X['service_area'] = X['TotalBsmtSF'] + X['GarageArea']
            del X['TotalBsmtSF']
            del X['GarageArea']
        return X
    

    def transform(self, X, y=None):
        for col in ['GrLivArea', '1stFlrSF', 'LotArea']:
            X = self.remove_skew(X, col)

        X = self.SF_per_room(X)
        X = self.bedroom_prop(X)
        X = self.total_bath(X)
        X = self.lot_prop(X)
        X = self.service_area(X)

        self.columns = X.columns
        return X
    

    def get_features_name(self):
        return self.columns

If we want to transform the numeric features, we just need to do the following

In [12]:
numeric_pipe = Pipeline([('fs', feat_sel('numeric')),
                         ('imputer', df_imputer(strategy='median')),
                         ('transf', tr_numeric()),
                         ('scl', df_scaler(method='standard'))])

full_pipe = Pipeline([('gen_cl', general_cleaner()), ('num_pipe', numeric_pipe)])

In [13]:
tmp = train_set.copy()
tmp = full_pipe.fit_transform(tmp)

tmp.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,MisBsm,MisGarage,sf_per_room,bedroom_prop,total_bath,lot_prop,service_area
68,-0.293799,-1.288747,-1.536497,0.388314,-0.854633,-1.687632,-0.572721,-0.951802,-0.289111,0.394816,-1.22902,-0.809847,-0.11994,-1.969786,-0.217078,-1.558394,-0.955475,0.166254,-1.023099,-0.74392,-0.700903,-0.360845,-0.114758,-0.275037,-0.070993,-0.08847,-0.09025,1.648787,-1.265526,-0.162364,-0.240772,1.837275,0.597661,-1.517596,0.300226,-0.971933
1097,-1.633589,-1.713701,1.365085,-0.520605,0.486653,0.054226,-0.572721,-0.951802,-0.289111,1.135134,-0.047277,-0.809847,-0.11994,-0.843589,-0.217078,-0.945737,-0.955475,0.259402,0.312074,-0.74392,0.408792,1.862782,-0.114758,-0.275037,-0.070993,-0.08847,1.384042,-0.612407,-0.14531,-0.162364,-0.240772,0.862474,-0.456184,-0.886121,-1.075097,0.007796
219,-0.407824,-2.109283,0.639689,-0.520605,1.108225,1.021925,-0.482072,-0.917099,-0.289111,1.492841,0.384011,-0.809847,-0.11994,-0.432573,-0.217078,-0.945737,-0.955475,0.299323,0.312074,0.101091,-0.700903,-0.360845,-0.114758,-0.275037,-0.070993,-0.08847,-1.195969,-1.366138,-0.179663,-0.162364,-0.240772,0.966974,-0.456184,-0.254645,-1.781626,0.268252
901,0.190806,-0.061418,-0.811101,1.297233,-0.462061,0.731615,-0.572721,0.913518,-0.289111,-0.997524,0.52682,-0.809847,-0.11994,-0.296476,-0.217078,-0.945737,-0.955475,0.217264,-1.023099,-0.74392,-0.700903,-0.360845,-0.114758,-0.275037,-0.070993,-0.08847,-0.458823,0.895056,-0.356906,-0.162364,-0.240772,1.001577,-0.456184,-0.254645,0.157175,0.375194
505,0.076781,-0.325582,-0.811101,-0.520605,-0.625633,-1.590862,1.466898,-0.951802,-0.289111,0.877042,-0.440678,1.464495,-0.11994,0.919993,4.15442,2.117548,-0.955475,0.181779,0.312074,-0.74392,-0.700903,-0.360845,-0.114758,-0.275037,-0.070993,-0.08847,0.278323,0.895056,-0.711641,-0.162364,-0.240772,-1.578772,-0.456184,-0.254645,-1.022071,-0.294057


On the other hand, categorical features require a different approach.

In [14]:
class make_ordinal(BaseEstimator, TransformerMixin):
    '''
    Transforms ordinal features in order to have them as numeric (preserving the order)
    If unsure about converting or not a feature (maybe making dummies is better), make use of
    extra_cols and unsure_conversion
    '''
    def __init__(self, cols, extra_cols=None, include_extra=True):
        self.cols = cols
        self.extra_cols = extra_cols
        self.mapping = {'Po':1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
        self.include_extra = include_extra
    

    def fit(self, X, y=None):
        return self
    

    def transform(self, X, y=None):
        if self.extra_cols:
            if self.include_extra:
                self.cols += self.extra_cols
            else:
                for col in self.extra_cols:
                    del X[col]
        
        for col in self.cols:
            X.loc[:, col] = X[col].map(self.mapping).fillna(0)
        return X


class recode_cat(BaseEstimator, TransformerMixin):        
    '''
    Recodes some categorical variables according to the insights gained from the
    data exploration phase.
    '''
    def __init__(self, mean_weight=10, te_neig=True, te_mssc=True):
        self.mean_tot = 0
        self.mean_weight = mean_weight
        self.smooth_neig = {}
        self.smooth_mssc = {}
        self.te_neig = te_neig
        self.te_mssc = te_mssc
    
    
    def smooth_te(self, data, target, col):
        tmp_data = data.copy()
        tmp_data['target'] = target
        mean_tot = tmp_data['target'].mean()
        means = tmp_data.groupby(col)['target'].mean()
        counts = tmp_data.groupby(col)['target'].count()

        smooth = ((counts * means + self.mean_weight * mean_tot) / 
                       (counts + self.mean_weight))
        return mean_tot, smooth
    
    def fit(self, X, y=None):
        if self.te_neig:
            self.mean_tot, self.smooth_neig = self.smooth_te(data=X, target=y, col='Neighborhood')

        if self.te_mssc:
            self.mean_tot, self.smooth_mssc = self.smooth_te(X, y, 'MSSubClass')
            
        return self
    
    
    def tr_GrgType(self, data):
        data['GarageType'] = data['GarageType'].map({'Basment': 'Attchd',
                                                     'CarPort': 'Detchd',
                                                     '2Types': 'Attchd' }).fillna(data['GarageType'])
        return data
    
    
    def tr_LotShape(self, data):
        fil = (data.LotShape != 'Reg')
        data['LotShape'] = 1
        data.loc[fil, 'LotShape'] = 0
        return data
    
    
    def tr_LandCont(self, data):
        fil = (data.LandContour == 'HLS') | (data.LandContour == 'Low')
        data['LandContour'] = 0
        data.loc[fil, 'LandContour'] = 1
        return data
    
    
    def tr_LandSlope(self, data):
        fil = (data.LandSlope != 'Gtl')
        data['LandSlope'] = 0
        data.loc[fil, 'LandSlope'] = 1
        return data
    
    
    def tr_MSZoning(self, data):
        data['MSZoning'] = data['MSZoning'].map({'RH': 'RM', # medium and high density
                                                 'C (all)': 'RM', # commercial and medium density
                                                 'FV': 'RM'}).fillna(data['MSZoning'])
        return data
    
    
    def tr_Alley(self, data):
        fil = (data.Alley != 'NoAlley')
        data['Alley'] = 0
        data.loc[fil, 'Alley'] = 1
        return data
    
    
    def tr_LotConfig(self, data):
        data['LotConfig'] = data['LotConfig'].map({'FR3': 'Corner', # corners have 2 or 3 free sides
                                                   'FR2': 'Corner'}).fillna(data['LotConfig'])
        return data
    
    
    def tr_BldgType(self, data):
        data['BldgType'] = data['BldgType'].map({'Twnhs' : 'TwnhsE',
                                                 '2fmCon': 'Duplex'}).fillna(data['BldgType'])
        return data
    
    
    def tr_MasVnrType(self, data):
        data['MasVnrType'] = data['MasVnrType'].map({'BrkCmn': 'BrkFace'}).fillna(data['MasVnrType'])
        return data


    def tr_HouseStyle(self, data):
        data['HouseStyle'] = data['HouseStyle'].map({'1.5Fin': '1.5Unf',
                                                     '2.5Fin': '2Story',
                                                     '2.5Unf': '2Story',
                                                     'SLvl': 'SFoyer'}).fillna(data['HouseStyle'])
        return data


    def tr_Neighborhood(self, data):
        if self.te_neig:
            data['Neighborhood'] = data['Neighborhood'].map(self.smooth_neig).fillna(self.mean_tot)
        return data
    
    def tr_MSSubClass(self, data):
        if self.te_mssc:
            data['MSSubClass'] = data['MSSubClass'].map(self.smooth_mssc).fillna(self.mean_tot)
        return data
    
    
    def transform(self, X, y=None):
        X = self.tr_GrgType(X)
        X = self.tr_LotShape(X)
        X = self.tr_LotConfig(X)
        X = self.tr_MSZoning(X)
        X = self.tr_Alley(X)
        X = self.tr_LandCont(X)
        X = self.tr_BldgType(X)
        X = self.tr_MasVnrType(X)
        X = self.tr_HouseStyle(X)
        X = self.tr_Neighborhood(X)
        X = self.tr_MSSubClass(X)
        return X

In [15]:
tmp = train_set.copy()

y = np.log1p(tmp.SalePrice)

cat_pipe = Pipeline([('fs', feat_sel('category')),
                     ('imputer', df_imputer(strategy='most_frequent')), 
                     ('ord', make_ordinal(['BsmtQual', 'KitchenQual','GarageQual',
                                           'GarageCond', 'ExterQual', 'HeatingQC'])), 
                     ('recode', recode_cat()), 
                     ('dummies', dummify())])

full_pipe = Pipeline([('gen_cl', general_cleaner()), ('cat_pipe', cat_pipe)])

tmp = full_pipe.fit_transform(tmp, y)

tmp.head()

Unnamed: 0,MSSubClass,Alley,LotShape,LandContour,Neighborhood,ExterQual,BsmtQual,HeatingQC,KitchenQual,GarageQual,GarageCond,MSZoning_RL,MSZoning_RM,LotConfig_Corner,LotConfig_CulDSac,LotConfig_Inside,BldgType_1Fam,BldgType_Duplex,BldgType_TwnhsE,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2Story,HouseStyle_SFoyer,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,BsmtExposure_Av,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,BsmtExposure_NoBsmt,BsmtFinType1_ALQ,BsmtFinType1_BLQ,BsmtFinType1_GLQ,BsmtFinType1_LwQ,BsmtFinType1_NoBsmt,BsmtFinType1_Rec,BsmtFinType1_Unf,BsmtFinType2_ALQ,BsmtFinType2_BLQ,BsmtFinType2_GLQ,BsmtFinType2_LwQ,BsmtFinType2_NoBsmt,BsmtFinType2_Rec,BsmtFinType2_Unf,CentralAir_N,CentralAir_Y,FireplaceQu_Ex,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_NoFire,FireplaceQu_Po,FireplaceQu_TA,GarageType_Attchd,GarageType_BuiltIn,GarageType_Detchd,GarageType_NoGrg,GarageFinish_Fin,GarageFinish_NoGrg,GarageFinish_RFn,GarageFinish_Unf,PavedDrive_N,PavedDrive_P,PavedDrive_Y
68,11.517338,0,1,0,11.740415,3,3.0,3,3,3.0,3.0,0,1,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1
1097,12.157053,0,1,0,12.414149,4,4.0,5,4,3.0,3.0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,1
219,12.157053,0,1,0,12.11845,4,4.0,5,4,3.0,3.0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1
901,12.055077,0,0,0,11.879519,3,3.0,3,3,3.0,3.0,1,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1
505,11.833054,1,1,0,11.740415,3,3.0,4,3,3.0,3.0,0,1,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0


Now the issue is to put everything back togeter, this is what the next class is all about.

In [16]:
class FeatureUnion_df(TransformerMixin, BaseEstimator):
    '''
    Wrapper of FeatureUnion but returning a Dataframe, 
    the column order follows the concatenation done by FeatureUnion

    transformer_list: list of Pipelines

    '''
    def __init__(self, transformer_list, n_jobs=None, transformer_weights=None, verbose=False):
        self.transformer_list = transformer_list
        self.n_jobs = n_jobs
        self.transformer_weights = transformer_weights
        self.verbose = verbose  # these are necessary to work inside of GridSearch or similar
        self.feat_un = FeatureUnion(self.transformer_list, 
                                    self.n_jobs, 
                                    self.transformer_weights, 
                                    self.verbose)
        
    def fit(self, X, y=None):
        self.feat_un.fit(X, y)
        return self

    def transform(self, X, y=None):
        X_tr = self.feat_un.transform(X)
        columns = []
        
        for trsnf in self.transformer_list:
            cols = trsnf[1].steps[-1][1].get_features_name()
            columns += list(cols)

        X_tr = pd.DataFrame(X_tr, index=X.index, columns=columns)
        
        return X_tr

    def get_params(self, deep=True):  # necessary to well behave in GridSearch
        return self.feat_un.get_params(deep=deep)


In [17]:
numeric_pipe = Pipeline([('fs', feat_sel('numeric')),
                         ('imputer', df_imputer(strategy='median')),
                         ('transf', tr_numeric())])


cat_pipe = Pipeline([('fs', feat_sel('category')),
                     ('imputer', df_imputer(strategy='most_frequent')), 
                     ('ord', make_ordinal(['BsmtQual', 'KitchenQual','GarageQual',
                                           'GarageCond', 'ExterQual', 'HeatingQC'])), 
                     ('recode', recode_cat()), 
                     ('dummies', dummify())])


processing_pipe = FeatureUnion_df(transformer_list=[('cat_pipe', cat_pipe),
                                                 ('num_pipe', numeric_pipe)])


full_pipe = Pipeline([('gen_cl', general_cleaner()), 
                      ('processing', processing_pipe), 
                      ('scaler', df_scaler())])

tmp = df_train.copy()

y = np.log1p(df_train.SalePrice)

full_pipe.fit_transform(tmp, y)

Unnamed: 0,MSSubClass,Alley,LotShape,LandContour,Neighborhood,ExterQual,BsmtQual,HeatingQC,KitchenQual,GarageQual,GarageCond,MSZoning_RL,MSZoning_RM,LotConfig_Corner,LotConfig_CulDSac,LotConfig_Inside,BldgType_1Fam,BldgType_Duplex,BldgType_TwnhsE,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2Story,HouseStyle_SFoyer,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,BsmtExposure_Av,BsmtExposure_Gd,BsmtExposure_Mn,BsmtExposure_No,BsmtExposure_NoBsmt,BsmtFinType1_ALQ,BsmtFinType1_BLQ,BsmtFinType1_GLQ,BsmtFinType1_LwQ,BsmtFinType1_NoBsmt,BsmtFinType1_Rec,BsmtFinType1_Unf,BsmtFinType2_ALQ,BsmtFinType2_BLQ,BsmtFinType2_GLQ,BsmtFinType2_LwQ,BsmtFinType2_NoBsmt,BsmtFinType2_Rec,BsmtFinType2_Unf,CentralAir_N,CentralAir_Y,FireplaceQu_Ex,FireplaceQu_Fa,FireplaceQu_Gd,FireplaceQu_NoFire,FireplaceQu_Po,FireplaceQu_TA,GarageType_Attchd,GarageType_BuiltIn,GarageType_Detchd,GarageType_NoGrg,GarageFinish_Fin,GarageFinish_NoGrg,GarageFinish_RFn,GarageFinish_Unf,PavedDrive_N,PavedDrive_P,PavedDrive_Y,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,MisBsm,MisGarage,sf_per_room,bedroom_prop,total_bath,lot_prop,service_area
0,1.443734,-0.257821,0.760512,-0.250182,0.525472,1.052302,0.583168,0.891179,0.735994,0.262542,0.265618,0.518133,-0.518133,-0.523447,-0.262324,0.622762,0.443533,-0.245512,-0.347118,-0.360598,-0.994535,1.465112,-0.274063,1.474420,-1.217782,-0.309994,-0.422338,-0.317893,-0.291025,0.728285,-0.16125,-0.421212,-0.335864,1.578868,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,1.056382,-0.117851,-0.522385,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,1.568348,-0.841191,-0.256307,-0.144841,0.299253,0.212877,-0.133270,0.651479,-0.517200,1.050994,0.878668,0.514104,0.575425,-0.288653,-0.944591,-0.803645,1.161852,-0.120242,0.529194,-0.211454,0.912210,-0.951226,0.296026,0.311725,-0.752176,0.216503,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,0.138777,0.347273,-0.16125,-0.24236,-0.933787,-0.710813,1.642256,-0.553276,-0.220303
1,0.087193,-0.257821,0.760512,-0.250182,0.670891,-0.689604,0.583168,0.891179,-0.771091,0.262542,0.265618,0.518133,-0.518133,1.910414,-0.262324,-1.605749,0.443533,-0.245512,-0.347118,-0.360598,1.005495,-0.682542,-0.274063,-0.678233,0.821165,-0.309994,-0.422338,3.145715,-0.291025,-1.373090,-0.16125,2.374103,-0.335864,-0.633365,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,-0.946628,-0.117851,1.914298,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,1.568348,-0.841191,-0.256307,-0.144841,0.299253,0.645747,0.113413,-0.071836,2.179628,0.156734,-0.429577,-0.570750,1.171992,-0.288653,-0.641228,0.418479,-0.795163,-0.120242,-0.381965,-0.211454,-0.318683,0.600495,0.236495,0.311725,1.626195,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-0.489110,-0.614439,0.007288,-0.16125,-0.24236,0.055147,0.560445,0.368581,0.399682,0.333898
2,1.443734,-0.257821,-1.314904,-0.250182,0.525472,1.052302,0.583168,0.891179,0.735994,0.262542,0.265618,0.518133,-0.518133,-0.523447,-0.262324,0.622762,0.443533,-0.245512,-0.347118,-0.360598,-0.994535,1.465112,-0.274063,1.474420,-1.217782,-0.309994,-0.422338,-0.317893,3.436134,-1.373090,-0.16125,-0.421212,-0.335864,1.578868,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,-0.946628,-0.117851,1.914298,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,1.568348,-0.841191,-0.256307,-0.144841,0.299253,0.299451,0.420049,0.651479,-0.517200,0.984752,0.830215,0.325915,0.092907,-0.288653,-0.301643,-0.576677,1.189351,-0.120242,0.659631,-0.211454,-0.318683,0.600495,0.291616,0.311725,-0.752176,-0.070361,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.990891,0.138777,0.536154,-0.16125,-0.24236,0.275488,0.560445,1.642256,-0.125912,-0.004199
3,-0.302159,-0.257821,-1.314904,-0.250182,0.611598,-0.689604,-0.558153,-0.151386,0.735994,0.262542,0.265618,0.518133,-0.518133,1.910414,-0.262324,-1.605749,0.443533,-0.245512,-0.347118,-0.360598,-0.994535,1.465112,-0.274063,-0.678233,0.821165,-0.309994,-0.422338,-0.317893,-0.291025,0.728285,-0.16125,2.374103,-0.335864,-0.633365,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,1.685854,-0.946628,-0.117851,-0.522385,-1.258599,-0.253259,1.639167,-0.24236,-0.563640,-0.24236,-0.637614,1.188791,-0.256307,-0.144841,0.299253,0.068587,0.103317,0.651479,-0.517200,-1.863632,-0.720298,-0.570750,-0.499274,-0.288653,-0.061670,-0.439421,0.937276,-0.120242,0.541448,-0.211454,0.296763,0.600495,0.285002,1.650307,-0.752176,-0.176048,4.092524,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,-1.367655,-0.515281,-0.16125,-0.24236,-0.425130,-0.165988,-0.268257,-0.337161,-0.230760
4,1.443734,-0.257821,-1.314904,-0.250182,2.078658,1.052302,0.583168,0.891179,0.735994,0.262542,0.265618,0.518133,-0.518133,1.910414,-0.262324,-1.605749,0.443533,-0.245512,-0.347118,-0.360598,-0.994535,1.465112,-0.274063,1.474420,-1.217782,-0.309994,2.367770,-0.317893,-0.291025,-1.373090,-0.16125,-0.421212,-0.335864,1.578868,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,-0.946628,-0.117851,1.914298,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,1.568348,-0.841191,-0.256307,-0.144841,0.299253,0.761179,0.878431,1.374795,-0.517200,0.951632,0.733308,1.366489,0.463568,-0.288653,-0.174865,0.112127,1.617877,-0.120242,1.282295,-0.211454,1.527656,0.600495,0.289412,1.650307,0.780197,0.563760,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,2.100892,0.138777,0.869843,-0.16125,-0.24236,-1.221473,-0.004558,1.642256,-0.163986,0.785276
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1.443734,-0.257821,0.760512,-0.250182,0.470503,-0.689604,0.583168,0.891179,-0.771091,0.262542,0.265618,0.518133,-0.518133,-0.523447,-0.262324,0.622762,0.443533,-0.245512,-0.347118,-0.360598,-0.994535,1.465112,-0.274063,-0.678233,0.821165,-0.309994,-0.422338,-0.317893,-0.291025,0.728285,-0.16125,-0.421212,-0.335864,-0.633365,-0.231065,-0.16125,-0.316585,1.547691,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,-0.946628,-0.117851,1.914298,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,1.568348,-0.841191,-0.256307,-0.144841,0.299253,0.126303,-0.259231,-0.071836,-0.517200,0.918511,0.733308,-0.570750,-0.973018,-0.288653,0.873321,-0.465737,0.795198,-0.120242,0.416598,-0.211454,0.296763,0.600495,0.287207,0.311725,-0.752176,-0.100558,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,0.620891,-0.614439,-0.074560,-0.16125,-0.24236,-0.447768,-0.165988,0.368581,-0.589525,-0.204618
1456,0.087193,-0.257821,0.760512,-0.250182,0.378758,-0.689604,0.583168,-1.193952,-0.771091,0.262542,0.265618,0.518133,-0.518133,-0.523447,-0.262324,0.622762,0.443533,-0.245512,-0.347118,-0.360598,1.005495,-0.682542,-0.274063,-0.678233,-1.217782,3.225872,-0.422338,-0.317893,-0.291025,0.728285,-0.16125,2.374103,-0.335864,-0.633365,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,5.102650,-2.488397,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,-0.946628,-0.117851,1.914298,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,-0.637614,1.188791,-0.256307,-0.144841,0.299253,0.790037,0.725429,-0.071836,0.381743,0.222975,0.151865,0.087911,0.759659,0.722112,0.049262,1.981524,-0.795163,-0.120242,1.106648,-0.211454,0.296763,2.152216,0.240904,0.311725,2.033231,-0.704483,-0.359325,-0.116339,-0.270208,-0.068692,-0.087688,-1.599111,1.645210,0.366161,-0.16125,-0.24236,-0.322647,-0.165988,1.005418,-0.175460,0.891585
1457,-0.302159,-0.257821,0.760512,-0.250182,0.611598,2.794208,-0.558153,0.891179,0.735994,0.262542,0.265618,0.518133,-0.518133,-0.523447,-0.262324,0.622762,0.443533,-0.245512,-0.347118,-0.360598,-0.994535,1.465112,-0.274063,-0.678233,0.821165,-0.309994,-0.422338,-0.317893,-0.291025,0.728285,-0.16125,-0.421212,-0.335864,1.578868,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,-0.195977,0.401865,-0.263813,0.263813,-0.129279,-0.152071,1.685854,-0.946628,-0.117851,-0.522385,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,1.568348,-0.841191,-0.256307,-0.144841,0.299253,0.241735,-0.002359,0.651479,3.078570,-1.002492,1.024029,-0.570750,-0.369871,-0.288653,0.701265,0.228208,1.844744,-0.120242,1.470102,-0.211454,1.527656,2.152216,0.159324,-1.026858,-0.752176,0.201405,-0.359325,-0.116339,-0.270208,-0.068692,4.953112,-0.489110,1.645210,1.077611,-0.16125,-0.24236,-1.194987,-0.004558,-0.268257,-1.106562,-0.220303
1458,0.087193,-0.257821,0.760512,-0.250182,-0.581342,-0.689604,-0.558153,-0.151386,0.735994,0.262542,0.265618,0.518133,-0.518133,-0.523447,-0.262324,0.622762,0.443533,-0.245512,-0.347118,-0.360598,1.005495,-0.682542,-0.274063,-0.678233,0.821165,-0.309994,-0.422338,-0.317893,3.436134,-1.373090,-0.16125,-0.421212,-0.335864,1.578868,-0.231065,-0.16125,-0.316585,-0.646124,-0.114827,-0.152071,-0.098397,-0.180366,-0.16125,5.102650,-2.488397,-0.263813,0.263813,-0.129279,-0.152071,-0.593171,1.056382,-0.117851,-0.522385,0.794534,-0.253259,-0.610066,-0.24236,-0.563640,-0.24236,-0.637614,1.188791,-0.256307,-0.144841,0.299253,0.299451,0.136833,-0.795151,0.381743,-0.704406,0.539493,-0.570750,-0.865548,6.092188,-1.284176,-0.077721,-0.795163,-0.120242,-0.854536,-0.211454,-0.934130,-0.951226,0.179168,-1.026858,2.168910,-0.704483,1.473789,-0.116339,-0.270208,-0.068692,-0.087688,-0.859110,1.645210,-0.488523,-0.16125,-0.24236,0.841981,-0.456562,-0.268257,0.820423,-0.370181


In [18]:
full_pipe.get_params()

{'memory': None,
 'steps': [('gen_cl', general_cleaner(train=None)),
  ('processing', FeatureUnion_df(n_jobs=None,
                   transformer_list=[('cat_pipe',
                                      Pipeline(memory=None,
                                               steps=[('fs',
                                                       feat_sel(dtype='category')),
                                                      ('imputer',
                                                       df_imputer(strategy='most_frequent')),
                                                      ('ord',
                                                       make_ordinal(cols=['BsmtQual',
                                                                          'KitchenQual',
                                                                          'GarageQual',
                                                                          'GarageCond',
                                                         

# The custom dummifier

The common problem of not having the same categories in every train and test set is overcomed by the use of this wrapper.

For example, let's take `RoofMatl`

In [19]:
tmp = df_train[['RoofMatl']].copy()

tmp.head()

Unnamed: 0,RoofMatl
0,CompShg
1,CompShg
2,CompShg
3,CompShg
4,CompShg


If we use the dummifier, we notice very rare categories that indeed do not show up in the test set.

In [20]:
dummifier = dummify()

dummifier.fit_transform(tmp).sum()

RoofMatl_ClyTile       1
RoofMatl_CompShg    1434
RoofMatl_Membran       1
RoofMatl_Metal         1
RoofMatl_Roll          1
RoofMatl_Tar&Grv      11
RoofMatl_WdShake       5
RoofMatl_WdShngl       6
dtype: int64

The transformers takes care of this and adds columns full of 0's in the test set

In [21]:
tmp = df_test[['RoofMatl']].copy()

dummifier.transform(tmp).sum()



RoofMatl_CompShg    1442
RoofMatl_Tar&Grv      12
RoofMatl_WdShake       4
RoofMatl_WdShngl       1
RoofMatl_Roll          0
RoofMatl_Membran       0
RoofMatl_ClyTile       0
RoofMatl_Metal         0
dtype: int64

If, for some reason, we switch the role of the train and test set, we get that this time we have columns in excess when we call transform. The transformer drops the extra columns unseen during the fit.

In [22]:
dummifier = dummify()

dummifier.fit_transform(tmp).sum()

RoofMatl_CompShg    1442
RoofMatl_Tar&Grv      12
RoofMatl_WdShake       4
RoofMatl_WdShngl       1
dtype: int64

In [23]:
tmp = df_train[['RoofMatl']].copy()

dummifier.transform(tmp).sum()



RoofMatl_CompShg    1434
RoofMatl_Tar&Grv      11
RoofMatl_WdShake       5
RoofMatl_WdShngl       6
dtype: int64