# Sklearn Pipeline setup (Henry's comments) 

In this notebook, we will set up all the feature engineering steps within a Scikit-learn pipeline utilizing the open source transformers plus those we developed in house (custom transformer classes that are "scikit-learn compatible".
   -  Categorical variables encoding (the encoding of categorical variables that already have ordered levels from strings to numeric) 
   - Compute elapsed time between YrSold and the Year (temoporal) variables

## From Notebook 6

- We now have several classes (scaler, encoders, etc...) with parameters learned from the train dataset. 
- We can store and retrieve these objects at a later stage (with joblib). 
- We can reuse these objects when scoring a new/more recent dataset

**Even so, this requires manual work:**

- Need to save each transformer class (have lots of pickles!)
- Load each class (when scoring a new dataset)
- Apply each transformation to the new dataset.

The good news is: we can set up all the transformations within a sklearn pipeline, which would be the focus of notebook 7.

# Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we **set the seed**.

In [81]:
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# for saving the pipeline
import joblib

# from Scikit-learn
from sklearn.pipeline import Pipeline # Scikit-learn pipeline
from sklearn.feature_selection import SelectFromModel # Feature selection
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, Binarizer

# from feature-engine
from feature_engine.imputation import (
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer,
)

from feature_engine.encoding import (
    RareLabelEncoder,
    OrdinalEncoder,
)

from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer,
)

from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper

# MODULE WRITTEN BY ME (custom scikit-learn compatible Transformer classes)
# Ref: https://blog.finxter.com/python-how-to-import-modules-from-another-folder/#:~:text=The%20most%20Pythonic%20way%20to,import%20module%20.
import sys
sys.path.append('../src/')
import preprocessors as pp

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

## Import data

In [82]:
# load dataset
data = pd.read_csv('../data/train.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


In [83]:
# Cast MSSubClass as object

data['MSSubClass'] = data['MSSubClass'].astype('O')

# Separate dataset into train and test

It is important to separate our data intro training and testing set so that we can estimate the test error of the model with the test set

**Data leakage:**
When we engineer features, some techniques learn parameters from data. It is important to learn these parameters **only from the train set**, and then apply "transformations" that uses these parameters to both train and test dataset. Otherwise, there will be "data leakage", which would lead to an overestimation of model performance. 

**In notebook # 7, we will introduce an easier way to avoid data leakage (via the use of sckit-learn pipelines)**
- Basically, we will use ONLY the train dataset to fit the pipeline (use only train data as inputs to .fit() method to learn parameters), and then apply transform() to both train and test datasets.


The engineered features will use the following learnt parameters from the train set:

- mean
- mode
- exponents for the yeo-johnson
- category frequency
- and category to number mappings

In [84]:
X_train,X_test,y_train,y_test = train_test_split(data.drop(['Id','SalePrice'],axis=1), # features
                                                 data['SalePrice'], # response variable
                                                 test_size=0.1, # 90-10 train-test split
                                                 random_state=0) # setting seed since there is randomness in the split

## Response variable
- Apply log transformation

In [85]:
y_train = np.log(y_train)
y_test = np.log(y_test)

# Config
- We usually put all this in a config file (yaml) and read variables from the yaml file
- Contains user-specified settings/variables
    - For example, categorical variables with >10% missing values (impute with missing level)

In [86]:
# categorical variables with missing values in the train set

# categorical variables with <10% missing values in the train set that are imputed with the mode of the variable
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['MasVnrType',
                                     'BsmtQual',
                                     'BsmtCond',
                                     'BsmtExposure',
                                     'BsmtFinType1',
                                     'BsmtFinType2',
                                     'Electrical',
                                     'GarageType',
                                     'GarageFinish',
                                     'GarageQual',
                                     'GarageCond']

# categorical variables with >10% missing values in the train set that are imputed with a new 'Missing' level
CATEGORICAL_VARS_WITH_NA_MISSING = [
    'Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']


# numerical variables with missing values in the train set
NUMERICAL_VARS_WITH_NA = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']


# Year variables
TEMPORAL_VARS = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']
# "Reference" year variable
REF_VAR = "YrSold"


# continuous variables to log transform
NUMERICALS_LOG_VARS = ["LotFrontage", "1stFlrSF", "GrLivArea"]

# continuous variables to perform Yeo-Johnson transformation
NUMERICALS_YEO_VARS = ['LotArea']

# "very skewed" continuous variables that we binarize
BINARIZE_VARS = [
    'BsmtFinSF2', 'LowQualFinSF', 'EnclosedPorch',
    '3SsnPorch', 'ScreenPorch', 'MiscVal'
]

# categorical variables with ordered levels that we map from string encoding to numeric encoding
QUAL_VARS = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
             'HeatingQC', 'KitchenQual', 'FireplaceQu',
             'GarageQual', 'GarageCond',
             ]

EXPOSURE_VARS = ['BsmtExposure']

FINISH_VARS = ['BsmtFinType1', 'BsmtFinType2']

GARAGE_VARS = ['GarageFinish']

FENCE_VARS = ['Fence']

# "other" categorical variables that we transform to ordinal variables
CATEGORICAL_VARS = [
    'MSZoning',
    'Street',
    'Alley',
    'LotShape',
    'LandContour',
    'Utilities',
    'LotConfig',
    'LandSlope',
    'Neighborhood',
    'Condition1',
    'Condition2',
    'BldgType',
    'HouseStyle',
    'RoofStyle',
    'RoofMatl',
    'Exterior1st',
    'Exterior2nd',
    'MasVnrType',
    'Foundation',
    'Heating',
    'CentralAir',
    'Electrical',
    'Functional',
    'GarageType',
    'PavedDrive',
    'PoolQC',
    'MiscFeature',
    'SaleType',
    'SaleCondition',
    'MSSubClass']


# Encoding mapping dictionaries for categorical variables with ordered level 
# that we encode from string encoding to numeric encoding
QUAL_MAPPINGS = {'Po': 1, 'Fa': 2, 'TA': 3,
                 'Gd': 4, 'Ex': 5, 'Missing': 0, 'NA': 0}

EXPOSURE_MAPPINGS = {'No': 1, 'Mn': 2, 'Av': 3, 'Gd': 4}

FINISH_MAPPINGS = {'Missing': 0, 'NA': 0, 'Unf': 1,
                   'LwQ': 2, 'Rec': 3, 'BLQ': 4, 'ALQ': 5, 'GLQ': 6}

GARAGE_MAPPINGS = {'Missing': 0, 'NA': 0, 'Unf': 1, 'RFn': 2, 'Fin': 3}

FENCE_MAPPINGS = {'Missing': 0, 'NA': 0,
                  'MnWw': 1, 'GdWo': 2, 'MnPrv': 3, 'GdPrv': 4}

# Scikit-learn pipeline for feature engineering

Now we are ready to create a pipeline object by instantiating the pipeline object with a list of steps (i.e., Pipeline setup). 

**Our steps are:** 
- Imputing missing values in categorical variables (with a new "Missing" level)
- Imputing missing values in categorical variables (with the mode of the categorical variable)
- Imputing missing values in continuous variables: create missing indicator variables
- Imputing missing values in continuous variables: mean imputation
- Handling Year (Temporal) variables: compute the elapse time variables and replace the original year variables
- Handling Year (Temporal) variables: drop the "reference" year variable (YrSold)
- Transformation of continuous variables: log transform several continuous variables
- Transformation of continuous variables: perform Yeo-Johnson transformation on a continuous variable
- Transformation of continuous variables: perform binary transformation on a set of very skewed variables
- Recode categorical variables with ordered level: map from string encoding to numeric encoding
- Recode categorical variables: group rare labels to a "Rare" level
- Recode categorical variables: convert a set of "other_cat" categorical variables to ordinal variables



**What are pipelines?**
- Pipelines are a way to "streamline" repetitive processes (e.g., raw data-> transformation 1-> ... -> transformation n ->train model->predict model)
- We can encapsulate all the steps in the feature engineering process into one function call, so that I don't have to copy and paste a bunch of code if I were to apply the sequential steps of feature engineering to a new/more recent dataset.
- Pipelines also helps to avoid data leakage.
- Another example can be found here:  
    - /Users/hfung/Documents/PycharmProjects/practice_projects/key_concepts/scikit-learn-pipeline.ipynb

### We use custom transformers (in src/preprocessing.py) for the following steps in the Pipeline:
- Handling Year (Temporal) variables: compute the elapse time variables and replace the original year variables
- Recode categorical variables with ordered level: map from string encoding to numeric encoding

## Pipeline setup

In [87]:
# price_pipe is a  list of Pipeline steps. Each element in the list is
# a tuple of (name of the Pipeline step, instantiation of the transformer/estimator)

price_pipe = Pipeline([
    
    # ===== IMPUTATION =====
    
    # Imputing missing values in categorical variables (with a new "Missing" level)
    # CategoricalImputer from feature_engine
    ('missing_imputation', CategoricalImputer(imputation_method='missing',
                                             variables=CATEGORICAL_VARS_WITH_NA_MISSING)),
    
    
    # Imputing missing values in categorical variables (with the mode of the categorical variable)
    # CategoricalImputer from feature_engine
    ('frequent_imputation', CategoricalImputer(imputation_method = 'frequent',
                                              variables=CATEGORICAL_VARS_WITH_NA_FREQUENT)),
    
    # Imputing missing values in continuous variables: create missing indicator variables
    # AddMissingIndicator from feature_engine
    ('missing_indicator', AddMissingIndicator(variables=NUMERICAL_VARS_WITH_NA)),
    
    # Imputing missing values in continuous variables: mean imputation
    # MeanMedianImputer from feature_engine
    ('mean_imputation', MeanMedianImputer(imputation_method='mean',
                                         variables = NUMERICAL_VARS_WITH_NA)),
    
    
    # ==== TEMPORAL (YEAR) VARIABLES ====
    
    # Handling Year (Temporal) variables: compute the elapse time variables and replace the original year variables
    # Use custom class from 'preprocessing.py' TemporalVariableTransformer
    ('elapsed_time', pp.TemporalVariableTransformer(variables = TEMPORAL_VARS,
                                                    reference_variable = REF_VAR)),
    
    
    # Handling Year (Temporal) variables: drop the "reference" year variable (YrSold)
    # Use DropFeatures from feature_engine
    ('drop_features', DropFeatures(features_to_drop=[REF_VAR])),
    
    
    # ==== VARIABLE TRANSFORMATION =====
    
    # Transformation of continuous variables: log transform several continuous variables
    # Use LogTransformer from feature_engine
    ('log', LogTransformer(variables=NUMERICALS_LOG_VARS)),
    
    # Transformation of continuous variables: perform Yeo-Johnson transformation on a continuous variable
    # Use YeoJohnsonTransformer from feature_engine
    ('yeojohnson', YeoJohnsonTransformer(variables=NUMERICALS_YEO_VARS)),
    
    
    # Transformation of continuous variables: perform binary transformation on a set of very skewed variables
    # Use Binarizer from sklearn. Needs to be wrapped by SklearnTransformerWrapper from feature-engine
    # so that I can apply Binarizer to a data subset (can choose variables)
    ('binarizer', SklearnTransformerWrapper(transformer = Binarizer(threshold=0),
                                           variables =BINARIZE_VARS)),
    
    
    # === Recoding categorical variables with ordered level ===
    
    # Recode categorical variables with ordered level: map from string encoding to numeric encoding
    # Use custom class from 'preprocessing.py' Mappers
    ('mapper_qual',pp.Mapper(
        variables=QUAL_VARS, mappings=QUAL_MAPPINGS)),
    
    ('mapper_exposure', pp.Mapper(
        variables=EXPOSURE_VARS, mappings=EXPOSURE_MAPPINGS)),

    ('mapper_finish', pp.Mapper(
        variables=FINISH_VARS, mappings=FINISH_MAPPINGS)),

    ('mapper_garage', pp.Mapper(
        variables=GARAGE_VARS, mappings=GARAGE_MAPPINGS)),
    
    ('mapper_fence', pp.Mapper(
        variables=FENCE_VARS, mappings=FENCE_MAPPINGS)),
    

    # === Recoding categorical variables ===
    
    # Recode categorical variables: group rare labels to a "Rare" level 
    # Use RareLabelEncoder from feature_engine
    
    ('rare_label_encoder', RareLabelEncoder(tol=0.01,
                                           n_categories=1,
                                           replace_with ='Rare', # default
                                           variables=CATEGORICAL_VARS)),
    
    
    # Recode categorical variables: convert a set of "other_cat" categorical variables to ordinal variables
    # Use OrdinalEncoder from feature_engine
    ('categorical_enoder', OrdinalEncoder(encoding_method ='ordered',
                                         variables=CATEGORICAL_VARS)),
    
])

## Fit pipeline to X_train and y_train
- I can use 'price_pipe' (the Pipeline object that we initialized & setup in the last cell) as though it is a single transformer/estimator. I can fit it to X_train and y_train to learn the parameters (basically run the fit() method for all classes in the Pipeline) 

In [88]:
# train the pipeline (learn parameters for the transformers)
price_pipe.fit(X_train,y_train)

  loglike = -n_samples / 2 * np.log(trans.var(axis=0))
  w = xb - ((xb - xc) * tmp2 - (xb - xa) * tmp1) / denom
  tmp1 = (x - w) * (fx - fv)
  tmp2 = (x - v) * (fx - fw)


Pipeline(steps=[('missing_imputation',
                 CategoricalImputer(variables=['Alley', 'FireplaceQu', 'PoolQC',
                                               'Fence', 'MiscFeature'])),
                ('frequent_imputation',
                 CategoricalImputer(imputation_method='frequent',
                                    variables=['MasVnrType', 'BsmtQual',
                                               'BsmtCond', 'BsmtExposure',
                                               'BsmtFinType1', 'BsmtFinType2',
                                               'Electrical', 'GarageType',
                                               'GarageFinish', 'GarageQual',
                                               'GarageCon...
                 OrdinalEncoder(variables=['MSZoning', 'Street', 'Alley',
                                           'LotShape', 'LandContour',
                                           'Utilities', 'LotConfig',
                                           'Lan

## Use the fitted Pipeline object to transform data
- I fitted the Pipeline object price_pipe to X_train and y_train (learnt parameters like column means for imputation)
- I use the fitted Pipeline object to transform the train and test data (run all transform() methods for all classes in the pipeline) 

In [89]:
# Transform the train dataset X_train
X_train = price_pipe.transform(X_train)
# Transform the test dataset X_train
X_test = price_pipe.transform(X_test)

## Sanity check: check for missing values in the transformed train and test sets

In [90]:
# check absence of na in the transformed (processed) train and test set
[var for var in X_train.columns if X_train[var].isnull().sum()>0]

[]

In [91]:
[var for var in X_test.columns if X_test[var].isnull().sum()>0]

[]

## I can examine the parameters (learnt and stored during fit) of each step in the fitted pipeline

In [92]:
# use named_steps['name of step'] of price_pipe
price_pipe.named_steps # a dictionary of all the steps in the pipeline

{'missing_imputation': CategoricalImputer(variables=['Alley', 'FireplaceQu', 'PoolQC', 'Fence',
                               'MiscFeature']),
 'frequent_imputation': CategoricalImputer(imputation_method='frequent',
                    variables=['MasVnrType', 'BsmtQual', 'BsmtCond',
                               'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
                               'Electrical', 'GarageType', 'GarageFinish',
                               'GarageQual', 'GarageCond']),
 'missing_indicator': AddMissingIndicator(variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt']),
 'mean_imputation': MeanMedianImputer(imputation_method='mean',
                   variables=['LotFrontage', 'MasVnrArea', 'GarageYrBlt']),
 'elapsed_time': TemporalVariableTransformer(reference_variable='YrSold',
                             variables=['YearBuilt', 'YearRemodAdd',
                                        'GarageYrBlt']),
 'drop_features': DropFeatures(features_to_drop=['YrSold']),

In [93]:
# Here, we access the imputer_dict_ attribute of the CategoricalImputer class that is associated with the
# 'missing_imputation' step
price_pipe.named_steps['missing_imputation'].imputer_dict_

{'Alley': 'Missing',
 'FireplaceQu': 'Missing',
 'PoolQC': 'Missing',
 'Fence': 'Missing',
 'MiscFeature': 'Missing'}

In [94]:
price_pipe.named_steps['frequent_imputation'].imputer_dict_

{'MasVnrType': 'None',
 'BsmtQual': 'TA',
 'BsmtCond': 'TA',
 'BsmtExposure': 'No',
 'BsmtFinType1': 'Unf',
 'BsmtFinType2': 'Unf',
 'Electrical': 'SBrkr',
 'GarageType': 'Attchd',
 'GarageFinish': 'Unf',
 'GarageQual': 'TA',
 'GarageCond': 'TA'}

In [95]:
price_pipe.named_steps['rare_label_encoder'].encoder_dict_

{'MSZoning': Index(['RL', 'RM', 'FV', 'RH'], dtype='object'),
 'Street': Index(['Pave'], dtype='object'),
 'Alley': Index(['Missing', 'Grvl', 'Pave'], dtype='object'),
 'LotShape': Index(['Reg', 'IR1', 'IR2'], dtype='object'),
 'LandContour': Index(['Lvl', 'Bnk', 'HLS', 'Low'], dtype='object'),
 'Utilities': Index(['AllPub'], dtype='object'),
 'LotConfig': Index(['Inside', 'Corner', 'CulDSac', 'FR2'], dtype='object'),
 'LandSlope': Index(['Gtl', 'Mod'], dtype='object'),
 'Neighborhood': Index(['NAmes', 'CollgCr', 'OldTown', 'Edwards', 'Somerst', 'NridgHt',
        'Gilbert', 'Sawyer', 'NWAmes', 'BrkSide', 'SawyerW', 'Crawfor',
        'Mitchel', 'Timber', 'NoRidge', 'IDOTRR', 'ClearCr', 'SWISU', 'StoneBr',
        'Blmngtn', 'MeadowV', 'BrDale'],
       dtype='object'),
 'Condition1': Index(['Norm', 'Feedr', 'Artery', 'RRAn', 'PosN'], dtype='object'),
 'Condition2': Index(['Norm'], dtype='object'),
 'BldgType': Index(['1Fam', 'TwnhsE', 'Duplex', 'Twnhs', '2fmCon'], dtype='object'),
 'H

## Check the transformed datasets

In [96]:
X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
930,9,3,4.290459,0.079663,1,2,1,3,1,0,0,19,2,1,3,3,8,5,2,2,0,0,10,10,1,0.0,4,3,4,4,3,3,6,16,1,0,1450,1466,2,5,1,3,7.290293,0,0,7.290293,0,0,2,0,3,1,4,7,4,0,0,3,2.0,3,3,610,3,3,2,100,18,0,0,0,0,0,0,2,0,7,2,3,0,0,0
656,9,3,4.276666,0.079663,1,2,1,1,1,0,0,8,2,1,3,3,5,7,49,2,0,0,6,6,2,54.0,4,3,2,3,3,1,5,806,1,0,247,1053,2,5,1,3,6.959399,0,0,6.959399,1,0,1,1,3,1,4,5,4,0,0,3,49.0,2,1,312,3,3,2,0,0,0,0,0,0,0,3,2,0,8,2,3,0,0,0
45,11,3,4.110874,0.079663,1,2,0,1,1,0,0,21,2,1,4,3,9,5,5,5,2,0,3,2,2,412.0,5,3,4,5,3,1,6,456,1,0,1296,1752,2,5,1,3,7.468513,0,0,7.468513,1,0,2,0,2,1,5,6,4,1,4,3,5.0,2,2,576,3,3,2,196,82,0,0,0,0,0,0,2,0,2,2,3,0,0,0
1348,9,3,4.246776,0.079663,1,2,2,2,1,0,0,10,2,1,3,3,7,5,9,9,0,0,10,10,1,0.0,4,3,4,4,3,4,6,1443,1,0,39,1482,2,5,1,3,7.309212,0,0,7.309212,1,0,2,0,3,1,4,5,4,1,2,3,9.0,2,2,514,3,3,2,402,25,0,0,0,0,0,0,2,0,8,2,3,1,0,0
55,9,3,4.60517,0.079663,1,2,1,1,1,0,0,8,2,1,3,3,6,5,44,44,0,0,6,7,2,272.0,3,3,2,3,3,1,4,490,1,0,935,1425,2,4,1,3,7.261927,0,0,7.261927,0,0,2,0,3,1,3,7,4,1,4,3,44.0,2,2,576,3,3,2,0,0,0,1,0,0,0,0,2,0,7,2,3,0,0,0


In [97]:
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,SaleType,SaleCondition,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
529,9,3,4.246776,0.079663,1,2,1,1,1,4,0,16,2,1,3,3,6,3,50,32,2,0,1,5,1,103.797401,4,3,4,3,3,1,3,1219,1,0,816,2035,2,3,1,3,7.830028,0,0,7.830028,1,0,3,0,4,2,3,9,0,2,3,3,32.0,2,2,484,3,3,2,0,0,1,0,0,0,0,0,2,0,3,2,0,1,1,0
491,5,3,4.369448,0.079663,1,2,0,1,1,0,0,8,0,1,3,1,6,7,65,56,0,0,1,1,1,0.0,3,3,2,3,3,1,4,403,3,1,238,806,2,3,1,2,6.864848,620,0,7.363914,1,0,1,0,3,1,2,5,4,2,3,3,65.0,1,1,240,3,3,2,0,0,1,0,0,0,0,3,2,0,8,2,3,0,0,0
459,5,3,4.246776,0.079663,1,2,1,0,1,2,0,4,2,1,3,1,5,4,59,59,0,0,3,2,0,161.0,3,3,2,3,3,1,2,185,1,0,524,709,2,3,1,3,6.886532,224,0,7.092574,1,0,1,0,3,1,4,5,4,1,3,1,59.0,1,1,352,3,3,2,0,0,1,0,0,0,0,0,2,0,7,2,3,1,0,0
279,12,3,4.418841,0.079663,1,2,0,1,1,0,0,17,2,1,3,5,7,5,31,31,2,0,7,7,2,299.0,3,3,2,4,3,1,4,392,1,0,768,1160,2,5,1,3,7.052721,866,0,7.611842,0,0,2,1,4,1,3,8,4,1,3,3,31.0,3,2,505,3,3,2,288,117,0,0,0,0,0,0,2,0,3,2,3,0,0,0
655,4,1,3.044522,0.079663,1,2,0,1,1,0,0,2,2,1,2,5,6,5,39,39,0,0,6,5,2,381.0,3,3,2,3,3,1,1,0,1,0,525,525,2,3,1,3,6.263398,567,0,6.995766,0,0,1,1,3,1,3,6,4,0,0,1,39.0,1,1,264,3,3,2,0,0,0,0,0,0,0,0,2,0,3,2,2,0,0,0


## Conclusion
 Now we have all the feature engineering steps in 1 pipeline.
 
 The next steps are:
 - Add the scaler and model training to the pipeline, and produce a final pipeline only with the selected features.