# Combining feature engineering and modeling fitting (ColumnTransformer vs. Pipeline)

<span>Photo by <a href="https://unsplash.com/@spacexuan?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Crystal Kwok</a> on <a href="https://unsplash.com/s/photos/pipes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

In the previous post, we learned about various missing data imputation strategies using scikit-learn. Before we dive into how we automatically determine the best imputation method for a given problem, I would like to first touch on two scikit-learn classes, `ColumnTransformer` and `Pipeline`, and the differences between them. 

Both `ColumnTransformer` and `Pipeline` are used to combine different feature engineering steps (e.g. missing data imputation, categorical variable encoding, feature scaling, etc.) to transform data before feeding it into a model, but there are two major differences between them:


**1. `ColumnTransformer` is only for transformers vs. `Pipeline` is for both transformers and estimator**   
**2. `ColumnTransformer` is parallel vs. `Pipeline` is sequential**


First of all, what do I mean by transformer and estimator? 

Second, what do I mean by parallel vs. sequential? 

# Prepare data

Let's first prepare the [house price data from Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) we will be using in this post. It is the same data used in the previous post as well. Do not forget to split the data into train and test sets before performing any feature engineering steps!

In [6]:
import pandas as pd 

# preparing data 
from sklearn.model_selection import train_test_split

# feature engineering: imputation, scaling, encoding
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# putting together in pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# model selection
from sklearn.linear_model import Lasso, LinearRegression

In [4]:
# import house price data 
df = pd.read_csv('../data/house_price/train.csv', index_col='Id')

# numerical columns vs. categorical columns 
num_cols = df.drop('SalePrice', axis=1).select_dtypes('number').columns
cat_cols = df.drop('SalePrice', axis=1).select_dtypes('object').columns

# split train and test dataset 
X_train, X_test, y_train, y_test = train_test_split(df.drop('SalePrice', axis=1), 
                                                    df['SalePrice'], 
                                                    test_size=0.3, 
                                                    random_state=0)

# check the size of train and test data
X_train.shape, X_test.shape

((1022, 79), (438, 79))

# Pipeline

Now that we have the data ready, let's say we want to train a model using Lasso regression that predicts `SalePrice` using the variables we have. Instead of using all of the 79 variables, let's use only numerical variables this time (we will include the categorical variables in the next section). We already know there is plenty of missing data in some columns as we saw in the previous post (e.g. `LotFrontage`, `MasVnrArea`, and `GarageYrBlt` among numerical columns), we definitely want to perform missing data imputation before fitting a model. Also, let's say we also want to scale the data using `StandardScaler`. The following is what we would do normally:

In [9]:
# take only numerical data
X_temp = X_train[num_cols].copy()

# missing data imputation
imputer = SimpleImputer(strategy='mean')
X_impute = imputer.fit_transform(X_temp)  # np.ndarray
X_impute = pd.DataFrame(X_impute, columns=X_temp.columns)  # pd.DataFrame


# scale data 
scaler = StandardScaler()
X_scale = scaler.fit_transform(X_impute)  # np.ndarray
X_scale = pd.DataFrame(X_scale, columns=X_temp.columns)  # pd.DataFrame

# fit model 
lasso = Lasso()
lasso.fit(X_scale, y_train)
lasso.score(X_scale, y_train)

0.8419801151434141

This is great but there are manual steps for moving transformed data: we have to pass the output of the first step (`SimpleImputer`) to the second step (`StandardScaler`) as an input. And then, the output of the second step (`StandardScaler`) is passed to the third step (`Lasso`) as an input. If we have more feature engineering steps, it will be more complex to handle different inputs and outputs. So, here comes Pipeline to the rescue!

**Pipeline is a class with wich you can put transformers and an estimator (model) together as sequential steps**. You just need to pass a list of tuples of steps in this order: (step_name, transformer or estimator object). Let's rewrite the same flow abouve in Pipeline.

In [10]:
pipe = Pipeline([('imputer', SimpleImputer(strategy='mean')),
                 ('scaler', StandardScaler()),
                 ('lasso', Lasso())])

pipe.fit(X_temp, y_train)
pipe.score(X_temp, y_train)

0.8419801151434141

Great! We saved a lot more lines and it looks much cleaner! As you can see, **Pipeline passes the first step's output to the next step as its input, meaning Pipeline is sequential**.

# ColumnTransformer

Now let's go back to our original dataset where we have both numerical and categorical variables. Because we cannot apply mean imputation to categorical variables, we decide to use mode imputation for categorical variables and mean imputation for numerical imputation. Okay, then can we do something like this?

In [16]:
pipe = Pipeline([('num_imputer', SimpleImputer(strategy='mean')),
                 ('cat_imputer', SimpleImputer(strategy='most_frequent')),
                 ('lasso', Lasso())])

# pipe.fit throws an error
# ------
# ValueError: Cannot use mean strategy with non-numeric data:
# could not convert string to float: 'RL'
# ------

# pipe.fit(X_train, y_train)


Unfortunately, no! When `pipe` fits `SimpleImputer(strategy='mean')` to X_train, which has both numerical and categorical, it fails because `SimpleImputer(strategy='mean')` can only apply to numerical variables. So, we need to let our pipeline know that we want to apply mean imputation to only numerical variables and most_frequent imputation to categorical variables. 

How do we do that? With `ColumnTransformer`! 

`ColumnTransformer` is similar to `Pipeline` in a sense that you pass a list of (step_name, transformer class) tuples, but in this time, you pass another argument in the tuple which is column names you want to apply the transformer. 

In [None]:
transformer = ColumnTransformer([('numerical', SimpleImputer(strategy='mean'), num_cols)])

# fit
X_train_transformed = transformer.fit_transform(X_train)
X_train_transformed = pd.DataFrame(X_train_transformed, columns=num_cols)

X_train_transformed.head()

However, you may have noticed that the output columns are not the full columns of the DataFrame you passed in, it's only columns that you used for transformer.

This time, let's put both transformers together.

In [None]:
# applying different transformers to different columns 
transformer = ColumnTransformer(
    [('numerical', SimpleImputer(strategy='mean'), num_cols), 
     ('categorical', SimpleImputer(strategy='most_frequent'), cat_cols)])


transformer.fit(X_train)
X_train_transformed = transformer.transform(X_train)
X_train_transformed = pd.DataFrame(X_train_transformed, 
                                   columns=list(num_cols) + list(cat_cols))

In [None]:
X_train_transformed.head()

Now the output columns are `list(num_cols) + list(cat_cols)` which is different from the original column order. **Remember the output columns are the concatenated outputs of each step in ColumnTransformer**. 

This is still very handy that you can just use a few lines of code to perform multiple feature engineering steps.

If you have other transformers, e.g. StandardScaler, you cannot do this because the output columns are not more than X_train.

In [None]:
# applying different transformers to different columns 
transformer = ColumnTransformer(
    [('numerical', SimpleImputer(strategy='mean'), num_cols), 
     ('categorical', SimpleImputer(strategy='most_frequent'), cat_cols)])


transformer.fit(X_train)
X_train_transformed = transformer.transform(X_train)
X_train_transformed = pd.DataFrame(X_train_transformed, 
                                   columns=list(num_cols) + list(cat_cols))

Then how do we apply multiple steps to a set of columns and a different steps to a different set of columns? 

We use `Pipeline` again!



In [None]:

num_pipeline= Pipeline([('imputer', SimpleImputer(strategy='mean')),
                        ('scaler', StandardScaler())])

cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor_pipe = ColumnTransformer([("num_pipeline", num_pipeline, num_cols),
                                 ("cat_pipeline", cat_pipeline, cat_cols)])

preprocessor_pipe.fit(X_train)
X_train_imputed = preprocessor_pipe.transform(X_train)

In [None]:
X_train_imputed

Difference between Pipeline vs. ColumnTransformer:

Pipeline 
- passes the output of the first transformer to the next one
- Sequential
- No option to pass which columns to apply transformers
- Output will be the same shape of the original input shape
- Work with an estimator
- fit, fit_transform, fit_predict, predict, score, etc.

ColumnTransformer 
- ColumnTransformer treats each transformer independently. 
- Parallel
- Can specify which column each transformer applies to 
- Output is concatenated array, concatenated after all steps. Size will can be different from input
- The output column order will be the order of passed columns
- Only for column transformation. No prediction
- fit, fit_transform (no predict or score)
