In [1]:

import os
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings('ignore')

### Why Do You Need a Pipeline?

Data cleaning and preparation is easily the most time-consuming and boring task in machine learning. All ML algorithms are really fussy, some want normalized or standardized features, some want encoded variables and some want both. Then, there is also the issue of missing values which is always there.

Dealing with them is no fun at all, not to mention the added bonus that comes with repeating the same cleaning operations on all training, validation and test sets. Fortunately, Scikit-learn's `Pipeline` is a major productivity tool to facilitate this process, cleaning up code and collapsing all preprocessing and modeling steps into to a single line of code. Here, check this out:

### Intro to Scikit-learn Pipelines

In this and coming sections, we will build the above `pipe_lasso` pipeline together for the [Ames Housing dataset](https://www.kaggle.com/c/home-data-for-ml-course/data) which is used for an [InClass competition](https://www.kaggle.com/c/home-data-for-ml-course/overview) on Kaggle. The dataset contains 81 variables on almost every aspect of a house and using these, you have to predict the house's price. Let's load the training and test sets:

In [3]:
PATH = 'C:\GitHub\pythonPrograms\machineLearningModels2023\datasets\Housing_Prices_Competition'

train = pd.read_csv(os.path.join(PATH,'train.csv'))
train.iloc[:, 70:]

Unnamed: 0,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,0,,,,0,2,2008,WD,Normal,208500
1,0,0,,,,0,5,2007,WD,Normal,181500
2,0,0,,,,0,9,2008,WD,Normal,223500
3,0,0,,,,0,2,2006,WD,Abnorml,140000
4,0,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,,,,0,8,2007,WD,Normal,175000
1456,0,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,0,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,0,0,,,,0,4,2010,WD,Normal,142125


In [4]:
print(f"Id unique values: {train.loc[:,'Id'].nunique()}")
print(f"# on obervations: {train.loc[:,'Id'].shape[0]}")

# drop Id 
train.drop(columns=['Id'],axis=0,inplace=True)

Id unique values: 1460
# on obervations: 1460


In [5]:

from sklearn.model_selection import train_test_split

X = train.drop('SalePrice', axis=1)
y = train.SalePrice

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.3, random_state=42)

In [6]:
col = 'SaleType'
print(set(X_train[col]))
print(set(X_valid[col]))

{'ConLI', 'Con', 'ConLD', 'ConLw', 'WD', 'COD', 'New', 'CWD', 'Oth'}
{'ConLI', 'ConLD', 'ConLw', 'WD', 'COD', 'New', 'Oth'}


In [7]:
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely label encoded
good_label_cols = [col for col in object_cols if set(X_train[col]) == set(X_valid[col])]

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))

# Drop categorical columns that will not be encoded
X_train = X_train.drop(bad_label_cols, axis=1)
X_valid = X_valid.drop(bad_label_cols, axis=1)

Now, let's do basic exploration of the training set:

In [8]:
X_train.describe().T.iloc[:10] # All numerical cols

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MSSubClass,1022.0,57.059687,42.669715,20.0,20.0,50.0,70.0,190.0
LotFrontage,832.0,70.375,25.533607,21.0,59.0,70.0,80.0,313.0
LotArea,1022.0,10745.437378,11329.753423,1300.0,7564.25,9600.0,11692.5,215245.0
OverallQual,1022.0,6.12818,1.371391,1.0,5.0,6.0,7.0,10.0
OverallCond,1022.0,5.564579,1.110557,1.0,5.0,5.0,6.0,9.0
YearBuilt,1022.0,1970.995108,30.748816,1872.0,1953.0,1972.0,2001.0,2010.0
YearRemodAdd,1022.0,1984.757339,20.747109,1950.0,1966.0,1994.0,2004.0,2010.0
MasVnrArea,1019.0,105.26104,172.707705,0.0,0.0,0.0,170.0,1378.0
BsmtFinSF1,1022.0,446.176125,459.971174,0.0,0.0,390.0,724.0,5644.0
BsmtFinSF2,1022.0,42.368885,151.210531,0.0,0.0,0.0,0.0,1127.0


In [10]:
X_train.describe(include="all").T.iloc[:10] # All object cols

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
MSSubClass,1022.0,,,,57.059687,42.669715,20.0,20.0,50.0,70.0,190.0
MSZoning,1022.0,5.0,RL,807.0,,,,,,,
LotFrontage,832.0,,,,70.375,25.533607,21.0,59.0,70.0,80.0,313.0
LotArea,1022.0,,,,10745.437378,11329.753423,1300.0,7564.25,9600.0,11692.5,215245.0
Street,1022.0,2.0,Pave,1018.0,,,,,,,
Alley,66.0,2.0,Grvl,42.0,,,,,,,
LotShape,1022.0,4.0,Reg,638.0,,,,,,,
LandContour,1022.0,4.0,Lvl,928.0,,,,,,,
LotConfig,1022.0,5.0,Inside,710.0,,,,,,,
LandSlope,1022.0,3.0,Gtl,965.0,,,,,,,


In [11]:
above_0_missing = X_train.isnull().sum() > 0

X_train.isnull().sum()[above_0_missing]

LotFrontage     190
Alley           956
MasVnrType        3
MasVnrArea        3
BsmtQual         26
BsmtCond         26
BsmtExposure     26
BsmtFinType1     26
BsmtFinType2     26
FireplaceQu     487
GarageType       54
GarageYrBlt      54
GarageFinish     54
Fence           820
dtype: int64

19 features have NaNs. 

# SPLIT NUMERICAL AND CATEGORICAL COLUMNS

In [12]:
numerical_features = X_train.select_dtypes(include='number').columns.tolist()

print(f'There are {len(numerical_features)} numerical features:', '\n')
print(numerical_features)

There are 36 numerical features: 

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


In [13]:
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()

print(f'There are {len(categorical_features)} categorical features:', '\n')
print(categorical_features)

There are 24 categorical features: 

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 'HouseStyle', 'MasVnrType', 'ExterQual', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'CentralAir', 'KitchenQual', 'FireplaceQu', 'GarageType', 'GarageFinish', 'PavedDrive', 'Fence']


Now, on to preprocessing. For numeric columns, we first fill the missing values with `SimpleImputer` using the mean and feature scale using `MinMaxScaler`. For categoricals, we will use `SimpleImputer` to fill the missing values with the mode of each column. Most importantly, we do all of these in a pipeline. Let's import everything:

In [14]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline

We create two small pipelines for both numeric and categorical features:

In [15]:
numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

categorical_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot', OneHotEncoder(handle_unknown='ignore',sparse=False))
])

[`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class takes a tuple of transformers for its `steps` argument. Each tuple should have this pattern:

```
('name_of_transformer', transformer)
```

Then, each tuple is called a *step* containing a transformer like `SimpleImputer` and arbitrary name. Each step will be chained and applied to the passed DataFrame in the given order.

But, these two pipelines are useless if we don't tell which columns they should be applied to. For that, we will use another transformer - `ColumnTransformer`.

### Column Transformer

By default, all `Pipeline` objects have `fit` and `transform` methods which can be used to transform the input array like this:

In [16]:
numeric_pipeline.fit_transform(X_train.select_dtypes(include='number'))

array([[0.        , 0.20205479, 0.0425343 , ..., 0.        , 0.36363636,
        0.5       ],
       [0.94117647, 0.04794521, 0.01110098, ..., 0.        , 0.36363636,
        0.        ],
       [0.23529412, 0.17465753, 0.03430788, ..., 0.        , 0.45454545,
        1.        ],
       ...,
       [0.        , 0.13356164, 0.0321204 , ..., 0.        , 0.27272727,
        0.        ],
       [0.17647059, 0.11643836, 0.02964313, ..., 0.        , 0.45454545,
        0.25      ],
       [0.58823529, 0.10958904, 0.01114305, ..., 0.        , 0.45454545,
        0.75      ]])

Above, we are using the new numeric preprocessor on `X_train` using `fit_transform`. We are specifying the columns with `select_dtypes`. But, using the pipelines in this way means we have to call each pipeline separately on selected columns which is not what we want. What we want is to have a single preprocessor that is able to perform both numeric and categorical transformations in a single line of code like this:

```python
full_processor.fit_transform(X_train)
```

To achieve this, we will use `ColumnTransformer` class:

In [17]:
from sklearn.compose import ColumnTransformer

full_processor = ColumnTransformer(
        transformers=[('number', numeric_pipeline, numerical_features),
                     ('category', categorical_pipeline, categorical_features)]
)

Similar to `Pipeline` class, `ColumnTransformer` takes a tuple of transformers. Each tuple should contain an arbitrary step name, the transformer itself and the list of column names that the transformer should be applied to. Here, we are creating a column transformer with 2 steps using both of our numeric and categorical preprocessing pipelines. Now, we can use it to fully transform the `X_train`:

In [18]:
full_processor.fit_transform(X_train)

array([[0.        , 0.20205479, 0.0425343 , ..., 0.        , 1.        ,
        0.        ],
       [0.94117647, 0.04794521, 0.01110098, ..., 0.        , 1.        ,
        0.        ],
       [0.23529412, 0.17465753, 0.03430788, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.        , 0.13356164, 0.0321204 , ..., 0.        , 1.        ,
        0.        ],
       [0.17647059, 0.11643836, 0.02964313, ..., 0.        , 0.        ,
        0.        ],
       [0.58823529, 0.10958904, 0.01114305, ..., 0.        , 1.        ,
        0.        ]])

### Final Pipeline With an Estimator

Adding an estimator (model) to a pipeline is as easy as creating a new pipeline which contains the above column transformer and the model itself. Let's import and instantiate `LassoRegression` and add it to a new pipeline with the `full_processor`:

In [17]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error

lasso = Lasso(alpha=0.1)

lasso_pipeline = Pipeline(steps=[
    ('preprocess', full_processor),
    ('model', lasso)
])

That's it! We can now call `lasso_pipeline` just like we call any other model. When we call `.fit`, the pipeline applies all transformations before fitting an estimator:

In [18]:
_ = lasso_pipeline.fit(X_train, y_train)

Let's evaluate our base model on the validation set (Remember, we have a separate testing set which we haven't touched so far):

In [19]:
preds = lasso_pipeline.predict(X_valid)
mean_absolute_error(y_valid, preds)

20699.227383783924

Great, our base pipeline works. Another great thing about pipelines is that they can be treated as any other model. In other words, we can plug it into anywhere where we would use Scikit-learn estimators. So, we will use the pipeline in a grid search to find the optimal hyperparameters in the next section.

### Using Your Pipeline Everywhere

The main hyperparameter for `Lasso` is alpha which can range from 0 to infinity. For simplicity, we will only cross-validate on the values within 0 and 1 with steps of 0.05:

In [20]:
from sklearn.model_selection import GridSearchCV

param_dict = {'model__alpha': np.arange(0, 1, 0.05)}

search = GridSearchCV(lasso_pipeline, param_dict, 
                      cv=10, 
                      scoring='neg_mean_absolute_error',
                      n_jobs=-1)

_ = search.fit(X_train, y_train)

Now, we can get the best score and parameters for `Lasso`:

In [21]:
print('Best score:', abs(search.best_score_))

Best score: 19890.264374381448


In [22]:
print('Best alpha:', search.best_params_)

Best alpha: {'model__alpha': 0.9500000000000001}


As you can see, best `alpha` is 0.95 which is the very end of our given interval, i. e. \[0, 1) with a step of 0.05. We need to search again in case the best parameter lies in a bigger interval:

In [23]:
param_dict = {'model__alpha': np.arange(1, 200, 5)}

search = GridSearchCV(lasso_pipeline, param_dict, 
                      cv=10, 
                      scoring='neg_mean_absolute_error',
                      n_jobs=-1)

_ = search.fit(X_train, y_train)

In [24]:
print('Best score:', abs(search.best_score_))

Best score: 18768.49506250806


In [25]:
print('Best alpha:', search.best_params_)

Best alpha: {'model__alpha': 106}


With best hyperparameters, we get a significant drop in MAE (which is good). Let's redefine our pipeline with `Lasso(alpha=Best alpha)`:

In [26]:
lasso = Lasso(alpha=101)

final_lasso_pipe = Pipeline(steps=[
    ('preprocess', full_processor),
    ('model', lasso)
])