- Source Blog: Bex T.'s [Blog](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d)
- Dataset: [Ames Housing dataset](https://www.kaggle.com/c/home-data-for-ml-course/data?select=train.csv)

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_absolute_error

In [2]:
train = pd.read_csv('data/train.csv')
X_test = pd.read_csv('data/test.csv')

In [3]:
X = train.drop('SalePrice', axis=1)
y = train.SalePrice

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.3, random_state=42)

> `Numerical` & `Categorical` Features: contain the respective names of columns from `X_train`

In [4]:
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
categorical_features = X_train.select_dtypes(exclude='number').columns.tolist()

> Numerical Column's Transformation

In [5]:
numeric_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', MinMaxScaler())
])

# Set handle_unknown to ignore to skip previously unseen labels. Otherwise, OneHotEncoder throws an
# error if there are labels in test set that are not in train set
categorical_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

`Column Transformer`

In [6]:
full_processor = ColumnTransformer(transformers=[
    ('number', numeric_pipeline, numerical_features),
    ('category', categorical_pipeline, categorical_features)
])
full_processor.fit_transform(X_train)



array([[0.09252913, 0.        , 0.20205479, ..., 0.        , 1.        ,
        0.        ],
       [0.99520219, 0.94117647, 0.04794521, ..., 0.        , 1.        ,
        0.        ],
       [0.52227553, 0.23529412, 0.17465753, ..., 0.        , 1.        ,
        0.        ],
       ...,
       [0.88690884, 0.        , 0.13356164, ..., 0.        , 1.        ,
        0.        ],
       [0.58944483, 0.17647059, 0.11643836, ..., 0.        , 1.        ,
        0.        ],
       [0.77176148, 0.58823529, 0.10958904, ..., 0.        , 1.        ,
        0.        ]])

In [7]:
lasso = Lasso(alpha=0.1)

`sklearn.pipeline.Pipeline class `: ('name_of_transformer`, transformer)

In [8]:
# Warning! The order of steps matter! The estimator should always be the last step for the pipeline to work correctly
lasso_pipeline = Pipeline(steps=[
    ('preprocess', full_processor),
    ('model', lasso)
])
_ = lasso_pipeline.fit(X_train, y_train)
preds = lasso_pipeline.predict(X_valid)

  model = cd_fast.enet_coordinate_descent(


> Evaluate our model

In [9]:
reds = lasso_pipeline.predict(X_valid)
print(f"MAE: {mean_absolute_error(y_valid, preds)}")
lasso_pipeline.score(X_valid, y_valid)

MAE: 17925.609664388747


0.8948472358936015