A pipeline automates the preprocessing task for us and returns back a clean dataframe which can be directly passed to the machine learning algorithms. We have a class named ColumnTransformer which takes the operations needs to be done on each of the type of data and provides us the required dataframe.

In [34]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

np.random.seed(0)

In [35]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True, parser='auto')

In [36]:
numeric_features = ["age", "fare"]
categorical_features = ["embarked", "sex", "pclass"]

In [37]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)
numeric_transformer

In [38]:
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)
categorical_transformer

In [39]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
preprocessor

Now we can test the pipelines whether they are working correctly or not.

In [40]:
import pandas as pd

df = pd.DataFrame(preprocessor.fit_transform(X, y))

df

Unnamed: 0,0,1,2,3,4,5
0,-0.039005,3.442584,1.0,0.0,1.0,0.0
1,-2.215952,2.286639,0.0,1.0,1.0,0.0
2,-2.131977,2.286639,1.0,0.0,1.0,0.0
3,0.038512,2.286639,0.0,1.0,1.0,0.0
4,-0.349075,2.286639,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...
1304,-1.163009,-0.364003,1.0,0.0,0.0,1.0
1305,-0.116523,-0.364003,1.0,0.0,0.0,1.0
1306,-0.232799,-0.503774,0.0,1.0,0.0,1.0
1307,-0.194040,-0.503774,0.0,1.0,0.0,1.0
