# Getting started with Pipeline's

Here is a little illustrative example how [scikit-learn's](https://scikit-learn.org/stable/) pipelines can be used for feature engineering and data transformation in one go. The aim of the Notebook is to show some examples how piplines can be used, but it should not be considered as a full tutorial. For further reads and see also the Acknowledgement section. 

The main advantage piplines is that for all datasets test, train and validation, the exact same transformations can be done easily, in rather compact manner and without a large amount of overhead code. Also error handling and dealing with code failure can be done on a single place rather then distributed over several classes, functions and interfaces. 

In this example we gonna apply the following, common steps via via pipelines:
- Correcting Missing data via `SimpleImputer`
- Scaling the Data using `StandardScaler`
- Feature engineering by implementing two custom transformers


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List, Any

In [2]:
URL_TO_DATA = (
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
TEST_SIZE = 0.2
VALID_SIZE = 0.25
RANDOM_STATE = 42
NUMERIC_TRANSFORMER_REPLACEMENT = "median"

As an example we are using the classic Titanic survival dataset. For a detailed description how the data looks and what each colum represents please have a look on [Kaggle](https://www.kaggle.com/competitions/titanic/data). As common feature we add the Family Size and correct the Titles.

In [3]:
# in case of CERTIFICATE_VERIFY_FAILED run Install Certificates.command
# see also https://stackoverflow.com/questions/50236117/scraping-ssl-certificate-verify-failed-error-for-http-en-wikipedia-org
df = pd.read_csv(filepath_or_buffer=URL_TO_DATA, index_col=0)


df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["Title"] = "NA"
df["Title"] = df.Name.str.extract("([A-Za-z]+)\.")

Splitting the data into test, train and validation datasets:

In [4]:
y = df["Survived"]
X = df.drop(columns=["Survived"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=VALID_SIZE, random_state=RANDOM_STATE
)  # 0.25 x 0.8 = 0.2

In [5]:
X_train.dtypes
X_train.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,Title
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
461,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,1,Mr
302,3,"McCoy, Mr. Bernard",male,,2,0,367226,23.25,,Q,3,Mr
386,2,"Davies, Mr. Charles Henry",male,18.0,0,0,S.O.C. 14879,73.5,,S,1,Mr
321,3,"Dennis, Mr. Samuel",male,22.0,0,0,A/5 21172,7.25,,S,1,Mr
346,2,"Brown, Miss. Amelia ""Mildred""",female,24.0,0,0,248733,13.0,F33,S,1,Miss


As an input for our pipeline we need to classify our columns according to type and which transformations we want to apply to them. 

In [6]:
numeric_features = ["Age", "Fare"]
categorical_features = [
    "Pclass",
    "Sex",
    "SibSp",
    "Parch",
    "Embarked",
    "Title",
    "FamilySize",
]
discretized_features = ["FamilySize"]
corrector_features = ["Title"]
BINS = [0, 1, 2, 4, np.Inf]
LABELS = ["ALONE", "SMALL", "MED", "LARGE"]
KNOWN_PROBLEMS = [
    "Mlle",
    "Mme",
    "Ms",
    "Dr",
    "Major",
    "Lady",
    "Countess",
    "Jonkheer",
    "Col",
    "Rev",
    "Capt",
    "Sir",
    "Don",
]

KNOWN_CORRECTIONS = [
    "Miss",
    "Miss",
    "Miss",
    "Mr",
    "Mr",
    "Mrs",
    "Mrs",
    "Other",
    "Other",
    "Other",
    "Mr",
    "Mr",
    "Mr",
]

First we take care of the numeric inputs. For filling missing values we can use [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) and provide a strategy how they should be replaced. In this example we use the median, other options provided are mean, most frequent or constant.

Scaling is another common practice for training machine learning models since the learning tends to be much harder or might be impossible if features are on completely different scales. 

In [9]:
# fmt: off
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy=NUMERIC_TRANSFORMER_REPLACEMENT)),
        ("scaler", StandardScaler()),
    ]
)
numeric_transformer
# fmt: on

In [13]:
class TitleCorrector(BaseEstimator, TransformerMixin):
    """
    Use transformer to correct the Title column and do One Hot Encoding.
    """

    def __init__(
        self,
        known_problems: List,
        known_corrections: List,
        encoder: object = None,
    ):
        self.known_corrections = known_corrections
        self.known_problems = known_problems
        self.encoder = encoder
        self.feature_names = None

    def fit(self, X, y=None) -> object:
        return self

    def transform(self, X) -> object:
        """Does the the transform step. In case an encoder is provided, encoding will be done as well."""

        X_corrected = X.apply(
            lambda x: x.replace(self.known_problems, self.known_corrections)
        )
        self.feature_names = X_corrected.columns.tolist()

        if self.encoder is not None:
            X_corrected = self.encoder.fit_transform(X_corrected)
            self.feature_names = self.encoder.get_feature_names_out()

        return X_corrected

    def get_feature_names_out(self, input_features=None) -> list:
        return self.feature_names

In [21]:
class Discretizer(BaseEstimator, TransformerMixin):
    """
    Use transformer to discretize numeric data. Interface to pandas:`~pandas.cut`

    """

    def __init__(self, bins: Any, labels: Any = None, encoder: object = None, **kwargs):
        self.bins = bins
        self.labels = labels
        self.kwargs = kwargs
        self.encoder =encoder
        self.feature_names = None

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_corrected = X.apply(lambda x: pd.cut(x, bins=self.bins, labels=self.labels, **self.kwargs))

        self.feature_names = X_corrected.columns.tolist()

        if self.encoder is not None:
            X_corrected = self.encoder.fit_transform(X_corrected)
            self.feature_names = self.encoder.get_feature_names_out()

        return X_corrected

    def get_feature_names_out(self, input_features=None) -> list:
        return self.feature_names

In [23]:
# fmt: off
one_hot_enc = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
    transformers=[
        ("discretize", Discretizer(BINS, LABELS, one_hot_enc), discretized_features),
        ("correct",    TitleCorrector(KNOWN_PROBLEMS, KNOWN_CORRECTIONS, one_hot_enc), corrector_features),
        ("num", numeric_transformer, numeric_features),
        ("onehot", one_hot_enc, categorical_features),
    ]
)
preprocessor
# fmt: on

In [None]:
preprocess = Pipeline(steps=[("preprocessor", preprocessor)])

df_to_inspect = pd.DataFrame.sparse.from_spmatrix(
    preprocess.named_steps["preprocessor"].fit_transform(X_train)
)

df_to_inspect.columns = preprocess["preprocessor"].get_feature_names_out()
# df_to_inspect.head()

In [None]:
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

In [None]:
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
print("model score: %.3f" % clf.score(X_val, y_val))

Acknowledgement:
- Gunes Evitan's Kaggle Notebook on [Titanic - Advanced Feature Engineering Tutorial](https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial/notebook)
- Ashwini Swain's Kaggle Notebook SWAIN [EDA To Prediction(DieTanic)](https://www.kaggle.com/ash316/eda-to-prediction-dietanic)
- Petro Morales's sklearn Tutorial on [Column Transformer with Mixed Types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html?highlight=standardscaler)

In [None]:
(preprocess["preprocessor"].get_feature_names_out())

___


In [24]:
preprocess["preprocessor"].get_feature_names_out()

array(['correct__Title', 'correct__ABC'], dtype=object)

preprocess['preprocessor'].get_feature_names_out()