# Getting started with Pipeline's

Here is a little illustrative example how [scikit-learn's](https://scikit-learn.org/stable/) pipelines can be used for feature engineering and data transformation in one go. The aim of the Notebook is to show some examples how pipelines can be used, but it should *not* be considered as a tutorial on Machine Learning. 

The one advantage of using pipelines is that for all datasets test, train and validation, the exact same transformations can be done easily, in rather compact manner and without a large amount of overhead code. Also error handling and dealing with code failure can be done on a single place rather then distributed over several classes, functions and interfaces. 

In this example we gonna apply the following, common steps via via pipelines:
- Correcting Missing data via `SimpleImputer`
- Scaling the Data using `StandardScaler`
- Feature engineering by implementing two custom transformers


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from typing import List, Any

In [None]:
URL_TO_DATA = (
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
TEST_SIZE = 0.2
VALID_SIZE = 0.25
RANDOM_STATE = 42
NUMERIC_TRANSFORMER_REPLACEMENT = "median"

As an example we are using the classic Titanic survival dataset. For a detailed description how the data looks and what each colum represents please have a look on [Kaggle](https://www.kaggle.com/competitions/titanic/data). As common feature we add the Family Size and correct the Titles.

In [None]:
# in case of CERTIFICATE_VERIFY_FAILED run Install Certificates.command
# see also https://stackoverflow.com/questions/50236117/scraping-ssl-certificate-verify-failed-error-for-http-en-wikipedia-org
df = pd.read_csv(filepath_or_buffer=URL_TO_DATA, index_col=0)


df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["Title"] = "NA"
df["Title"] = df.Name.str.extract("([A-Za-z]+)\.")

Splitting the data into test, train and validation datasets:

In [None]:
y = df["Survived"]
X = df.drop(columns=["Survived"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=VALID_SIZE, random_state=RANDOM_STATE
)  # 0.25 x 0.8 = 0.2

In [None]:
print(X_train.dtypes)
print(X_train.head())

As an input for our pipeline we need to classify our columns according to type and which transformations we want to apply to them. 

In [None]:
numeric_features = ["Age", "Fare"]
categorical_features = [
    "Pclass",
    "Sex",
    "SibSp",
    "Parch",
    "Embarked",
    "Title",
    "FamilySize",
]
cutting_features = ["FamilySize"]
corrector_features = ["Title"]

BINS = [0, 1, 2, 4, np.Inf]
LABELS = ["ALONE", "SMALL", "MED", "LARGE"]


CORRECTIONS = {
    "Mlle": "Miss",
    "Mme": "Miss",
    "Ms": "Miss",
    "Dr": "Mr",
    "Major": "Mr",
    "Lady": "Mrs",
    "Countess": "Mrs",
    "Jonkheer": "Other",
    "Col": "Other",
    "Rev": "Other",
    "Capt": "Mr",
    "Sir": "Mr",
    "Don": "Mr",
}

First we take care of the numeric inputs. For filling missing values we can use [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) and provide a strategy how they should be replaced. In this example we use the median, other options provided are mean, most frequent or constant.

Scaling is another common practice for training machine learning models since the learning tends to be much harder or might be impossible if features are on completely different scales. 

In [None]:
# fmt: off
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy=NUMERIC_TRANSFORMER_REPLACEMENT)),
        ("scaler", StandardScaler()),
    ]
)
numeric_transformer
# fmt: on

For the categoric variables its even more simple, we just need to add one of the encoders provided in the [sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) package and can directly call it as pipeline step, in our case we use a one hot encoder:

In [None]:
  
ColumnTransformer(
    transformers=[("onehot", OneHotEncoder(handle_unknown="ignore"), categorical_features)]
)


Reading through Gunes Evitan's advanced feature engineering tutorial on Kaggle he suggested to correct the titles to a smaller set of values. The easiest way would just be using the `replace`function, but we can also do it via transformers. 

Scikit learn  provides the possibility for writing customized transformers, by simply creating a class and inherit `BaseEstimator` and `TransformerMixin`. Our class must implement `fit` and `transform` method, but since we are only using the `transform` method, `fit` will only be a placeholder. 

The transform method takes the known corrections passed while initializing the class and applies it to all columns provided to the transformer function.

Additionally we included two features which should increase usability: First the optional directly encode by providing one of scikit learns encoders provided in the prepossessing package and second the `get_feature_names_out` method, which needs do be implemented to get column names after using the custom transformer.

In [None]:
class TitleCorrector(BaseEstimator, TransformerMixin):
    """
    Use transformer to correct the Title column and do One Hot Encoding.
    """

    def __init__(
        self,
        corrections: dict,
        encoder: object = None,
    ):
        self.corrections = corrections
        self.encoder = encoder
        self.feature_names = None

    def fit(self, X, y=None) -> object:
        return self

    def transform(self, X) -> object:
        """Does the the transform step. In case an encoder is provided, encoding will be done as well."""

        X_corrected = X.apply(
            lambda x: x.replace(list(self.corrections.keys()), list(self.corrections.values()))
        )
        self.feature_names = X_corrected.columns.tolist()

        if self.encoder is not None:
            X_corrected = self.encoder.fit_transform(X_corrected)
            self.feature_names = self.encoder.get_feature_names_out()

        return X_corrected

    def get_feature_names_out(self, input_features=None) -> list:
        return self.feature_names

An other customized transformer could be for categorizing numerical variables, for example categorize the family size into alone, small medium and big. Again we can write a customized transformer for doing this. 

In [None]:
class Cutter(BaseEstimator, TransformerMixin):
    """
    Use transformer to cut numeric data. Interface to pandas:`~pandas.cut`

    """

    def __init__(self, bins: Any, labels: Any = None, encoder: object = None, **kwargs):
        self.bins = bins
        self.labels = labels
        self.kwargs = kwargs
        self.encoder = encoder
        self.feature_names = None

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_corrected = X.apply(
            lambda x: pd.cut(x, bins=self.bins, labels=self.labels, **self.kwargs)
        )

        self.feature_names = X_corrected.columns.tolist()

        if self.encoder is not None:
            X_corrected = self.encoder.fit_transform(X_corrected)
            self.feature_names = self.encoder.get_feature_names_out()

        return X_corrected

    def get_feature_names_out(self, input_features=None) -> list:
        return self.feature_names

As an alternative a user could simply call 
`X_test['X_corrected'] = pd.cut(X_test["FamilySize"], bins=BINS, labels=LABELS))` and or in our case, since we are using a closed dataset any how, we could apply the change before splitting. In this case the approach would be valid since the bins are fixed. But for example scaling needs to be done separate on each dataset, otherwise we would spill over information from the training data to the validation data. 

To see the beauty of of this customize transformer approach we need to imagine a production environment where every day new data flys in and retraining takes place several times. For both use cases we can apply the same code in a compact and easy way.  

As a next step we can put all our previous steps together via [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) as one pipeline: 

In [None]:
# fmt: off
one_hot_enc = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
    transformers=[
        ("discretize", Cutter(BINS, LABELS, one_hot_enc), cutting_features),
        ("correct",    TitleCorrector(CORRECTIONS, one_hot_enc), corrector_features),
        ("num", numeric_transformer, numeric_features),
        ("onehot", one_hot_enc, categorical_features),
    ]
)
preprocessor 
# fmt: on

We can also take a look how our data will look like after applying all pre precessing steps:  

In [None]:
inspect_data = Pipeline(steps=[("preprocessor", preprocessor)])

df_to_inspect = pd.DataFrame.sparse.from_spmatrix(
    inspect_data.named_steps["preprocessor"].fit_transform(X_train)
)

df_to_inspect.columns = inspect_data["preprocessor"].get_feature_names_out()
df_to_inspect.info()

In [None]:
df_to_inspect.head()

Fitting the model is now rather easy, we just add a step in our pipeline and all transformers will be executed before the model gets fitted. In our case it's a simple random forest model:

In [None]:
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

In [None]:
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_val, y_val))
print("model score: %.3f" % clf.score(X_test, y_test))

#### Conclusion:

Pipelines for sure offer a nice and compact way how to execute transformations, standard scaling, and feature engineering for all datasets at once. Especially in a production environment with a complex code base this approach definitely makes sense. Especially if a lot of steps are necessary the approach provides a clear solution and provides already output for nice documentation via the HTML snippets.

Nevertheless, it might be overkill for some small use cases with closed datasets or only a few transformation steps.

#### Acknowledgement: 
- Gunes Evitan's Kaggle Notebook on [Titanic - Advanced Feature Engineering Tutorial](https://www.kaggle.com/code/gunesevitan/titanic-advanced-feature-engineering-tutorial/notebook)
- Ashwini Swain's Kaggle Notebook SWAIN [EDA To Prediction(DieTanic)](https://www.kaggle.com/ash316/eda-to-prediction-dietanic)
- Petro Morales's sklearn Tutorial on [Column Transformer with Mixed Types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html?highlight=standardscaler)

___
