# Making Your Data Flow With Sklearn Pipelines

## Introduction

Sklearn's [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) are an elegant way to organize your modeling workflow.  It also provides an "at-a-glance" picture of what is going into the current model &mdash; something your future self will thank you for when you read that notebook back in six months.

## Getting Toy Data To Play With

Let's import all the libraries we're working with (don't worry if you don't know what some of these do, we'll get to it!) and get some toy data to work with.  We'll be working with the cute [Penguins](https://github.com/allisonhorst/palmerpenguins) dataset which ``seaborn`` can load.

**Note that I will be emphasizing type hints and style quite a bit!**

**Goal**: We'll try to predict the sex, given the rest of the features.

In [101]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [5]:
# For mypy users, as of 2022-01-08, seaborn does not use typing.
# We have to wrap the load in `pd.DataFrame` to make mypy
# understand that it is a dataframe.`
# See: https://github.com/mwaskom/seaborn/issues/2212

df = pd.DataFrame(sns.load_dataset("penguins"))  # type: ignore

Great, let's do some quick EDA to see what we're working with.

In [20]:
df.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [21]:
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [22]:
df.isna().sum(axis=0)

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

For the missing ``sex`` values, let's drop those rows for now, since we're trying to predict on ``sex``.  

If we were to simply do 
```python
df.dropna(subset=["sex"], inplace=True)
``` 
in a cell, we might forget that we applied this and get messed up down the line!  Because Jupyter Notebooks are pretty easy to mess up when you've got code in a bunch of different cells, we're going to make a data import function that does basic importing and cleaning.

In [87]:
def get_and_clean_penguin_data() -> tuple[pd.DataFrame, pd.Series]:
    """
    Get and clean ``Penguins`` data.

    Loads ``Penguins`` data, removes rows with null values for ``sex``.
    Returns (df_features, df_target) as a tuple of dataframes.

    Returns
    -------
    tuple[pd.DataFrame, pd.DataFrame]
    """

    df = pd.DataFrame(sns.load_dataset("penguins"))  # type: ignore
    df.dropna(subset=["sex"], inplace=True)

    # Transform Male/Female into 0/1.
    targets: pd.Series = df["sex"].apply(
        lambda x: 0 if x == "Male" else 1
    )  # type: ignore

    return (df.drop("sex", axis=1), targets)

Great, now let's look at our numeric data.  There's a few things to do:
 
- We'd like to impute on the missing values,
- We'd like to scale these down a bit so everything is nice and normalized.

Let's use a ``Pipeline`` to do this.

A ``Pipeline``will take a list of 2-tuples ``(name, transform)`` where a ``transform`` in Sklean is defined as anything which has implemented the ``fit``/``transform`` methods. 

In [31]:
pipeline_numeric = Pipeline(
    [
        ("impute_w_mean", SimpleImputer(strategy="mean")),
        ("scale_normal", StandardScaler()),
    ]
)

We note here that the name of the first step of the pipeline is ``impute_w_mean`` and the associated transform is ``SimpleImputer``.  Similarly, ``scale_normal`` is associated to ``StandardScaler``.

For kicks, let's run this through some data and see what happens.

In [38]:
# Running some fake data through ``pipeline_numeric`` for fun.
fake_data = np.array([1, 2, 2, np.nan, 4, 3, 1, 2, np.nan]).reshape(-1, 1)
pipeline_numeric.fit_transform(fake_data)

array([[-1.30930734],
       [-0.16366342],
       [-0.16366342],
       [ 0.        ],
       [ 2.12762443],
       [ 0.98198051],
       [-1.30930734],
       [-0.16366342],
       [ 0.        ]])

Interesting!  We see here that this replaced our N/A values with whatever the mean was, then normalized our data which sent the mean to 0.  Cool.


What about the categorical data?  Can we do anything with that?  Since there are only a few islands (3) and a few species (3), we might try ``OneHotEncoder`` and see what we get from that.  Let's make a similar pipeline, imputing with the most frequent value if necessary.

In [51]:
pipeline_categorical = Pipeline(
    [
        ("impute_w_most_frequent", SimpleImputer(strategy="most_frequent")),
        ("one_hot_encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
    ]
)

Let's try this one on some fake data as well.

In [53]:
# Running some fake data through ``pipeline_categorical`` for fun.
fake_data = np.array(["a", "a", "b", np.nan, np.nan], dtype=object).reshape(-1, 1)
pipeline_categorical.fit_transform(fake_data)

array([[1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

Might be a bit harder to tell, but this imputed the missing values as "a", and then converted "a" and "b" to ``[1, 0]`` and ``[0, 1]`` respectively.

## Preprocessing: Putting It All Together

So far, we've made our numeric and categorical pipelines for the loaded data. We need to tell Sklearn what pipeline each column should go into.  This is where ``ColumnTransformer`` comes in.  This time, we pass a list of 3-tuples in representing ``(name, pipeline, column names to use)``.  

Our preprocessing code, excluding the helper function we made, should look something like this:

In [60]:
# All preprocessing code, excluding helper functions.

pipeline_numeric = Pipeline(
    [
        ("impute_w_mean", SimpleImputer(strategy="mean")),
        ("scale_normal", StandardScaler()),
    ]
)

pipeline_categorical = Pipeline(
    [
        ("impute_w_most_frequent", SimpleImputer(strategy="most_frequent")),
        ("one_hot_encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
    ]
)

numeric_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
categorical_cols = ["species", "island"]

preprocessing_transformer = ColumnTransformer(
    [
        ("numeric", pipeline_numeric, numeric_cols),
        ("categorical", pipeline_categorical, categorical_cols),
    ]
)

## A Simple Model

Let's make a simple model for this data.  A Random Forest might be a nice one, let's try that.

In [65]:
rf_clf = RandomForestClassifier()

Because our classifier has a ``fit``/``transform`` method, it can also be pipelined.  Let's take our _entire preprocessing transformer_ and make that the first step, then push that into the random forest classifier.

In [66]:
preprocess_model_pipeline = Pipeline(
    [("preprocessing", preprocessing_transformer), ("random_forest_classifier", rf_clf)]
)

## Time to Train

At this point, we'll break our original data into a training and test set and pass the training set through our pipeline.  Then we'll evaluate how we did!

In [114]:
# Set up the Data.

df_features, df_target = get_and_clean_penguin_data()

x_train, x_test, y_train, y_test = train_test_split(
    df_features, df_target, test_size=0.33, random_state=1234
)

pmp = preprocess_model_pipeline.fit(x_train, y_train)

# Predict!
y_predicted = pmp.predict(x_test)

# Score!
scores = np.array(
    [
        ("accuracy", accuracy_score(y_test, y_predicted)),
        ("precision", precision_score(y_test, y_predicted)),
        ("recall", recall_score(y_test, y_predicted)),
        ("f1", f1_score(y_test, y_predicted)),
    ]
)
df_scores = pd.DataFrame(scores[:, 1], index=scores[:, 0], columns=["value"])
df_scores

Unnamed: 0,value
accuracy,0.9454545454545454
precision,0.981132075471698
recall,0.912280701754386
f1,0.9454545454545454


Not too bad!


## The Complete Code.

Here's the code in one big chunk:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


def get_and_clean_penguin_data() -> tuple[pd.DataFrame, pd.Series]:
    """
    Get and clean ``Penguins`` data.

    Loads ``Penguins`` data, removes rows with null values for ``sex``.
    Returns (df_features, df_target) as a tuple of dataframes.

    Returns
    -------
    tuple[pd.DataFrame, pd.DataFrame]
    """

    df = pd.DataFrame(sns.load_dataset("penguins"))  # type: ignore
    df.dropna(subset=["sex"], inplace=True)

    # Transform Male/Female into 0/1.
    targets: pd.Series = df["sex"].apply(
        lambda x: 0 if x == "Male" else 1
    )  # type: ignore

    return (df.drop("sex", axis=1), targets)


# PREPROCESSING PIPELINES
pipeline_numeric = Pipeline(
    [
        ("impute_w_mean", SimpleImputer(strategy="mean")),
        ("scale_normal", StandardScaler()),
    ]
)

pipeline_categorical = Pipeline(
    [
        ("impute_w_most_frequent", SimpleImputer(strategy="most_frequent")),
        ("one_hot_encode", OneHotEncoder(handle_unknown="ignore", sparse=False)),
    ]
)

numeric_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
categorical_cols = ["species", "island"]

preprocessing_transformer = ColumnTransformer(
    [
        ("numeric", pipeline_numeric, numeric_cols),
        ("categorical", pipeline_categorical, categorical_cols),
    ]
)

# MODEL PIPELINES
rf_clf = RandomForestClassifier()

preprocess_model_pipeline = Pipeline(
    [("preprocessing", preprocessing_transformer), ("random_forest_classifier", rf_clf)]
)

# TRAINING AND SCORING
df_features, df_target = get_and_clean_penguin_data()

x_train, x_test, y_train, y_test = train_test_split(
    df_features, df_target, test_size=0.33, random_state=1234
)

pmp = preprocess_model_pipeline.fit(x_train, y_train)
y_predicted = pmp.predict(x_test)

scores = np.array(
    [
        ("accuracy", accuracy_score(y_test, y_predicted)),
        ("precision", precision_score(y_test, y_predicted)),
        ("recall", recall_score(y_test, y_predicted)),
        ("f1", f1_score(y_test, y_predicted)),
    ]
)
df_scores = pd.DataFrame(scores[:, 1], index=scores[:, 0], columns=["value"])
df_scores

Note that, with this structure, we could make different pipeline "pieces" to try out different classifiers, different params, etc.  The code is still a bit messy but for EDA it's able to be read through easily and able to be modified as needed with minimal difficulty.