# Pipelines: Or How I Learned to Stop Duplicating Work and Love the Convenience of Storing Multiple Steps in a Single Object

Date: October 6, 2019

Written by: Cristian E. Nuno

![kid transformer](visuals/transformer.gif)

### Definition of a pipeline

We'll store these steps in a `Pipeline` object. From the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn-pipeline-pipeline):

> The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. [The `Pipeline` object] sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’ (i.e. transformers), that is, they must implement fit and transform methods. 

### Broad generalization of a pipeline

The key here is we need to specify a specific column, pass it's "transformer" (i.e. `SimpleImputer`, `OneHotEncoder`, `StandardScaler`), and determine if the transformation belongs in it's own new column or if it's more appropriate for the transformed column to overwrite the input column.

### Benefits of the `Pipeline`

From the [User Guide](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators):
> * **Convenience and encapsulation**
>     + You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> * **Joint parameter selection**
>     + You can grid search over parameters of all estimators in the pipeline at once.
> * **Safety**
>     + Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors. 

Here, we'll create a few `Pipeline` objects to do the following:

* flag which records contain `NaN` values in the `sepal_length` and `sepal_width` features;
* impute the median value for `NaN` values in the `sepal_length` and `sepal_width` features; and
* return all other columns - `petal_length` & `petal_width` - as is

## Load necessary modules

In [1]:
import numpy as np
import pandas as pd
import random
from sklearn.base import BaseEstimator
from sklearn.compose import ColumnTransformer
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

## Load necessary data

In [2]:
# load iris data set
iris = load_iris()

# load feature matrix
X = iris["data"]
# load target vector
y = iris["target"]

# standardize feature names spelling and casing
feature_names = [col.replace(" ", "_").replace("_(cm)", "") 
                 for col in (iris["feature_names"] + ["species"])]

# transform X and y into data frame
iris_df = pd.DataFrame(np.column_stack((X, y.reshape(-1, 1)))
                       , columns=feature_names)

# convert species from numeric to categorical
class_names = {0.0: "setosa", 1.0: "versicolor", 2.0: "virginica"}
iris_df["species"] = iris_df["species"].map(class_names)

# show first few records
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### For educational purposes only, let's replace some values with `NaN`

In [3]:
# initialize random number generator
random.seed(2019)

# generate list of random integers 
random_ints = [random.randrange(0, len(iris_df)) for _ in range(20)]

# for these random index values, replace their real values with NaN
iris_df.loc[random_ints, "sepal_length"] = np.nan
iris_df.loc[random_ints, "sepal_width"] = np.nan

# manually force the first "sepal_width" value to also be NaN
iris_df.loc[0, "sepal_width"] = np.nan

iris_df.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,,,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


## Split `iris_df` into training and testing sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(iris_df.drop("species", axis=1),
                                                    iris_df["species"],
                                                    test_size=0.3,
                                                    random_state=2019)

## Create custom transformer that identifies records that have `NaN` values

Sometimes we need to add new features to our existing feature space. In this case, we can't rely on importing a traditional transformer (i.e. `StandardScaler`, `OneHotEncoder`, etc.).

Instead, we'll need to create our own custom transformer. We will do this by creating a new class that implements both `.fit()` and `.transform()` methods. 

_Shoutout to [Sebastian Raschka](https://sebastianraschka.com/) for helping me out on [Twitter](https://twitter.com/cenuno_/status/1179855832374099968) to figure this out!_

In [5]:
class IsMissing(BaseEstimator):
    """Creates a new column flagging if any values from one column are missing
    
    Note: this class will be used inside a scikit-learn Pipeline
    
    Attributes:
        col_name (str): name of a column
        
    Methods:
        _is_missing(): returns 1 if record contains NaN value; 0 if else
        
        fit(): fit all the transformers one after the other 
               then fit the transformed data using the final estimator
               
        transform(): apply transformers, and transform with the final estimator
    """
    
    def __init__(self, col_name):
        self.col_name = col_name
    
    
    def fit(self, X, y=None):
        return self
    
    
    def _is_missing(self, X):
        """Flag if a record has a NaN value"""
        if pd.isna(X):
            return 1
        else:
            return 0
    
    
    def transform(self, X, y=None):
        """Copies X and creates a new column before returning X_new"""
        new_col = self.col_name + "_missing"
        X_new = X.copy()
        X_new[new_col] = X_new[self.col_name].apply(self._is_missing)
        return X_new
    
    

## Create first `Pipeline`

In [6]:
missing_mapper = Pipeline(steps=[
    ("missing_sl", IsMissing(col_name="sepal_length")),
    ("missing_sw", IsMissing(col_name="sepal_width"))
])

## Inspect results

In [7]:
missing_mapper.fit(X_train).transform(X_train).head(15)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,sepal_length_missing,sepal_width_missing
73,6.1,2.8,4.7,1.2,0,0
30,4.8,3.1,1.6,0.2,0,0
98,5.1,2.5,3.0,1.1,0,0
36,5.5,3.5,1.3,0.2,0,0
1,4.9,3.0,1.4,0.2,0,0
149,5.9,3.0,5.1,1.8,0,0
4,5.0,3.6,1.4,0.2,0,0
100,6.3,3.3,6.0,2.5,0,0
64,5.6,2.9,3.6,1.3,0,0
135,7.7,3.0,6.1,2.3,0,0


## Create a pipeline to impute the median values for `sepal_length` and `sepal_width`

In [8]:
print(f"Median value for sepal length: {iris_df['sepal_length'].median()}")
print(f"Median value for sepal width: {iris_df['sepal_width'].median()}")

Median value for sepal length: 5.75
Median value for sepal width: 3.0


In [9]:
impute_mapper = ColumnTransformer(transformers=[
    ("impute", SimpleImputer(missing_values=np.nan, strategy="median"), ['sepal_length', 'sepal_width'])
],
                      remainder="passthrough")

_Note: we're setting `remainder` to "passthrough" so that all non `sepal_length` and `sepal_width` columns are returned without any transformations_

In [10]:
impute_mapper.fit(X_train).transform(X_train)[0:15]

array([[6.1, 2.8, 4.7, 1.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.1, 2.5, 3. , 1.1],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [5.9, 3. , 5.1, 1.8],
       [5. , 3.6, 1.4, 0.2],
       [6.3, 3.3, 6. , 2.5],
       [5.6, 2.9, 3.6, 1.3],
       [7.7, 3. , 6.1, 2.3],
       [5.4, 3.9, 1.7, 0.4],
       [5.3, 3.7, 1.5, 0.2],
       [6.4, 2.8, 5.6, 2.2],
       [5.7, 3. , 1.5, 0.2],
       [5.7, 3. , 3.9, 1.1]])

From above, notice the 14 and 15th records: they were previously `NaN`. After using the `SimpleImputer` transformer, the `NaN` values were replaced with the median values of `sepal_length` and `sepal_width`.

## Combine the two pipelines into one

In [11]:
data_prep_mapper = Pipeline(steps=[
    ("missing", missing_mapper),
    ("impute", impute_mapper)
])

In [12]:
data_prep_mapper.fit(X_train).transform(X_train)[0:15]

array([[6.1, 2.8, 4.7, 1.2, 0. , 0. ],
       [4.8, 3.1, 1.6, 0.2, 0. , 0. ],
       [5.1, 2.5, 3. , 1.1, 0. , 0. ],
       [5.5, 3.5, 1.3, 0.2, 0. , 0. ],
       [4.9, 3. , 1.4, 0.2, 0. , 0. ],
       [5.9, 3. , 5.1, 1.8, 0. , 0. ],
       [5. , 3.6, 1.4, 0.2, 0. , 0. ],
       [6.3, 3.3, 6. , 2.5, 0. , 0. ],
       [5.6, 2.9, 3.6, 1.3, 0. , 0. ],
       [7.7, 3. , 6.1, 2.3, 0. , 0. ],
       [5.4, 3.9, 1.7, 0.4, 0. , 0. ],
       [5.3, 3.7, 1.5, 0.2, 0. , 0. ],
       [6.4, 2.8, 5.6, 2.2, 0. , 0. ],
       [5.7, 3. , 1.5, 0.2, 1. , 1. ],
       [5.7, 3. , 3.9, 1.1, 1. , 1. ]])

## Now let's take this up by perfoming all the preprocessing steps prior to building a `DecisionTreeClassifier` model

This model will help us classify which species a flower is from `X_test`

In [13]:
# build classifier
dt_clf = DecisionTreeClassifier(random_state=2019,
                                min_samples_leaf=30,
                                criterion="gini",
                                min_samples_split=2)

In [14]:
# build pipeline
pipe = Pipeline(steps=[
    ("dataprep", data_prep_mapper),
    ("model", dt_clf)
])

Fit `X_train` and `y_train` onto the `pipe` object

In [15]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('dataprep',
                 Pipeline(memory=None,
                          steps=[('missing',
                                  Pipeline(memory=None,
                                           steps=[('missing_sl',
                                                   IsMissing(col_name='sepal_length')),
                                                  ('missing_sw',
                                                   IsMissing(col_name='sepal_width'))],
                                           verbose=False)),
                                 ('impute',
                                  ColumnTransformer(n_jobs=None,
                                                    remainder='passthrough',
                                                    sparse_threshold=0.3,
                                                    transformer_weights=None,
                                                    transformers=[('i...
                                

Use the `pipe` object to make predictions on `X_test`

In [16]:
y_pred = pipe.predict(X_test)

Examine the first 10 predictions of our multi-class classifier!

In [17]:
y_pred[0:10]

array(['setosa', 'setosa', 'virginica', 'versicolor', 'virginica',
       'setosa', 'virginica', 'setosa', 'versicolor', 'virginica'],
      dtype=object)

Compare with `y_test`

In [18]:
y_test[0:10]

42         setosa
20         setosa
124     virginica
87     versicolor
107     virginica
40         setosa
111     virginica
46         setosa
84     versicolor
147     virginica
Name: species, dtype: object

## Conclusion

You now know how to take advantage of `Pipeline` objects to stop duplicating your preprocessing steps in the training and testing sets. I hope this helps you take advantage of yet another module built into `scikit-learn` to help improve your machine learning workflow!