# Demo notebook for a pipeline functions

This demonstration notebook demonstrates the usage of a pipeline function. See also Scikit-learn's [pipeline function](http://scikit-learn.org/stable/modules/pipeline.html). The sklearn pipeline function is a tool in machine learning that simplifies the workflow by encapsulating all the steps involved in a single object. It offers advantages such as simplicity, reproducibility, efficiency, flexibility, and integration. 

The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object (the method to be executed). 

This notebook demonstrates several use cases, comparing the single steps to the pipeline step.


In [1]:
import pandas as pd
import numpy as np
from time import time


from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV


## The data

For this notebook we use lifelines data. Lifelines is a large, multi-generational, prospective cohort study that includes over 167,000 participants (10%) from the northern population of the Netherlands. Within this cohort study the participants are followed over a 30-year period. Every five years, participants visit one of the Lifelines sites in the northern parts of the Netherlands for an assessment. During these visits, several physical measurements are taken and different biomaterials are collected.

In [2]:
df = pd.read_csv('lifelines.txt', sep = '\t').set_index('ZIPCODE')
#df.info()

In [3]:
df.shape

(480, 71)

In [4]:
#make a copy to compare manual way with pipeline way
data_manual = df.copy()

## Use case preprocessing

First we will demonstrate the manual steps, the log tranformation and the scaling. Then we demonstrate the pipeline way

In [5]:
t0 = time()
# single step log transformation
data_manual = np.log1p(data_manual)
# single step scaling
mms = MinMaxScaler()
data_manual = mms.fit_transform(data_manual)
print(time() - t0)

0.0023238658905029297


In [6]:
t0 = time()
# The pipeline
prep = Pipeline([('log1p', FunctionTransformer(np.log1p)), 
                 ('minmaxscale', MinMaxScaler())]
               )

# Convert the original data
data_pipe = prep.fit_transform(df)
print(time() - t0)

0.0028982162475585938


Note that the `pipeline` uses the `fit_transform()` function. The MinMaxScaler does have the fit_transform function but the np.log1p function does not have this. We therefore use the sklearn `FunctionTransformer` function. The `FunctionTransformer` is a class in the sklearn.preprocessing module that allows you to apply an arbitrary function to your data. It is commonly used as a preprocessing step in sklearn pipelines.

The FunctionTransformer constructor takes as input a callable, which can be any Python function or lambda expression. When applied to a dataset, the FunctionTransformer simply calls the specified function on the input data and returns the transformed output.


`np.allclose` is a function in the NumPy library that compares two arrays element-wise and returns True if all corresponding pairs of elements are within a specified tolerance of each other, and False otherwise.


In [7]:
#check if values are the same, pipeline works same as the manual steps
np.allclose(data_pipe, data_manual)

True

## Extend pipeline with PCA
We can use the preprocessing pipeline in a new pipeline

In [8]:
data_manual = df.copy()
# single step log transformation
data_manual = np.log1p(data_manual)
# single step scaling
mms = MinMaxScaler()
data_manual = mms.fit_transform(data_manual)
#single step PCA
pca = PCA(2, random_state=42)
result_manual_pca = pca.fit_transform(data_manual)

In [9]:
#Use case PCA

pipe = Pipeline([
    ('prep', prep),
    ('pca', PCA(2, random_state=42))
])

results_pipe_pca = pipe.fit_transform(df)

In [10]:
np.allclose(result_manual_pca, results_pipe_pca)

True

# Use a pipeline for clustering

The same way we used the pipeline for performing preprocessing and PCA, we can make a pipeline for preprocessing and clustering with Kmeans

In [11]:
# Instantiate the KMeans object
from sklearn.pipeline import make_pipeline
kmeans = KMeans()

# Instantiate the Pipeline object with the StandardScaler and KMeans objects
pipeline = Pipeline([
    ('prep', prep),
    ('kmeans', kmeans), 
])

# Fit the pipeline to the data
pipeline.fit_predict(df)

array([7, 5, 1, 4, 4, 5, 5, 5, 4, 7, 1, 0, 4, 5, 0, 5, 5, 5, 1, 4, 5, 5,
       7, 5, 4, 7, 0, 7, 5, 5, 5, 5, 4, 5, 5, 5, 7, 1, 1, 1, 1, 7, 1, 7,
       3, 1, 3, 5, 3, 3, 7, 3, 3, 3, 3, 3, 5, 3, 0, 1, 0, 1, 7, 3, 3, 7,
       4, 3, 3, 4, 4, 4, 7, 3, 7, 4, 4, 4, 4, 4, 4, 4, 7, 7, 3, 3, 5, 3,
       7, 4, 0, 4, 3, 1, 3, 1, 3, 4, 3, 3, 1, 1, 1, 0, 1, 1, 0, 1, 0, 7,
       0, 3, 3, 3, 0, 1, 4, 0, 4, 7, 7, 7, 0, 7, 7, 3, 4, 3, 7, 3, 3, 3,
       3, 3, 3, 3, 0, 7, 1, 7, 7, 7, 1, 1, 7, 6, 3, 6, 6, 6, 7, 6, 6, 7,
       6, 6, 6, 7, 0, 3, 3, 4, 3, 4, 3, 3, 0, 7, 4, 1, 4, 7, 3, 1, 0, 4,
       3, 4, 1, 1, 7, 7, 7, 7, 7, 2, 0, 1, 1, 0, 5, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 4, 3, 3, 7, 6, 6, 0, 3, 0, 7, 7, 7, 3, 1, 3,
       3, 0, 4, 3, 4, 4, 4, 0, 3, 4, 0, 0, 4, 4, 4, 4, 1, 1, 1, 1, 0, 1,
       6, 0, 6, 4, 3, 3, 6, 6, 5, 1, 1, 3, 3, 6, 6, 3, 3, 3, 7, 4, 4, 3,
       4, 3, 3, 3, 3, 3, 4, 1, 7, 4, 7, 4, 5, 4, 5, 3, 3, 3, 4, 3, 3, 3,
       1, 0, 0, 7, 0, 7, 1, 4, 3, 1, 0, 7, 7, 7, 0,

## Optimze with gridsearchCV and pipeline

Most of the time pipeline is used in combination with `GridSearchCV`. 
`Pipeline` is often used in combination with `GridSearchCV` for hyperparameter tuning because it allows us to encapsulate all the preprocessing and modeling steps into a single object. This can be useful for several reasons:

1. Convenience: Instead of manually performing each preprocessing step and modeling step separately, we can combine them into a single object and pass them as an argument to `GridSearchCV`. This can make our code simpler and easier to read.

2. Reproducibility: By encapsulating all the preprocessing and modeling steps into a single object, we can ensure that the same steps are applied consistently to the data, regardless of how many times we run the code.

3. Avoiding data leakage: When performing hyperparameter tuning, it's important to ensure that the evaluation of each parameter combination is done using only the training data, not the validation data. By including the preprocessing steps in the `Pipeline` object, we can ensure that the same preprocessing steps are applied to both the training and validation data, preventing data leakage.

`GridSearchCV` is used to search over a grid of hyperparameters to find the best combination of hyperparameters that maximizes some performance metric, see `sklearn.metrics`. By using `Pipeline` in combination with `GridSearchCV`, we can search over a grid of hyperparameters for both the preprocessing and modeling steps simultaneously. This can be a powerful technique for optimizing the performance of machine learning models.

We can configure `GridSearchCV` with: 
- estimator: estimator object being used, can be the result of a pipeline
- param_grid: dictionary that contains all of the parameters to try
- scoring: evaluation metric to use when ranking results
- cv: cross-validation, the number of cv folds for each combination of parameters

In [12]:
hyperparams = {
    "kmeans__n_clusters": [2, 3, 4, 5, 6],
    "kmeans__random_state": [42],
    "kmeans__init": ['k-means++', 'random']
}

# Instantiate the KMeans object
kmeans = KMeans()

# Instantiate the Pipeline object with the StandardScaler and KMeans objects
pipeline = Pipeline([
    ('prep', prep),
    ('kmeans', kmeans)
])

# Instantiate the GridSearchCV object with the Pipeline object and parameter grid
grid_search = GridSearchCV(estimator=pipeline, param_grid=hyperparams, cv=10, n_jobs=-1)

# Fit the GridSearchCV object to the data
grid_search.fit(df)

# Access the best combination of hyperparameters and the corresponding evaluation metric score
print(grid_search.best_params_)
print(grid_search.best_score_)


{'kmeans__init': 'random', 'kmeans__n_clusters': 6, 'kmeans__random_state': 42}
-70.93532143198493
