The task is simple: find the best combination of pre-processing steps and predictive models with respect to an objective criterion. Logistically this can be problematic: a small example might involve three classification models, and two data preprocessing steps with two possible variations for each — overall 12 combinations. For each of these combinations we would like to perform a grid search of predefined hyperparameters on a fixed cross-validation dataset, computing performance metrics for each option (for example ROC AUC). Clearly this can become complicated quickly. On the other hand, many of these combinations share substeps, and re-running such shared steps amounts to a loss of compute time.

## 1. Defining Pipelines Steps and Grid Search Parameters

The researcher specifies the possible processing steps and the scikit objects involved, then Reskit expands these steps to each possible pipeline. Reskit represents these pipelines in a convenient pandas dataframe, so the researcher can directly visualize and manipulate the experiments.

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

from sklearn.feature_selection import VarianceThreshold
from sklearn.decomposition import PCA

from reskit.core import Pipeliner

# Feature selection and feature extraction step variants (1st step)
feature_engineering = [('VT', VarianceThreshold()),
                       ('PCA', PCA())]

# Preprocessing step variants (2nd step)
scalers = [('standard', StandardScaler()),
           ('minmax', MinMaxScaler())]

# Models (3rd step)
classifiers = [('LR', LogisticRegression()),
               ('SVC', SVC()),
               ('SGD', SGDClassifier())]

# Reskit needs to define steps in this manner
steps = [('feature_engineering', feature_engineering),
         ('scaler', scalers),
         ('classifier', classifiers)]

# Grid search parameters for our models
param_grid = {'LR': {'penalty': ['l1', 'l2']},
              'SVC': {'kernel': ['linear', 'poly', 'rbf', 'sigmoid']},
              'SGD': {'penalty': ['elasticnet'],
                      'l1_ratio': [0.1, 0.2, 0.3]}}

# Quality metric that we want to optimize
scoring='roc_auc'

pipe = Pipeliner(steps, param_grid=param_grid)
pipe.plan_table

Unnamed: 0,feature_engineering,scaler,classifier
0,VT,standard,LR
1,VT,standard,SVC
2,VT,standard,SGD
3,VT,minmax,LR
4,VT,minmax,SVC
5,VT,minmax,SGD
6,PCA,standard,LR
7,PCA,standard,SVC
8,PCA,standard,SGD
9,PCA,minmax,LR


## 2. Forbidden combinations

In case you don't want to use minmax scaler with SVC, you can define banned combo:

In [2]:
banned_combos = [('minmax', 'SVC')]
pipe = Pipeliner(steps, param_grid=param_grid, banned_combos=banned_combos)
pipe.plan_table

Unnamed: 0,feature_engineering,scaler,classifier
0,VT,standard,LR
1,VT,standard,SVC
2,VT,standard,SGD
3,VT,minmax,LR
4,VT,minmax,SGD
5,PCA,standard,LR
6,PCA,standard,SVC
7,PCA,standard,SGD
8,PCA,minmax,LR
9,PCA,minmax,SGD


## 3. Launching Experiment

Reskit then runs each experiment and presents results which are provided to the user through a pandas dataframe. For each pipeline’s classifier, Reskit grid search on cross-validation to find the best classifier’s parameters and report metric mean and standard deviation for each tested pipeline (ROC AUC in this case).

In [3]:
from sklearn.datasets import make_classification


X, y = make_classification()
pipe.get_results(X, y, scoring=['roc_auc'])

Line: 1/10
Line: 2/10
Line: 3/10
Line: 4/10
Line: 5/10
Line: 6/10
Line: 7/10
Line: 8/10
Line: 9/10
Line: 10/10


Unnamed: 0,feature_engineering,scaler,classifier,grid_roc_auc_mean,grid_roc_auc_std,grid_roc_auc_best_params,eval_roc_auc_mean,eval_roc_auc_std,eval_roc_auc_scores
0,VT,standard,LR,0.968088,0.0342188,{'penalty': 'l1'},0.969714,0.0324149,[ 0.92387543 0.99307958 0.9921875 ]
1,VT,standard,SVC,0.944412,0.033099,{'kernel': 'linear'},0.945195,0.033238,[ 0.90311419 0.94809689 0.984375 ]
2,VT,standard,SGD,0.931103,0.0524624,"{'l1_ratio': 0.1, 'penalty': 'elasticnet'}",0.907543,0.0493952,[ 0.84429066 0.91349481 0.96484375]
3,VT,minmax,LR,0.958824,0.0573696,{'penalty': 'l1'},0.959631,0.0570905,[ 0.87889273 1. 1. ]
4,VT,minmax,SGD,0.947941,0.0571189,"{'l1_ratio': 0.2, 'penalty': 'elasticnet'}",0.949809,0.0576997,[ 0.86851211 0.99653979 0.984375 ]
5,PCA,standard,LR,0.965735,0.0350271,{'penalty': 'l1'},0.966254,0.0348754,[ 0.91695502 0.98961938 0.9921875 ]
6,PCA,standard,SVC,0.950662,0.00696092,{'kernel': 'sigmoid'},0.95048,0.00701078,[ 0.95155709 0.95847751 0.94140625]
7,PCA,standard,SGD,0.95,0.0352806,"{'l1_ratio': 0.1, 'penalty': 'elasticnet'}",0.884588,0.0701422,[ 0.78546713 0.93079585 0.9375 ]
8,PCA,minmax,LR,0.941176,0.0746251,{'penalty': 'l1'},0.94233,0.0743386,[ 0.83737024 0.98961938 1. ]
9,PCA,minmax,SGD,0.945147,0.0401865,"{'l1_ratio': 0.2, 'penalty': 'elasticnet'}",0.955765,0.0417254,[ 0.99653979 0.97231834 0.8984375 ]


## 4. Caching intermediate steps

Reskit also allows you to cache interim calculations to avoid unnecessary recalculations.

In [4]:
from sklearn.preprocessing import Binarizer

# Simple binarization step that we want ot cache
binarizer = [('binarizer', Binarizer())]

# Reskit needs to define steps in this manner
steps = [('binarizer', binarizer),
         ('classifier', classifiers)]

pipe = Pipeliner(steps, param_grid=param_grid)
pipe.plan_table

Unnamed: 0,binarizer,classifier
0,binarizer,LR
1,binarizer,SVC
2,binarizer,SGD


In [5]:
pipe.get_results(X, y, caching_steps=['binarizer'])

Line: 1/3
Line: 2/3
Line: 3/3


Unnamed: 0,binarizer,classifier,grid_accuracy_mean,grid_accuracy_std,grid_accuracy_best_params,eval_accuracy_mean,eval_accuracy_std,eval_accuracy_scores
0,binarizer,LR,0.95,0.0274532,{'penalty': 'l1'},0.950368,0.0273067,[ 0.91176471 0.97058824 0.96875 ]
1,binarizer,SVC,0.95,0.0274532,{'kernel': 'linear'},0.950368,0.0273067,[ 0.91176471 0.97058824 0.96875 ]
2,binarizer,SGD,0.89,0.0579617,"{'l1_ratio': 0.1, 'penalty': 'elasticnet'}",0.911152,0.0827592,[ 0.79411765 0.97058824 0.96875 ]


Last cached calculations stored in _cached_X

In [6]:
pipe._cached_X

OrderedDict([('init',
              array([[-0.08043891,  0.04414966, -0.50652287, ...,  0.64708285,
                       0.31735999,  0.37452212],
                     [-0.01761313, -0.69354915,  1.50059522, ..., -1.54449409,
                      -0.03444691, -1.45568281],
                     [-2.62348127, -0.2486129 , -0.82455361, ..., -2.01813262,
                      -1.48093549, -0.1112113 ],
                     ..., 
                     [ 0.27043193,  1.78441792, -0.57314278, ...,  0.30293066,
                       0.45134124,  1.3098357 ],
                     [-0.81090513,  0.1115994 , -0.45419453, ..., -0.03806367,
                       1.3780458 ,  0.01108369],
                     [-0.58389623, -0.36467139, -1.04641143, ...,  0.51293879,
                      -1.78587858, -0.62062718]])),
             ('binarizer', array([[ 0.,  1.,  0., ...,  1.,  1.,  1.],
                     [ 0.,  0.,  1., ...,  0.,  0.,  0.],
                     [ 0.,  0.,  0., ...,  0.,  0.,