This tutorial is about Pipeliner class usage. It helps you to simply generate pipelines from possible steps. For each pipeline will be found hyperparameters using grid search on cross-validation and each pipeline will be evaluated on another cross-validation.

Data prepearing to needed fromat according Transformers tutorial:

In [6]:
from sklearn.datasets import make_classification


X, y = make_classification()
data = {'X': X, 'y': y}

## 1. Defining Pipelines Steps

Here we define two steps. First step is scaling. We set possible scalers from scikit-learn. Possible variants of steps should be in format of list of tuples. Tuples should be in ('name_of_transformer', transformer) form.

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scalers = [
    ('standard', StandardScaler()),
    ('minmax', MinMaxScaler())
]

Seconds step is classifing:

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


classifiers = [
    ('LR', LogisticRegression()),
    ('SVC', SVC())
]

Steps variable, that you should put in Pipeliner:

In [9]:
steps = [
    ('Scaler', scalers),
    ('Classifier', classifiers)
]

## 2. Defining Cross Validations

Grid search and evaluation cross validations should be scikit-learn like cross-validation object. Here we just use scikit-learn cross validation with different random_state for grid search and evaluation part.

In [10]:
from sklearn.model_selection import StratifiedKFold


grid_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
eval_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

## 2. Defining Grid Search Parameters

Grid search parameters defined using dictionary with the same keys as keys in plan_table in classifier's column. Each key corresponds to a classifier parameters which you want to grid search. The form of this grid search dictionary the same as for scikit-learn GridSearchCV class.

In [11]:
param_grid = {
        'LR' : {
                'penalty' : ['l1', 'l2']
        },
        'SVC' : {
                'kernel' : ['linear', 'poly', 'rbf', 'sigmoid']
        }
}

## 3. Creating Pipeliner object and printing plan table

Variables which we created upper we put in Pipeliner class. You can access and modify plan table by variable plan_table.

In [13]:
from reskit.core import Pipeliner


pipe = Pipeliner(steps=steps, eval_cv=eval_cv, grid_cv=grid_cv, param_grid=param_grid)
pipe.plan_table

Unnamed: 0,Scaler,Classifier
0,standard,LR
1,standard,SVC
2,minmax,LR
3,minmax,SVC


## 4. Setting banned steps

Let' say we don't want to try first and last pipeline in plan table. We can set banned step as here:

In [17]:
banned_combos = [
    ('standard', 'LR'),
    ('minmax', 'SVC')
]

pipe = Pipeliner(steps=steps, eval_cv=eval_cv, grid_cv=grid_cv, param_grid=param_grid, banned_combos=banned_combos)
pipe.plan_table

Unnamed: 0,Scaler,Classifier
0,standard,SVC
1,minmax,LR


## 5. Launching Experiment

No we launching our experiment using get_results method.

In [19]:
pipe.get_results(data=data, scoring=['roc_auc'])

Line: 1/2
Line: 2/2


Unnamed: 0,Scaler,Classifier,grid_roc_auc_mean,grid_roc_auc_std,grid_roc_auc_best_params,eval_roc_auc_mean,eval_roc_auc_std,eval_roc_auc_scores
0,standard,SVC,0.984,0.0149666,{'kernel': 'rbf'},0.986,0.0135647,[ 0.99 1. 0.99 0.99 0.96]
1,minmax,LR,0.978,0.0203961,{'penalty': 'l2'},0.984,0.0205913,[ 1. 1. 1. 0.97 0.95]
