# Sklearn Pipeline Permuter Example

<div class="alert alert-block alert-info">
    
This example shows how to systematically evaluate different machine learning pipelines. 

This is, for instance, useful if combinations of different feature selection methods with different estimators want to be evaluated in one step.
</div>

## Imports and Helper Functions

In [None]:
import pandas as pd
import numpy as np

# Utils
from sklearn.datasets import load_breast_cancer, load_diabetes

# Preprocessing & Feature Selection
from sklearn.feature_selection import SelectKBest, RFE
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor


# Cross-Validation
from sklearn.model_selection import KFold

from biopsykit.classification.model_selection import SklearnPipelinePermuter

%load_ext autoreload
%autoreload 2

## Classification

### Load Example Dataset

In [None]:
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [None]:
model_dict = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "KNeighborsClassifier": KNeighborsClassifier(),
        "DecisionTreeClassifier": DecisionTreeClassifier(),
        # "SVC": SVC(),
        # "AdaBoostClassifier": AdaBoostClassifier(),
    },
}

In [None]:
params_dict = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsClassifier": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
    "DecisionTreeClassifier": {"criterion": ["gini", "entropy"], "max_depth": [2, 4]},
    # "SVC": [
    #    {
    #        "kernel": ["linear"],
    #        "C": np.logspace(start=-2, stop=2, num=5)
    #    },
    #    {
    #        "kernel": ["rbf"],
    #        "C": np.logspace(start=-2, stop=2, num=5),
    #        "gamma": np.logspace(start=-2, stop=2, num=5)
    #    }
    # ],
    # "AdaBoostClassifier": {
    #    "base_estimator": [DecisionTreeClassifier(max_depth=1)],
    #    "n_estimators": np.arange(20, 110, 10),
    #    "learning_rate": np.arange(0.6, 1.1, 0.1)
    # },
}


# use randomized-search for decision tree classifier, use grid-search (the default) for all other estimators
hyper_search_dict = {"DecisionTreeClassifier": {"search_method": "random", "n_iter": 2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

Note: For further information please visit the documentation of [SklearnPipelinePermuter](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter).

In [None]:
pipeline_permuter = SklearnPipelinePermuter(model_dict, params_dict, hyper_search_dict=hyper_search_dict, random_state=42)

In [None]:
outer_cv = KFold(5)
inner_cv = KFold(5)

pipeline_permuter.fit(X, y, outer_cv=outer_cv, inner_cv=inner_cv)

### Display Results

#### Mean Performance Scores

The performance scores for each pipeline and parameter combinations, respectively, averaged over all outer CV folds using [mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results).

In [None]:
pipeline_permuter.mean_pipeline_score_results()

#### Best Pipeline

The pipeline with the hyperparameter combination which achieved the highest average test score over all outer CV folds (i.e., the parameter combination which represents the first row of [mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results)).

In [None]:
pipeline_permuter.best_pipeline()

#### Metric Summary

The summary of all relevant metrics (performance scores, confusion matrix, true and predicted labels) of the **best pipelines** for each fold (i.e., the [best_pipeline()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_pipeline) parameter of each inner `cv` object), evaluated for each evaluated pipeline combination.

In [None]:
pipeline_permuter.metric_summary()

List of `Pipeline` objects for the **best pipeline** for each evaluated pipeline combination.

In [None]:
pipeline_permuter.best_estimator_summary()

## Regression

### Load Example Dataset

In [None]:
diabetes_data = load_diabetes()
X = diabetes_data.data
y = diabetes_data.target

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [None]:
model_dict = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVR(kernel="linear", C=1))},
    "clf": {
        "KNeighborsRegressor": KNeighborsRegressor(),
        "DecisionTreeRegressor": DecisionTreeRegressor(),
        # "SVR": SVR(),
        # "AdaBoostRegressor": AdaBoostRegressor(),
    },
}

In [None]:
params_dict = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4]},
    "KNeighborsRegressor": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
    "DecisionTreeRegressor": {"max_depth": [2, 4]},
    # "SVR": [
    #    {
    #        "kernel": ["linear"],
    #        "C": np.logspace(start=-2, stop=2, num=5)
    #    },
    #    {
    #        "kernel": ["rbf"],
    #        "C": np.logspace(start=-2, stop=2, num=5),
    #        "gamma": np.logspace(start=-2, stop=2, num=5)
    #    }
    # ],
    # "AdaBoostRegressor": {
    #    "base_estimator": [DecisionTreeClassifier(max_depth=1)],
    #    "n_estimators": np.arange(20, 110, 10),
    #    "learning_rate": np.arange(0.6, 1.1, 0.1)
    # },
}


# use randomized-search for decision tree classifier, use grid-search (the default) for all other estimators
hyper_search_dict = {"DecisionTreeRegressor": {"search_method": "random", "n_iter": 2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

Note: For further information please visit the documentatin of [SklearnPipelinePermuter](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter).

In [None]:
pipeline_permuter = SklearnPipelinePermuter(model_dict, params_dict, hyper_search_dict=hyper_search_dict)

In [None]:
outer_cv = KFold(5)
inner_cv = KFold(5)

pipeline_permuter.fit(X, y, outer_cv=outer_cv, inner_cv=inner_cv, scoring="r2")

### Display Results

#### Mean Performance Scores

The performance scores for each pipeline and parameter combinations, respectively, averaged over all outer CV folds using [mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results).

In [None]:
pipeline_permuter.mean_pipeline_score_results()

#### Best Pipeline

The pipeline with the hyperparameter combination which achieved the highest average test score over all outer CV folds (i.e., the parameter combination which represents the first row of [mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results)).

In [None]:
pipeline_permuter.best_pipeline()

#### Metric Summary

The summary of all relevant metrics (performance scores, confusion matrix, true and predicted labels) of the **best pipelines** for each fold (i.e., the [best_pipeline()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_pipeline) parameter of each inner `cv` object), evaluated for each evaluated pipeline combination.

In [None]:
pipeline_permuter.metric_summary()

List of `Pipeline` objects for the **best pipeline** for each evaluated pipeline combination.

In [None]:
pipeline_permuter.best_estimator_summary()