# Sklearn Pipeline Permuter Example

<div class="alert alert-block alert-info">
    
This example shows how to systematically evaluate different machine learning pipelines. 

This is, for instance, useful if combinations of different feature selection methods with different estimators want to be evaluated in one step.
</div>

## Imports and Helper Functions

In [None]:
from pathlib import Path
from shutil import rmtree

import pandas as pd
import numpy as np

# Utils
from sklearn.datasets import load_breast_cancer, load_diabetes

# Preprocessing & Feature Selection
from sklearn.feature_selection import SelectKBest, RFE
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# Regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor


# Cross-Validation
from sklearn.model_selection import KFold

from biopsykit.classification.model_selection import SklearnPipelinePermuter

%load_ext autoreload
%autoreload 2

## Classification

Create temporary directory

In [None]:
tmpdir = Path("tmpdir")
tmpdir.mkdir(exist_ok=True)

### Load Example Dataset

In [None]:
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [None]:
model_dict = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "KNeighborsClassifier": KNeighborsClassifier(),
        "DecisionTreeClassifier": DecisionTreeClassifier(),
        # "SVC": SVC(),
        # "AdaBoostClassifier": AdaBoostClassifier(),
    },
}

In [None]:
params_dict = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsClassifier": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
    "DecisionTreeClassifier": {"criterion": ["gini", "entropy"], "max_depth": [2, 4]},
    # "SVC": [
    #    {
    #        "kernel": ["linear"],
    #        "C": np.logspace(start=-2, stop=2, num=5)
    #    },
    #    {
    #        "kernel": ["rbf"],
    #        "C": np.logspace(start=-2, stop=2, num=5),
    #        "gamma": np.logspace(start=-2, stop=2, num=5)
    #    }
    # ],
    # "AdaBoostClassifier": {
    #    "base_estimator": [DecisionTreeClassifier(max_depth=1)],
    #    "n_estimators": np.arange(20, 110, 10),
    #    "learning_rate": np.arange(0.6, 1.1, 0.1)
    # },
}


# use randomized-search for decision tree classifier, use grid-search (the default) for all other estimators
hyper_search_dict = {"DecisionTreeClassifier": {"search_method": "random", "n_iter": 2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

Note: For further information please visit the documentation of [SklearnPipelinePermuter](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter).

In [None]:
pipeline_permuter = SklearnPipelinePermuter(
    model_dict, params_dict, hyper_search_dict=hyper_search_dict, random_state=42
)

outer_cv = KFold(5)
inner_cv = KFold(5)

### Fit all Parameter Combinations

In [None]:
pipeline_permuter.fit(X=X, y=y, outer_cv=outer_cv, inner_cv=inner_cv)

### Display Results

#### Metric Summary for Classification Pipelines

The summary of all relevant metrics (performance scores, confusion matrix, true and predicted labels) of the **best-performing pipelines** for each fold (i.e., the [best_pipeline()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_pipeline) parameter of each inner `cv` object), evaluated for each evaluated pipeline combination.

In [None]:
pipeline_permuter.metric_summary()

List of `Pipeline` objects for the **best pipeline** for each evaluated pipeline combination.

In [None]:
pipeline_permuter.best_estimator_summary()

#### Mean Performance Scores for Individual Hyperparameter Combinations

The performance scores for each pipeline and parameter combinations, respectively, averaged over all outer CV folds using [SklearnPipelinePermuter.mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results).

**NOTE**:
* The summary of these pipelines does not necessarily correspond to the best-performing pipeline as returned by
        [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or 
        [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) because the
        best-performing pipelines are determined by averaging the `best_estimator` instances, as determined by
        `scikit-learn`, over all folds. Hence, all `best_estimator` instances can have a **different** set of
        hyperparameters, whereas in this function, it is explicitely averaged over the **same** set of hyperparameters.
* Thus, this function should only be used if you want to gain a deeper understanding of the different hyperparameter
        combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper,
        use [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or 
        [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) instead.

In [None]:
pipeline_permuter.mean_pipeline_score_results()

#### Best Hyperparameter Pipeline

The pipeline with the hyperparameter combination which achieved the highest average test score over all outer CV folds (i.e., the parameter combination which represents the first row of [mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results)).

**NOTE**:
* The summary of these pipelines does not necessarily correspond to the best-performing pipeline as returned by
        [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or 
        [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) because the
        best-performing pipelines are determined by averaging the `best_estimator` instances, as determined by
        `scikit-learn`, over all folds. Hence, all `best_estimator` instances can have a **different** set of
        hyperparameters, whereas in this function, it is explicitely averaged over the **same** set of hyperparameters.
* Thus, this function should only be used if you want to gain a deeper understanding of the different hyperparameter
        combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper,
        use [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or 
        [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) instead.

In [None]:
pipeline_permuter.best_hyperparameter_pipeline()

## Regression

### Load Example Dataset

In [None]:
diabetes_data = load_diabetes()
X_reg = diabetes_data.data
y_reg = diabetes_data.target

### Specify Estimator Combinations and Parameters for Hyperparameter Search

In [None]:
model_dict_reg = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVR(kernel="linear", C=1))},
    "clf": {
        "KNeighborsRegressor": KNeighborsRegressor(),
        "DecisionTreeRegressor": DecisionTreeRegressor(),
        # "SVR": SVR(),
        # "AdaBoostRegressor": AdaBoostRegressor(),
    },
}

In [None]:
params_dict_reg = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4]},
    "KNeighborsRegressor": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
    "DecisionTreeRegressor": {"max_depth": [2, 4]},
    # "SVR": [
    #    {
    #        "kernel": ["linear"],
    #        "C": np.logspace(start=-2, stop=2, num=5)
    #    },
    #    {
    #        "kernel": ["rbf"],
    #        "C": np.logspace(start=-2, stop=2, num=5),
    #        "gamma": np.logspace(start=-2, stop=2, num=5)
    #    }
    # ],
    # "AdaBoostRegressor": {
    #    "base_estimator": [DecisionTreeClassifier(max_depth=1)],
    #    "n_estimators": np.arange(20, 110, 10),
    #    "learning_rate": np.arange(0.6, 1.1, 0.1)
    # },
}


# use randomized-search for decision tree classifier, use grid-search (the default) for all other estimators
hyper_search_dict_reg = {"DecisionTreeRegressor": {"search_method": "random", "n_iter": 2}}

### Setup PipelinePermuter and Cross-Validations for Model Evaluation

Note: For further information please visit the documentatin of [SklearnPipelinePermuter](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter).

In [None]:
pipeline_permuter_regression = SklearnPipelinePermuter(
    model_dict_reg, params_dict_reg, hyper_search_dict=hyper_search_dict_reg
)

In [None]:
outer_cv = KFold(5)
inner_cv = KFold(5)

pipeline_permuter_regression.fit(X_reg, y_reg, outer_cv=outer_cv, inner_cv=inner_cv, scoring="r2")

### Display Results

This works analogously to the classification example.

## Further Functions

### Export Results as LaTeX Table

In [None]:
print(pipeline_permuter.metric_summary_to_latex())

### Save and Load `PipelinePermuter` results

#### Save to Pickle File

In [None]:
pipeline_permuter.to_pickle(tmpdir.joinpath("test.pkl"))

#### Load from Pickle File

In [None]:
pipeline_permuter_load = SklearnPipelinePermuter.from_pickle(tmpdir.joinpath("test.pkl"))

### Fit pipeline combinations and save intermediate results

This saves the current state after successfully evaluating one pipeline combination.

In [None]:
pipeline_permuter.fit_and_save_intermediate(
    X=X, y=y, outer_cv=outer_cv, inner_cv=inner_cv, file_path=tmpdir.joinpath("test.pkl")
)

### Merge multiple `PipelinePermuter` instances

In the case the evaluation of different classification pipelines had to be split (e.g., due to runtime reasons), the `PipelinePermuter` instances can be saved separately and afterwards merged back into one joint `PipelinePermuter` instance.

The following example provides a minimal working example, consisting of the steps:  
* Initializing, fitting, and saving different `PipelinePermuter` instances
* Loading saved `PipelinePermuter` instances from disk
* Merging multiple `PipelinePermuter` instances into one instance for joint evaluation

#### Load Example Dataset

In [None]:
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

#### Fit and Save Different `PipelinePermuter` instances

In [None]:
model_dict_01 = {
    "scaler": {"StandardScaler": StandardScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "KNeighborsClassifier": KNeighborsClassifier(),
    },
}
params_dict_01 = {
    "StandardScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsClassifier": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
}

pipeline_permuter_01 = SklearnPipelinePermuter(model_dict_01, params_dict_01, random_state=42)

pipeline_permuter_01.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5), verbose=0)
pipeline_permuter_01.to_pickle(tmpdir.joinpath("permuter_01.pkl"))

In [None]:
model_dict_02 = {
    "scaler": {"MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "KNeighborsClassifier": KNeighborsClassifier(),
    },
}
params_dict_02 = {
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsClassifier": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
}

pipeline_permuter_02 = SklearnPipelinePermuter(model_dict_02, params_dict_02, random_state=42)

pipeline_permuter_02.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5), verbose=0)
pipeline_permuter_02.to_pickle(tmpdir.joinpath("permuter_02.pkl"))

In [None]:
model_dict_03 = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "DecisionTreeClassifier": DecisionTreeClassifier(),
    },
}
params_dict_03 = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "DecisionTreeClassifier": {"criterion": ["gini", "entropy"], "max_depth": [2, 4]},
}

pipeline_permuter_03 = SklearnPipelinePermuter(model_dict_03, params_dict_03, random_state=42)

pipeline_permuter_03.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5), verbose=0)
pipeline_permuter_03.to_pickle(tmpdir.joinpath("permuter_03.pkl"))

#### Load and Merge `PipelinePermuter` instances

In [None]:
permuter_file_list = sorted(tmpdir.glob("permuter_*.pkl"))
print(permuter_file_list)

In [None]:
permuter_list = [SklearnPipelinePermuter.from_pickle(p) for p in permuter_file_list]
permuter_list

In [None]:
merged_permuter = SklearnPipelinePermuter.merge_permuter_instances(permuter_list)

Double-check if permuters were correcrtly merged:

In [None]:
for p in permuter_list:
    display(p.best_estimator_summary())

In [None]:
merged_permuter.best_estimator_summary()

## Updated partially fitted `SklearnPipelinePermuter` with additional Parameters

For this example, we perform an experiment using a partial hyperparameter set. We save this object as pickle file, load it in the next step, update the parameter sets, and continue with our experiments. This is useful for incremental experiments without having to run multiple experiments and merge different `SklearnPipelinePermuter` instances.

#### Do Partial Fitting

In [None]:
model_dict_partial = {
    "scaler": {"StandardScaler": StandardScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "KNeighborsClassifier": KNeighborsClassifier(),
    },
}
params_dict_partial = {
    "StandardScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsClassifier": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
}

pipeline_permuter_partial = SklearnPipelinePermuter(model_dict_partial, params_dict_partial, random_state=42)

pipeline_permuter_partial.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5))
pipeline_permuter_partial.to_pickle(tmpdir.joinpath("permuter_partial.pkl"))

#### Load Partially Fitted Model, Update with Total Parameter Dicts, and Fit the Remaining Combinations

In [None]:
model_dict_total = {
    "scaler": {"StandardScaler": StandardScaler(), "MinMaxScaler": MinMaxScaler()},
    "reduce_dim": {"SelectKBest": SelectKBest(), "RFE": RFE(SVC(kernel="linear", C=1))},
    "clf": {
        "KNeighborsClassifier": KNeighborsClassifier(),
        "DecisionTreeClassifier": DecisionTreeClassifier(),
    },
}

params_dict_total = {
    "StandardScaler": None,
    "MinMaxScaler": None,
    "SelectKBest": {"k": [2, 4, "all"]},
    "RFE": {"n_features_to_select": [2, 4, None]},
    "KNeighborsClassifier": {"n_neighbors": [2, 4], "weights": ["uniform", "distance"]},
    "DecisionTreeClassifier": {"criterion": ["gini", "entropy"], "max_depth": [2, 4]},
}

In [None]:
pipeline_permuter_total = SklearnPipelinePermuter.from_pickle(tmpdir.joinpath("permuter_partial.pkl"))
pipeline_permuter_total = pipeline_permuter_total.update_permuter(model_dict_total, params_dict_total)

In [None]:
pipeline_permuter_total.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5))

## Cleanup

In [None]:
rmtree(tmpdir)