# Tutorial: Cross-validating hierarchical shrinkage hyperparameters
Hyperparameters for (augmented) hierarchical shrinkage (i.e. `shrink_mode` and
`lmb`) can be tuned using cross-validation, without having to retrain the
underlying model. This is because (augmented) hierarchical shrinkage is a
**fully post-hoc** procedure. As the `ShrinkageClassifier` and
`ShrinkageRegressor` are valid scikit-learn estimators, you could simply tune
these hyperparameters using [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) as you would do with any other scikit-learn
model. However, this **will** retrain the decision tree or random forest, which
leads to unnecessary performance loss. This notebook shows how you can use our
cross-validation function to cross-validate `shrink_mode` and `lmb` without
this performance loss.

In [None]:
import sys
sys.path.append('../')  # Necessary to import aughs from parent directory

from aughs import ShrinkageClassifier, cross_val_shrinkage
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from imodels.util.data_util import get_clean_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
from matplotlib import pyplot as plt
import numpy as np

First, we load the dataset and create a classifier.

In [None]:
X, y, feature_names = get_clean_dataset("breast_cancer", data_source="imodels")
clf = ShrinkageClassifier(DecisionTreeClassifier())

Next, we define a grid of hyperparameters to search over. This grid works
analogously as the `param_grid` in `GridSearchCV`, but only accepts values for
`shrink_mode` and `lmb`. Values for both variables **must** be present. For example,
if you want to use a fixed value for `shrink_mode` and only search over `lmb`:
```python
param_grid = {
    "shrink_mode": ["hs_entropy"],
    "lmb": [0, 0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]
}
```

In [None]:
# Define a grid of parameters
param_grid = {
    "shrink_mode": ["hs", "hs_entropy", "hs_log_cardinality"],
    "lmb": [0, 0.1, 0.5, 1, 2, 5, 10, 20, 50, 100]
}

The next cell runs 5-fold cross-validation for each combination of the hyperparameters. The different combinations are tested in parallel, in `n_jobs` different processes (use -1 to use all threads). The `verbose` parameter is passed to `joblib` for the parallel execution. If `n_jobs=1`, `joblib` isn't used. In this case, any value different from 0 will show a progress bar.

In [None]:
# Run cross-validation for each parameter combination
scores, param_shrink_mode, param_lmb = cross_val_shrinkage(clf, X, y,
                                                           param_grid,
                                                           n_splits=5,
                                                           n_jobs=-1,
                                                           verbose=10)

The function returns 3 values:
- `scores`: the scores for each parameter setting
- `param_shrink_mode`: the shrink mode value for each setting
- `param_lmb`: the lambda value for each setting

For example, the score for setting 5 is `scores[5]`, and this score was achieved
with `shrink_mode=param_shrink_mode[5]` and `lmb=param_lmb[5]`. To get the best
score and parameter values, we can use `np.argmax` as shown below.

In [None]:
best_score_idx = np.argmax(scores)
best_score = scores[best_score_idx]
best_shrink_mode = param_shrink_mode[best_score_idx]
best_lmb = param_lmb[best_score_idx]

In [None]:
scores

In [None]:
best_score

In [None]:
best_shrink_mode

In [None]:
best_lmb