# Using hyperopt-sklearn with `Bench`


[hyperopt-sklearn](https://github.com/hyperopt/hyperopt-sklearn) is a popular library for tuning many "classical" ML models. The convenience this library brings is that it has predefined hyperparameter grids that are automatically tuned when one calls `fit`.

One difficulty is that it currently does not have a convenient interface for providing custom validation sets (see [this issue](https://github.com/hyperopt/hyperopt-sklearn/issues/152)). We will use this as one other example for how to implement a model for benchmarking with `mofdscribe`.


In [1]:
from mofdscribe.bench import LogkHCO2IDBench
from mofdscribe.datasets.core_dataset import CoREDataset
from hpsklearn import (
    HyperoptEstimator,
    gaussian_process_regressor,
    lightgbm_regression,
    power_transformer,
    standard_scaler,
    xgboost_regression,
)

from mofdscribe.splitters import HashSplitter
import numpy as np
from copy import deepcopy


In [2]:
ds = CoREDataset()

FEATURES = list(ds.available_features)

TARGET = "outputs.logKH_CO2"


2022-08-07 16:25:09.726 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:127 - Dropped 3227 duplicate basenames. New length 2166
2022-08-07 16:25:09.762 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:133 - Dropped 62 duplicate graphs. New length 2104


In [7]:
class TunedXBoost:
    def __init__(self, features):
        self.model = HyperoptEstimator(regressor=xgboost_regression("mymodel"))
        self.features = features

    def tune(self, idx, y):
        tune_splitter = HashSplitter(self.ds.get_subset(idx))
        # we will now use a simple split in two parts,
        # however, you could also use a k-fold in the tune method
        models = []
        for train_idx_, valid_idx_ in tune_splitter.k_fold(5):
            train_idx = idx[train_idx_]
            valid_idx = idx[valid_idx_]

            train_x, train_y = self.ds._df.iloc[train_idx][self.features], y[train_idx_]
            valid_x, valid_y = self.ds._df.iloc[valid_idx][self.features], y[valid_idx_]

            # we concatenate train and validation data
            # but make sure to turn of shuffling and use the last fraction of the data for validation
            x = np.concatenate([train_x, valid_x])
            y = np.concatenate([train_y, valid_y])

            valid_frac = len(valid_x) / len(x)

            model = deepcopy(self.model)
            model.fit(x, y, cv_shuffle=False, n_folds=None, valid_size=valid_frac)

            models.append((model._best_loss, model, model._best_learner))

        models = sorted(models, key=lambda x: x[0])
        print(models)
        self.model = models[0][1]

    def fit(self, idx, structures, y):
        self.tune(idx, y)
        X = self.ds._df.iloc[idx][self.features]
        self.model.fit(X, y)

    def predict(self, idx, structures):
        X = self.ds._df.iloc[idx][self.features]
        pred = self.model.predict(X)
        return pred


In [8]:
model = TunedXBoost(FEATURES)


In [9]:
bench = LogkHCO2IDBench(model, name="xgboost-hyperopt", debug=True, patch_in_ds=True)


2022-08-07 16:26:45.829 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:127 - Dropped 3227 duplicate basenames. New length 2166
2022-08-07 16:26:45.871 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:133 - Dropped 62 duplicate graphs. New length 2104
2022-08-07 16:26:50.404 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:127 - Dropped 3227 duplicate basenames. New length 2166
2022-08-07 16:26:50.519 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:133 - Dropped 62 duplicate graphs. New length 2104
2022-08-07 16:26:50.533 | DEBUG    | mofdscribe.splitters.splitters:__init__:116 - Splitter settings | shuffle True, random state None, sample frac 0.01, q (0, 0.25, 0.5, 0.75, 1)


In [10]:
report = bench.bench()


2022-08-07 16:26:51.056 | DEBUG    | mofdscribe.bench.mofbench:_score:230 - K-fold round 0, 16 train points, 4 test points
2022-08-07 16:26:55.640 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:127 - Dropped 3227 duplicate basenames. New length 2166
2022-08-07 16:26:55.678 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:133 - Dropped 62 duplicate graphs. New length 2104
2022-08-07 16:26:55.686 | DEBUG    | mofdscribe.splitters.splitters:__init__:116 - Splitter settings | shuffle True, random state None, sample frac 1.0, q (0, 0.25, 0.5, 0.75, 1)


100%|██████████| 1/1 [00:02<00:00,  2.73s/trial, best loss: 1.0755024369365775]
100%|██████████| 2/2 [00:01<00:00,  1.32s/trial, best loss: 1.0755024369365775]
100%|██████████| 3/3 [00:03<00:00,  3.78s/trial, best loss: 1.0755024369365775]
100%|██████████| 4/4 [00:02<00:00,  2.64s/trial, best loss: 1.0755024369365775]
100%|██████████| 5/5 [00:01<00:00,  1.19s/trial, best loss: 1.0755024369365775]
100%|██████████| 6/6 [00:01<00:00,  1.73s/trial, best loss: 1.0755024369365775]
100%|██████████| 7/7 [00:01<00:00,  1.44s/trial, best loss: 1.0755024369365775]
100%|██████████| 8/8 [00:03<00:00,  3.36s/trial, best loss: 1.0755024369365775]
100%|██████████| 9/9 [00:01<00:00,  1.44s/trial, best loss: 1.0755024369365775]
100%|██████████| 10/10 [00:01<00:00,  1.34s/trial, best loss: 1.0755024369365775]
100%|██████████| 1/1 [00:01<00:00,  1.36s/trial, best loss: 111.0646452788255]
100%|██████████| 2/2 [00:01<00:00,  1.23s/trial, best loss: 74.36400610955863]
100%|██████████| 3/3 [00:01<00:00,  1.80

2022-08-07 16:29:38.787 | DEBUG    | mofdscribe.bench.mofbench:_score:230 - K-fold round 1, 16 train points, 4 test points
2022-08-07 16:29:43.768 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:127 - Dropped 3227 duplicate basenames. New length 2166
2022-08-07 16:29:43.803 | DEBUG    | mofdscribe.datasets.core_dataset:__init__:133 - Dropped 62 duplicate graphs. New length 2104
2022-08-07 16:29:43.814 | DEBUG    | mofdscribe.splitters.splitters:__init__:116 - Splitter settings | shuffle True, random state None, sample frac 1.0, q (0, 0.25, 0.5, 0.75, 1)


100%|██████████| 1/1 [00:01<00:00,  1.81s/trial, best loss: 1.0637203918052178]
100%|██████████| 2/2 [00:01<00:00,  1.57s/trial, best loss: 1.0637203918052178]
100%|██████████| 3/3 [00:01<00:00,  1.45s/trial, best loss: 1.0637203918052178]
100%|██████████| 4/4 [00:02<00:00,  2.42s/trial, best loss: 1.0637203918052178]
100%|██████████| 5/5 [00:01<00:00,  1.66s/trial, best loss: 1.0637203918052178]
100%|██████████| 6/6 [00:01<00:00,  1.39s/trial, best loss: 1.0637203918052178]
100%|██████████| 7/7 [00:01<00:00,  1.35s/trial, best loss: 1.0637203918052178]
100%|██████████| 8/8 [00:04<00:00,  4.34s/trial, best loss: 1.0637203918052178]
100%|██████████| 9/9 [00:05<00:00,  5.13s/trial, best loss: 1.0637203918052178]
100%|██████████| 10/10 [00:01<00:00,  1.44s/trial, best loss: 1.0637203918052178]
100%|██████████| 1/1 [00:01<00:00,  1.94s/trial, best loss: 2.0539555010466115]
100%|██████████| 2/2 [00:01<00:00,  1.38s/trial, best loss: 2.0539555010466115]
100%|██████████| 3/3 [00:03<00:00,  3.

2022-08-07 16:32:22.125 | DEBUG    | mofdscribe.bench.mofbench:_score:230 - K-fold round 2, 16 train points, 4 test points
