# Using H2OAutoML with Scikit-Learn

## Introduction

Since release `3.28.0.1`, it is finally possible to use `H2OAutoML` as a `Scikit-Learn` estimator and in combination with other `sklearn` components.

Like for other `H2O-3` estimators (see `H2O_estimators_as_sklearn_estimators` notebook in the current folder for more details), the new `h2o.sklearn` module exposes 2 wrappers for `H2OAutoML`:
- `H2OAutoMLClassifier`
- `H2OAutoMLRegressor`

Those wrappers expose the standard API familiar to `sklearn` users (`fit`, `predict`, `fit_predict`, `score`, `get_params`, `set_params`) and accept various format as input data (`H2OFrame`, `numpy` array, `pandas` Dataframe) which allow them to be combined with pure `sklearn` components in pipelines.

Finally, user can still access the original `H2OAutoML` estimator from the wrapper thanks to its `estimator` property: this gives access to all the methods and properties declared in `H2OAutoML`, like `leaderboard`, `leader`, `event_log`, `training_info`, ...

Again, for more details, it is recommended to have a look at the `H2O_estimators_as_sklearn_estimators` notebook in this folder.

## `H2OAutoMLClassifier` in practice 

### Requirements

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
from sklearn import datasets
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

from h2o.sklearn import H2OAutoMLClassifier

seed = 2020

X, y = datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('polyfeat', PolynomialFeatures(degree=2)),
    ('featselect', SelectKBest(f_classif, k=5)),
    ('classifier', H2OAutoMLClassifier(max_models=10, seed=seed, sort_metric='aucpr'))
])

In [3]:
from sklearn.metrics import classification_report

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,32 mins 25 secs
H2O cluster timezone:,Europe/Prague
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.2
H2O cluster version age:,8 days
H2O cluster name:,H2O_from_python_seb_tdhiu9
H2O cluster total nodes:,1
H2O cluster free memory:,3.218 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%


In [4]:
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95        58
           1       0.96      0.96      0.96        85

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143



We can also access the `H2OAutoML` instance with the `estimator` property:

In [5]:
automl = pipeline.named_steps.classifier.estimator

automl.leaderboard

model_id,aucpr,auc,logloss,mean_per_class_error,rmse,mse
GLM_1_AutoML_20200128_201119,0.98864,0.989257,0.11541,0.0351652,0.176139,0.0310248
StackedEnsemble_BestOfFamily_AutoML_20200128_201119,0.984122,0.984172,0.125697,0.0370034,0.176479,0.0311449
StackedEnsemble_AllModels_AutoML_20200128_201119,0.984022,0.9841,0.125147,0.0416587,0.176794,0.0312561
XGBoost_3_AutoML_20200128_201119,0.981715,0.983802,0.135276,0.0522584,0.191452,0.0366538
GBM_1_AutoML_20200128_201119,0.974896,0.973871,0.146192,0.0430672,0.193089,0.0372835
GBM_3_AutoML_20200128_201119,0.968214,0.972641,0.150999,0.0560542,0.196219,0.0385017
GBM_4_AutoML_20200128_201119,0.962027,0.973883,0.147488,0.0476031,0.193489,0.0374382
GBM_2_AutoML_20200128_201119,0.960663,0.972044,0.156461,0.0494414,0.201593,0.0406398
XGBoost_2_AutoML_20200128_201119,0.925624,0.983707,0.189011,0.0634072,0.213545,0.0456016
XGBoost_1_AutoML_20200128_201119,0.861227,0.983635,0.145491,0.0574628,0.195643,0.0382763




## `H2OAutoMLRegressor` example

The `H2OAutoMLRegressor` can be used in exactly the same way as its classifier counterpart.

Here, we will try to run multiple automl at once, using a small grid search.

In [6]:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV

from h2o.sklearn import H2OAutoMLRegressor

ds = datasets.load_boston()

seed = 2020
regressor = H2OAutoMLRegressor(max_models=10, max_runtime_secs_per_model=30, seed=seed)

grid = GridSearchCV(regressor, cv=2, param_grid=dict(
    monotone_constraints=[None, dict(AGE=1), dict(PTRATIO=1), dict(CRIM=-1)],
))

# converting the data into a pandas DataFrame as we need the column names for the monotone_constraints parameter
X = pd.DataFrame(ds.data, columns=ds.feature_names)
y = ds.target

grid.fit(X, y)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |███████████

GridSearchCV(cv=2, error_score=nan,
             estimator=H2OAutoMLRegressor(algo_parameters=None,
                                          balance_classes=False,
                                          class_sampling_factors=None,
                                          data_conversion='auto',
                                          exclude_algos=None,
                                          export_checkpoints_dir=None,
                                          include_algos=None,
                                          keep_cross_validation_fold_assignment=False,
                                          keep_cross_validation_models=False,
                                          keep_cross_validation_predictions=False,
                                          max_afte...
                                          monotone_constraints=None, nfolds=5,
                                          project_name=None, seed=2020,
                                          sort_met

In [7]:
best = grid.best_estimator_
grid.best_params_

{'monotone_constraints': {'AGE': 1}}

In [8]:
grid.cv_results_

{'mean_fit_time': array([13.52753139, 12.98746252, 13.68160176, 13.30000401]),
 'std_fit_time': array([0.29193354, 0.03823948, 0.21549892, 0.07853317]),
 'mean_score_time': array([0.38817024, 0.48703146, 0.40478611, 0.39410508]),
 'std_score_time': array([0.0828371 , 0.01597548, 0.15079379, 0.1262399 ]),
 'param_monotone_constraints': masked_array(data=[None, {'AGE': 1}, {'PTRATIO': 1}, {'CRIM': -1}],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'monotone_constraints': None},
  {'monotone_constraints': {'AGE': 1}},
  {'monotone_constraints': {'PTRATIO': 1}},
  {'monotone_constraints': {'CRIM': -1}}],
 'split0_test_score': array([0.78161713, 0.82788077, 0.76359332, 0.77081437]),
 'split1_test_score': array([0.62617193, 0.61747351, 0.56696796, 0.61541766]),
 'mean_test_score': array([0.70389453, 0.72267714, 0.66528064, 0.69311602]),
 'std_test_score': array([0.0777226 , 0.10520363, 0.09831268, 0.07769836]),
 'rank_test_

In [9]:
automl = best.estimator

automl.leaderboard

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_3_AutoML_20200128_201323,8.68469,2.94698,8.68469,2.09068,0.134579
StackedEnsemble_AllModels_AutoML_20200128_201323,8.93475,2.98911,8.93475,2.07317,0.134023
StackedEnsemble_BestOfFamily_AutoML_20200128_201323,8.93475,2.98911,8.93475,2.07317,0.134023
GBM_1_AutoML_20200128_201323,9.60047,3.09846,9.60047,2.20709,0.137438
DRF_1_AutoML_20200128_201323,10.1783,3.19034,10.1783,2.14753,0.137721
GBM_2_AutoML_20200128_201323,10.2626,3.20353,10.2626,2.20055,0.140183
XGBoost_1_AutoML_20200128_201323,10.281,3.2064,10.281,2.16717,0.14362
GBM_4_AutoML_20200128_201323,10.537,3.24608,10.537,2.22226,0.143879
XGBoost_2_AutoML_20200128_201323,10.7641,3.28087,10.7641,2.21865,0.145428
GBM_3_AutoML_20200128_201323,11.2836,3.35911,11.2836,2.26692,0.149402


