# Using H2OAutoML with Scikit-Learn

## Introduction

Since release `3.28.0.1`, it is finally possible to use `H2OAutoML` as a `Scikit-Learn` estimator and in combination with other `sklearn` components.

Like for other `H2O-3` estimators (see `H2O_estimators_as_sklearn_estimators` notebook in the current folder for more details), the new `h2o.sklearn` module exposes 2 wrappers for `H2OAutoML`:
- `H2OAutoMLClassifier`
- `H2OAutoMLRegressor`

Those wrappers expose the standard API familiar to `sklearn` users (`fit`, `predict`, `fit_predict`, `score`, `get_params`, `set_params`) and accept various format as input data (`H2OFrame`, `numpy` array, `pandas` Dataframe) which allow them to be combined with pure `sklearn` components in pipelines.

Finally, user can still access the original `H2OAutoML` estimator from the wrapper thanks to its `estimator` property: this gives access to all the methods and properties declared in `H2OAutoML`, like `leaderboard`, `leader`, `event_log`, `training_info`, ...

Again, for more details, it is recommended to have a look at the `H2O_estimators_as_sklearn_estimators` notebook in this folder.

## `H2OAutoMLClassifier` in practice 

### Requirements

In [1]:
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
from sklearn import datasets
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

from h2o.sklearn import H2OAutoMLClassifier

seed = 2020

X, y = datasets.load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('polyfeat', PolynomialFeatures(degree=2)),
    ('featselect', SelectKBest(f_classif, k=5)),
    ('classifier', H2OAutoMLClassifier(max_models=10, max_runtime_secs_per_model=30, seed=seed))
])

In [3]:
from sklearn.metrics import classification_report

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,1 hour 17 mins
H2O cluster timezone:,Europe/Prague
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.2
H2O cluster version age:,22 hours and 54 minutes
H2O cluster name:,H2O_from_python_seb_ca9mh4
H2O cluster total nodes:,1
H2O cluster free memory:,3.048 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |██████████
17:56:49.537: Skipping training of model GBM_5_AutoML_20200121_175640 due to exception: water.exceptions.H2OModelBuilderIllegalArgumentException: Illegal argument(s) for GBM model: GBM_5_AutoML_20200121_175640.  Details: ERRR on field: _min_rows: The dataset size is too small to split for min_rows=100.0: must have at least 200.0 (weighted) rows, but have only 112.0.


██████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
deeplearning prediction progress: |███████████████████████████████████████| 100%


In [4]:
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       1.00      0.88      0.93         8
           2       0.94      1.00      0.97        17

    accuracy                           0.97        38
   macro avg       0.98      0.96      0.97        38
weighted avg       0.98      0.97      0.97        38



We can also access the `H2OAutoML` instance with the `estimator` property:

In [5]:
automl = pipeline.named_steps.classifier.estimator

automl.leaderboard

model_id,mean_per_class_error,logloss,rmse,mse
DeepLearning_1_AutoML_20200121_175640,0.046176,0.109812,0.167587,0.0280855
GLM_1_AutoML_20200121_175640,0.046176,0.0987429,0.177804,0.0316143
XGBoost_1_AutoML_20200121_175640,0.046176,0.224027,0.223231,0.0498319
StackedEnsemble_BestOfFamily_AutoML_20200121_175640,0.046176,0.269713,0.252601,0.063807
XGBoost_3_AutoML_20200121_175640,0.046176,0.178241,0.200152,0.0400608
StackedEnsemble_AllModels_AutoML_20200121_175640,0.0562771,0.255428,0.243153,0.0591232
GBM_4_AutoML_20200121_175640,0.0562771,0.136659,0.195024,0.0380343
GBM_2_AutoML_20200121_175640,0.0562771,0.140848,0.196959,0.0387929
GBM_3_AutoML_20200121_175640,0.0562771,0.143808,0.197884,0.0391579
GBM_1_AutoML_20200121_175640,0.0562771,0.164639,0.206113,0.0424824




## `H2OAutoMLRegressor` example

The `H2OAutoMLRegressor` can be used in exactly the same way as its classifier counterpart.

Here, we will try to run multiple automl at once, using a small grid search.

In [6]:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV

from h2o.sklearn import H2OAutoMLRegressor

ds = datasets.load_boston()

seed = 2020
regressor = H2OAutoMLRegressor(max_models=10, max_runtime_secs_per_model=30, seed=seed)

grid = GridSearchCV(regressor, cv=2, param_grid=dict(
    monotone_constraints=[None, dict(AGE=1), dict(PTRATIO=1), dict(CRIM=-1)],
))

# converting the data into a pandas DataFrame as we need the column names for the monotone_constraints parameter
X = pd.DataFrame(ds.data, columns=ds.feature_names)
y = ds.target

grid.fit(X, y)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |███████████

GridSearchCV(cv=2, error_score=nan,
             estimator=H2OAutoMLRegressor(algo_parameters=None,
                                          balance_classes=False,
                                          class_sampling_factors=None,
                                          data_conversion='auto',
                                          exclude_algos=None,
                                          export_checkpoints_dir=None,
                                          include_algos=None,
                                          keep_cross_validation_fold_assignment=False,
                                          keep_cross_validation_models=False,
                                          keep_cross_validation_predictions=False,
                                          max_afte...
                                          monotone_constraints=None, nfolds=5,
                                          project_name=None, seed=2020,
                                          sort_met

In [7]:
best = grid.best_estimator_
grid.best_params_

{'monotone_constraints': {'AGE': 1}}

In [8]:
grid.cv_results_

{'mean_fit_time': array([14.85659599, 16.13324833, 16.81081295, 49.23280656]),
 'std_fit_time': array([ 1.39146304,  1.06640244,  0.05238509, 33.95612943]),
 'mean_score_time': array([0.37637341, 0.51012123, 0.40543604, 1.01728797]),
 'std_score_time': array([0.1011554 , 0.00545013, 0.1186471 , 0.22575498]),
 'param_monotone_constraints': masked_array(data=[None, {'AGE': 1}, {'PTRATIO': 1}, {'CRIM': -1}],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'monotone_constraints': None},
  {'monotone_constraints': {'AGE': 1}},
  {'monotone_constraints': {'PTRATIO': 1}},
  {'monotone_constraints': {'CRIM': -1}}],
 'split0_test_score': array([0.78161713, 0.82788077, 0.76359332, 0.77081437]),
 'split1_test_score': array([0.62617193, 0.61747351, 0.56696796, 0.61541766]),
 'mean_test_score': array([0.70389453, 0.72267714, 0.66528064, 0.69311602]),
 'std_test_score': array([0.0777226 , 0.10520363, 0.09831268, 0.07769836]),
 'rank_t

In [9]:
automl = best.estimator

automl.leaderboard

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_3_AutoML_20200121_180013,8.68469,2.94698,8.68469,2.09068,0.134579
StackedEnsemble_AllModels_AutoML_20200121_180013,8.93475,2.98911,8.93475,2.07317,0.134023
StackedEnsemble_BestOfFamily_AutoML_20200121_180013,8.93475,2.98911,8.93475,2.07317,0.134023
GBM_1_AutoML_20200121_180013,9.60047,3.09846,9.60047,2.20709,0.137438
DRF_1_AutoML_20200121_180013,10.1783,3.19034,10.1783,2.14753,0.137721
GBM_2_AutoML_20200121_180013,10.2626,3.20353,10.2626,2.20055,0.140183
XGBoost_1_AutoML_20200121_180013,10.281,3.2064,10.281,2.16717,0.14362
XGBoost_2_AutoML_20200121_180013,10.7641,3.28087,10.7641,2.21865,0.145428
GBM_4_AutoML_20200121_180013,12.455,3.52916,12.455,2.34559,0.151009
GBM_3_AutoML_20200121_180013,13.1418,3.62516,13.1418,2.39588,0.157715


