# H2O-3 integration with Scikit-Learn

## Introduction

Most of the `H2O-3` estimators available in the `Python` client (module `h2o.estimators`) provide a rudimentary support for the `sklearn` standard API, especially the following methods:
- `fit(X, y)`
- `predict(X)`
- `get_params()`
- `set_params(**params)`
- `transform(X)` and `fit_transform(X, y)` for the transformers defined in module `h2o.transforms`

Those methods have several limitations however:
- they accept only `H2OFrame`s for the `X` and `y` parameters.
- due to the previous limitation, they can be used in a sklearn Pipeline only in combination with `H2O-3` transformers.
- naming collisions: for example some estimators and transformers (e.g. `H2OPCA`, `H2OSVD`, `H2OAggregatorEstimator`, `H2OGeneralizedLowRankEstimator`, ...) authorize a `transform` constructor param, leading to collisions when used in a sklearn Pipeline, and finally raising errors.
- `get_params` and `set_params` not behaving exactly as they should in a `sklearn` context.

All those limitations were not providing a good experience to the user that wanted to combine the power of `H2O-3` with the flexibility of `sklearn`. 

That's why since release `3.28.0.1`, our Python client comes with a new `h2o.sklearn` module that we hope will fulfill most of your needs.

## The `h2o.sklearn` module

This new support leaves untouched the existing estimators and transformers for backwards compatibility.

Instead, the new `h2o.sklearn` module provides a collection of wrappers auto-generated on top of the original estimators (`h2o.estimators`) and transformers (`h2o.transforms`), as well as on top of `H2OAutoML`.

Those wrappers try to cover most of the use-cases you may encounter when willing to use `H2O-3` with `sklearn`:
- use the same naming convention as `sklearn`, e.g.:
  - `H2OGradientBoostingClassifier`, ..., `H2OAutoMLClassifier` for classifiers (on top of `H2OGradientBoostingEstimator`, ..., `H2OAutoML`) that will automatically ensure that the target is turned into categorical.
  - `H2OGradientBoostingRegressor`, ..., `H2OAutoMLRegressor` for regressors (on top of `H2OGradientBoostingEstimator`, ..., `H2OAutoML`).
  - `H2OGradientBoostingEstimator`, ..., `H2OAutoMLEstimator` for generic estimators (accepting an additional `estimator_type` param with value None, 'classifier' or 'regressor').
- expose only a `sklearn` API: 
  - a constructor with all params as keyword arguments and available for auto-completion in a Python environment (in a Jupyter notebook for example).
  - `get_params()`, `set_params(**params)` with the first returning all possible parameters, not only the ones that have been set (on the constructor or in any other way).
  - `fit(X, y)`, `predict(X)`, `fit_predict(X, y)` for all estimators (+ H2OAutoML).
  - `predict_proba(X)`, `predict_log_proba(X)` for estimators (classifiers) that support predictions probabilities.
  - `transform(X)`, `fit_transform(X, y)`, `inverse_transform(X)` for all transformers, and also estimators supporting transformations (e.g. H2OPrincipalComponentAnalysisEstimator).
  - `score(X, y)` for estimators, using, like in `sklearn`, `sklearn.metrics.accuracy_score` for classifiers, and `sklearn.metrics.r2_score` for regressors.
- also expose an `estimator` property (since `3.28.0.2`; in `3.28.0.1` it is only available as the `_estimator` "private" property) which is a reference to the original `H2OEstimator` (or `H2OAutoML`) instance: this way, the user can still access properties and methods from the estimator that are not exposed directly by the wrapper.
- data parameters `X` and `y` accept various kinds of data: `H2OFrame`, `numpy` arrays, `pandas.DataFrame`. The wrapper will try to return predictions or transformation in the same format as the input. Basic rules will be explained below, but first, this means that those wrappers can be chained with any kind of `sklearn` component in a `sklearn.pipeline.Pipeline`.
- for quickstarters, the wrappers can also automatically handle the connection to the local backend (auto-start, auto-connect, auto-shutdown when the wrapper is GC-ed). Note that this automatic connection management is disabled if the user first created a connection using `h2o.init()`.

## `h2o.sklearn` in practice

The following examples should show you how easily you can now use `H2O-3` in combination with `Scikit-Learn`.
They will also point at some (minor) drawbacks that may appear, especially when alternating too many `sklearn` components with `H2O` components in the same pipeline.

### Requirements

### mixing `sklearn` with `h2o.sklearn` components

As we're going to use `sklearn.preprocessing` module, we will use `numpy` arrays as input for this section, and we will verify that we effectively obtain `numpy` arrays in return.

In [None]:
import warnings
warnings.filterwarnings(action='ignore')

In [2]:
import numpy as np
import pandas as pd  # pandas >= 0.19.2 required
train = pd.read_csv("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_train.csv").values
test = pd.read_csv("https://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_test.csv").values

In [3]:
X_train = train[:,:-1]
y_train = train[:,-1]
X_test = test[:,:-1]
y_test = test[:,-1]

In [4]:
X_train[:10], y_train[:10]

(array([[6.0, 2.2, 4.0, 1.0],
        [5.2, 3.4, 1.4, 0.2],
        [6.9, 3.1, 5.4, 2.1],
        [7.3, 2.9, 6.3, 1.8],
        [7.6, 3.0, 6.6, 2.1],
        [5.6, 3.0, 4.5, 1.5],
        [5.4, 3.4, 1.7, 0.2],
        [6.4, 3.2, 5.3, 2.3],
        [4.5, 2.3, 1.3, 0.3],
        [6.2, 3.4, 5.4, 2.3]], dtype=object),
 array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',
        'Iris-virginica', 'Iris-virginica', 'Iris-versicolor',
        'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica'],
       dtype=object))

In [5]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from h2o.sklearn import H2OGradientBoostingClassifier

seed = 42

pipeline_mix = Pipeline([
    ("standardize", StandardScaler()),
    ("pca", PCA(n_components=2, random_state=seed)),
    ("classifier", H2OGradientBoostingClassifier(seed=seed))
])

We can also set nested parameters directly from the pipeline object, especially useful when using `sklearn` cross-validation.

In [6]:
assert 'learn_rate' in pipeline_mix.named_steps.classifier.get_params()

pipeline_mix.set_params(classifier__learn_rate=0.01)

assert pipeline_mix.named_steps.classifier.learn_rate == 0.01

Now that our pipeline is defined, we can train our model.

Note that as we haven't initialized `H2O-3` yet (normally using the `h2o.init()` method), then it will be automatically started by the first H2O component encountered in the pipeline.

Please also note the progress bars showing that the training `numpy` data are converted and uploaded to the `H2O` backend.
If you're annoyed by those progress bars and want to hide them, you should simply initialized `H2O` using `h2o.init(show_progress=False)`. This can also be done directly in the `h2o.sklearn` wrapper, using for example `H2OGradientBoostingClassifier(seed=seed, init_connection_args=dict(show_progress=False))`

In [7]:
pipeline_mix.fit(X_train, y_train)

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_202"; Java(TM) SE Runtime Environment (build 1.8.0_202-b08); Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)
  Starting server from /Users/seb/.pyenv/versions/3.7.5/envs/ve37-h2o/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/8j/1spy0dnn4pj3f018plmmbf200000gn/T/tmptnc_f88_
  JVM stdout: /var/folders/8j/1spy0dnn4pj3f018plmmbf200000gn/T/tmptnc_f88_/h2o_seb_started_from_python.out
  JVM stderr: /var/folders/8j/1spy0dnn4pj3f018plmmbf200000gn/T/tmptnc_f88_/h2o_seb_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Europe/Prague
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.2
H2O cluster version age:,"7 days, 22 hours and 32 minutes"
H2O cluster name:,H2O_from_python_seb_g4b5vf
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Pipeline(memory=None,
         steps=[('standardize',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=2,
                     random_state=42, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('classifier',
                 H2OGradientBoostingClassifier(balance_classes=None,
                                               build_tree_one_node=None,
                                               calibrate_model=None,
                                               calibration_frame=None,
                                               ca...
                                               fold_column=None,
                                               histogram_type=None,
                                               huber_alpha=None,
                                               ignore_const_cols=None,
                                     

In [8]:
preds = pipeline_mix.predict(X_test)

assert isinstance(preds, np.ndarray)

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


In [9]:
# get accuracy score (automatically calls `predict` on the estimator internally)
pipeline_mix.score(X_test, y_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


0.94

In [10]:
gbm_wrapper = pipeline_mix.named_steps.classifier
gbm_wrapper.estimator  # use gbm_wrapper._estimator in 3.28.0.1

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1580229230862_1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,50.0,150.0,17535.0,1.0,4.0,3.613333,2.0,6.0,4.626666




ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.18709890373026358
RMSE: 0.43254930786011386
LogLoss: 0.5659162612656915
Mean Per-Class Error: 0.0664488017429194

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
0,30.0,0.0,0.0,0.0,0 / 30
1,0.0,31.0,3.0,0.088235,3 / 34
2,0.0,4.0,32.0,0.111111,4 / 36
3,30.0,35.0,35.0,0.07,7 / 100



Top-3 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.93
1,2,1.0
2,3,1.0



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error
0,,2020-01-28 17:33:54,0.020 sec,0.0,0.666667,1.098612,0.72
1,,2020-01-28 17:33:55,0.177 sec,1.0,0.660893,1.081448,0.07
2,,2020-01-28 17:33:55,0.218 sec,2.0,0.655158,1.064701,0.07
3,,2020-01-28 17:33:55,0.238 sec,3.0,0.649463,1.048357,0.07
4,,2020-01-28 17:33:55,0.254 sec,4.0,0.643808,1.0324,0.07
5,,2020-01-28 17:33:55,0.266 sec,5.0,0.638194,1.016816,0.07
6,,2020-01-28 17:33:55,0.278 sec,6.0,0.632622,1.001594,0.07
7,,2020-01-28 17:33:55,0.291 sec,7.0,0.627091,0.986719,0.07
8,,2020-01-28 17:33:55,0.305 sec,8.0,0.621602,0.97218,0.07
9,,2020-01-28 17:33:55,0.321 sec,9.0,0.616156,0.957967,0.07



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,C1,1747.859619,1.0,0.980498
1,C2,34.764946,0.01989,0.019502




### using only `h2o.sklearn` components

In [11]:
from h2o import H2OFrame
from h2o.sklearn import H2OScaler, H2OPCA, H2OGradientBoostingClassifier

seed = 42

pipeline_h2o = Pipeline([
    ("standardize", H2OScaler()),
    ("pca", H2OPCA(k=2, seed=seed)),
    ("classifier", H2OGradientBoostingClassifier(learn_rate=0.05, seed=seed))
])

Here, as we are using only `H2O` components, we will look at the behaviour of the pipeline when we feed it with  `H2OFrame`s, and then when we feed it with `numpy` arrays.

#### with `H2OFrame`

In [12]:
X_train_h2o, y_train_h2o = H2OFrame(X_train), H2OFrame(y_train)
X_test_h2o, y_test_h2o = H2OFrame(X_test), H2OFrame(y_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [13]:
pipeline_h2o.fit(X_train_h2o, y_train_h2o)

pca Model Build progress: |███████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Pipeline(memory=None,
         steps=[('standardize',
                 H2OScaler(center=True, data_conversion='auto', scale=True)),
                ('pca',
                 H2OPCA(compute_metrics=True, data_conversion='auto',
                        ignore_const_cols=True, impute_missing=False, k=2,
                        max_iterations=None, model_id=None,
                        pca_impl='mtj_evd_symmmatrix', pca_method='GramSVD',
                        seed=42, transform='NONE',
                        use_all_factor_levels=False)),
                ('c...
                                               fold_column=None,
                                               histogram_type=None,
                                               huber_alpha=None,
                                               ignore_const_cols=None,
                                               ignored_columns=None,
                                               keep_cross_validation_fold_assignment=None,
    

In [14]:
preds = pipeline_h2o.predict(X_test_h2o)

assert isinstance(preds, H2OFrame)

pipeline_h2o.score(X_test_h2o, y_test_h2o)

pca prediction progress: |████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


0.98

as we may have expected, our `sklearn` pipeline with only `H2O` components not only accepts H2O frames as input data, but it also produce `H2OFrame` results.

#### with `numpy` arrays

In [15]:
pipeline_h2o.fit(X_train, y_train)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
pca Model Build progress: |███████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Pipeline(memory=None,
         steps=[('standardize',
                 H2OScaler(center=True, data_conversion='auto', scale=True)),
                ('pca',
                 H2OPCA(compute_metrics=True, data_conversion='auto',
                        ignore_const_cols=True, impute_missing=False, k=2,
                        max_iterations=None, model_id=None,
                        pca_impl='mtj_evd_symmmatrix', pca_method='GramSVD',
                        seed=42, transform='NONE',
                        use_all_factor_levels=False)),
                ('c...
                                               fold_column=None,
                                               histogram_type=None,
                                               huber_alpha=None,
                                               ignore_const_cols=None,
                                               ignored_columns=None,
                                               keep_cross_validation_fold_assignment=None,
    

In [16]:
preds = pipeline_h2o.predict(X_test)

assert isinstance(preds, H2OFrame)

pipeline_h2o.score(X_test, y_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%


0.98

in this call to `predict`, we could have expected that the predictions will also be represented as a `numpy` array.

However, the logic used by the `h2o.sklearn` wrappers to detect the data format doesn't allow correctly guess user expectations here.
This logic is relatively simple: by default, for a given estimator wrapper, it returns objects of the same type as the input:
- `numpy` in -> `numpy` out
- `H2OFrame` in -> `H2OFrame` out
- `pd.DataFrame` in -> `numpy` out : small exception, but this is also `sklearn`'s behaviour.

It doesn't apply by default for transformer wrappers though, as we expect `H2O` transformers to be chained with other `H2O` transformers or estimators; therefore, the `transform` method doesn't convert the result back by default to avoid too many useless conversions in the pipeline.

There is still a possibility to control the output format of a wrapper by setting its `data_conversion` param.
This parameter accepts 3 possible values:
- `'auto'` (default for estimators): the result will be of the same type as the input, as discussed above.
- `True`: the result is always converted to a `numpy` array.
- `False` (default for transformers): the result is not converted, and therefore returned as an `H2OFrame`.

Here we have multiple `H2O` transformers before the estimator, so if we want to always obtain `numpy` arrays, we can slightly modify the pipeline as follow:

In [17]:
pipeline_h2o_to_numpy = Pipeline([
    ("standardize", H2OScaler()),
    ("pca", H2OPCA(k=2, seed=seed)),
    ("classifier", H2OGradientBoostingClassifier(learn_rate=0.05, seed=seed, data_conversion=True))
])
pipeline_h2o_to_numpy.fit(X_train, y_train)

preds = pipeline_h2o_to_numpy.predict(X_test)
assert isinstance(preds, np.ndarray)

pipeline_h2o_to_numpy.score(X_test, y_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
pca Model Build progress: |███████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
pca prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |███████████

0.98

### Demo

Now that you understand how those `H2O` wrappers work and how they combine with `sklearn` components, we can use them to enhance `H2O` with some unique `sklearn` features.

In [18]:
from sklearn import datasets
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

from h2o.sklearn import H2OGradientBoostingClassifier

seed = 2020

X, y = datasets.load_iris(return_X_y=True)

pipeline = Pipeline([
    ('polyfeat', PolynomialFeatures(degree=2)),
    ('classifier', H2OGradientBoostingClassifier(seed=seed))
])

cv = StratifiedKFold(n_splits=3)

cross_validate(pipeline, X, y, cv=cv)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |███████████

{'fit_time': array([3.2146287 , 0.79900289, 0.60141706]),
 'score_time': array([0.21829629, 0.2278161 , 0.44310427]),
 'test_score': array([0.98, 0.94, 0.96])}

## Future improvements

### Model persistence and deployment

For deployment or simple reuse, `Scikit Learn` pipelines are commonly persisted using [pickle](https://docs.python.org/3.8/library/pickle.html) or [joblib](https://joblib.readthedocs.io/en/latest/persistence.html).

The current version of the `h2o.sklearn` module allows persistence of untrained configured pipelines (for reuse), but fails with persistence (actually during loading) of trained pipelines: this is because `H2O-3` actual model resides on the Java backend, and pickling an `H2O-3` estimator will pickle the Python client object tree, but will ignore most of the information stored on the backend.

We plan to solve this soon, probably by automatically exporting the model to binary format when pickling the wrapper, and then re-importing this model when unpickling the wrapper.