# H2OAutoML Plugin

Since H2O-3 `3.28.0.1`, users have the possibility to customize the `H2OAutoML` model selection engine by writing their own training steps as a Java plugin.

## How to write a simple plugin

To create such plugin, user simply needs to create a small project containing at least:
- an implementation of the `ai.h2o.automl.ModelingStepsProvider` interface.
- a file `META-INF/services/ai.h2o.automl.ModelingStepsProvider` with a entry for each of those implementations that need to be exposed to the service provider of the main `H2O-3` jar.

This folder contains such a plugin example:
```text
.
├── Makefile
├── java_plugin.ipynb
└── src
    ├── META-INF
    │   └── services
    │       └── ai.h2o.automl.ModelingStepsProvider
    └── my
        └── automl
            ├── MyDRFStepsProvider.java
            └── MyGLMStepsProvider.java
```

with `src/META-INF/services/ai.h2o.automl.ModelingStepsProvider`:
```text
my.automl.MyDRFStepsProvider
my.automl.MyGLMStepsProvider
```

and for example `MyDRFStepsProvider.java`:
```java
package my.automl;

import ai.h2o.automl.*;
import hex.grid.Grid;
import hex.tree.drf.DRFModel;
import hex.tree.drf.DRFModel.DRFParameters;
import water.Job;

import java.util.HashMap;
import java.util.Map;
import java.util.stream.IntStream;

import static ai.h2o.automl.ModelingStep.ModelStep.DEFAULT_MODEL_TRAINING_WEIGHT;


public class MyDRFStepsProvider implements ModelingStepsProvider<MyDRFStepsProvider.DRFSteps> {

    public static class DRFSteps extends ModelingSteps {

        static abstract class DRFGridStep extends ModelingStep.GridStep<DRFModel> {

            DRFGridStep(String id, int weight, AutoML autoML) {
                super(Algo.DRF, id, weight, autoML);
            }

            DRFParameters prepareModelParameters() {
                DRFParameters drfParameters = new DRFParameters();
                drfParameters._sample_rate = 0.8;
                drfParameters._col_sample_rate_per_tree = 0.8;
                drfParameters._col_sample_rate_change_per_level = 0.9;
                return drfParameters;
            }
        }

        private ModelingStep[] grids = new ModelingStep[] {
                new DRFGridStep("grid_1", 10*DEFAULT_MODEL_TRAINING_WEIGHT, aml()) {
                    @Override
                    protected Job<Grid> startJob() {
                        DRFParameters drfParameters = prepareModelParameters();

                        Map<String, Object[]> searchParams = new HashMap<>();
                        searchParams.put("_ntrees", IntStream.rangeClosed(5, 1000).filter(i -> i % 50 == 0).boxed().toArray());
                        searchParams.put("_nbins", IntStream.of(5, 10, 15, 20, 30).boxed().toArray());
                        searchParams.put("_max_depth", IntStream.rangeClosed(3, 20).boxed().toArray());
                        searchParams.put("_min_rows", IntStream.of(3, 5, 10, 20, 50, 80, 100).boxed().toArray());

                        return hyperparameterSearch(makeKey("MyDRF", false), drfParameters, searchParams);
                    }
                },
        };

        public DRFSteps(AutoML autoML) {
            super(autoML);
        }

        @Override
        protected ModelingStep[] getGrids() {
            return grids;
        }
    }

    @Override
    public String getName() {
        return "MyDRF";
    }

    @Override
    public DRFSteps newInstance(AutoML aml) {
        return new DRFSteps(aml);
    }
}
```

As shown above, writing a `ModelingStepsProvider` simply requires to implement 2 methods:
- `String getName()` returning the name of this provider, which should be unique among all the registered providers: default algo names like "GLM", "XGBoost", "GBM", "DRF" are already used by `H2O-3` and must be avoided.
- `T newInstance(AutoML aml)` returning an instance of `ai.h2o.automl.ModelingSteps`: this is the class defining the logic for the default models and/or the grids that the user wants to add to `H2O AutoML`.


## How to add the plugin to H2O-3

H2O AutoML plugins are simply discovered using [ServiceLoader](https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html), so the only requirement is to make this plugin available on the classpath.

The simplest way is to create a jar, and add it to the classpath.
For example, from this directory, running
```bash
make dist
```
will create a jar for our plugin in the `./dist` subfolder.

This jar can then be added to the classpath when starting `H2O-3`:
```bash
java -cp /path/to/h2o.jar:/path/to/automl/plugin.jar water.H2OApp
```
or directly from the clients:
- Python:
```python
import h2o
h2o.init(extra_classpath=["/path/to/automl/plugin.jar"])
```
- R:
```R
library("h2o")
h2o.init(extra_classpath=c("/path/to/automl/plugin.jar"))
```

In [1]:
# run this cell if you don't have h2o installed in your Python environment
!pip install h2o

Collecting h2o
  Downloading h2o-3.28.0.2.tar.gz (126.2 MB)
[K     |████████████████████████████████| 126.2 MB 1.6 MB/s eta 0:00:01
Installing collected packages: h2o
    Running setup.py install for h2o ... [?25ldone
[?25hSuccessfully installed h2o-3.28.0.2


In [2]:
# let's build our plugin jar
!make dist

rm -Rf ./build ./dist
sources = ./src/my/automl/MyGLMStepsProvider.java ./src/my/automl/MyDRFStepsProvider.java
mkdir -p build
javac ./src/my/automl/MyGLMStepsProvider.java ./src/my/automl/MyDRFStepsProvider.java -cp "/Users/seb/.pyenv/versions/3.7.5/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar" -d ./build
cp -R ./src/META-INF ./build
mkdir -p dist
jar cvf ./dist/h2oautoml_plugin.jar -C ./build .
added manifest
ignoring entry META-INF/
adding: META-INF/services/(in = 0) (out= 0)(stored 0%)
adding: META-INF/services/ai.h2o.automl.ModelingStepsProvider(in = 59) (out= 40)(deflated 32%)
adding: my/(in = 0) (out= 0)(stored 0%)
adding: my/automl/(in = 0) (out= 0)(stored 0%)
adding: my/automl/MyGLMStepsProvider$GLMSteps$1.class(in = 2770) (out= 1282)(deflated 53%)
adding: my/automl/MyDRFStepsProvider$DRFSteps.class(in = 899) (out= 452)(deflated 49%)
adding: my/automl/MyDRFStepsProvider$DRFSteps$DRFGridStep.class(in = 1109) (out= 571)(deflated 48%)
adding: my/automl/MyDRFStepsProvider.c

In [3]:
# and start the Python client with our plugin
import h2o
h2o.init(extra_classpath=["./dist/h2oautoml_plugin.jar"])

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_202"; Java(TM) SE Runtime Environment (build 1.8.0_202-b08); Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)
  Starting server from /Users/seb/.pyenv/versions/3.7.5/envs/ve37-h2o/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/8j/1spy0dnn4pj3f018plmmbf200000gn/T/tmpaufn47qd
  JVM stdout: /var/folders/8j/1spy0dnn4pj3f018plmmbf200000gn/T/tmpaufn47qd/h2o_seb_started_from_python.out
  JVM stderr: /var/folders/8j/1spy0dnn4pj3f018plmmbf200000gn/T/tmpaufn47qd/h2o_seb_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,Europe/Prague
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.2
H2O cluster version age:,"7 days, 23 hours and 16 minutes"
H2O cluster name:,H2O_from_python_seb_cy6gp3
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


## How to use the custom steps

Those new steps won't be trained by default by `H2O AutoML`, however user can use the `modeling_plan` argument in the `Python` or `R` clients to tell `AutoML` to use them.

Let's first run a simple AutoML job and look at the first modeling steps:

In [4]:
from h2o.automl import H2OAutoML

aml = H2OAutoML(project_name="without_plugin", max_models=15, seed=42)

In [5]:
fr = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
target = "CAPSULE"
train = fr
train[target] = train[target].asfactor()

In [7]:
aml.train(y=target, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [8]:
aml.modeling_steps

[{'name': 'XGBoost',
  'steps': [{'id': 'def_1', 'weight': 10},
   {'id': 'def_2', 'weight': 10},
   {'id': 'def_3', 'weight': 10}]},
 {'name': 'GLM', 'steps': [{'id': 'def_1', 'weight': 10}]},
 {'name': 'DRF', 'steps': [{'id': 'def_1', 'weight': 10}]},
 {'name': 'GBM',
  'steps': [{'id': 'def_1', 'weight': 10},
   {'id': 'def_2', 'weight': 10},
   {'id': 'def_3', 'weight': 10},
   {'id': 'def_4', 'weight': 10},
   {'id': 'def_5', 'weight': 10}]},
 {'name': 'DeepLearning', 'steps': [{'id': 'def_1', 'weight': 10}]},
 {'name': 'DRF', 'steps': [{'id': 'XRT', 'weight': 10}]},
 {'name': 'XGBoost', 'steps': [{'id': 'grid_1', 'weight': 100}]},
 {'name': 'GBM', 'steps': [{'id': 'grid_1', 'weight': 60}]},
 {'name': 'StackedEnsemble',
  'steps': [{'id': 'best', 'weight': 10}, {'id': 'all', 'weight': 10}]}]

As we can see, the default run doesn't contain any step defined in our plugin.
To tell AutoML to use our new steps, we will use the `modeling_plan` property.

In [9]:
# we can decide to add our new steps at the beginning: 
# by default, adding just the provider name will add both the default models and the grids.
new_plan = ["MyGLM", "MyDRF"] + aml.modeling_steps

# it is also possible to be more precise when defining the modeling sequence, 
# for example ensuring that default models are all trained before the grids:
another_plan = [
    ('XGBoost', 'defaults'),
    ('GLM', 'defaults'),
    ('DRF', 'defaults'),
    ('GBM', 'defaults'),
    ('DeepLearning', 'defaults'),
    ('MyGLM', 'grids'),
    ('MyDRF', 'grids'),
    ('XGBoost', 'grids'),
    ('GBM', 'grids'),
    ('DeepLearning', 'grids'),
    'StackedEnsemble'
]

# or even go into further details, 
# for example by tweaking the 'weight' property of the `modeling_plan` to produce more models from the `MyGBM` grid, relatively to other grids: 
# this is currently applied only for grids when using `max_runtime_secs` and/or `max_models` constraints.
yet_another_plan = [
    ('XGBoost', 'defaults'),
    ('GLM', 'defaults'),
    ('DRF', 'defaults'),
    ('GBM', 'defaults'),
    ('DeepLearning', 'defaults'),
    ('MyGLM', 'grids'),
    dict(name='MyDRF', steps=dict(id='grid_1', weight=100)),
    dict(name='XGBoost', steps=dict(id='grid_1', weight=60)),
    ('GBM', 'grids'),
    ('DeepLearning', 'grids'),
    'StackedEnsemble'
]

In [10]:
new_plan

['MyGLM',
 'MyDRF',
 {'name': 'XGBoost',
  'steps': [{'id': 'def_1', 'weight': 10},
   {'id': 'def_2', 'weight': 10},
   {'id': 'def_3', 'weight': 10}]},
 {'name': 'GLM', 'steps': [{'id': 'def_1', 'weight': 10}]},
 {'name': 'DRF', 'steps': [{'id': 'def_1', 'weight': 10}]},
 {'name': 'GBM',
  'steps': [{'id': 'def_1', 'weight': 10},
   {'id': 'def_2', 'weight': 10},
   {'id': 'def_3', 'weight': 10},
   {'id': 'def_4', 'weight': 10},
   {'id': 'def_5', 'weight': 10}]},
 {'name': 'DeepLearning', 'steps': [{'id': 'def_1', 'weight': 10}]},
 {'name': 'DRF', 'steps': [{'id': 'XRT', 'weight': 10}]},
 {'name': 'XGBoost', 'steps': [{'id': 'grid_1', 'weight': 100}]},
 {'name': 'GBM', 'steps': [{'id': 'grid_1', 'weight': 60}]},
 {'name': 'StackedEnsemble',
  'steps': [{'id': 'best', 'weight': 10}, {'id': 'all', 'weight': 10}]}]

In [11]:
aml_plugin = H2OAutoML(project_name="with_plugin", max_models=25, modeling_plan=new_plan, seed=42)

In [12]:
aml_plugin.train(y=target, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Verification

Let's now compare the 2 leaderboards.

The first one contains only models defined by `H2O AutoML`, whereas the second one contains a mix of models defined by both `H2O AutoML` and our plugin.

In [13]:
aml.leaderboard

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GLM_1_AutoML_20200128_181828,0.808816,0.523744,0.730139,0.273545,0.418759,0.175359
StackedEnsemble_BestOfFamily_AutoML_20200128_181828,0.806858,0.528142,0.715206,0.246826,0.41849,0.175134
StackedEnsemble_AllModels_AutoML_20200128_181828,0.805563,0.528472,0.709471,0.250022,0.418364,0.175028
XGBoost_3_AutoML_20200128_181828,0.801992,0.534371,0.689133,0.239858,0.422096,0.178165
XGBoost_1_AutoML_20200128_181828,0.801877,0.536284,0.673698,0.224511,0.422864,0.178814
XGBoost_2_AutoML_20200128_181828,0.794175,0.544483,0.685078,0.244263,0.425622,0.181154
DRF_1_AutoML_20200128_181828,0.789165,0.548811,0.686757,0.286617,0.426756,0.182121
GBM_2_AutoML_20200128_181828,0.787251,0.552955,0.692255,0.280859,0.429891,0.184806
GBM_grid__1_AutoML_20200128_181828_model_1,0.785206,0.552939,0.696239,0.266793,0.430176,0.185051
GBM_4_AutoML_20200128_181828,0.784602,0.552572,0.684192,0.27248,0.431463,0.18616




In [14]:
aml_plugin.leaderboard.head(30)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
GLM_1_AutoML_20200128_181849,0.808816,0.523744,0.730139,0.273545,0.418759,0.175359
StackedEnsemble_BestOfFamily_AutoML_20200128_181849,0.808183,0.525923,0.714758,0.252224,0.41704,0.173922
MyGLM_grid__AutoML_20200128_181849_model_1,0.806974,0.526215,0.724679,0.261035,0.420071,0.17646
XGBoost_3_AutoML_20200128_181849,0.801992,0.534371,0.689133,0.239858,0.422096,0.178165
StackedEnsemble_AllModels_AutoML_20200128_181849,0.801992,0.537614,0.700143,0.248956,0.422697,0.178673
XGBoost_1_AutoML_20200128_181849,0.801877,0.536284,0.673698,0.224511,0.422864,0.178814
XGBoost_grid__1_AutoML_20200128_181849_model_3,0.79871,0.538929,0.68214,0.229838,0.423159,0.179064
XGBoost_grid__1_AutoML_20200128_181849_model_4,0.796637,0.537556,0.703381,0.263237,0.424004,0.179779
XGBoost_2_AutoML_20200128_181849,0.794175,0.544483,0.685078,0.244263,0.425622,0.181154
DRF_1_AutoML_20200128_181849,0.789165,0.548811,0.686757,0.286617,0.426756,0.182121


