# Gradient Boosted Trees and AutoML

Last updated: Aug 05th 2021

This Gradient Notebook is part of the project *Gradient Boosted Trees and AutoML* at https://github.com/gradient-ai/Gradient-Boosted-Trees-and-AutoML .

Business and other problems not amenable to deep learning are often best solved by using well-tuned Gradient-boosted decision trees. These methods are, like deep learning, capable of solving arbitrarily complex problems via nonlinear mappings, but can do so without requiring the large training sets and compute-intensive processing that deep learning sometimes can.

This project shows that such methods are supported on Gradient by demonstrating training of **gradient-boosted decision trees** (GBT) using the well-known open source machine learning (ML) library H2O.

We also show H2O's **automated machine learning** (AutoML) capability that can search the model hyperparameter tuning space. This can both save the user time required to so do manually, and produce better results by finding hyperparameter combinations that the user may miss. AutoML used in this way can surpass even expert human data scientists in some situations.

H2O's AutoML includes within it another well-known GBT library, **XGBoost**.

This project does not aim to show extensive model tuning, large datasets, or specific business problems, but to show the **end-to-end** combination of data preparation, model training, and deployment to production of the H2O model that is enabled within Gradient. We therefore show the commonly used [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) from the UCI ML repository.

## Setup
This Notebook runs on the Gradient container `tensorflow/tensorflow:2.4.1-gpu-jupyter`, and requires the installation of H2O, and hence Java.

In [None]:
# Install H2O
!pip install h2o==3.32.1.3

Collecting h2o==3.32.1.3
  Downloading h2o-3.32.1.3.tar.gz (164.8 MB)
[K     |████████████████████████████████| 164.8 MB 113 kB/s  eta 0:00:01
Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 36.6 MB/s eta 0:00:01
[?25hCollecting colorama>=0.3.8
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Building wheels for collected packages: h2o, future
  Building wheel for h2o (setup.py) ... [?25ldone
[?25h  Created wheel for h2o: filename=h2o-3.32.1.3-py2.py3-none-any.whl size=164854343 sha256=1540fb7695e05071c9f9bd487c737368bf50fc7480f4dc5d9b6afd5d43f7b8c1
  Stored in directory: /root/.cache/pip/wheels/72/00/18/d1ed0b56eb5efd5e96b48828c07bd131ff8829a6d16fcef39d
  Building wheel for future (setup.py) ... [?25ldone
[?25h  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491059 sha256=c79b75cd6b51769fb48f5b00c8c50989e6b8a0a

In [None]:
# Install Java using https://pypi.org/project/install-jdk/
!pip install install-jdk==0.3.0

Collecting install-jdk==0.3.0
  Downloading install-jdk-0.3.0.tar.gz (3.8 kB)
Building wheels for collected packages: install-jdk
  Building wheel for install-jdk (setup.py) ... [?25ldone
[?25h  Created wheel for install-jdk: filename=install_jdk-0.3.0-py3-none-any.whl size=3739 sha256=1f8bc560917a8cd6f620d7aa85b2914ebbf713ff3bff69a081d060d55ca62d6a
  Stored in directory: /root/.cache/pip/wheels/3a/5f/ee/3ff795a99fbd5222097c94dff4535ea4b5c2a91a234daa6611
Successfully built install-jdk
Installing collected packages: install-jdk
Successfully installed install-jdk-0.3.0
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [None]:
# This may show an error if jdk is already installed from a previous run of the notebook,
# but it is OK to proceed

import jdk
jdk.install('11', jre=True)

'/root/.jre/jdk-11.0.11+9-jre'

Add the Java to the path so that H2O can see it.

In [None]:
import os
import subprocess

os.environ['PATH'] = "/root/.jre/jdk-11.0.11+9-jre/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
subprocess.run('echo $PATH', shell=True, check=True, stdout=subprocess.PIPE, universal_newlines=True)

CompletedProcess(args='echo $PATH', returncode=0, stdout='/root/.jre/jdk-11.0.11+9-jre/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\n')

H2O runs as a server, so we start this up.

In [None]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.11" 2021-04-20; OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9); OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpq5ho0wi_
  JVM stdout: /tmp/tmpq5ho0wi_/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpq5ho0wi_/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/GMT
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.3
H2O_cluster_version_age:,2 months and 1 day
H2O_cluster_name:,H2O_from_python_unknownUser_rzhigf
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,6.750 Gb
H2O_cluster_total_cores:,11
H2O_cluster_allowed_cores:,11


## Prepare data
We load the slightly modified version of the income dataset supplied with the repo. This saves some data cleaning lines not relevant to this project such as removing the final empty line.

The original data is at the [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/census+income) .

H2O provides an `import_file` method that enables convenient import of a CSV file to a dataframe. This process is fine here because the data are small.

In [None]:
df = h2o.import_file(path = "../income.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


The data can be viewed. It consists of 14 columns of demographic information of mixed data type, and a binary ground-truth column `yearly-income`.

Our task is to build a binary supervised ML classification model to predict whether a person's income is low (`<=50K`) or high (`>50K`).

This has obvious potential business applications, such as deciding who to market cheap or expensive products to, but we will not explore those here.

In [None]:
df

age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,yearly-income
39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K




We can also summarize the dataframe with various statistics particularly useful for the exploratory data science that we are performing, using H2O's `summary()` method. Information includes min/max/spread, but also data type, number of zeros, and number of missing values.

In [None]:
df.summary()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,yearly-income
type,int,enum,int,enum,int,enum,enum,enum,enum,enum,int,int,int,enum,enum
mins,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
mean,38.581646755320776,,189778.36651208502,,10.080679340315099,,,,,,1077.6488437087312,87.303829734959,40.437455852092995,,
maxs,90.0,,1484705.0,,16.0,,,,,,99999.0,4356.0,99.0,,
sigma,13.64043255358134,,105549.97769702224,,2.5727203320673877,,,,,,7385.29208484034,402.96021864899967,12.347428681731843,,
zeros,0,,0,,0,,,,,,29849,31042,0,,
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,39.0,State-gov,77516.0,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
2,38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K


We separate the data feature columns (1-14) from the label in column 15 (yearly-income).

In [None]:
# Feature columns and label
y = "yearly-income"
x = df.columns
del x[14]
print(x)

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']


And split the data into a training, validation, and testing set.

In H2O, the datasets are put into their *hex* format, which improves performance.

In [None]:
# Split
train, valid, test = df.split_frame(
    ratios = [0.6,0.2],
    seed = 123456,
    destination_frames=['train.hex','valid.hex','test.hex']
)

## Train the model using AutoML

Model training can then be performed using AutoML. Here we set the maximum number of models to search to be 20. The training takes a few minutes to run, which can be measure by `%%time` as the first command in the cell.

In [None]:
%%time

# Run AutoML
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%
CPU times: user 12.9 s, sys: 416 ms, total: 13.3 s
Wall time: 9min 43s


We see from the searched models that a variety of configurations have been tried, including:

 - Regular GBT (aka. GBM, gradient boosting machine)
 - XGBoost model with grid of hyperparameter values
 - A deep learning model
 - Random forest
 - Stacked ensembles of models (stacking = feed model output into next model input)

For full details of the models searched in AutoML, see [H2O's AutoML documentation](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html).

We also see in the table various metrics for the model performance on the validation set, the leaderboard here being ordered by `auc`, which is the area under curve of model true versus false positive rate. Other [metrics](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html?#classification) shown include logarithmic loss, area under precision-recall curve, and mean squared error.

Gradient includes support for [tracking model metrics](https://docs.paperspace.com/gradient/data/metrics-overview), both in model experimentation and production.


In [None]:
lb = h2o.automl.get_leaderboard(aml, extra_columns = 'ALL')
lb.head(rows=lb.nrows)

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse,training_time_ms,predict_time_per_row_ms,algo
StackedEnsemble_AllModels_AutoML_20210721_040517,0.927516,0.280334,0.828239,0.16976,0.298357,0.0890171,2791,0.043219,StackedEnsemble
StackedEnsemble_BestOfFamily_AutoML_20210721_040517,0.926968,0.281188,0.827822,0.175579,0.298627,0.0891782,1563,0.017421,StackedEnsemble
XGBoost_grid__1_AutoML_20210721_040517_model_4,0.925388,0.283678,0.824562,0.176857,0.300212,0.0901273,1132,0.003136,XGBoost
GBM_1_AutoML_20210721_040517,0.925145,0.285528,0.823719,0.172024,0.30061,0.0903661,1852,0.013197,GBM
XGBoost_3_AutoML_20210721_040517,0.925114,0.284771,0.822262,0.16712,0.301022,0.0906144,1428,0.00234,XGBoost
GBM_2_AutoML_20210721_040517,0.924923,0.286024,0.823077,0.173164,0.300807,0.0904848,1602,0.01191,GBM
GBM_3_AutoML_20210721_040517,0.92415,0.287323,0.821873,0.173311,0.301484,0.0908926,1586,0.011957,GBM
GBM_grid__1_AutoML_20210721_040517_model_1,0.922166,0.291503,0.816787,0.179959,0.303526,0.092128,1373,0.012912,GBM
XGBoost_grid__1_AutoML_20210721_040517_model_3,0.922007,0.290389,0.816119,0.168792,0.303951,0.0923859,1641,0.002229,XGBoost
GBM_4_AutoML_20210721_040517,0.921542,0.292358,0.816709,0.167862,0.304239,0.0925616,1693,0.010229,GBM




The best model is the stacked ensemble, and we can see its properties in more detail. These include further metrics on model performance, such as the F-score harmonic mean of precision and recall, and the confusion matrix between predicted and ground truth labels, showing true and false positives and negatives. The information is shown for the training data, and then for the (cross-validated) validation data.

In [None]:
aml.leader

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_AutoML_20210721_040517

No model summary for this model

ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.0741088222473043
RMSE: 0.2722293559616676
LogLoss: 0.23663590818870123
Null degrees of freedom: 10046
Residual degrees of freedom: 10040
Null deviance: 11134.613562297909
Residual deviance: 4754.961939143763
AIC: 4768.961939143763
AUC: 0.9520188734229607
AUCPR: 0.8804096756815729
Gini: 0.9040377468459213

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.42508909310298126: 


0,1,2,3,4
,<=50K,>50K,Error,Rate
<=50K,7095.0,514.0,0.0676,(514.0/7609.0)
>50K,539.0,1899.0,0.2211,(539.0/2438.0)
Total,7634.0,2413.0,0.1048,(1053.0/10047.0)



Maximum Metrics: Maximum metrics at their respective thresholds


0,1,2,3
metric,threshold,value,idx
max f1,0.4250891,0.7829314,188.0
max f2,0.2373109,0.8465888,263.0
max f0point5,0.5626342,0.8147846,138.0
max accuracy,0.4800965,0.8970837,166.0
max precision,0.9983993,1.0,0.0
max recall,0.0066038,1.0,390.0
max specificity,0.9983993,1.0,0.0
max absolute_mcc,0.4475590,0.7148985,179.0
max min_per_class_accuracy,0.3109446,0.8742279,233.0



Gains/Lift Table: Avg response rate: 24.27 %, avg score: 24.31 %


0,1,2,3,4,5,6,7,8,9,10,11,12,13
group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100528,0.9975005,4.1210008,4.1210008,1.0,0.9981390,1.0,0.9981390,0.0414274,0.0414274,312.1000820,312.1000820,0.0414274
2,0.0200060,0.9955189,4.1210008,4.1210008,1.0,0.9965840,1.0,0.9973654,0.0410172,0.0824446,312.1000820,312.1000820,0.0824446
3,0.0300587,0.9925287,4.1210008,4.1210008,1.0,0.9941371,1.0,0.9962857,0.0414274,0.1238720,312.1000820,312.1000820,0.1238720
4,0.0400119,0.9865375,4.1210008,4.1210008,1.0,0.9901623,1.0,0.9947625,0.0410172,0.1648893,312.1000820,312.1000820,0.1648893
5,0.0500647,0.9751929,4.1210008,4.1210008,1.0,0.9810366,1.0,0.9920064,0.0414274,0.2063167,312.1000820,312.1000820,0.2063167
6,0.1000299,0.7868145,3.8993534,4.0102874,0.9462151,0.8820496,0.9731343,0.9370827,0.1948318,0.4011485,289.9353366,301.0287365,0.3976001
7,0.1499950,0.6548474,3.1769468,3.7326915,0.7709163,0.7201090,0.9057731,0.8648061,0.1587367,0.5598852,217.6946848,273.2691519,0.5412230
8,0.2000597,0.5201886,2.7446029,3.4854236,0.6660040,0.5836761,0.8457711,0.7944537,0.1374077,0.6972929,174.4602932,248.5423579,0.6565516
9,0.2999900,0.3224050,1.6869834,2.8863413,0.4093625,0.4109287,0.7003981,0.6666969,0.1685808,0.8658737,68.6983404,188.6341318,0.7471984




ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.08901706244242294
RMSE: 0.2983572731515405
LogLoss: 0.2803344921150826
Null degrees of freedom: 19679
Residual degrees of freedom: 19673
Null deviance: 21801.12781023197
Residual deviance: 11033.965609649653
AIC: 11047.965609649653
AUC: 0.9275155885620863
AUCPR: 0.8282390204996729
Gini: 0.8550311771241725

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.37399972153697447: 


0,1,2,3,4
,<=50K,>50K,Error,Rate
<=50K,13297.0,1612.0,0.1081,(1612.0/14909.0)
>50K,1104.0,3667.0,0.2314,(1104.0/4771.0)
Total,14401.0,5279.0,0.138,(2716.0/19680.0)



Maximum Metrics: Maximum metrics at their respective thresholds


0,1,2,3
metric,threshold,value,idx
max f1,0.3739997,0.7297512,211.0
max f2,0.1484512,0.8111916,303.0
max f0point5,0.6492051,0.7655269,118.0
max accuracy,0.5068508,0.8718496,163.0
max precision,0.9981910,1.0,0.0
max recall,0.0010301,1.0,398.0
max specificity,0.9981910,1.0,0.0
max absolute_mcc,0.4183810,0.6423844,194.0
max min_per_class_accuracy,0.2818292,0.8434291,246.0



Gains/Lift Table: Avg response rate: 24.24 %, avg score: 24.25 %


0,1,2,3,4,5,6,7,8,9,10,11,12,13
group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100102,0.9966790,4.1249214,4.1249214,1.0,0.9977215,1.0,0.9977215,0.0412911,0.0412911,312.4921400,312.4921400,0.0412911
2,0.0200203,0.9942997,4.1249214,4.1249214,1.0,0.9954928,1.0,0.9966071,0.0412911,0.0825823,312.4921400,312.4921400,0.0825823
3,0.0300305,0.9909762,4.0830440,4.1109623,0.9898477,0.9928464,0.9966159,0.9953536,0.0408719,0.1234542,308.3044026,311.0962275,0.1233201
4,0.0400407,0.9826635,4.1039827,4.1092174,0.9949239,0.9873531,0.9961929,0.9933535,0.0410815,0.1645357,310.3982713,310.9217385,0.1643345
5,0.05,0.9680361,4.1038759,4.1081534,0.9948980,0.9760964,0.9959350,0.9899161,0.0408719,0.2054077,310.3875883,310.8153427,0.2051394
6,0.1,0.7752008,3.5925383,3.8503458,0.8709350,0.8652481,0.9334350,0.9275821,0.1796269,0.3850346,259.2538252,285.0345839,0.3762479
7,0.15,0.6401044,2.9302033,3.5436317,0.7103659,0.7091983,0.8590786,0.8547875,0.1465102,0.5315447,193.0203312,254.3631664,0.5036421
8,0.2,0.5108817,2.3349403,3.2414588,0.5660569,0.5750862,0.7858232,0.7848622,0.1167470,0.6482918,133.4940264,224.1458814,0.5917487
9,0.3,0.3186673,1.6202054,2.7010410,0.3927846,0.4087394,0.6548103,0.6594879,0.1620205,0.8103123,62.0205408,170.1041012,0.6736163







## Model performance on testing set
The measure of a model's likely performance in production is its performance on unseen data. Therefore it is common to hold out unseen a portion of the data as a testing set, and the model's performance measured against its ground truth.

We can do this here by showing the model running predictions on the testing data (class probabilities), and analyzing its performance via the `model_performance()` method. This shows similar information to the `leader()` method above. We see that the model generalizes quite well to the test data.

In [None]:
model = aml.leader
predictions = model.predict(test)

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [None]:
predictions

predict,<=50K,>50K
<=50K,0.909669,0.090331
>50K,0.538777,0.461223
>50K,0.0122868,0.987713
<=50K,0.997832,0.00216811
<=50K,0.976717,0.0232834
<=50K,0.939607,0.0603929
<=50K,0.715487,0.284513
<=50K,0.998162,0.00183815
<=50K,0.711692,0.288308
<=50K,0.995555,0.00444468




In [None]:
model.model_performance(test)


ModelMetricsBinomialGLM: stackedensemble
** Reported on test data. **

MSE: 0.08495665870546716
RMSE: 0.29147325555780784
LogLoss: 0.26725783945281806
Null degrees of freedom: 6384
Residual degrees of freedom: 6378
Null deviance: 6997.844809115317
Residual deviance: 3412.8826098124864
AIC: 3426.8826098124864
AUC: 0.9337852820189617
AUCPR: 0.8371945381466177
Gini: 0.8675705640379234

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.39554547194115: 


0,1,2,3,4
,<=50K,>50K,Error,Rate
<=50K,4410.0,460.0,0.0945,(460.0/4870.0)
>50K,350.0,1165.0,0.231,(350.0/1515.0)
Total,4760.0,1625.0,0.1269,(810.0/6385.0)



Maximum Metrics: Maximum metrics at their respective thresholds


0,1,2,3
metric,threshold,value,idx
max f1,0.3955455,0.7420382,198.0
max f2,0.2023667,0.8112445,275.0
max f0point5,0.6903900,0.7737241,98.0
max accuracy,0.4918037,0.8815975,163.0
max precision,0.9984129,1.0,0.0
max recall,0.0084844,1.0,386.0
max specificity,0.9984129,1.0,0.0
max absolute_mcc,0.4748854,0.6613660,169.0
max min_per_class_accuracy,0.2891045,0.8475248,239.0



Gains/Lift Table: Avg response rate: 23.73 %, avg score: 24.08 %


0,1,2,3,4,5,6,7,8,9,10,11,12,13
group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0100235,0.9971625,4.2145215,4.2145215,1.0,0.9980595,1.0,0.9980595,0.0422442,0.0422442,321.4521452,321.4521452,0.0422442
2,0.0200470,0.9946995,4.2145215,4.2145215,1.0,0.9959285,1.0,0.9969940,0.0422442,0.0844884,321.4521452,321.4521452,0.0844884
3,0.0300705,0.9913300,4.2145215,4.2145215,1.0,0.9932932,1.0,0.9957604,0.0422442,0.1267327,321.4521452,321.4521452,0.1267327
4,0.0400940,0.9831519,4.2145215,4.2145215,1.0,0.9879531,1.0,0.9938086,0.0422442,0.1689769,321.4521452,321.4521452,0.1689769
5,0.0501175,0.9646815,4.2145215,4.2145215,1.0,0.9751623,1.0,0.9900793,0.0422442,0.2112211,321.4521452,321.4521452,0.2112211
6,0.1000783,0.7748916,3.6860548,3.9507016,0.8746082,0.8631609,0.9374022,0.9267194,0.1841584,0.3953795,268.6054812,295.0701643,0.3871660
7,0.1500392,0.6425027,2.9594132,3.6206171,0.7021944,0.7073270,0.8590814,0.8536649,0.1478548,0.5432343,195.9413183,262.0617072,0.5155136
8,0.2,0.5069073,2.5234282,3.3465347,0.5987461,0.5746836,0.7940486,0.7839742,0.1260726,0.6693069,152.3428205,234.6534653,0.6153028
9,0.3000783,0.3145347,1.6027053,2.7649548,0.3802817,0.4050825,0.6560543,0.6576111,0.1603960,0.8297030,60.2705341,176.4954836,0.6943847







## Save the model for deployment

Finally, for a model to be put into production, it needs to be saved in a manner that can be accessed later. H2O has several model formats, but the one most [preferred for production](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html) is MOJO, or modified Java object. This allows the most general functionality and datatypes to be passed.

The model is output as a .zip file that includes its single Java dependency, `h2o-genmodel.jar`. Java knowledge is therefore required to proceed to production deployment, but the format allows significant flexibility in where it can be deployed.

The location that we save the model to is the Gradient-provided storage corresponding to this notebook, at `/storage`.

In the command line section of this project (refer back to https://github.com/gradient-ai/Gradient-Boosted-Trees-and-AutoML), we will deploy this model on Gradient as a REST endpoint, and send inference data to it.

In [None]:
modelfile = model.download_mojo(path="/storage", get_genmodel_jar=True)
print("Model saved to " + modelfile)

Model saved to /storage/StackedEnsemble_AllModels_AutoML_20210721_040517.zip


## Conclusions

We have shown

 - Setup Java and H2O on Gradient
 - Load and prepare small dataset (UCI Census Income)
 - Train gradient-boosted decision tree and other models using H2O's AutoML
 - Evaluate model performance on unseen testing data
 - Save model so that it can be deployed to production

## Next Steps
To see the Workflow portion of this project, or to deploy the model using the command line, refer back to the project GitHub repo at https://github.com/gradient-ai/Gradient-Boosted-Trees-and-AutoML .