# H2O GBM Tuning Tutorial for Python
### Navdeep Gill, M.S., Hacker/Data Scientist, H2O.ai

In this tutorial, we show how to build a well-tuned H2O GBM model for a supervised classification task. We specifically don't focus on feature engineering and use a small dataset to allow you to reproduce these results in a few minutes on a laptop. This script can be directly transferred to datasets that are hundreds of GBs large and H2O clusters with dozens of compute nodes.

You can download the source [from H2O's github repository](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb).

## Installation of the H2O Python Package
Either download H2O from [H2O.ai's website](http://h2o.ai/download) or install the latest version of H2O into Python with the following set of commands:

In [None]:
#Install dependencies from command line (prepending with `sudo` if needed):
[sudo] pip install -U requests
[sudo] pip install -U tabulate
[sudo] pip install -U future
[sudo] pip install -U six

# The following command removes the H2O module for Python.
[sudo] pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
[sudo] pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/6/Python/h2o-3.8.2.6-py2.py3-none-any.whl

## Launch an H2O cluster on localhost

In [294]:
import h2o
import numpy as np
import math
h2o.init(nthreads=-1,strict_version_check=False)
## optional: connect to a running H2O cluster
#h2o.init(ip="mycluster", port=55555) 



No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: /var/folders/55/rj4cny_s29q4vn1wjt_x08sm0000gn/T/tmpDuptFg/h2o_navdeepgill_started_from_python.out
JVM stderr: /var/folders/55/rj4cny_s29q4vn1wjt_x08sm0000gn/T/tmpW95XBX/h2o_navdeepgill_started_from_python.err
Using ice_root: /var/folders/55/rj4cny_s29q4vn1wjt_x08sm0000gn/T/tmpos3v9A


Java Version: java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)


Starting H2O JVM and connecting: ................ Connection successful!


0,1
H2O cluster uptime:,1 seconds 832 milliseconds
H2O cluster version:,3.8.2.6
H2O cluster name:,H2O_started_from_python_navdeepgill_btt708
H2O cluster total nodes:,1
H2O cluster total free memory:,3.56 GB
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


## Import the data into H2O 
Everything is scalable and distributed from now on. All processing is done on the fully multi-threaded and distributed H2O Java-based backend and can be scaled to large datasets on large compute clusters.
Here, we use a small public dataset ([Titanic](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html)), but you can use datasets that are hundreds of GBs large.

In [295]:
## 'path' can point to a local file, hdfs, s3, nfs, Hive, directories, etc.
df = h2o.import_file(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
print df.dim
print df.head
print df.tail
print df.describe

## pick a response for the supervised problem
response = "survived"

## the response variable is an integer, we will turn it into a categorical/factor for binary classification
df[response] = df[response].asfactor()           

## use all other columns (except for the name & the response column ("survived")) as predictors
predictors = df.columns
del predictors[1:3]
print predictors


Parse Progress: [##################################################] 100%
[1309, 14]


pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,Allen Miss. Elisabeth Walton,female,29.0,0,0,24160.0,211.338,B5,S,2.0,,St Louis MO
1,1,Allison Master. Hudson Trevor,male,0.9167,1,2,113781.0,151.55,C22 C26,S,11.0,,Montreal PQ / Chesterville ON
1,0,Allison Miss. Helen Loraine,female,2.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,0,Allison Mr. Hudson Joshua Creighton,male,30.0,1,2,113781.0,151.55,C22 C26,S,,135.0,Montreal PQ / Chesterville ON
1,0,Allison Mrs. Hudson J C (Bessie Waldo Daniels),female,25.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,1,Anderson Mr. Harry,male,48.0,0,0,19952.0,26.55,E12,S,3.0,,New York NY
1,1,Andrews Miss. Kornelia Theodosia,female,63.0,1,0,13502.0,77.9583,D7,S,10.0,,Hudson NY
1,0,Andrews Mr. Thomas Jr,male,39.0,0,0,112050.0,0.0,A36,S,,,Belfast NI
1,1,Appleton Mrs. Edward Dale (Charlotte Lamson),female,53.0,2,0,11769.0,51.4792,C101,S,,,Bayside Queens NY
1,0,Artagaveytia Mr. Ramon,male,71.0,0,0,,49.5042,,C,,22.0,Montevideo Uruguay


<bound method H2OFrame.head of >


pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,Allen Miss. Elisabeth Walton,female,29.0,0,0,24160.0,211.338,B5,S,2.0,,St Louis MO
1,1,Allison Master. Hudson Trevor,male,0.9167,1,2,113781.0,151.55,C22 C26,S,11.0,,Montreal PQ / Chesterville ON
1,0,Allison Miss. Helen Loraine,female,2.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,0,Allison Mr. Hudson Joshua Creighton,male,30.0,1,2,113781.0,151.55,C22 C26,S,,135.0,Montreal PQ / Chesterville ON
1,0,Allison Mrs. Hudson J C (Bessie Waldo Daniels),female,25.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,1,Anderson Mr. Harry,male,48.0,0,0,19952.0,26.55,E12,S,3.0,,New York NY
1,1,Andrews Miss. Kornelia Theodosia,female,63.0,1,0,13502.0,77.9583,D7,S,10.0,,Hudson NY
1,0,Andrews Mr. Thomas Jr,male,39.0,0,0,112050.0,0.0,A36,S,,,Belfast NI
1,1,Appleton Mrs. Edward Dale (Charlotte Lamson),female,53.0,2,0,11769.0,51.4792,C101,S,,,Bayside Queens NY
1,0,Artagaveytia Mr. Ramon,male,71.0,0,0,,49.5042,,C,,22.0,Montevideo Uruguay


<bound method H2OFrame.tail of >


pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1,1,Allen Miss. Elisabeth Walton,female,29.0,0,0,24160.0,211.338,B5,S,2.0,,St Louis MO
1,1,Allison Master. Hudson Trevor,male,0.9167,1,2,113781.0,151.55,C22 C26,S,11.0,,Montreal PQ / Chesterville ON
1,0,Allison Miss. Helen Loraine,female,2.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,0,Allison Mr. Hudson Joshua Creighton,male,30.0,1,2,113781.0,151.55,C22 C26,S,,135.0,Montreal PQ / Chesterville ON
1,0,Allison Mrs. Hudson J C (Bessie Waldo Daniels),female,25.0,1,2,113781.0,151.55,C22 C26,S,,,Montreal PQ / Chesterville ON
1,1,Anderson Mr. Harry,male,48.0,0,0,19952.0,26.55,E12,S,3.0,,New York NY
1,1,Andrews Miss. Kornelia Theodosia,female,63.0,1,0,13502.0,77.9583,D7,S,10.0,,Hudson NY
1,0,Andrews Mr. Thomas Jr,male,39.0,0,0,112050.0,0.0,A36,S,,,Belfast NI
1,1,Appleton Mrs. Edward Dale (Charlotte Lamson),female,53.0,2,0,11769.0,51.4792,C101,S,,,Bayside Queens NY
1,0,Artagaveytia Mr. Ramon,male,71.0,0,0,,49.5042,,C,,22.0,Montevideo Uruguay


<bound method H2OFrame.describe of >
[u'pclass', u'sex', u'age', u'sibsp', u'parch', u'ticket', u'fare', u'cabin', u'embarked', u'boat', u'body', u'home.dest']


From now on, everything is generic and directly applies to most datasets. We assume that all feature engineering is done at this stage and focus on model tuning. For multi-class problems, you can use `h2o.logloss()` or `h2o.confusion_matrix()` instead of `h2o.auc()` and for regression problems, you can use `h2o.mean_residual_deviance()` or `h2o.mse()`.

## Split the data for Machine Learning
We split the data into three pieces: 60% for training, 20% for validation, 20% for final testing. 
Here, we use random splitting, but this assumes i.i.d. data. If this is not the case (e.g., when events span across multiple rows or data has a time structure), you'll have to sample your data non-randomly.

In [296]:
train, valid, test = df.split_frame(ratios=[0.6,0.2], seed=1234)

## Establish baseline performance
As the first step, we'll build some default models to see what accuracy we can expect. Let's use the [AUC metric](http://mlwiki.org/index.php/ROC_Analysis) for this demo, but you can use `h2o.logloss()` and `stopping_metric="logloss"` as well. It ranges from 0.5 for random models to 1 for perfect models.


The first model is a default GBM, trained on the 60% training split

In [297]:
#We only provide the required parameters, everything else is default
gbm = h2o.H2OGradientBoostingEstimator(distribution='bernoulli')
gbm.train(x=predictors, y=response, training_frame=train)

## Show a detailed model summary
print gbm


gbm Model Build Progress: [##################################################] 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Method
Model Key:  GBM_model_python_1464825551794_1

Model Summary: 


0,1,2,3,4,5,6,7,8
,number_of_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,50.0,27316.0,5.0,5.0,5.0,10.0,21.0,15.58




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.0347141508428
R^2: 0.853514801928
LogLoss: 0.135724386711
AUC: 0.990369609999
Gini: 0.980739219997

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.394257736927: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,461.0,18.0,0.0376,(18.0/479.0)
1,13.0,288.0,0.0432,(13.0/301.0)
Total,474.0,306.0,0.0397,(31.0/780.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3942577,0.9489292,220.0
max f2,0.3942577,0.9536424,220.0
max f0point5,0.6059331,0.9682188,197.0
max accuracy,0.4073598,0.9602564,218.0
max precision,0.9945845,1.0,0.0
max recall,0.0608964,1.0,280.0
max specificity,0.9945845,1.0,0.0
max absolute_MCC,0.3942577,0.9164872,220.0
max min_per_class_accuracy,0.3942577,0.9568106,220.0



Gains/Lift Table: Avg response rate: 38.59 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0102564,0.9792763,2.5913621,2.5913621,1.0,1.0,0.0265781,0.0265781,159.1362126,159.1362126
,2,0.0205128,0.9772307,2.5913621,2.5913621,1.0,1.0,0.0265781,0.0531561,159.1362126,159.1362126
,3,0.0307692,0.9749052,2.5913621,2.5913621,1.0,1.0,0.0265781,0.0797342,159.1362126,159.1362126
,4,0.0410256,0.9735284,2.5913621,2.5913621,1.0,1.0,0.0265781,0.1063123,159.1362126,159.1362126
,5,0.0525641,0.9727573,2.5913621,2.5913621,1.0,1.0,0.0299003,0.1362126,159.1362126,159.1362126
,6,0.1,0.9705040,2.5913621,2.5913621,1.0,1.0,0.1229236,0.2591362,159.1362126,159.1362126
,7,0.15,0.9690187,2.5913621,2.5913621,1.0,1.0,0.1295681,0.3887043,159.1362126,159.1362126
,8,0.2,0.9666881,2.5913621,2.5913621,1.0,1.0,0.1295681,0.5182724,159.1362126,159.1362126
,9,0.3,0.9286900,2.5913621,2.5913621,1.0,1.0,0.2591362,0.7774086,159.1362126,159.1362126




Scoring History: 


0,1,2,3,4,5,6,7,8
,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_AUC,training_lift,training_classification_error
,2016-06-01 16:59:31,0.020 sec,0.0,0.2369806,0.6668775,0.5,1.0,0.6141026
,2016-06-01 16:59:31,0.226 sec,1.0,0.2022753,0.5942842,0.9685634,2.5913621,0.0782051
,2016-06-01 16:59:31,0.333 sec,2.0,0.1744836,0.5361788,0.9760888,2.5913621,0.0692308
,2016-06-01 16:59:31,0.436 sec,3.0,0.1520389,0.4883150,0.9760888,2.5913621,0.0692308
,2016-06-01 16:59:31,0.526 sec,4.0,0.1335759,0.4476644,0.9767338,2.5913621,0.0692308
---,---,---,---,---,---,---,---,---
,2016-06-01 16:59:32,1.518 sec,46.0,0.0351867,0.1373187,0.9900159,2.5913621,0.0410256
,2016-06-01 16:59:32,1.535 sec,47.0,0.0350182,0.1368714,0.9901338,2.5913621,0.0410256
,2016-06-01 16:59:32,1.553 sec,48.0,0.0349213,0.1365776,0.9902794,2.5913621,0.0410256



See the whole table with table.as_data_frame()

Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
boat,608.0037231,1.0,0.5470799
home.dest,379.0013123,0.6233536,0.3410242
cabin,35.3481979,0.0581381,0.0318062
sex,28.0973225,0.0462124,0.0252819
ticket,21.2186050,0.0348988,0.0190924
embarked,14.9458570,0.0245819,0.0134482
fare,11.1052256,0.0182651,0.0099924
age,5.4856591,0.0090224,0.0049360
parch,3.7122171,0.0061056,0.0033402





The AUC is over 94%, so this model is highly predictive!

In [298]:
## Get the AUC on the validation set
perf = gbm.model_performance(valid)
print perf.auc()

0.943195266272


The second model is another default GBM, but trained on 80% of the data (here, we combine the training and validation splits to get more training data), and cross-validated using 4 folds.
Note that cross-validation takes longer and is not usually done for really large datasets.

In [299]:
## rbind() makes a copy here, so it's better to use split_frame with `ratios = c(0.8)` instead above
cv_gbm = h2o.H2OGradientBoostingEstimator(distribution='bernoulli',nfolds = 4, seed = 0xDECAF)
cv_gbm.train(x = predictors, y = response, training_frame = train.rbind(valid))

## Show a detailed summary of the cross validation metrics
## This gives you an idea of the variance between the folds
print cv_gbm.cross_validation_metrics_summary

## Get the cross-validated AUC by scoring the combined holdout predictions.
## (Instead of taking the average of the metrics across the folds)
perf_cv = cv_gbm.model_performance()
print perf_cv.auc()


gbm Model Build Progress: [##################################################] 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Method
Model Key:  GBM_model_python_1464825551794_105

Model Summary: 


0,1,2,3,4,5,6,7,8
,number_of_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,50.0,33300.0,5.0,5.0,5.0,11.0,22.0,18.92




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.0345774118897
R^2: 0.853993340225
LogLoss: 0.128413772379
AUC: 0.989262148027
Gini: 0.978524296053

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.339643844673: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,624.0,24.0,0.037,(24.0/648.0)
1,20.0,386.0,0.0493,(20.0/406.0)
Total,644.0,410.0,0.0417,(44.0/1054.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3396438,0.9460784,193.0
max f2,0.2788154,0.9512195,200.0
max f0point5,0.5307025,0.9714747,179.0
max accuracy,0.3396438,0.9582543,193.0
max precision,0.9957889,1.0,0.0
max recall,0.0635094,1.0,252.0
max specificity,0.9957889,1.0,0.0
max absolute_MCC,0.3396438,0.9120532,193.0
max min_per_class_accuracy,0.3190823,0.9532020,195.0



Gains/Lift Table: Avg response rate: 38.52 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0113852,0.9955574,2.5960591,2.5960591,1.0,1.0,0.0295567,0.0295567,159.6059113,159.6059113
,2,0.0237192,0.9954868,2.5960591,2.5960591,1.0,1.0,0.0320197,0.0615764,159.6059113,159.6059113
,3,0.0303605,0.9952492,2.5960591,2.5960591,1.0,1.0,0.0172414,0.0788177,159.6059113,159.6059113
,4,0.0407970,0.9949399,2.5960591,2.5960591,1.0,1.0,0.0270936,0.1059113,159.6059113,159.6059113
,5,0.0502846,0.9947559,2.5960591,2.5960591,1.0,1.0,0.0246305,0.1305419,159.6059113,159.6059113
,6,0.1005693,0.9938667,2.5960591,2.5960591,1.0,1.0,0.1305419,0.2610837,159.6059113,159.6059113
,7,0.1499051,0.9847828,2.5960591,2.5960591,1.0,1.0,0.1280788,0.3891626,159.6059113,159.6059113
,8,0.2001898,0.9803967,2.5960591,2.5960591,1.0,1.0,0.1305419,0.5197044,159.6059113,159.6059113
,9,0.2998102,0.9459382,2.5960591,2.5960591,1.0,1.0,0.2586207,0.7783251,159.6059113,159.6059113




ModelMetricsBinomial: gbm
** Reported on cross-validation data. **

MSE: 0.0719141389606
R^2: 0.696335189756
LogLoss: 0.259459936663
AUC: 0.940343155142
Gini: 0.880686310284

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.527612429749: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,625.0,23.0,0.0355,(23.0/648.0)
1,68.0,338.0,0.1675,(68.0/406.0)
Total,693.0,361.0,0.0863,(91.0/1054.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.5276124,0.8813559,159.0
max f2,0.2332222,0.8820686,220.0
max f0point5,0.6511963,0.9241020,136.0
max accuracy,0.5333012,0.9136622,157.0
max precision,0.9968387,1.0,0.0
max recall,0.0055336,1.0,396.0
max specificity,0.9968387,1.0,0.0
max absolute_MCC,0.5333012,0.8174812,157.0
max min_per_class_accuracy,0.2692503,0.8842593,211.0



Gains/Lift Table: Avg response rate: 38.52 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0104364,0.9960152,2.5960591,2.5960591,1.0,1.0,0.0270936,0.0270936,159.6059113,159.6059113
,2,0.0218216,0.9956228,2.5960591,2.5960591,1.0,1.0,0.0295567,0.0566502,159.6059113,159.6059113
,3,0.0303605,0.9949225,2.5960591,2.5960591,1.0,1.0,0.0221675,0.0788177,159.6059113,159.6059113
,4,0.0407970,0.9943393,2.5960591,2.5960591,1.0,1.0,0.0270936,0.1059113,159.6059113,159.6059113
,5,0.0502846,0.9939988,2.5960591,2.5960591,1.0,1.0,0.0246305,0.1305419,159.6059113,159.6059113
,6,0.1005693,0.9830256,2.5960591,2.5960591,1.0,1.0,0.1305419,0.2610837,159.6059113,159.6059113
,7,0.1499051,0.9735360,2.5461349,2.5796284,0.9807692,0.9936709,0.1256158,0.3866995,154.6134900,157.9628359
,8,0.2001898,0.9564909,2.5470769,2.5714519,0.9811321,0.9905213,0.1280788,0.5147783,154.7076866,157.1451918
,9,0.2998102,0.7542109,2.4477129,2.5303361,0.9428571,0.9746835,0.2438424,0.7586207,144.7712878,153.0336098




Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid
F0point5,0.9127705,0.0107794,0.9204546,0.9183673,0.8867521,0.9255079
F1,0.8876407,0.0054374,0.8756757,0.8959276,0.8924731,0.8864865
F2,0.8646252,0.0169603,0.8350515,0.8745583,0.8982684,0.8506224
accuracy,0.9174827,0.0030382,0.9151291,0.9118774,0.9230769,0.9198473
auc,0.9432912,0.0077477,0.9298538,0.9361525,0.9578804,0.9492781
err,0.0825173,0.0030382,0.0848708,0.0881226,0.0769231,0.0801527
err_count,21.75,0.9185587,23.0,23.0,20.0,21.0
lift_top_group,2.6130292,0.1474290,2.71,2.269565,2.826087,2.6464646
logloss,0.2594696,0.0162408,0.2549429,0.2922707,0.2278113,0.2628536



Scoring History: 


0,1,2,3,4,5,6,7,8
,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_AUC,training_lift,training_classification_error
,2016-06-01 16:59:45,2.076 sec,0.0,0.2368208,0.6665521,0.5,1.0,0.6148008
,2016-06-01 16:59:45,2.085 sec,1.0,0.2017114,0.5930715,0.9709850,2.5960591,0.0721063
,2016-06-01 16:59:45,2.093 sec,2.0,0.1736422,0.5343418,0.9710363,2.5960591,0.0721063
,2016-06-01 16:59:45,2.101 sec,3.0,0.1510759,0.4861838,0.9710135,2.5960591,0.0721063
,2016-06-01 16:59:45,2.109 sec,4.0,0.1328107,0.4460070,0.9765516,2.5960591,0.0673624
---,---,---,---,---,---,---,---,---
,2016-06-01 16:59:45,2.648 sec,46.0,0.0350350,0.1309968,0.9892279,2.5960591,0.0426945
,2016-06-01 16:59:45,2.664 sec,47.0,0.0349514,0.1305076,0.9892127,2.5960591,0.0426945
,2016-06-01 16:59:45,2.683 sec,48.0,0.0348895,0.1297943,0.9892089,2.5960591,0.0426945



See the whole table with table.as_data_frame()

Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
boat,853.3854370,1.0,0.5441471
home.dest,561.1301270,0.6575342,0.3577953
cabin,44.6142235,0.0522791,0.0284475
sex,43.1090775,0.0505154,0.0274878
ticket,16.0098724,0.0187604,0.0102084
embarked,14.3029938,0.0167603,0.0091201
parch,10.1755810,0.0119238,0.0064883
fare,9.9404049,0.0116482,0.0063383
age,9.5454817,0.0111854,0.0060865


<bound method H2OGradientBoostingEstimator.cross_validation_metrics_summary of >
0.989262148027


Next, we train a GBM with "I feel lucky" parameters.
We'll use early stopping to automatically tune the number of trees using the validation AUC. 
We'll use a lower learning rate (lower is always better, just takes more trees to converge).
We'll also use stochastic sampling of rows and columns to (hopefully) improve generalization.

In [300]:
gbm_lucky = h2o.H2OGradientBoostingEstimator(
  ## more trees is better if the learning rate is small enough 
  ## here, use "more than enough" trees - we have early stopping
  ntrees = 10000,                                                            

  ## smaller learning rate is better (this is a good value for most datasets, but see below for annealing)
  learn_rate=0.01,                                                         

  ## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
  stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", 

  ## sample 80% of rows per tree
  sample_rate = 0.8,                                                       

  ## sample 80% of columns per split
  col_sample_rate = 0.8,                                                   

  ## fix a random number generator seed for reproducibility
  seed = 1234,                                                             

  ## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
  score_tree_interval = 10)

In [301]:
gbm_lucky.train(x=predictors, y=response, training_frame=train, validation_frame=valid)
perf_lucky = gbm_lucky.model_performance(valid)
print perf_lucky.auc()


gbm Model Build Progress: [##################################################] 100%
0.93933502395


This model doesn't seem to be much better than the previous models with an AUC of ~.93.

## Hyper-Parameter Search

Next, we'll do real hyper-parameter optimization to see if we can beat the best AUC so far (around 94%).

The key here is to start tuning some key parameters first (i.e., those that we expect to have the biggest impact on the results). From experience with gradient boosted trees across many datasets, we can state the following "rules":

1. Build as many trees (`ntrees`) as it takes until the validation set error starts increasing.
2. A lower learning rate (`learn_rate`) is generally better, but will require more trees. Using `learn_rate=0.02 `and `learn_rate_annealing=0.995` (reduction of learning rate with each additional tree) can help speed up convergence without sacrificing accuracy too much, and is great to hyper-parameter searches. For faster scans, use values of 0.05 and 0.99 instead.
3. The optimum maximum allowed depth for the trees (`max_depth`) is data dependent, deeper trees take longer to train, especially at depths greater than 10.
4. Row and column sampling (`sample_rate` and `col_sample_rate`) can improve generalization and lead to lower validation and test set errors. Good general values for large datasets are around 0.7 to 0.8 (sampling 70-80 percent of the data) for both parameters. Column sampling per tree (`col_sample_rate_per_tree`) can also be tuned. Note that it is multiplicative with `col_sample_rate`, so setting both parameters to 0.8 results in 64% of columns being considered at any given node to split.
5. For highly imbalanced classification datasets (e.g., fewer buyers than non-buyers), stratified row sampling based on response class membership can help improve predictive accuracy.  It is configured with `sample_rate_per_class` (array of ratios, one per response class in lexicographic order).
6. Most other options only have a small impact on the model performance, but are worth tuning with a Random hyper-parameter search nonetheless, if highest performance is critical.

First we want to know what value of `max_depth` to use because it has a big impact on the model training time and optimal values depend strongly on the dataset.
We'll do a quick Cartesian grid search to get a rough idea of good candidate `max_depth` values. Each model in the grid search will use early stopping to tune the number of trees using the validation set AUC, as before.
We'll use learning rate annealing to speed up convergence without sacrificing too much accuracy.

In [302]:
## Depth 10 is usually plenty of depth for most datasets, but you never know
hyper_params = {'max_depth' : range(1,30,2)}
#hyper_params = {max_depth = c(4,6,8,12,16,20)} ##faster for larger datasets

#Build initial GBM Model
gbm_grid = h2o.H2OGradientBoostingEstimator(distribution='bernoulli',
                                    ## more trees is better if the learning rate is small enough 
                                    ## here, use "more than enough" trees - we have early stopping
                                    ntrees=10000,
                                    ## smaller learning rate is better
                                    ## since we have learning_rate_annealing, we can afford to start with a 
                                    #bigger learning rate
                                    learn_rate=0.05,
                                    ## sample 80% of rows per tree
                                    sample_rate = 0.8,
                                    ## sample 80% of columns per split
                                    col_sample_rate = 0.8,
                                    ## fix a random number generator seed for reproducibility
                                    seed = 1234,
                                    ## score every 10 trees to make early stopping reproducible 
                                    #(it depends on the scoring interval)
                                    score_tree_interval = 10, 
                                    ## early stopping once the validation AUC doesn't improve by at least 0.01% for 
                                    #5 consecutive scoring events
                                    stopping_rounds = 5,
                                    stopping_metric = "AUC",
                                    stopping_tolerance = 1e-4)

#Build grid search with previously made GBM and hyper parameters
grid = h2o.H2OGridSearch(gbm_grid,hyper_params,
                         grid_id = 'depth_grid',
                         search_criteria = {'strategy': "Cartesian"})


#Train grid search
grid.train(x=predictors, 
           y=response,
           ## learning rate annealing: learning_rate shrinks by 1% after every tree 
           ## (use 1.00 to disable, but then lower the learning_rate)
           learn_rate_annealing = 0.99,
           training_frame=train,
           validation_frame = valid)


gbm Grid Build Progress: [##################################################] 100%


In [303]:
## by default, display the grid search results sorted by increasing logloss (since this is a classification task)
print grid

      max_depth            model_ids   logloss
0            25  depth_grid_model_12  0.215749
1            23  depth_grid_model_11  0.216639
2            29  depth_grid_model_14  0.218503
3            13   depth_grid_model_6  0.218825
4            19   depth_grid_model_9  0.218849
5            27  depth_grid_model_13  0.219126
6            21  depth_grid_model_10  0.219794
7            17   depth_grid_model_8  0.220059
8            11   depth_grid_model_5  0.221247
9             9   depth_grid_model_4  0.222040
10           15   depth_grid_model_7  0.222911
11            7   depth_grid_model_3  0.225722
12            5   depth_grid_model_2  0.232193
13            3   depth_grid_model_1  0.264067
14            1   depth_grid_model_0  0.324154



In [304]:
## sort the grid models by decreasing AUC
sorted_grid = grid.sort_by('auc(valid=True)',increasing=False)
print(sorted_grid)


Grid Search Results for H2OGradientBoostingEstimator: 


0,1,2
Model Id,Hyperparameters: [max_depth],auc(valid=True)
depth_grid_model_13,[27],0.9565793
depth_grid_model_12,[25],0.9563539
depth_grid_model_14,[29],0.9562412
depth_grid_model_10,[21],0.9546633
depth_grid_model_9,[19],0.9544942
depth_grid_model_6,[13],0.9543815
depth_grid_model_11,[23],0.9540434
depth_grid_model_5,[11],0.9521837
depth_grid_model_7,[15],0.9517892





In [305]:
# find the range of the max_depth for the top ten models
top_depths = sorted_grid[0:5]
max_depths = top_depths['Hyperparameters: [max_depth]']
print max_depths

# get the max depths as a list
max_min_list = []
for element in max_depths:
    max_min_list.append(element[0])

([27], [25], [29], [21], [19])


It appears that `max_depth` values of 19 to 29 are best suited for this dataset, which is unusally deep!

In [306]:
new_max = max(max_min_list)
new_min = min(max_min_list)

print "MaxDepth", new_max
print "MinDepth", new_min

MaxDepth 29
MinDepth 19


Now that we know a good range for max_depth, we can tune all other parameters in more detail. Since we don't know what combinations of hyper-parameters will result in the best model, we'll use random hyper-parameter search to "let the machine get luckier than a best guess of any human".

In [307]:
# create hyperameter and search criteria lists
# variable used in the dictionary:
log_val = math.log(train.nrow,2)-1

hyper_params_tune = {'max_depth' : list(np.arange(new_min,new_max+1,1)),
                'sample_rate': list(np.arange(0.2,1.01,0.01)),
                'col_sample_rate' : list(np.arange(0.2,1,0.01)),
                'col_sample_rate_per_tree': list(np.arange(0.2,1.01,0.01)),
                'col_sample_rate_change_per_level': list(np.arange(0.9,1.10,0.01)),
                'min_rows': list(2**np.arange(0,log_val,1)),
                'nbins': list(2**np.arange(4,11,1)),
                'nbins_cats': list(2**np.arange(4,13,1)),
                'min_split_improvement': [0,1e-8,1e-6,1e-4],
                'histogram_type': ["UniformAdaptive","QuantilesGlobal","RoundRobin"]}
search_criteria_tune = {'strategy': "RandomDiscrete",
                   'max_runtime_secs': 36000,
                   'max_models' :50,
                   'seed' : 1234,
                   'stopping_rounds' : 5,
                   'stopping_metric' : "AUC",
                   }

In [308]:
gbm_grid_tune = h2o.H2OGradientBoostingEstimator(distribution='bernoulli',
                                    ## more trees is better if the learning rate is small enough 
                                    ## here, use "more than enough" trees - we have early stopping
                                    ntrees=10000,
                                    ## smaller learning rate is better
                                    ## since we have learning_rate_annealing, we can afford to start with a 
                                    #bigger learning rate
                                    learn_rate=0.05,
                                    ## fix a random number generator seed for reproducibility
                                    seed = 1234,
                                    ## score every 10 trees to make early stopping reproducible 
                                    #(it depends on the scoring interval)
                                    score_tree_interval = 10, 
                                    ## early stopping once the validation AUC doesn't improve by at least 0.01% for 
                                    #5 consecutive scoring events
                                    stopping_rounds = 5,
                                    stopping_metric = "AUC",
                                    stopping_tolerance = 1e-4)
            
#Build grid search with previously made GBM and hyper parameters
grid_tune = h2o.H2OGridSearch(gbm_grid_tune,hyper_params = hyper_params_tune,
                                    grid_id = 'grid_tune',
                                    search_criteria = search_criteria_tune)
#Train grid search
grid_tune.train(x=predictors, 
           y=response,
           ## learning rate annealing: learning_rate shrinks by 1% after every tree 
           ## (use 1.00 to disable, but then lower the learning_rate)
           learn_rate_annealing = 0.99,
           training_frame=train,
           validation_frame = valid)

print grid_tune


gbm Grid Build Progress: [##################################################] 100%
       histogram_type  nbins_cats  sample_rate  nbins  min_rows  \
0     UniformAdaptive          64         0.50   1024         4   
1          RoundRobin          64         0.73    256         4   
2          RoundRobin          32         0.37   1024         2   
3          RoundRobin         128         0.73     64         1   
4     UniformAdaptive          16         0.71    256         1   
5     UniformAdaptive          32         0.30    512        16   
6          RoundRobin          64         0.37    128         4   
7          RoundRobin         256         0.85     32         1   
8          RoundRobin         128         0.94    256        16   
9     UniformAdaptive         512         0.99   1024        32   
10         RoundRobin        4096         0.81   1024         4   
11    UniformAdaptive        1024         0.66    256         8   
12         RoundRobin        2048         0.3

We can see that the best models have even better validation AUCs than our previous best models, so the random grid search was successful!

In [309]:
## Sort the grid models by AUC
sorted_grid_tune = grid_tune.sort_by('auc(valid=True)',increasing=False)
print sorted_grid_tune


Grid Search Results for H2OGradientBoostingEstimator: 


0,1,2
Model Id,"Hyperparameters: [nbins, col_sample_rate, min_split_improvement, col_sample_rate_per_tree, min_rows, col_sample_rate_change_per_level, nbins_cats, sample_rate, histogram_type, max_depth]",auc(valid=True)
grid_tune_model_32,"[64, 0.3300000000000001, 0.0001, 0.8800000000000006, 1.0, 0.93, 128, 0.7300000000000004, u'RoundRobin', 29]",0.9703860
grid_tune_model_41,"[256, 0.4300000000000002, 1e-06, 0.8800000000000006, 4.0, 1.0500000000000003, 64, 0.7300000000000004, u'RoundRobin', 23]",0.9701606
grid_tune_model_43,"[1024, 0.7900000000000005, 0.0001, 0.9300000000000006, 4.0, 1.04, 64, 0.5000000000000002, u'UniformAdaptive', 25]",0.9699352
grid_tune_model_8,"[256, 0.34000000000000014, 0.0001, 0.8000000000000005, 16.0, 1.02, 128, 0.9400000000000006, u'RoundRobin', 24]",0.9686391
grid_tune_model_19,"[32, 0.3200000000000001, 0.0, 0.7000000000000004, 1.0, 1.0100000000000002, 256, 0.8500000000000005, u'RoundRobin', 21]",0.9683573
---,---,---
grid_tune_model_24,"[256, 0.9000000000000006, 1e-08, 0.5900000000000003, 256.0, 0.9400000000000001, 128, 0.7400000000000004, u'QuantilesGlobal', 27]",0.8131586
grid_tune_model_33,"[256, 0.46000000000000024, 1e-08, 0.4000000000000002, 256.0, 1.0300000000000002, 256, 0.6600000000000004, u'QuantilesGlobal', 19]",0.8127360
grid_tune_model_16,"[32, 0.9800000000000006, 0.0001, 0.3300000000000001, 256.0, 0.9900000000000001, 1024, 0.34000000000000014, u'UniformAdaptive', 27]",0.7916314



See the whole table with table.as_data_frame()



We can inspect the best 5 models from the grid search explicitly, and query their validation AUC:

In [310]:
print sorted_grid_tune["auc(valid=True)"][0:5]

(0.9703860242321781, 0.970160608622147, 0.9699351930121162, 0.9686390532544378, 0.9683572837418991)


## Model Inspection and Final Test Set Scoring

Let's see how well the best model of the grid search (as judged by validation set AUC) does on the held out test set:

In [311]:
#Get the best model from the list (the model name listed at the top of the table)
best_model = h2o.get_model('grid_tune_model_32')
test_performance_model = best_model.model_performance(test)

Good news. It does as well on the test set as on the validation set, so it looks like our best GBM model generalizes well to the unseen test set:

In [312]:
#Get the performance on the test model
print test_performance_model.auc()

0.975816043346


We can inspect the winning model's parameters:

In [313]:
for key, value in best_model.params.iteritems():
    print key,value['actual']

learn_rate 0.05
fold_column None
col_sample_rate_per_tree 0.88
learn_rate_annealing 0.99
score_tree_interval 10
sample_rate_per_class None
seed 1234
keep_cross_validation_predictions False
model_id {u'URL': u'/4/Models/grid_tune_model_32', u'type': u'Key<Model>', u'name': u'grid_tune_model_32', u'__meta': {u'schema_name': u'ModelKeyV3', u'schema_version': 3, u'schema_type': u'Key<Model>'}}
nfolds 0
max_abs_leafnode_pred 1.79769313486e+308
offset_column None
quantile_alpha 0.5
stopping_tolerance 0.0001
fold_assignment AUTO
training_frame {u'URL': u'/4/Frames/py_44', u'type': u'Key<Frame>', u'name': u'py_44', u'__meta': {u'schema_name': u'FrameKeyV3', u'schema_version': 3, u'schema_type': u'Key<Frame>'}}
max_runtime_secs 35916.603
checkpoint None
balance_classes False
r2_stopping 0.999999
validation_frame {u'URL': u'/4/Frames/py_45', u'type': u'Key<Frame>', u'name': u'py_45', u'__meta': {u'schema_name': u'FrameKeyV3', u'schema_version': 3, u'schema_type': u'Key<Frame>'}}
max_depth 29
res

Now we can confirm that these parameters are generally sound, by building a GBM model on the whole dataset (instead of the 60%) and using internal 5-fold cross-validation (re-using all other parameters including the seed):

In [314]:
gbm_best = h2o.H2OGradientBoostingEstimator(distribution='bernoulli',
                                    ntrees=10000,
                                    learn_rate=0.05,
                                    col_sample_rate = 0.33,
                                    col_sample_rate_change_per_level = 0.93,
                                    col_sample_rate_per_tree = 0.88,
                                    seed = 1234,
                                    sample_rate = 0.73,
                                    score_tree_interval = 10, 
                                    stopping_rounds = 5,
                                    stopping_metric = "AUC",
                                    stopping_tolerance = 1e-4,
                                    nbins_cats = 128,
                                    histogram_type = "RoundRobin",
                                    min_split_improvement = 0.0001,
                                    nfolds = 5)

In [315]:
gbm_best.train(x=predictors, y=response,learn_rate_annealing = 0.99, training_frame=df)


gbm Model Build Progress: [##################################################] 100%


In [316]:
print gbm_best.cross_validation_metrics_summary

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Method
Model Key:  GBM_model_python_1464825551794_7134

Model Summary: 


0,1,2,3,4,5,6,7,8
,number_of_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,238.0,67561.0,5.0,5.0,5.0,9.0,27.0,18.718487




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.024742035678
R^2: 0.895191574696
LogLoss: 0.0946858792667
AUC: 0.99643881335
Gini: 0.9928776267

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431132213513: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,802.0,7.0,0.0087,(7.0/809.0)
1,30.0,470.0,0.06,(30.0/500.0)
Total,832.0,477.0,0.0283,(37.0/1309.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4311322,0.9621290,132.0
max f2,0.2400137,0.9648082,175.0
max f0point5,0.4967490,0.9830508,124.0
max accuracy,0.4967490,0.9717341,124.0
max precision,0.9996552,1.0,0.0
max recall,0.0988633,1.0,234.0
max specificity,0.9996552,1.0,0.0
max absolute_MCC,0.4967490,0.9408723,124.0
max min_per_class_accuracy,0.3115916,0.9653894,161.0



Gains/Lift Table: Avg response rate: 38.20 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0106952,0.9996445,2.618,2.618,1.0,1.0,0.028,0.028,161.8,161.8
,2,0.0206264,0.9995825,2.618,2.618,1.0,1.0,0.026,0.054,161.8,161.8
,3,0.0305577,0.9995164,2.618,2.618,1.0,1.0,0.026,0.08,161.8,161.8
,4,0.0404889,0.9994217,2.618,2.618,1.0,1.0,0.026,0.106,161.8,161.8
,5,0.0504202,0.9993512,2.618,2.618,1.0,1.0,0.026,0.132,161.8,161.8
,6,0.1000764,0.9989560,2.618,2.618,1.0,1.0,0.13,0.262,161.8,161.8
,7,0.1504966,0.9984189,2.618,2.618,1.0,1.0,0.132,0.394,161.8,161.8
,8,0.2001528,0.9974748,2.618,2.618,1.0,1.0,0.13,0.524,161.8,161.8
,9,0.3002292,0.9283516,2.618,2.618,1.0,1.0,0.262,0.786,161.8,161.8




ModelMetricsBinomial: gbm
** Reported on cross-validation data. **

MSE: 0.0548944822053
R^2: 0.767464394898
LogLoss: 0.190785731053
AUC: 0.96914091471
Gini: 0.938281829419

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.570170412897: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,793.0,16.0,0.0198,(16.0/809.0)
1,71.0,429.0,0.142,(71.0/500.0)
Total,864.0,445.0,0.0665,(87.0/1309.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.5701704,0.9079365,119.0
max f2,0.2341802,0.9030232,199.0
max f0point5,0.6942130,0.9494536,100.0
max accuracy,0.5701704,0.9335371,119.0
max precision,0.9998896,1.0,0.0
max recall,0.0157905,1.0,367.0
max specificity,0.9998896,1.0,0.0
max absolute_MCC,0.5701704,0.8597688,119.0
max min_per_class_accuracy,0.2814202,0.9072930,185.0



Gains/Lift Table: Avg response rate: 38.20 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0106952,0.9999194,2.618,2.618,1.0,1.0,0.028,0.028,161.8,161.8
,2,0.0206264,0.9998643,2.618,2.618,1.0,1.0,0.026,0.054,161.8,161.8
,3,0.0305577,0.9997939,2.618,2.618,1.0,1.0,0.026,0.08,161.8,161.8
,4,0.0404889,0.9996700,2.618,2.618,1.0,1.0,0.026,0.106,161.8,161.8
,5,0.0504202,0.9995073,2.618,2.618,1.0,1.0,0.026,0.132,161.8,161.8
,6,0.1000764,0.9986523,2.618,2.618,1.0,1.0,0.13,0.262,161.8,161.8
,7,0.1504966,0.9969341,2.618,2.618,1.0,1.0,0.132,0.394,161.8,161.8
,8,0.2001528,0.9933988,2.5777231,2.6080076,0.9846154,0.9961832,0.128,0.522,157.7723077,160.8007634
,9,0.3002292,0.8673018,2.5980153,2.6046768,0.9923664,0.9949109,0.26,0.782,159.8015267,160.4676845




Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
F0point5,0.9464863,0.0028340,0.9423077,0.9466019,0.9447005,0.954023,0.9447983
F1,0.9115442,0.0094239,0.9158878,0.8914286,0.9010989,0.9222222,0.9270833
F2,0.879416,0.0171493,0.8909091,0.8423326,0.8613445,0.8924731,0.9100205
accuracy,0.9367068,0.0050753,0.9325843,0.9298893,0.9302326,0.9448819,0.9459459
auc,0.969806,0.0051657,0.9580933,0.9658036,0.9707908,0.9786164,0.975726
err,0.0632932,0.0050753,0.0674157,0.0701107,0.0697674,0.0551181,0.0540541
err_count,16.6,1.5231546,18.0,19.0,18.0,14.0,14.0
lift_top_group,2.6258688,0.0998947,2.3839285,2.8229167,2.632653,2.6736841,2.6161616
logloss,0.1903914,0.0178253,0.2237620,0.1981992,0.2069530,0.1559702,0.1670726



Scoring History: 


0,1,2,3,4,5,6,7,8
,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_AUC,training_lift,training_classification_error
,2016-06-01 17:03:46,3.233 sec,0.0,0.2360691,0.6650208,0.5,1.0,0.6180290
,2016-06-01 17:03:46,3.260 sec,10.0,0.1436176,0.4688668,0.9713708,2.618,0.0733384
,2016-06-01 17:03:46,3.279 sec,20.0,0.0964926,0.3553781,0.9736193,2.618,0.0710466
,2016-06-01 17:03:46,3.302 sec,30.0,0.0763702,0.2958328,0.9764215,2.618,0.0672269
,2016-06-01 17:03:46,3.327 sec,40.0,0.0633225,0.2504598,0.9789098,2.618,0.0634072
---,---,---,---,---,---,---,---,---
,2016-06-01 17:03:47,4.054 sec,200.0,0.0280089,0.1064794,0.9948888,2.618,0.0320856
,2016-06-01 17:03:47,4.214 sec,210.0,0.0270032,0.1030859,0.9953053,2.618,0.0313216
,2016-06-01 17:03:47,4.284 sec,220.0,0.0263122,0.1002153,0.9956984,2.618,0.0290298



See the whole table with table.as_data_frame()

Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
boat,1276.1845703,1.0,0.5200953
sex,443.5411987,0.3475525,0.1807605
fare,116.3196335,0.0911464,0.0474048
cabin,104.0673981,0.0815457,0.0424116
ticket,102.6859894,0.0804633,0.0418486
age,92.1821823,0.0722326,0.0375679
home.dest,79.9298096,0.0626319,0.0325745
pclass,78.4117355,0.0614423,0.0319559
body,52.3205261,0.0409976,0.0213227


<bound method H2OGradientBoostingEstimator.cross_validation_metrics_summary of >


Keeping the same "best" model, we can make test set predictions as follows:

In [317]:
gbm = h2o.get_model('grid_tune_model_32')
preds = gbm.predict(test)
preds.head()
#gbm@model$validation_metrics@metrics$max_criteria_and_metric_scores


gbm prediction Progress: [##################################################] 100%


predict,p0,p1
0,0.97708,0.0229197
0,0.987267,0.0127335
0,0.755024,0.244976
1,0.0164609,0.983539
1,0.0164099,0.98359
0,0.77207,0.22793
1,0.0581372,0.941863
1,0.019667,0.980333
1,0.0273786,0.972621
0,0.962789,0.0372105




Note that the label (survived or not) is predicted as well (in the first predict column), and it uses the threshold with the highest F1 score (here: 0.4811222) to make labels from the probabilities for survival (p1). The probability for death (p0) is given for convenience, as it is just 1-p1.

In [318]:
gbm.model_performance

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Method
Model Key:  grid_tune_model_32

Model Summary: 


0,1,2,3,4,5,6,7,8
,number_of_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,130.0,350279.0,12.0,25.0,18.007692,64.0,329.0,222.06154




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.0103541333846
R^2: 0.956308097912
LogLoss: 0.0617033059212
AUC: 0.999868219366
Gini: 0.999736438732

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.393763950946: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,477.0,2.0,0.0042,(2.0/479.0)
1,1.0,300.0,0.0033,(1.0/301.0)
Total,478.0,302.0,0.0038,(3.0/780.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.3937640,0.9950249,173.0
max f2,0.3663009,0.9966887,177.0
max f0point5,0.4567333,0.9953240,170.0
max accuracy,0.3937640,0.9961538,173.0
max precision,0.9848078,1.0,0.0
max recall,0.3663009,1.0,177.0
max specificity,0.9848078,1.0,0.0
max absolute_MCC,0.3937640,0.9918937,173.0
max min_per_class_accuracy,0.3937640,0.9958246,173.0



Gains/Lift Table: Avg response rate: 38.59 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0102564,0.9845962,2.5913621,2.5913621,1.0,1.0,0.0265781,0.0265781,159.1362126,159.1362126
,2,0.0205128,0.9844180,2.5913621,2.5913621,1.0,1.0,0.0265781,0.0531561,159.1362126,159.1362126
,3,0.0307692,0.9841610,2.5913621,2.5913621,1.0,1.0,0.0265781,0.0797342,159.1362126,159.1362126
,4,0.0410256,0.9839015,2.5913621,2.5913621,1.0,1.0,0.0265781,0.1063123,159.1362126,159.1362126
,5,0.05,0.9836057,2.5913621,2.5913621,1.0,1.0,0.0232558,0.1295681,159.1362126,159.1362126
,6,0.1,0.9822774,2.5913621,2.5913621,1.0,1.0,0.1295681,0.2591362,159.1362126,159.1362126
,7,0.15,0.9798541,2.5913621,2.5913621,1.0,1.0,0.1295681,0.3887043,159.1362126,159.1362126
,8,0.2,0.9761270,2.5913621,2.5913621,1.0,1.0,0.1295681,0.5182724,159.1362126,159.1362126
,9,0.3,0.9319450,2.5913621,2.5913621,1.0,1.0,0.2591362,0.7774086,159.1362126,159.1362126




ModelMetricsBinomial: gbm
** Reported on validation data. **

MSE: 0.0504959021911
R^2: 0.786360645089
LogLoss: 0.190423970435
AUC: 0.970386024232
Gini: 0.940772048464

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.481122236956: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,167.0,2.0,0.0118,(2.0/169.0)
1,13.0,92.0,0.1238,(13.0/105.0)
Total,180.0,94.0,0.0547,(15.0/274.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4811222,0.9246231,93.0
max f2,0.0857740,0.9082734,135.0
max f0point5,0.6209273,0.9619450,91.0
max accuracy,0.6209273,0.9452555,91.0
max precision,0.9845675,1.0,0.0
max recall,0.0173907,1.0,254.0
max specificity,0.9845675,1.0,0.0
max absolute_MCC,0.6209273,0.8861050,91.0
max min_per_class_accuracy,0.2501180,0.9142857,109.0



Gains/Lift Table: Avg response rate: 38.32 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0109489,0.9837571,2.6095238,2.6095238,1.0,1.0,0.0285714,0.0285714,160.9523810,160.9523810
,2,0.0218978,0.9832969,2.6095238,2.6095238,1.0,1.0,0.0285714,0.0571429,160.9523810,160.9523810
,3,0.0328467,0.9823665,2.6095238,2.6095238,1.0,1.0,0.0285714,0.0857143,160.9523810,160.9523810
,4,0.0401460,0.9796526,2.6095238,2.6095238,1.0,1.0,0.0190476,0.1047619,160.9523810,160.9523810
,5,0.0510949,0.9786444,2.6095238,2.6095238,1.0,1.0,0.0285714,0.1333333,160.9523810,160.9523810
,6,0.1021898,0.9751848,2.6095238,2.6095238,1.0,1.0,0.1333333,0.2666667,160.9523810,160.9523810
,7,0.1496350,0.9707483,2.6095238,2.6095238,1.0,1.0,0.1238095,0.3904762,160.9523810,160.9523810
,8,0.2007299,0.9386312,2.6095238,2.6095238,1.0,1.0,0.1333333,0.5238095,160.9523810,160.9523810
,9,0.2992701,0.8001349,2.6095238,2.6095238,1.0,1.0,0.2571429,0.7809524,160.9523810,160.9523810




Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13
,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_AUC,training_lift,training_classification_error,validation_MSE,validation_logloss,validation_AUC,validation_lift,validation_classification_error
,2016-06-01 17:02:24,1 min 23.402 sec,0.0,0.2369806,0.6668775,0.5,1.0,0.6141026,0.2363677,0.6656298,0.5,1.0,0.6167883
,2016-06-01 17:02:24,1 min 23.475 sec,10.0,0.1143514,0.4046775,0.9983458,2.5913621,0.0166667,0.1411322,0.4617331,0.9578755,2.6095238,0.0729927
,2016-06-01 17:02:24,1 min 23.524 sec,20.0,0.0646629,0.2798682,0.9987516,2.5913621,0.0179487,0.0966420,0.3540958,0.9653142,2.6095238,0.0693431
,2016-06-01 17:02:24,1 min 23.579 sec,30.0,0.0428596,0.2123646,0.9992371,2.5913621,0.0128205,0.0783698,0.3006264,0.9689772,2.6095238,0.0620438
,2016-06-01 17:02:24,1 min 23.642 sec,40.0,0.0306994,0.1677444,0.9995006,2.5913621,0.0102564,0.0685441,0.2669445,0.9692026,2.6095238,0.0656934
,2016-06-01 17:02:24,1 min 23.705 sec,50.0,0.0236269,0.1368956,0.9996532,2.5913621,0.0089744,0.0615856,0.2411748,0.9708932,2.6095238,0.0620438
,2016-06-01 17:02:24,1 min 23.773 sec,60.0,0.0192380,0.1153113,0.9997226,2.5913621,0.0076923,0.0577490,0.2248451,0.9705551,2.6095238,0.0620438
,2016-06-01 17:02:24,1 min 23.845 sec,70.0,0.0166284,0.1005205,0.9997468,2.5913621,0.0076923,0.0551495,0.2134503,0.9702170,2.6095238,0.0583942
,2016-06-01 17:02:24,1 min 23.920 sec,80.0,0.0149023,0.0901975,0.9997885,2.5913621,0.0064103,0.0538631,0.2067295,0.9701606,2.6095238,0.0547445



Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
boat,716.0181274,1.0,0.3567185
sex,298.2000122,0.4164699,0.1485625
fare,177.8346710,0.2483662,0.0885968
ticket,174.5613251,0.2437946,0.0869660
age,155.2628021,0.2168420,0.0773516
home.dest,131.7727051,0.1840354,0.0656489
cabin,96.1860046,0.1343346,0.0479196
pclass,71.6239929,0.1000310,0.0356829
parch,54.4130287,0.0759939,0.0271084


<bound method H2OGradientBoostingEstimator.model_performance of >

The model and the predictions can be saved to file as follows:

In [319]:
h2o.save_model(gbm, "/tmp/bestModel.csv", force=True)
h2o.export_file(preds, "/tmp/bestPreds.csv", force=True)


Export File Progress: [##################################################] 100%


The model can also be exported as a plain old Java object (POJO) for H2O-independent (standalone/Storm/Kafka/UDF) scoring in any Java environment.

In [None]:
h2o.download_pojo(gbm)

## Summary
We learned how to build H2O GBM models for a binary classification task on a small but realistic dataset with numerical and categorical variables, with the goal to maximize the AUC (ranges from 0.5 to 1). We first established a baseline with the default model, then carefully tuned the remaining hyper-parameters without "too much" human guess-work. We used both Cartesian and Random hyper-parameter searches to find good models. We were able to get the AUC on a holdout test set from the low 94% range with the default model to the mid 97% after tuning.

Note that this script and the findings therein are directly transferrable to large datasets on distributed clusters including Spark/Hadoop environments.

More information can be found here [http://www.h2o.ai/docs/](http://www.h2o.ai/docs/).