# Gradient Boosting in H2O
## Jose M Albornoz
### December 2018

This notebook shows how to do a train-test split in H2O

In [1]:
import h2o

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from c:\users\albornoj\appdata\local\programs\python\python37\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\AlbornoJ\AppData\Local\Temp\tmpngu88mf1
  JVM stdout: C:\Users\AlbornoJ\AppData\Local\Temp\tmpngu88mf1\h2o_AlbornoJ_started_from_python.out
  JVM stderr: C:\Users\AlbornoJ\AppData\Local\Temp\tmpngu88mf1\h2o_AlbornoJ_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,06 secs
H2O cluster timezone:,Europe/London
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,12 days
H2O cluster name:,H2O_from_python_AlbornoJ_oapsnj
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# 1.- Import Iris dataset

In [3]:
url = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"

In [4]:
iris = h2o.import_file(url)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [5]:
iris.shape

(150, 5)

# 2.- Train-test split

We will use a 80-20 train-test split

In [6]:
train, test = iris.split_frame([0.8])

In [7]:
train.shape

(119, 5)

In [8]:
test.shape

(31, 5)

In [9]:
train.summary()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class
type,real,real,real,real,enum
mins,4.4,2.0,1.0,0.1,
mean,5.854621848739495,3.034453781512605,3.8226890756302523,1.2201680672268904,
maxs,7.7,4.4,6.9,2.5,
sigma,0.8227432306536366,0.42693963666333795,1.7322189636705823,0.7548240149517819,
zeros,0,0,0,0,
missing,0,0,0,0,0
0,4.7,3.2,1.3,0.2,Iris-setosa
1,4.6,3.1,1.5,0.2,Iris-setosa
2,5.0,3.6,1.4,0.2,Iris-setosa


# 3.- Model build

In [10]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [11]:
mGBM = H2OGradientBoostingEstimator()

In [12]:
mGBM.train(["sepal_len", "sepal_wid", "petal_len", "petal_wid"], "class", train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [13]:
mGBM

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1543927455262_1


ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.002790592010464094
RMSE: 0.052826054276882105
LogLoss: 0.017476272904586188
Mean Per-Class Error: 0.0
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
37.0,0.0,0.0,0.0,0 / 37
0.0,42.0,0.0,0.0,0 / 42
0.0,0.0,40.0,0.0,0 / 40
37.0,42.0,40.0,0.0,0 / 119


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,1.0
2,1.0
3,1.0


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error
,2018-12-04 12:44:27,0.150 sec,0.0,0.6666667,1.0986123,0.6218487
,2018-12-04 12:44:28,0.508 sec,1.0,0.6031805,0.9246870,0.0336134
,2018-12-04 12:44:28,0.653 sec,2.0,0.5454041,0.7893513,0.0336134
,2018-12-04 12:44:28,0.792 sec,3.0,0.4927540,0.6797675,0.0336134
,2018-12-04 12:44:28,0.830 sec,4.0,0.4449532,0.5890979,0.0336134
---,---,---,---,---,---,---
,2018-12-04 12:44:30,2.706 sec,46.0,0.0600134,0.0214302,0.0
,2018-12-04 12:44:30,2.718 sec,47.0,0.0587942,0.0206212,0.0
,2018-12-04 12:44:30,2.729 sec,48.0,0.0575721,0.0197567,0.0


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
petal_len,216.7495117,1.0,0.5932508
petal_wid,144.4012146,0.6662124,0.3952311
sepal_wid,2.6983480,0.0124492,0.0073855
sepal_len,1.5098836,0.0069660,0.0041326




# 4.- Predictions

In [14]:
p = mGBM.predict(test)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [15]:
p

predict,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,0.998982,0.000503086,0.000515137
Iris-setosa,0.998809,0.000641481,0.000549225
Iris-setosa,0.998808,0.000641654,0.000550001
Iris-setosa,0.99881,0.000641202,0.000548929
Iris-setosa,0.997835,0.00177245,0.000392934
Iris-setosa,0.997863,0.0017717,0.000365617
Iris-setosa,0.998689,0.000796088,0.000514672
Iris-setosa,0.99835,0.00116748,0.000482465
Iris-setosa,0.998334,0.00116516,0.000500533
Iris-setosa,0.99871,0.000775037,0.000514883




In [16]:
mGBM.model_performance(test)


ModelMetricsMultinomial: gbm
** Reported on test data. **

MSE: 0.03304693797784583
RMSE: 0.1817881678708651
LogLoss: 0.18288111583643185
Mean Per-Class Error: 0.041666666666666664
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
13.0,0.0,0.0,0.0,0 / 13
0.0,7.0,1.0,0.125,1 / 8
0.0,0.0,10.0,0.0,0 / 10
13.0,7.0,11.0,0.0322581,1 / 31


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.9677419
2,1.0
3,1.0




In [17]:
help(h2o.estimators.gbm.H2OGradientBoostingEstimator)

Help on class H2OGradientBoostingEstimator in module h2o.estimators.gbm:

class H2OGradientBoostingEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  H2OGradientBoostingEstimator(**kwargs)
 |  
 |  Gradient Boosting Machine
 |  
 |  Builds gradient boosted trees on a parsed data set, for regression or classification.
 |  The default distribution function will guess the model type based on the response column type.
 |  Otherwise, the response column must be an enum for "bernoulli" or "multinomial", and numeric
 |  for all other distributions.
 |  
 |  Method resolution order:
 |      H2OGradientBoostingEstimator
 |      h2o.estimators.estimator_base.H2OEstimator
 |      h2o.model.model_base.ModelBase
 |      h2o.utils.backward_compatibility.BackwardsCompatibleBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, **kwargs)
 |      Construct a new model instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data d


