# Random Forests in H2O
## Jose M Albornoz
### December 2018

This notebook demonstrates a basic Random Forest model in H2O

In [1]:
import h2o

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  Starting server from c:\users\albornoj\appdata\local\programs\python\python37\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\AlbornoJ\AppData\Local\Temp\tmpjxkuuqyf
  JVM stdout: C:\Users\AlbornoJ\AppData\Local\Temp\tmpjxkuuqyf\h2o_AlbornoJ_started_from_python.out
  JVM stderr: C:\Users\AlbornoJ\AppData\Local\Temp\tmpjxkuuqyf\h2o_AlbornoJ_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,05 secs
H2O cluster timezone:,Europe/London
H2O data parsing timezone:,UTC
H2O cluster version:,3.22.0.2
H2O cluster version age:,11 days
H2O cluster name:,H2O_from_python_AlbornoJ_1mv3u4
H2O cluster total nodes:,1
H2O cluster free memory:,3.531 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


# 1.- Import Iris dataset

In [3]:
url = "http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"

In [4]:
iris = h2o.import_file(url)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [5]:
iris.shape

(150, 5)

# 2.- Train-test split

We will use a 80-20 train-test split

In [6]:
train, test = iris.split_frame([0.8])

In [7]:
train.shape

(125, 5)

In [8]:
test.shape

(25, 5)

In [9]:
train.summary()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class
type,real,real,real,real,enum
mins,4.4,2.0,1.0,0.1,
mean,5.823199999999997,3.0608,3.6920000000000006,1.1608,
maxs,7.9,4.2,6.9,2.5,
sigma,0.8376364597542769,0.42651929578557063,1.782604111785575,0.7596578177047872,
zeros,0,0,0,0,
missing,0,0,0,0,0
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


# 3.- Model build

In [10]:
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [11]:
mRF = H2ORandomForestEstimator()

In [12]:
mRF.train(["sepal_len", "sepal_wid", "petal_len", "petal_wid"], "class", train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [13]:
mRF

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1543854642908_1


ModelMetricsMultinomial: drf
** Reported on train data. **

MSE: 0.03971863813397885
RMSE: 0.1992953540200545
LogLoss: 0.1385834489675426
Mean Per-Class Error: 0.05772357723577235
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
44.0,0.0,0.0,0.0,0 / 44
0.0,38.0,3.0,0.0731707,3 / 41
0.0,4.0,36.0,0.1,4 / 40
44.0,42.0,39.0,0.056,7 / 125


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.944
2,1.0
3,1.0


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error
,2018-12-03 16:30:53,0.078 sec,0.0,,,
,2018-12-03 16:30:54,0.372 sec,1.0,0.2511026,1.5271711,0.0638298
,2018-12-03 16:30:54,0.402 sec,2.0,0.2404941,1.4479987,0.0540541
,2018-12-03 16:30:54,0.437 sec,3.0,0.3292557,3.0362184,0.0537634
,2018-12-03 16:30:54,0.456 sec,4.0,0.2721837,2.0614883,0.0485437
---,---,---,---,---,---,---
,2018-12-03 16:30:55,1.532 sec,46.0,0.1970684,0.1378176,0.056
,2018-12-03 16:30:55,1.593 sec,47.0,0.1978101,0.1392634,0.056
,2018-12-03 16:30:55,1.779 sec,48.0,0.1980983,0.1395192,0.056


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
petal_len,1704.6412354,1.0,0.4689258
petal_wid,1497.3769531,0.8784118,0.4119100
sepal_len,340.3966064,0.1996881,0.0936389
sepal_wid,92.7896881,0.0544336,0.0255253




# 4.- Predictions

In [14]:
p = mRF.predict(test)

drf prediction progress: |████████████████████████████████████████████████| 100%


In [15]:
p

predict,Iris-setosa,Iris-versicolor,Iris-virginica
Iris-setosa,0.998573,0.0,0.00142653
Iris-setosa,0.998573,0.0,0.00142653
Iris-setosa,0.979021,0.0195804,0.0013986
Iris-setosa,0.979021,0.0195804,0.0013986
Iris-setosa,0.998573,0.0,0.00142653
Iris-setosa,0.998573,0.0,0.00142653
Iris-versicolor,0.0,0.977712,0.0222883
Iris-versicolor,0.0,0.967603,0.0323974
Iris-versicolor,0.0,0.956909,0.0430906
Iris-versicolor,0.0,0.958631,0.0413695




In [16]:
mRF.model_performance(test)


ModelMetricsMultinomial: drf
** Reported on test data. **

MSE: 0.03267390943224702
RMSE: 0.18075925822000657
LogLoss: 0.10200513566819698
Mean Per-Class Error: 0.07037037037037037
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
6.0,0.0,0.0,0.0,0 / 6
0.0,8.0,1.0,0.1111111,1 / 9
0.0,1.0,9.0,0.1,1 / 10
6.0,9.0,10.0,0.08,2 / 25


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.92
2,1.0
3,1.0




In [17]:
help(h2o.estimators.random_forest.H2ORandomForestEstimator)

Help on class H2ORandomForestEstimator in module h2o.estimators.random_forest:

class H2ORandomForestEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  H2ORandomForestEstimator(**kwargs)
 |  
 |  Distributed Random Forest
 |  
 |  Method resolution order:
 |      H2ORandomForestEstimator
 |      h2o.estimators.estimator_base.H2OEstimator
 |      h2o.model.model_base.ModelBase
 |      h2o.utils.backward_compatibility.BackwardsCompatibleBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, **kwargs)
 |      Construct a new model instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  balance_classes
 |      Balance training data class counts via over/under-sampling (for imbalanced data).
 |      
 |      Type: ``bool``  (default: ``False``).
 |  
 |  binomial_double_trees
 |      For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.


