### Create H2O cloud

In [18]:
### Task: Predicting forest cover type from cartographic variables only
###       The actual forest cover type for a given observation 
###       (30 x 30 meter cell) was determined from the US Forest Service (USFS).

### Note: If run from plain R, execute R in the directory of this script. If run from RStudio, 
### be sure to setwd() to the location of this script. h2o.init() starts H2O in R's current 
### working directory. h2o.importFile() looks for files from the perspective of where H2O was 
### started.

## install.packages("h2o", lib="/opt/conda/lib/R/library", repo="http://cran.us.r-project.org")
library(h2o)
h2o.init(
  nthreads=-1,            ## -1: use all available threads
  max_mem_size = "2G")   
h2o.removeAll() # Clean slate - just in case the cluster was already running

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         22 hours 4 minutes 
    H2O cluster version:        3.10.3.6 
    H2O cluster version age:    1 month and 14 days  
    H2O cluster name:           H2O_started_from_R_jovyan_qjv229 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.35 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    R Version:                  R version 3.3.2 (2016-10-31) 

[1] 0


### Read data from disk 

In [2]:

df <- h2o.importFile(path = normalizePath("data/covtype.full.csv"))

## First, we will create three splits for train/test/valid independent data sets.
## We will train a data set on one set and use the others to test the validity
##  of model by ensuring that it can predict accurately on data the model has not
##  been shown.
## The second set will be used for validation most of the time. The third set will
##  be withheld until the end, to ensure that our validation accuracy is consistent
##  with data we have never seen during the iterative process. 
splits <- h2o.splitFrame(
  df,           ##  splitting the H2O frame we read above
  c(0.6,0.2),   ##  create splits of 60% and 20%; 
                ##  H2O will create one more split of 1-(sum of these parameters)
                ##  so we will get 0.6 / 0.2 / 1 - (0.6+0.2) = 0.6/0.2/0.2
  seed=1234)    ##  setting a seed will ensure reproducible results (not R's seed)



### Look at variables types

In [3]:
h2o.getTypes(df)

### Create training, validation and test set 

In [4]:
train <- h2o.assign(splits[[1]], "train.hex")   
                ## assign the first result the R variable train
                ## and the H2O name train.hex
valid <- h2o.assign(splits[[2]], "valid.hex")   ## R valid, H2O valid.hex
test <- h2o.assign(splits[[3]], "test.hex")     ## R test, H2O test.hex

## take a look at the first few rows of the data set
train[1:5,]   ## rows 1-5, all columns

  Elevation Aspect Slope Horizontal_Distance_To_Hydrology
1      3136     32    20                              450
2      3217     80    13                               30
3      3119    293    13                               30
4      2679     48     7                              150
5      3261    322    13                               30
  Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways Hillshade_9am
1                            -38                            1290           211
2                              1                            3901           237
3                             10                            4810           182
4                             24                            1588           223
5                              5                            5701           186
  Hillshade_Noon Hillshade_3pm Horizontal_Distance_To_Fire_Points
1            193           111                               1112
2            217           109                

### Run the first Random Forest predictive model

In [10]:

rf1 <- h2o.randomForest(         ## h2o.randomForest function
  training_frame = train,        ## the H2O frame for training
  validation_frame = valid,      ## the H2O frame for validation (not required)
  x=1:12,                        ## the predictor columns, by column index
  y=13,                          ## the target index (what we are predicting)
  model_id = "rf_covType_v1",    ## name the model in H2O
                                 ##   not required, but helps use Flow
  ntrees = 200,                  ## use a maximum of 200 trees to create the
                                 ##  random forest model. The default is 50.
                                 ##  I have increased it because I will let 
                                 ##  the early stopping criteria decide when
                                 ##  the random forest is sufficiently accurate
  stopping_rounds = 2,           ## Stop fitting new trees when the 2-tree
                                 ##  average is within 0.001 (default) of 
                                 ##  the prior two 2-tree averages.
                                 ##  Can be thought of as a convergence setting
  score_each_iteration = T,      ## Predict against training and validation for
                                 ##  each tree. Default will skip several.
  seed = 1000000)                ## Set the random seed so that this can be
                                 ##  reproduced.
###############################################################################




###############################################################################



### Model rf1 information

In [11]:
summary(rf1)                     ## View information about the model.
                                 ## Keys to look for are validation performance
                                 ##  and variable importance

Model Details:

H2OMultinomialModel: drf
Model Key:  rf_covType_v1 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              24                      168            11881288        17
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        20   19.95238        527      16103  5504.44630

H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

Training Set Metrics: 

Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.05607684
RMSE: (Extract with `h2o.rmse`) 0.2368055
Logloss: (Extract with `h2o.logloss`) 0.238433
Mean Per-Class Error: 0.1110219
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Confusion Matrix: vertical: actual; across: predicted
        class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error
class_1  117176    9534       5       0      53      11     338 0.0782
class_2    5414  164066   

In [12]:
rf1@model$validation_metrics     ## A more direct way to access the validation 
                                 ##  metrics. Performance metrics depend on 
                                 ##  the type of model being built. With a
                                 ##  multinomial classification, we will primarily
                                 ##  look at the confusion matrix, and overall
                                 ##  accuracy via hit_ratio @ k=1.

H2OMultinomialMetrics: drf
** Reported on validation data. **

Validation Set Metrics: 

Extract validation frame with `h2o.getFrame("valid.hex")`
MSE: (Extract with `h2o.mse`) 0.05314141
RMSE: (Extract with `h2o.rmse`) 0.2305242
Logloss: (Extract with `h2o.logloss`) 0.2003041
Mean Per-Class Error: 0.1025131
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,valid = TRUE)`)
Confusion Matrix: vertical: actual; across: predicted
        class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error
class_1   39403    2998       0       0      15       2      82 0.0729
class_2    1589   54529     104       0      83      60      15 0.0328
class_3       0     131    6844      30       3     135       0 0.0419
class_4       1       1      61     479       0      20       0 0.1477
class_5      29     432      24       0    1377       8       0 0.2636
class_6       0     129     212      19       3    3101       0 0.1048
class_7     204      16       0       0       1       0    3878

In [13]:
h2o.hit_ratio_table(rf1,valid = T)[1,2]
                                 ## Even more directly, the hit_ratio @ k=1

### Run the first GBM predictive model

In [7]:
## Now we will try GBM. 
## First we will use all default settings, and then make some changes,
##  where the parameters and defaults are described.

gbm1 <- h2o.gbm(
  training_frame = train,        ## the H2O frame for training
  validation_frame = valid,      ## the H2O frame for validation (not required)
  x=1:12,                        ## the predictor columns, by column index
  y=13,                          ## the target index (what we are predicting)
  model_id = "gbm_covType1",     ## name the model in H2O
  seed = 2000000)                ## Set the random seed for reproducability

###############################################################################
summary(gbm1)                   ## View information about the model.

h2o.hit_ratio_table(gbm1,valid = T)[1,2]
                                ## Overall accuracy.

## This default GBM is much worse than our original random forest.
## The GBM is far from converging, so there are three primary knobs to adjust
##  to get our performance up if we want to keep a similar run time.
## 1: Adding trees will help. The default is 50.
## 2: Increasing the learning rate will also help. The contribution of each
##  tree will be stronger, so the model will move further away from the
##  overall mean.
## 3: Increasing the depth will help. This is the parameter that is the least
##  straightforward. Tuning trees and learning rate both have direct impact
##  that is easy to understand. Changing the depth means you are adjusting
##  the "weakness" of each learner. Adding depth makes each tree fit the data
##  closer. 
##
## The first configuration will attack depth the most, since we've seen the
##  random forest focus on a continuous variable (elevation) and 40-class factor
##  (soil type) the most.
##


###############################################################################

Model Details:

H2OMultinomialModel: gbm
Model Key:  gbm_covType1 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              50                      350              315921         5
  max_depth mean_depth min_leaves max_leaves mean_leaves
1         5    5.00000         22         32    31.14000

H2OMultinomialMetrics: gbm
** Reported on training data. **

Training Set Metrics: 

Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.1435669
RMSE: (Extract with `h2o.rmse`) 0.3789023
Logloss: (Extract with `h2o.logloss`) 0.4560168
Mean Per-Class Error: 0.2697606
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Confusion Matrix: vertical: actual; across: predicted
        class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error
class_1   99691   26454      19       0      35      21     900 0.2158
class_2   21964  146151    1049       0     241     873      64 0.1420
class_3   

### Run the second GBM predictive model

In [9]:
gbm2 <- h2o.gbm(
  training_frame = train,     ##
  validation_frame = valid,   ##
  x=1:12,                     ##
  y=13,                       ## 
  ntrees = 20,                ## decrease the trees, mostly to allow for run time
                              ##  (from 50)
  learn_rate = 0.2,           ## increase the learning rate (from 0.1)
  max_depth = 10,             ## increase the depth (from 5)
  stopping_rounds = 2,        ## 
  stopping_tolerance = 0.01,  ##
  score_each_iteration = T,   ##
  model_id = "gbm_covType2",  ##
  seed = 2000000)             ##

###############################################################################

summary(gbm2)
h2o.hit_ratio_table(gbm1,valid = T)[1,2]    ## review the first model's accuracy
h2o.hit_ratio_table(gbm2,valid = T)[1,2]    ## review the new model's accuracy
###############################################################################

## This has moved us in the right direction, but still lower accuracy 
##  than the random forest.
## And it still has not converged, so we can make it more aggressive.
## We can now add the stochastic nature of random forest into the GBM
##  using some of the new H2O settings. This will help generalize 
##  and also provide a quicker runtime, so we can add a few more trees.

Model Details:

H2OMultinomialModel: gbm
Model Key:  gbm_covType2 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              20                      140             1154332        10
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        10   10.00000        167        855   600.65717

H2OMultinomialMetrics: gbm
** Reported on training data. **

Training Set Metrics: 

Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.05351069
RMSE: (Extract with `h2o.rmse`) 0.2313238
Logloss: (Extract with `h2o.logloss`) 0.1961796
Mean Per-Class Error: 0.05628537
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Confusion Matrix: vertical: actual; across: predicted
        class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error
class_1  115685   11234       1       0      24       2     174 0.0900
class_2    7733  162200     139       0     171      75      24 0.0478
class_3 

### Run the third GBM predictive model

In [15]:

gbm3 <- h2o.gbm(
  training_frame = train,     ##
  validation_frame = valid,   ##
  x=1:12,                     ##
  y=13,                       ## 
  ntrees = 30,                ## add a few trees (from 20, though default is 50)
  learn_rate = 0.3,           ## increase the learning rate even further
  max_depth = 10,             ## 
  sample_rate = 0.7,          ## use a random 70% of the rows to fit each tree
  col_sample_rate = 0.7,       ## use 70% of the columns to fit each tree
  stopping_rounds = 2,        ## 
  stopping_tolerance = 0.01,  ##
  score_each_iteration = T,   ##
  model_id = "gbm_covType3",  ##
  seed = 2000000)             ##
###############################################################################

summary(gbm3)
h2o.hit_ratio_table(rf1,valid = T)[1,2]     ## review the random forest accuracy
h2o.hit_ratio_table(gbm1,valid = T)[1,2]    ## review the first model's accuracy
h2o.hit_ratio_table(gbm2,valid = T)[1,2]    ## review the second model's accuracy
h2o.hit_ratio_table(gbm3,valid = T)[1,2]    ## review the newest model's accuracy
###############################################################################


Model Details:

H2OMultinomialModel: gbm
Model Key:  gbm_covType3 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              30                      210             1460962        10
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        10   10.00000        160        800   496.31906

H2OMultinomialMetrics: gbm
** Reported on training data. **

Training Set Metrics: 

Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.02767722
RMSE: (Extract with `h2o.rmse`) 0.1663647
Logloss: (Extract with `h2o.logloss`) 0.1113528
Mean Per-Class Error: 0.02172058
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Confusion Matrix: vertical: actual; across: predicted
        class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error
class_1  121209    5849       2       0      19       0      41 0.0465
class_2    3779  166343      77       0      86      45      12 0.0235
class_3 

### Modify Random Forest parameter

In [16]:

## Now the GBM is close to the initial random forest.
## However, we used a default random forest. 
## Random forest's primary strength is how well it runs with standard
##  parameters. And while there are only a few parameters to tune, we can 
##  experiment with those to see if it will make a difference.
## The main parameters to tune are the tree depth and the mtries, which
##  is the number of predictors to use.
## The default depth of trees is 20. It is common to increase this number,
##  to the point that in some implementations, the depth is unlimited.
##  We will increase ours from 20 to 30.
## Note that the default mtries depends on whether classification or regression
##  is being run. The default for classification is one-third of the columns.
##  The default for regression is the square root of the number of columns.

rf2 <- h2o.randomForest(        ##
  training_frame = train,       ##
  validation_frame = valid,     ##
  x=1:12,                       ##
  y=13,                         ##
  model_id = "rf_covType2",     ## 
  ntrees = 200,                 ##
  max_depth = 30,               ## Increase depth, from 20
  stopping_rounds = 2,          ##
  stopping_tolerance = 1e-2,    ##
  score_each_iteration = T,     ##
  seed=3000000)                 ##
###############################################################################
summary(rf2)
h2o.hit_ratio_table(gbm3,valid = T)[1,2]    ## review the newest GBM accuracy
h2o.hit_ratio_table(rf1,valid = T)[1,2]     ## original random forest accuracy
h2o.hit_ratio_table(rf2,valid = T)[1,2]     ## newest random forest accuracy
###############################################################################

Model Details:

H2OMultinomialModel: drf
Model Key:  rf_covType2 
Model Summary: 
  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1              22                      154            16259862        18
  max_depth mean_depth min_leaves max_leaves mean_leaves
1        30   28.03896        538      25147  8261.27900

H2OMultinomialMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **

Training Set Metrics: 

Extract training frame with `h2o.getFrame("train.hex")`
MSE: (Extract with `h2o.mse`) 0.04415075
RMSE: (Extract with `h2o.rmse`) 0.2101208
Logloss: (Extract with `h2o.logloss`) 0.2736372
Mean Per-Class Error: 0.1031423
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
Confusion Matrix: vertical: actual; across: predicted
        class_1 class_2 class_3 class_4 class_5 class_6 class_7  Error
class_1  119338    7439       3       0      43       9     284 0.0612
class_2    4032  165551    

In [17]:

## So we now have our accuracy up beyond 95%. 
## We have witheld an extra test set to ensure that after all the parameter
##  tuning we have done, repeatedly applied to the validation data, that our
##  model produces similar results against the third data set. 

## Create predictions using our latest RF model against the test set.
finalRf_predictions<-h2o.predict(
  object = rf2
  ,newdata = test)

## Glance at what that prediction set looks like
## We see a final prediction in the "predict" column,
##  and then the predicted probabilities per class.
finalRf_predictions

## Compare these predictions to the accuracy we got from our experimentation
h2o.hit_ratio_table(rf2,valid = T)[1,2]             ## validation set accuracy
mean(finalRf_predictions$predict==test$Cover_Type)  ## test set accuracy

## We have very similar error rates on both sets, so it would not seem
##  that we have overfit the validation set through our experimentation.




  predict   class_1   class_2 class_3 class_4   class_5 class_6 class_7
1 class_2 0.3000000 0.7000000       0       0 0.0000000       0       0
2 class_1 1.0000000 0.0000000       0       0 0.0000000       0       0
3 class_1 0.7777778 0.2222222       0       0 0.0000000       0       0
4 class_1 0.8450704 0.1549296       0       0 0.0000000       0       0
5 class_2 0.2052980 0.7947020       0       0 0.0000000       0       0
6 class_5 0.0000000 0.3333333       0       0 0.6666667       0       0

[115979 rows x 8 columns] 

### Shutdown H2O

In [None]:
h2o.shutdown(prompt=FALSE)