In [1]:
# import h2o lib and allow it to use max. threads
library(h2o)
h2o.init(nthreads = -1)


----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------


Attaching package: ‘h2o’

The following objects are masked from ‘package:stats’:

    cor, sd, var

The following objects are masked from ‘package:base’:

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc




H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/tc/0ss1l73113j3wdyjsxmy1j2r0000gn/T//Rtmpt94xib/h2o_phall_started_from_r.out
    /var/folders/tc/0ss1l73113j3wdyjsxmy1j2r0000gn/T//Rtmpt94xib/h2o_phall_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 770 milliseconds 
    H2O cluster version:        3.12.0.1 
    H2O cluster version age:    29 days  
    H2O cluster name:           H2O_started_from_R_phall_vno587 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.56 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 3.3.2 (2016-10-31) 

In [2]:
# location of clean data file
path <- "/Users/phall/Documents/aetna/share/data/loan.csv"

In [3]:
# import file
frame <- h2o.importFile(path)

# strings automatically parsed as enums (categorical)
# numbers automatically parsed as numeric
# bad_loan is numeric, but categorical
frame$bad_loan <- as.factor(frame$bad_loan)



In [4]:
# find missing numeric and impute
for (name in names(frame)) {
  if (any(is.na(frame[name]))) {
      h2o.impute(frame, name, "median")
  }
}

In [5]:
h2o.describe(frame) # summarize table, check for missing

Label,Type,Missing,Zeros,PosInf,NegInf,Min,Max,Mean,Sigma,Cardinality
loan_amnt,int,0,0,0,0,500.0,35000.0,13074.17,7993.556,
term,enum,0,129950,0,0,0.0,1.0,0.2075591,0.4055605,2.0
int_rate,real,0,0,0,0,5.42,26.06,13.7159,4.39194,
emp_length,int,0,14248,0,0,0.0,10.0,5.695525,3.546671,
home_ownership,enum,0,1,0,0,0.0,5.0,,,6.0
annual_inc,real,0,0,0,0,1896.0,7141778.0,71915.4,59070.22,
purpose,enum,0,2842,0,0,0.0,13.0,,,14.0
addr_state,enum,0,413,0,0,0.0,49.0,,,50.0
dti,real,0,270,0,0,0.0,39.99,15.88153,7.587668,
delinq_2yrs,int,0,139488,0,0,0.0,29.0,0.2273168,0.6941131,


In [5]:
# assign target and inputs
y <- 'bad_loan'
X <- names(frame)[names(frame) != y]
print(y)
print(X)

[1] "bad_loan"
 [1] "loan_amnt"             "term"                  "int_rate"             
 [4] "emp_length"            "home_ownership"        "annual_inc"           
 [7] "purpose"               "addr_state"            "dti"                  
[10] "delinq_2yrs"           "revol_util"            "total_acc"            
[13] "longest_credit_length" "verification_status"  


In [6]:
# split into training and test for cross validation
split <- h2o.splitFrame(frame, ratios = 0.7)
train <- split[[1]]
test <- split[[2]]

In [8]:
# start automl process
# automl loosely based on: http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf
auto <- h2o.automl(x = X, 
                   y = y,
                   training_frame = train, # training automatically split into 70% train, 30% validation
                   leaderboard_frame = test,
                   max_runtime_secs = 300) # will run for 300 seconds, then build a stacked ensemble



In [9]:
# view leaderboard
lb <- auto@leaderboard
lb

                                   model_id      auc  logloss
1  StackedEnsemble_0_AutoML_20170706_163428 0.703922 0.438078
2 GBM_grid_0_AutoML_20170706_163428_model_0 0.701623 0.449968
3 GLM_grid_0_AutoML_20170706_163428_model_1 0.697168 0.439673
4 GLM_grid_0_AutoML_20170706_163428_model_0 0.697168 0.439673
5              XRT_0_AutoML_20170706_163428 0.688596 0.443309
6              DRF_0_AutoML_20170706_163428 0.685044 0.446875

[6 rows x 3 columns] 

In [11]:
# view best model
best <- auto@leader
best # can only be used for predict with .predict(), no MOJO for stacked ensemble yet

Model Details:

H2OBinomialModel: stackedensemble
Model ID:  StackedEnsemble_0_AutoML_20170706_163428 
NULL


H2OBinomialMetrics: stackedensemble
** Reported on training data. **

MSE:  0.0953438
RMSE:  0.3087779
LogLoss:  0.3221293
Mean Per-Class Error:  0.1791166
AUC:  0.9315942
Gini:  0.8631884

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
           0     1    Error         Rate
0      60674  5128 0.077931  =5128/65802
1       4077 10468 0.280303  =4077/14545
Totals 64751 15596 0.114566  =9205/80347

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold    value idx
1                       max f1  0.256958 0.694602 212
2                       max f2  0.183652 0.772241 265
3                 max f0point5  0.345865 0.731650 164
4                 max accuracy  0.315529 0.893375 179
5                max precision  0.906782 1.000000   0
6                   max recall  0.093797 1.000000 357
7           

In [None]:
h2o.shutdown(prompt = FALSE)