### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster.

NOTE: the structure of this code is primarily from the h2o AutoML regression ipynb tutorial and adapted to run with my data

In [1]:
import pandas as pd
import numpy as np
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,42 mins 10 secs
H2O_cluster_timezone:,Europe/London
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.1
H2O_cluster_version_age:,19 days
H2O_cluster_name:,H2O_from_python_mackenzie_bbmmpt
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.773 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


### Load Data

In [4]:
# Group A is Black
# Group B is White
dataA_path = "/home/mackenzie/git_repositories/delayedimpact/data/simulated_data/simData_groupA_black.csv"
#dataB_path = "/home/mackenzie/git_repositories/delayedimpact/data/simulated_data/simData_groupB_black.csv"

# Load data into H2O
df_A = h2o.import_file(dataA_path)
#df_B = h2o.import_file(dataB_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


Let's take a look at the data.

In [5]:
df.describe()

Rows:1000
Cols:2




Unnamed: 0,score,repay_probability
type,real,real
mins,311.9047619047619,1.2000000000000028
mean,633.662900804343,70.83487000000005
maxs,841.2280701754386,99.05
sigma,130.08983997452646,34.00573837200606
zeros,0,0
missing,0,0
0,323.8095238095238,1.2000000000000028
1,323.8095238095238,1.2000000000000028
2,323.8095238095238,1.2000000000000028


In [None]:
# Q: maybe I should cut down on the decimals included?? That could be causing the regression models I was using before to suffer

Next, let's identify the response column and save the column name as `y`.  In this dataset, we will use all columns except the response as predictors, so we can skip setting the `x` argument explicitly.

In [6]:
y = "repay_probability"

Lastly, let's split the data into two frames, a `train` (80%) and a `test` frame (20%).  The `test` frame will be used to score the leaderboard and to demonstrate how to generate predictions using an AutoML leader model.

In [7]:
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

## Run AutoML 

Run AutoML, stopping after 60 seconds.  The `max_runtime_secs` argument provides a way to limit the AutoML run by time.  When using a time-limited stopping criterion, the number of models train will vary between runs.  If different hardware is used or even if the same machine is used but the available compute resources on that machine are not the same between runs, then AutoML may be able to train more models on one run vs another. 

The `test` frame is passed explicitly to the `leaderboard_frame` argument here, which means that instead of using cross-validated metrics, we use test set metrics for generating the leaderboard.

In [8]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "repay_groupA_train")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |████████████████████████████████████████████████████████| 100%


For demonstration purposes, we will also execute a second AutoML run, this time providing the original, full dataset, `df` (without passing a `leaderboard_frame`).  This is a more efficient use of our data since we can use 100% of the data for training, rather than 80% like we did above.  This time our leaderboard will use cross-validated metrics.

*Note: Using an explicit `leaderboard_frame` for scoring may be useful in some cases, which is why the option is available.*  

In [9]:
aml2 = H2OAutoML(max_runtime_secs = 60, seed = 1, project_name = "repay_groupA_full_data")
aml2.train(y = y, training_frame = df)

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

Next, we will view the AutoML Leaderboard.  Since we specified a `leaderboard_frame` in the `H2OAutoML.train()` method for scoring and ranking the models, the AutoML leaderboard uses the performance on this data to rank the models.

We can see that the results are better when the full dataset is used for training.  

In the case of regression, the default ranking metric is mean residual deviance.  In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

In [10]:
aml.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_AutoML_20210414_161300,9.1217,3.02022,9.1217,1.57776,0.112103
GBM_grid__1_AutoML_20210414_161300_model_5,9.15735,3.02611,9.15735,1.59303,0.113978
StackedEnsemble_BestOfFamily_AutoML_20210414_161300,9.17336,3.02876,9.17336,1.60447,0.112205
GBM_grid__1_AutoML_20210414_161300_model_1,9.30638,3.05064,9.30638,1.6044,0.107969
GBM_grid__1_AutoML_20210414_161300_model_6,9.34113,3.05633,9.34113,1.54901,0.108889
GBM_grid__1_AutoML_20210414_161300_model_7,9.47685,3.07845,9.47685,1.60668,0.1142
GBM_grid__1_AutoML_20210414_161300_model_4,9.56512,3.09275,9.56512,1.6562,0.119363
XGBoost_grid__1_AutoML_20210414_161300_model_9,9.56852,3.0933,9.56852,1.57901,0.113015
XGBoost_grid__1_AutoML_20210414_161300_model_5,9.66351,3.10862,9.66351,1.60561,0.101056
GBM_2_AutoML_20210414_161300,9.70189,3.11479,9.70189,1.56912,0.10384




Now we will view a snapshot of the top models.  Here we should see the two Stacked Ensembles at or near the top of the leaderboard.  Stacked Ensembles can almost always outperform a single model.

In [11]:
aml2.leaderboard.head()

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
GBM_grid__1_AutoML_20210414_162119_model_1,8.25813,2.8737,8.25813,1.53062,0.114514
StackedEnsemble_BestOfFamily_AutoML_20210414_162119,8.32766,2.88577,8.32766,1.54941,0.113013
GBM_grid__1_AutoML_20210414_162119_model_6,8.33682,2.88736,8.33682,1.48992,0.116533
StackedEnsemble_AllModels_AutoML_20210414_162119,8.354,2.89033,8.354,1.54947,0.114976
GBM_grid__1_AutoML_20210414_162119_model_5,8.38072,2.89495,8.38072,1.52868,0.125859
GBM_grid__1_AutoML_20210414_162119_model_7,8.38596,2.89585,8.38596,1.5274,0.125065
GBM_4_AutoML_20210414_162119,8.41215,2.90037,8.41215,1.4561,0.112324
GBM_3_AutoML_20210414_162119,8.4311,2.90364,8.4311,1.4655,0.112563
GBM_2_AutoML_20210414_162119,8.44393,2.90584,8.44393,1.48223,0.112353
GBM_grid__1_AutoML_20210414_162119_model_4,8.50258,2.91592,8.50258,1.54432,0.127694




## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [12]:
pred = aml.predict(test)
pred.head()

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict
3.91525
4.95684
8.37621
17.2936
19.3099
21.0827
21.0827
35.0734
35.7597
60.5727




If needed, the standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.

In [13]:
perf = aml.leader.model_performance(test)
perf


ModelMetricsRegressionGLM: stackedensemble
** Reported on test data. **

MSE: 9.12170436098709
RMSE: 3.020215946085162
MAE: 1.57775843435667
RMSLE: 0.11210334774997908
R^2: 0.9912216659230666
Mean Residual Deviance: 9.12170436098709
Null degrees of freedom: 205
Residual degrees of freedom: 194
Null deviance: 214883.8855244431
Residual deviance: 1879.0710983633408
AIC: 1065.9979493528472


