## install h2o

### Install dependencies

```sh
pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future
```

Run the following command to remove any existing H2O module for Python.

```sh
pip uninstall h2o
```

Use pip to install this version of the H2O Python module.

```sh
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
```



reference http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html

In [98]:
# imports
import h2o
from h2o.automl import H2OAutoML, get_leaderboard

import pandas as pd

In [5]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_231"; Java(TM) SE Runtime Environment (build 1.8.0_231-b11); Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
  Starting server from /opt/anaconda3/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/5k/k0w7dgvs3116zdkj7l2msgyr0000gn/T/tmpd58ircs4
  JVM stdout: /var/folders/5k/k0w7dgvs3116zdkj7l2msgyr0000gn/T/tmpd58ircs4/h2o_josearevalo_started_from_python.out
  JVM stderr: /var/folders/5k/k0w7dgvs3116zdkj7l2msgyr0000gn/T/tmpd58ircs4/h2o_josearevalo_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,06 secs
H2O cluster timezone:,America/Bogota
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.2
H2O cluster version age:,"7 days, 15 hours and 24 minutes"
H2O cluster name:,H2O_from_python_josearevalo_ybfwus
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,0
H2O cluster allowed cores:,0


In [52]:
# Uploading a File
titanic_df = h2o.upload_file("data/titanic_data.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [53]:
titanic_df

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803.0,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450.0,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877.0,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463.0,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909.0,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742.0,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736.0,30.0708,,C




In [54]:
# Slicing columns, no Name
list_columns = titanic_df.columns
list_columns.remove("Name")

new_titanic_df = titanic_df[:,list_columns]

In [58]:
# Splitting Datasets into Training/Testing/Validating
train,test,valid = new_titanic_df.split_frame(ratios=[.7, .15])

In [59]:
train

PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,1,1,female,35,1,0,113803.0,53.1,C123,S
5,0,3,male,35,0,0,373450.0,8.05,,S
7,0,1,male,54,0,0,17463.0,51.8625,E46,S
8,0,3,male,2,3,1,349909.0,21.075,,S
9,1,3,female,27,0,2,347742.0,11.1333,,S
10,1,2,female,14,1,0,237736.0,30.0708,,C
11,1,3,female,4,1,1,,16.7,G6,S
13,0,3,male,20,0,0,,8.05,,S
14,0,3,male,39,1,5,347082.0,31.275,,S
15,0,3,female,14,0,0,350406.0,7.8542,,S




# Train Model

In [61]:
# Identify predictors and response

x = train.columns
y = "Survived"
x.remove(y)

In [62]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [67]:
# AutoML Leaderboard
lb = aml.leaderboard

# Optionally edd extra model information to the leaderboard
lb = get_leaderboard(aml, extra_columns='ALL')

# Print all rows (instead of default 10 rows)
lb.head(rows=lb.nrows)

model_id,mean_residual_deviance,rmse,mse,mae,rmsle,training_time_ms,predict_time_per_row_ms
StackedEnsemble_BestOfFamily_AutoML_20200128_104543,0.129441,0.359779,0.129441,0.259778,0.253514,282,0.189275
StackedEnsemble_AllModels_AutoML_20200128_104543,0.12981,0.360292,0.12981,0.258552,0.253693,441,0.292856
XGBoost_grid__1_AutoML_20200128_104543_model_3,0.132524,0.364039,0.132524,0.274962,0.256731,157,0.004456
GBM_2_AutoML_20200128_104543,0.133766,0.365741,0.133766,0.260534,0.2597,96,0.00978
XGBoost_2_AutoML_20200128_104543,0.134051,0.36613,0.134051,0.268579,0.259326,379,0.005752
XGBoost_3_AutoML_20200128_104543,0.134363,0.366556,0.134363,0.26964,0.258806,173,0.004403
XGBoost_grid__1_AutoML_20200128_104543_model_1,0.136073,0.36888,0.136073,0.275143,0.261104,199,0.004226
DRF_1_AutoML_20200128_104543,0.136207,0.369062,0.136207,0.261817,0.263313,211,0.011009
XGBoost_grid__1_AutoML_20200128_104543_model_4,0.136816,0.369887,0.136816,0.260236,0.260428,175,0.003521
XGBoost_1_AutoML_20200128_104543,0.136943,0.370058,0.136943,0.269004,0.263258,337,0.006275




In [68]:
aml.leader

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_BestOfFamily_AutoML_20200128_104543

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.059280959731899084
RMSE: 0.2434768155942144
MAE: 0.1909033935246831
RMSLE: 0.17274741676458505
R^2: 0.7515410560807042
Mean Residual Deviance: 0.059280959731899084
Null degrees of freedom: 617
Residual degrees of freedom: 612
Null deviance: 147.4514563106787
Residual deviance: 36.635633114313634
AIC: 21.669354101690367

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.1294408697593839
RMSE: 0.3597789178917852
MAE: 0.2597781488400081
RMSLE: 0.25351420949307085
R^2: 0.4574861482361269
Mean Residual Deviance: 0.1294408697593839
Null degrees of freedom: 617
Residual degrees of freedom: 612
Null deviance: 147.9408217315209
Residual deviance: 79.99445751129926
AIC: 504.2878033521391




In [84]:
xgboost_model = h2o.get_model("XGBoost_grid__1_AutoML_20200128_104543_model_3")

In [85]:
xgboost_model

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_grid__1_AutoML_20200128_104543_model_3


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,76.0




ModelMetricsRegression: xgboost
** Reported on train data. **

MSE: 0.08072526948100471
RMSE: 0.2841219271386929
MAE: 0.21374978737537914
RMSLE: 0.20022232068238466
Mean Residual Deviance: 0.08072526948100471

ModelMetricsRegression: xgboost
** Reported on cross-validation data. **

MSE: 0.13252414626725725
RMSE: 0.3640386604019651
MAE: 0.274961621147915
RMSLE: 0.25673095996180084
Mean Residual Deviance: 0.13252414626725725

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,0.27495655,0.024647804,0.2589342,0.31365514,0.25540826,0.2854079,0.26137733
1,mean_residual_deviance,0.13250831,0.018998807,0.11827999,0.16488928,0.12414639,0.13373649,0.121489376
2,mse,0.13250831,0.018998807,0.11827999,0.16488928,0.12414639,0.13373649,0.121489376
3,r2,0.4410094,0.06843915,0.49312344,0.33768615,0.4725408,0.40526178,0.49643484
4,residual_deviance,0.13250831,0.018998807,0.11827999,0.16488928,0.12414639,0.13373649,0.121489376
5,rmse,0.36331633,0.025237828,0.3439186,0.4060656,0.35234413,0.3657,0.34855327
6,rmsle,0.2565589,0.010335254,0.2508468,0.27130812,0.25091746,0.26321307,0.24650908



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2020-01-28 10:45:59,2.357 sec,0.0,0.5,0.5,0.25
1,,2020-01-28 10:45:59,2.364 sec,5.0,0.449273,0.446672,0.201846
2,,2020-01-28 10:45:59,2.370 sec,10.0,0.413017,0.404718,0.170583
3,,2020-01-28 10:45:59,2.376 sec,15.0,0.385776,0.370169,0.148823
4,,2020-01-28 10:45:59,2.382 sec,20.0,0.365731,0.341904,0.133759
5,,2020-01-28 10:45:59,2.389 sec,25.0,0.348754,0.316602,0.121629
6,,2020-01-28 10:45:59,2.397 sec,30.0,0.336917,0.296584,0.113513
7,,2020-01-28 10:45:59,2.405 sec,35.0,0.328477,0.281992,0.107897
8,,2020-01-28 10:45:59,2.413 sec,40.0,0.320476,0.26849,0.102705
9,,2020-01-28 10:45:59,2.422 sec,45.0,0.312659,0.254805,0.097756



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Sex.female,222.511795,1.0,0.299265
1,Ticket,105.42524,0.473796,0.141791
2,Age,102.114342,0.458917,0.137338
3,Fare,82.622505,0.371317,0.111122
4,Sex.male,49.023804,0.22032,0.065934
5,PassengerId,48.676464,0.218759,0.065467
6,Pclass,45.040035,0.202416,0.060576
7,Cabin.missing(NA),33.614361,0.151068,0.045209
8,SibSp,24.571133,0.110426,0.033047
9,Parch,14.974156,0.067296,0.020139




In [104]:
# save model
leader_model = aml.leader

In [105]:
h2o.save_model(model=final_model, path="models/leader_model", force=True)

'/Users/josearevalo/Documents/develop/pyspark_h2o/models/leader_model/StackedEnsemble_BestOfFamily_AutoML_20200128_104543'

## or we can download mojo
The MOJO import functionality provides a means to use external, pre-trained models in H2O - mainly for the purpose of scoring. Depending on each external model, metrics and other model information might be obtained as well.

In [87]:
# Alert Unsupported MOJO model 'stackedensemble'.

In [88]:
xgboost_model.download_mojo("models/xgboost_model.zip")

'/Users/josearevalo/Documents/develop/pyspark_h2o/models/xgboost_model.zip'

# Predict data

In [106]:
# Load Model
my_model = h2o.load_model("models/leader_model/StackedEnsemble_BestOfFamily_AutoML_20200128_104543")

In [107]:
my_model

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_BestOfFamily_AutoML_20200128_104543

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 0.059280959731899084
RMSE: 0.2434768155942144
MAE: 0.1909033935246831
RMSLE: 0.17274741676458505
R^2: 0.7515410560807042
Mean Residual Deviance: 0.059280959731899084
Null degrees of freedom: 617
Residual degrees of freedom: 612
Null deviance: 147.4514563106787
Residual deviance: 36.635633114313634
AIC: 21.669354101690367

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.1294408697593839
RMSE: 0.3597789178917852
MAE: 0.2597781488400081
RMSLE: 0.25351420949307085
R^2: 0.4574861482361269
Mean Residual Deviance: 0.1294408697593839
Null degrees of freedom: 617
Residual degrees of freedom: 612
Null deviance: 147.9408217315209
Residual deviance: 79.99445751129926
AIC: 504.2878033521391




## or import mojo

In [89]:
my_model_mojo = h2o.import_mojo("models/xgboost_model.zip")
# Unsupported MOJO model 'stackedensemble'

generic Model Build progress: |███████████████████████████████████████████| 100%
Model Details
H2OGenericEstimator :  Import MOJO Model
Model Key:  Generic_model_python_1580225176177_3


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,76.0




ModelMetricsRegressionGeneric: generic
** Reported on train data. **

MSE: 0.08072526948100471
RMSE: 0.2841219271386929
MAE: 0.21374978737537914
RMSLE: 0.20022232068238466
Mean Residual Deviance: 0.08072526948100471

ModelMetricsRegressionGeneric: generic
** Reported on cross-validation data. **

MSE: 0.13252414626725725
RMSE: 0.3640386604019651
MAE: 0.274961621147915
RMSLE: 0.25673095996180084
Mean Residual Deviance: 0.13252414626725725

Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
0,mae,0.27495655,0.024647804,0.2589342,0.31365514,0.25540826,0.2854079,0.26137733
1,mean_residual_deviance,0.13250831,0.018998807,0.11827999,0.16488928,0.12414639,0.13373649,0.121489376
2,mse,0.13250831,0.018998807,0.11827999,0.16488928,0.12414639,0.13373649,0.121489376
3,r2,0.4410094,0.06843915,0.49312344,0.33768615,0.4725408,0.40526178,0.49643484
4,residual_deviance,0.13250831,0.018998807,0.11827999,0.16488928,0.12414639,0.13373649,0.121489376
5,rmse,0.36331633,0.025237828,0.3439186,0.4060656,0.35234413,0.3657,0.34855327
6,rmsle,0.2565589,0.010335254,0.2508468,0.27130812,0.25091746,0.26321307,0.24650908



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance
0,,2020-01-28 10:45:59,2.357 sec,0.0,0.5,0.5,0.25
1,,2020-01-28 10:45:59,2.364 sec,5.0,0.449273,0.446672,0.201846
2,,2020-01-28 10:45:59,2.370 sec,10.0,0.413017,0.404718,0.170583
3,,2020-01-28 10:45:59,2.376 sec,15.0,0.385776,0.370169,0.148823
4,,2020-01-28 10:45:59,2.382 sec,20.0,0.365731,0.341904,0.133759
5,,2020-01-28 10:45:59,2.389 sec,25.0,0.348754,0.316602,0.121629
6,,2020-01-28 10:45:59,2.397 sec,30.0,0.336917,0.296584,0.113513
7,,2020-01-28 10:45:59,2.405 sec,35.0,0.328477,0.281992,0.107897
8,,2020-01-28 10:45:59,2.413 sec,40.0,0.320476,0.26849,0.102705
9,,2020-01-28 10:45:59,2.422 sec,45.0,0.312659,0.254805,0.097756



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Sex.female,222.511795,1.0,0.299265
1,Ticket,105.42524,0.473796,0.141791
2,Age,102.114342,0.458917,0.137338
3,Fare,82.622505,0.371317,0.111122
4,Sex.male,49.023804,0.22032,0.065934
5,PassengerId,48.676464,0.218759,0.065467
6,Pclass,45.040035,0.202416,0.060576
7,Cabin.missing(NA),33.614361,0.151068,0.045209
8,SibSp,24.571133,0.110426,0.033047
9,Parch,14.974156,0.067296,0.020139





In [74]:
test

PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
29,1,3,female,,0,0,330959.0,7.8792,,Q
38,0,3,male,21.0,0,0,,8.05,,S
40,1,3,female,14.0,1,0,2651.0,11.2417,,C
41,0,3,female,40.0,1,0,7546.0,9.475,,S
42,0,2,female,27.0,1,0,11668.0,21.0,,S
44,1,2,female,3.0,1,2,,41.5792,,C
46,0,3,male,,0,0,,8.05,,S
52,0,3,male,21.0,0,0,,7.8,,S
67,1,2,female,29.0,0,0,,10.5,F33,S
68,0,3,male,19.0,0,0,,8.1583,,S




In [90]:
# predict with h2o model
my_model.predict(test)

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict
0.857748
0.213635
0.523079
0.429477
0.914368
0.879132
0.158652
0.218108
0.842457
0.276471




In [91]:
# predict with mojo model
my_model_mojo.predict(test)

generic prediction progress: |████████████████████████████████████████████| 100%


predict
0.785451
0.222045
0.582189
0.408003
0.939314
0.79231
0.0943139
0.221789
0.79291
0.220339




In [94]:
test["prediction_a"] = my_model.predict(test)

stackedensemble prediction progress: |████████████████████████████████████| 100%


In [95]:
test["prediction_b"] = my_model_mojo.predict(test)

generic prediction progress: |████████████████████████████████████████████| 100%


In [96]:
test

PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,prediction_a,prediction_b
29,1,3,female,,0,0,330959.0,7.8792,,Q,0.857748,0.785451
38,0,3,male,21.0,0,0,,8.05,,S,0.213635,0.222045
40,1,3,female,14.0,1,0,2651.0,11.2417,,C,0.523079,0.582189
41,0,3,female,40.0,1,0,7546.0,9.475,,S,0.429477,0.408003
42,0,2,female,27.0,1,0,11668.0,21.0,,S,0.914368,0.939314
44,1,2,female,3.0,1,2,,41.5792,,C,0.879132,0.79231
46,0,3,male,,0,0,,8.05,,S,0.158652,0.0943139
52,0,3,male,21.0,0,0,,7.8,,S,0.218108,0.221789
67,1,2,female,29.0,0,0,,10.5,F33,S,0.842457,0.79291
68,0,3,male,19.0,0,0,,8.1583,,S,0.276471,0.220339




# save predictions

In [97]:
# save in csv data
h2o.export_file(test, path="data/prediction.csv")

Export File progress: |███████████████████████████████████████████████████| 100%


In [101]:
# inspect the csv
pd.read_csv("data/prediction.csv")

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,prediction_a,prediction_b
0,29,1,3,female,,0,0,330959.0,7.8792,,Q,0.857748,0.785451
1,38,0,3,male,21.0,0,0,,8.0500,,S,0.213635,0.222045
2,40,1,3,female,14.0,1,0,2651.0,11.2417,,C,0.523079,0.582189
3,41,0,3,female,40.0,1,0,7546.0,9.4750,,S,0.429477,0.408003
4,42,0,2,female,27.0,1,0,11668.0,21.0000,,S,0.914368,0.939314
...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,874,0,3,male,47.0,0,0,345765.0,9.0000,,S,0.076356,0.056570
140,877,0,3,male,20.0,0,0,7534.0,9.8458,,S,0.110282,0.117355
141,880,1,1,female,56.0,0,1,11767.0,83.1583,C50,C,0.859387,0.829094
142,882,0,3,male,33.0,0,0,349257.0,7.8958,,S,0.121145,0.090279


In [103]:
# or convert the h2o dataframe to pandas dataframe
test.as_data_frame()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,prediction_a,prediction_b
0,29,1,3,female,,0,0,330959.0,7.8792,,Q,0.857748,0.785451
1,38,0,3,male,21.0,0,0,,8.0500,,S,0.213635,0.222045
2,40,1,3,female,14.0,1,0,2651.0,11.2417,,C,0.523079,0.582189
3,41,0,3,female,40.0,1,0,7546.0,9.4750,,S,0.429477,0.408003
4,42,0,2,female,27.0,1,0,11668.0,21.0000,,S,0.914368,0.939314
...,...,...,...,...,...,...,...,...,...,...,...,...,...
139,874,0,3,male,47.0,0,0,345765.0,9.0000,,S,0.076356,0.056570
140,877,0,3,male,20.0,0,0,7534.0,9.8458,,S,0.110282,0.117355
141,880,1,1,female,56.0,0,1,11767.0,83.1583,C50,C,0.859387,0.829094
142,882,0,3,male,33.0,0,0,349257.0,7.8958,,S,0.121145,0.090279


# References

### Instalation


http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html
### AutoML
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html