## Import required packages

The Machine Learning package that we use is h2oAutoML. It's a simplified machine learning tool, which involves various machine learning tools. It will automatically develop GBM, XGBoost, DeepLearning models and have Stacked Ensemble as well.

In [1]:
import numpy as np
import pandas as pd
import h2o
from h2o.automl import H2OAutoML


#### Initiate a h2o cluster with a maximum memory size of 16G.

In [2]:
h2o.init(max_mem_size='16G')

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_212"; OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-2~deb9u1-b03); OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpli2eq1rw
  JVM stdout: /tmp/tmpli2eq1rw/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpli2eq1rw/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.5
H2O cluster version age:,9 days
H2O cluster name:,H2O_from_python_unknownUser_5r1o2n
H2O cluster total nodes:,1
H2O cluster free memory:,14.22 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


#### Set data path for training and testing dataset.

In [3]:
train_path = '../input/train.csv'
test_path = '../input/test.csv'

#### Our target variable is binary. So we set the column type to 'enum' in h2o.dataframe.

In [4]:
col_types = {'target': 'enum'}

#### Load training and testing data into h2o dataframe.

In [5]:
train = h2o.import_file(path=train_path, col_types=col_types)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
test = h2o.import_file(path=test_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


#### Assign target variable 'target' to y, and all other variables but 'ID_code' and 'target' as independent variables to X.

In [7]:
y = 'target'
X = [name for name in train.columns if name not in ['ID_code', y]]

#### Make 'target' factors.

In [8]:
train[y] = train[y].asfactor()

## Train the model
H2OAutoML is very simple. We decide to build a maximum of 50 models or stop after 5 hours (18000 seconds). Then we train the model with assigned X, y, and our training data.

In [9]:
model = H2OAutoML(max_models=50,
                  max_runtime_secs = 18000,
                seed=12345)
model.train(x=X, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


We list all the models that we built.

In [10]:
lb = model.leaderboard
lb.head(rows=lb.nrows)

model_id,auc,logloss,mean_per_class_error,rmse,mse
StackedEnsemble_AllModels_AutoML_20190628_162741,0.885675,0.21422,0.242116,0.246741,0.0608813
StackedEnsemble_BestOfFamily_AutoML_20190628_162741,0.88356,0.215487,0.24748,0.247551,0.0612816
XGBoost_1_AutoML_20190628_162741,0.8821,0.225393,0.250629,0.255693,0.0653787
XGBoost_2_AutoML_20190628_162741,0.882043,0.227577,0.252034,0.257723,0.0664212
XGBoost_3_AutoML_20190628_162741,0.863853,0.246898,0.266002,0.265731,0.0706128
GLM_grid_1_AutoML_20190628_162741_model_1,0.859696,0.316467,0.26647,0.297613,0.0885737
GBM_1_AutoML_20190628_162741,0.844846,0.25356,0.289414,0.269024,0.0723738
DRF_1_AutoML_20190628_162741,0.806128,0.278247,0.309462,0.282238,0.0796581
GBM_2_AutoML_20190628_162741,0.741154,0.305155,0.343406,0.292812,0.0857386




model.leader is the best model in terms of auc in all models that we built. model.leader.predict will return the prediction of 'target' on our testing data, and we store the result into a new dataframe 'result'.

In [11]:
model.leader

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_AutoML_20190628_162741
No model summary for this model


ModelMetricsBinomialGLM: stackedensemble
** Reported on train data. **

MSE: 0.021752293535060672
RMSE: 0.147486587644642
LogLoss: 0.09665825493700282
Null degrees of freedom: 199999
Residual degrees of freedom: 199993
Null deviance: 130463.31259094302
Residual deviance: 38663.30197480113
AIC: 38677.30197480113
AUC: 0.9844170899641925
pr_auc: 0.898728063094399
Gini: 0.968834179928385
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4568611272403024: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,178170.0,1732.0,0.0096,(1732.0/179902.0)
1,3151.0,16947.0,0.1568,(3151.0/20098.0)
Total,181321.0,18679.0,0.0244,(4883.0/200000.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4568611,0.8740748,155.0
max f2,0.2844986,0.8809430,203.0
max f0point5,0.6134227,0.9132291,117.0
max accuracy,0.5088979,0.97576,141.0
max precision,0.9993322,1.0,0.0
max recall,0.0042653,1.0,396.0
max specificity,0.9993322,1.0,0.0
max absolute_mcc,0.4568611,0.8612841,155.0
max min_per_class_accuracy,0.1853497,0.9414370,241.0


Gains/Lift Table: Avg response rate: 10.05 %, avg score: 12.88 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.01,0.9959870,9.9512389,9.9512389,1.0,0.9981963,1.0,0.9981963,0.0995124,0.0995124,895.1238929,895.1238929
,2,0.02,0.9881690,9.9512389,9.9512389,1.0,0.9925042,1.0,0.9953503,0.0995124,0.1990248,895.1238929,895.1238929
,3,0.03,0.9740164,9.9164096,9.9396292,0.9965,0.9817759,0.9988333,0.9908255,0.0991641,0.2981889,891.6409593,893.9629150
,4,0.04,0.9493058,9.9164096,9.9338243,0.9965,0.9627686,0.99825,0.9838112,0.0991641,0.3973530,891.6409593,893.3824261
,5,0.05,0.9076345,9.8069460,9.9084486,0.9855,0.9300725,0.9957,0.9730635,0.0980695,0.4954224,880.6945965,890.8448602
,6,0.1,0.3886872,7.4365609,8.6725047,0.7473,0.6723948,0.8715,0.8227291,0.3718280,0.8672505,643.6560852,767.2504727
,7,0.15,0.1786848,1.5225396,6.2891830,0.153,0.2581299,0.632,0.6345294,0.0761270,0.9433775,52.2539556,528.9183003
,8,0.2,0.1158177,0.4229277,4.8226192,0.0425,0.1428028,0.484625,0.5115977,0.0211464,0.9645238,-57.7072346,382.2619166
,9,0.3,0.0674019,0.1686735,3.2713039,0.01695,0.0876252,0.3287333,0.3702736,0.0168673,0.9813912,-83.1326500,227.1303944




ModelMetricsBinomialGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 0.06088127363251494
RMSE: 0.24674130913269254
LogLoss: 0.21422026621993717
Null degrees of freedom: 199999
Residual degrees of freedom: 199993
Null deviance: 130463.64873441865
Residual deviance: 85688.10648797489
AIC: 85702.10648797489
AUC: 0.8856750745429397
pr_auc: 0.5757144224554498
Gini: 0.7713501490858794
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.22445461673397699: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,169670.0,10232.0,0.0569,(10232.0/179902.0)
1,8589.0,11509.0,0.4274,(8589.0/20098.0)
Total,178259.0,21741.0,0.0941,(18821.0/200000.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2244546,0.5501566,216.0
max f2,0.1100197,0.6200839,279.0
max f0point5,0.4873342,0.6018503,129.0
max accuracy,0.4873342,0.92179,129.0
max precision,0.9982226,1.0,0.0
max recall,0.0042180,1.0,398.0
max specificity,0.9982226,1.0,0.0
max absolute_mcc,0.2740027,0.4999402,197.0
max min_per_class_accuracy,0.0835569,0.8032929,298.0


Gains/Lift Table: Avg response rate: 10.05 %, avg score: 10.05 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.01,0.9418959,9.1451886,9.1451886,0.919,0.9751056,0.919,0.9751056,0.0914519,0.0914519,814.5188576,814.5188576
,2,0.02,0.8351910,8.0704548,8.6078217,0.811,0.8909154,0.865,0.9330105,0.0807045,0.1721564,707.0454772,760.7821674
,3,0.03,0.7149600,7.2096726,8.1417720,0.7245,0.7755185,0.8181667,0.8805132,0.0720967,0.2442532,620.9672604,714.1771984
,4,0.04,0.5948668,6.1996219,7.6562345,0.623,0.6533036,0.769375,0.8237108,0.0619962,0.3062494,519.9621853,665.6234451
,5,0.05,0.4993392,5.2343517,7.1718579,0.526,0.5445727,0.7207,0.7678832,0.0523435,0.3585929,423.4351677,617.1857896
,6,0.1,0.2448208,3.7794805,5.4756692,0.3798,0.3483853,0.55025,0.5581342,0.1889740,0.5475669,277.9480545,447.5669221
,7,0.15,0.1537950,2.3345607,4.4286330,0.2346,0.1931268,0.4450333,0.4364651,0.1167280,0.6642950,133.4560653,342.8633031
,8,0.2,0.1109986,1.5603543,3.7115633,0.1568,0.1300340,0.372975,0.3598573,0.0780177,0.7423127,56.0354264,271.1563340
,9,0.3,0.0694882,0.9637775,2.7956347,0.09685,0.0872926,0.2809333,0.2690024,0.0963777,0.8386904,-3.6222510,179.5634723







In [12]:
result = model.leader.predict(test)
result

stackedensemble prediction progress: |████████████████████████████████████| 100%


predict,p0,p1
0,0.842633,0.157367
1,0.759715,0.240285
0,0.916508,0.083492
0,0.852454,0.147546
0,0.953305,0.0466947
0,0.994923,0.00507677
0,0.988653,0.0113474
0,0.875442,0.124558
0,0.995255,0.00474487
0,0.990004,0.00999596




Now, we combine our test data with the first column in result, which is the prediction itself.

In [13]:
sub = test.cbind(result[0])

Then we select 'ID_code' and 'predict' from sub because those two columns are the only ones that we need in our submission file. We rename the 'predict' column to 'target' as required.

In [14]:
sub = sub[['ID_code','predict']]
sub = sub.rename(columns={'predict':'target'})

We convert our h2o dataframe sub to a pandas dataframe, and write it to csv.

In [15]:
sub = sub.as_data_frame()
sub.to_csv('submission.csv',index=False)

## Shut down the h2o cluster

In [16]:
h2o.cluster().shutdown(prompt=True)

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.