## Model Building in H2O

I will go through 4 H2O  models including linear GLM, GBM, DRF (Distributed Random Forest) and DL (Deep Learning NN).

I'll use H2OFlow for the hyperparameters searching (it's just easier than writing code) and post here the best parameters found.


# H2O - GLM, GBM, NN, RF

In [1]:
import pandas as pd
import numpy as np
import time
import csv
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 8
import math

_start_time = time.time()

def tic():
    global _start_time 
    _start_time = time.time()

def tac():
    t_sec = round(time.time() - _start_time)
    (t_min, t_sec) = divmod(t_sec,60)
    (t_hour,t_min) = divmod(t_min,60) 
    print('Time passed: {}hour:{}min:{}sec'.format(t_hour,t_min,t_sec))

In [2]:
import h2o
import time
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [4]:
# Connect to a cluster
h2o.init()

Connecting to H2O server at http://localhost:54321... successful!


  def _ipython_display_formatter_default(self):
  def _singleton_printers_default(self):


0,1
H2O cluster uptime:,03 secs
H2O cluster version:,3.10.0.3
H2O cluster version age:,10 days
H2O cluster name:,H2O_from_python_nobody_vjfhhh
H2O cluster total nodes:,1
H2O cluster free memory:,12.23 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster is healthy:,True
H2O cluster is locked:,False


In [5]:
#now we load our modified train and test set
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

tic()
train = h2o.upload_file(path=_locate("./input-data/train_modified.csv"))
val = h2o.upload_file(path=_locate("./input-data/val_modified_w9.csv"))
test = h2o.upload_file(path=_locate("./input-data/test_modified.csv"))
tac()




Time passed: 0hour:1min:52sec


In [7]:
# RSMLE - error function used in LB:
# H2O python API recently (Jun 2016) added RSME as a model performance metric. So we are going to use it directly
# into our target = log_target , to get the RSMLE

def modelfit(alg, dtrain, dval, dtest, predictors, target, IDcol, filename):   
    #Fit the algorithm on the data
    alg.train(x=predictors, y=target, training_frame=dtrain, validation_frame=dval)

    #Performance on Training and Val sets:
    print ("\nModel Report")
    print ('RMSLE TRAIN: ', alg.model_performance(train).rmse())
    print ('RMSLE VAL: ', alg.model_performance(val).rmse())
    
    #Predict on testing data: we need to revert it back to "Demanda_uni_equil" by applying expm1 
    dtest[target] = alg.predict(dtest).expm1()
    
    print ('NUM ROWS PREDICTED: ', dtest.shape[0] )
    #print ('NUM NEGATIVES PREDICTED: ', dtest[target][dtest[target] < 0].nrow())
    print ('MIN TARGET PREDICTED: ', dtest[target].min())
    print ('MEAN TARGET PREDICTED: ', dtest[target].mean())
    print ('MAX TARGET PREDICTED: ', dtest[target].max())
    
    #Export submission file:
    submission = dtest[[IDcol,target]].as_data_frame(use_pandas=True)
    submission[IDcol] = submission[IDcol].astype(int)
    submission.rename(columns={target: 'Demanda_uni_equil'}, inplace=True)
    submission.to_csv("./Submissions/"+filename, index=False)

Let's define now the target and the Id cols

In [8]:
#Define target and ID columns:
target = 'log_target'
IDcol = 'id'

### Alg6 - GBM

Lets make our first GBM model

In [9]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg6 = H2OGradientBoostingEstimator(ntrees=300,max_depth=25,learn_rate=0.1, min_rows=10, nbins=20, 
                                    ignored_columns=["Semana","pairs_mean"])
tic()
modelfit(alg6, train, val, test, predictors, target, IDcol, 'alg6.csv')
tac()

alg6.varimp(use_pandas=True)



Model Report
RMSLE TRAIN:  0.3707788560864561
RMSLE VAL:  0.3707048055394812

NUM ROWS PREDICTED:  6999251
MIN TARGET PREDICTED:  -0.23013072177354224
MEAN TARGET PREDICTED:  [5.798725018818779]
MAX TARGET PREDICTED:  3252.0560679150753
Time passed: 6hour:21min:56sec


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Log_Target_mean_lag1,49332960.0,1.0,0.528009
1,Log_Target_mean_lag3,10523560.0,0.213317,0.112633
2,Lags_sum,6336896.0,0.128452,0.067824
3,Producto_ID,4954710.0,0.100434,0.05303
4,cluster,3772194.0,0.076464,0.040374
5,Ruta_SAK,3596937.0,0.072911,0.038498
6,Log_Target_mean_lag4,3481056.0,0.070562,0.037258
7,ZipCode,2880712.0,0.058393,0.030832
8,Cliente_ID,2838599.0,0.05754,0.030381
9,Log_Target_mean_lag2,2448448.0,0.049631,0.026206


### Alg7 - DRF

In [None]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg7 = H2ORandomForestEstimator(ntrees=200, max_depth=30)

tic()
modelfit(alg7, train, val, test, predictors, target, IDcol, 'alg7.csv')
tac()

alg7.varimp(use_pandas=True)

### Alg8 - GLM

In [None]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg8 = H2OGeneralizedLinearEstimator(Lambda=[1e-5], family="poisson")
  
tic()
modelfit(alg8, train, val, test, predictors, target, IDcol, 'alg8.csv')
tac()

alg8.varimp(use_pandas=True)

### Alg9 - DL

In [None]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg9 = H2ODeepLearningEstimator(hidden=[50,50,50,50], epochs=50)
    
tic()
modelfit(alg9, train, val, test, predictors, target, IDcol, 'alg9.csv')
tac()

alg9.varimp(use_pandas=True)