## Model Building in H2O

I will go through 4 H2O  models including linear GLM, GBM, DRF (Distributed Random Forest) and DL (Deep Learning NN).

I'll use H2OFlow for the hyperparameters searching (it's just easier than writing code) and post here the best parameters found.


# H2O - GLM, GBM, NN, RF

In [1]:
import pandas as pd
import numpy as np
import time
import csv
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 8
import math

_start_time = time.time()

def tic():
    global _start_time 
    _start_time = time.time()

def tac():
    t_sec = round(time.time() - _start_time)
    (t_min, t_sec) = divmod(t_sec,60)
    (t_hour,t_min) = divmod(t_min,60) 
    print('Time passed: {}hour:{}min:{}sec'.format(t_hour,t_min,t_sec))

In [2]:
import h2o
import time
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [3]:
# Connect to a cluster
h2o.init()

Connecting to H2O server at http://localhost:54321... successful!


0,1
H2O cluster uptime:,17 hours 19 mins
H2O cluster version:,3.10.0.3
H2O cluster version age:,22 days
H2O cluster name:,LOCAL SERVICE
H2O cluster total nodes:,1
H2O cluster free memory:,47.16 Gb
H2O cluster total cores:,16
H2O cluster allowed cores:,16
H2O cluster is healthy:,True
H2O cluster is locked:,True


In [4]:
#now we load our modified train and test set
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

tic()
train = h2o.upload_file(path=_locate("./input-data/train_modified_5lags.csv"))
val = h2o.upload_file(path=_locate("./input-data/val_modified_5lags.csv"))
test = h2o.upload_file(path=_locate("./input-data/test_modified_5lags.csv"))
tac()




Time passed: 0.0hour:0.0min:53.0sec


In [5]:
# H2O python API recently (Jun 2016) added RSME as a model performance metric. So we are going to use it directly
# into our target = log_target , to get the RSMLE

def modelfit(alg, dtrain, dval, dtest, predictors, target, IDcol, filename):   
    #Fit the algorithm on the data
    alg.train(x=predictors, y=target, training_frame=dtrain, validation_frame=dval)

    #Performance on Training and Val sets:
    print ("\nModel Report")
    print ('RMSLE TRAIN: ', alg.model_performance(train).rmse())
    print ('RMSLE VAL: ', alg.model_performance(val).rmse())
    
    #Predict on testing data: we need to revert it back to "Demanda_uni_equil" by applying expm1 
    dtest[target] = alg.predict(dtest).expm1()
    
    print ('NUM ROWS PREDICTED: ', dtest.shape[0] )
    #print ('NUM NEGATIVES PREDICTED: ', dtest[target][dtest[target] < 0].nrow
    print ('MIN TARGET PREDICTED: ', dtest[target].min())
    print ('MEAN TARGET PREDICTED: ', dtest[target].mean())
    print ('MAX TARGET PREDICTED: ', dtest[target].max())
    
    dtest[target] = np.maximum(dtest[target], 0) # we make all negative numbers = 0 since there cannot be a negative demand
   
    #Export submission file:
    submission = dtest[[IDcol,target]].as_data_frame(use_pandas=True)
    submission[IDcol] = submission[IDcol].astype(int)
    submission.rename(columns={target: 'Demanda_uni_equil'}, inplace=True)
    submission.to_csv("./Submissions/"+filename, index=False)

Let's define now the target and the Id cols

In [None]:
#Define target and ID columns:
target = 'log_target'
IDcol = 'id'

### Alg6 - GBM

Lets make our first GBM model

In [None]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg6 = H2OGradientBoostingEstimator(ntrees=300,max_depth=25,learn_rate=0.1, min_rows=10, nbins=20, 
                                    ignored_columns=["Semana","pairs_mean"])
tic()
modelfit(alg6, train, val, test, predictors, target, IDcol, 'alg6.csv')
tac()

alg6.varimp(use_pandas=True)



Model Report
('RMSLE TRAIN: ', 0.3314411740600231)
('RMSLE VAL: ', 0.47747733301046835)

('NUM ROWS PREDICTED: ', 6999251)
('MIN TARGET PREDICTED: ', -0.4466447822087602)

## --> LB: 0.47600

A good improvement from scikit-learn models. I still don't like the fact that RSMLE VAL is away from RSMLE TEST.

### Alg7 - DRF

In [None]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg7 = H2ORandomForestEstimator(ntrees=300, max_depth=30)

tic()
modelfit(alg7, train, val, test, predictors, target, IDcol, 'alg7.csv')
tac()

alg7.varimp(use_pandas=True)



Model Report<br>
RMSLE TRAIN:  0.4428664746471727<br>
RMSLE VAL:  0.4416625096031187

NUM ROWS PREDICTED:  6999251<br>
MIN TARGET PREDICTED:  0.020059765726462852<br>
MEAN TARGET PREDICTED:  [5.6052535622997]<br>
MAX TARGET PREDICTED:  2421.874184697988<br>
Time passed: 14hour:31min:5sec


## --> LB: 0.46515

A good improvement however 14 hrs is killing me!

### Alg8 - GLM

In [8]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg8 = H2OGeneralizedLinearEstimator(Lambda=0.1, alpha=0.1, lambda_search=True, nlambdas=100,  family="poisson")
  
tic()
modelfit(alg8, train, val, test, predictors, target, IDcol, 'alg8.csv')
tac()

alg8.varimp(use_pandas=True)



Model Report
RMSLE TRAIN:  0.6208971783764934
RMSLE VAL:  0.6159225642176592

NUM ROWS PREDICTED:  6999251
MIN TARGET PREDICTED:  1.7312609680778603
MEAN TARGET PREDICTED:  [5.345971839793424]
MAX TARGET PREDICTED:  1397701.5977421915
Time passed: 0hour:1min:15sec


I'm not even going to send this to LB. We saw before in the sckit models that Linear Regression is not that good.

### Alg9 - DL

In [None]:
predictors = ['Agencia_ID','Canal_ID','Ruta_SAK','Cliente_ID','Producto_ID','Log_Target_mean_lag1',
                'Log_Target_mean_lag2','Log_Target_mean_lag3','Log_Target_mean_lag4','Lags_sum','brand','cluster',
                'Qty_Ruta_SAK_Bin','ZipCode']

alg9 = H2ODeepLearningEstimator(hidden=[50,50,50,50], epochs=50)
    
tic()
modelfit(alg9, train, val, test, predictors, target, IDcol, 'alg9.csv')
tac()

alg9.varimp(use_pandas=True)