# AutoML Grid Search

This notebook shows the grid searches that are performed in AutoML.

AutoML constructs models for each algorithm in two steps: 

1. Train models with parameters that have been found to work well most of the time.
2. Run grid search over parameters that have been found to have large impact on performance.

The reason behind the two step approach is to create good models as fast as possible.  Step 1 should generate a very good baseline.  The grid search that follows can then improve upon step 1 if more time is available.

In [1]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "12.0.2" 2019-07-16; Java(TM) SE Runtime Environment (build 12.0.2+10); Java HotSpot(TM) 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)
  Starting server from /Users/megankurka/env2/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpwe82ymwi
  JVM stdout: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpwe82ymwi/h2o_megankurka_started_from_python.out
  JVM stderr: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpwe82ymwi/h2o_megankurka_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.4
H2O_cluster_version_age:,24 days
H2O_cluster_name:,H2O_from_python_megankurka_rdmonj
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,4 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [2]:
df = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/demos/bank-additional-full.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [3]:
df.head()

age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
45,services,married,basic.9y,unknown,no,no,telephone,may,mon,198,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
59,admin.,married,professional.course,no,no,no,telephone,may,mon,139,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,217,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
24,technician,single,professional.course,no,yes,no,telephone,may,mon,380,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
25,services,single,high.school,no,yes,no,telephone,may,mon,50,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no




In [4]:
train, valid = df.split_frame(seed = 1234)

Set parameters for the grid search.

In [5]:
nfolds = 3
seed = 1234
target_col = "y"
x = [i for i in train.col_names if i != target_col]
search_criteria = {'max_models': 5, 'seed': seed, 'strategy': "RandomDiscrete"}

## GBM

In [6]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator
import copy

Step 1: Train best guess models.

In [7]:
default_gbm_params = {
    'score_tree_interval': 5,
    'ntrees': 10000,
    'sample_rate': 0.8,
    'col_sample_rate': 0.8,
    'col_sample_rate_per_tree': 0.8
}

In [8]:
best_guess_params = [{'max_depth': 6, 'min_rows': 1},
                     {'max_depth': 7, 'min_rows': 10},
                     {'max_depth': 8, 'min_rows': 10},
                     {'max_depth': 10, 'min_rows': 10},
                     {'max_depth': 15, 'min_rows': 100},
                    ]

best_guess_gbms = []
for i in best_guess_params:
    
    model_params = copy.deepcopy(default_gbm_params)
    for k, v in i.items():
        model_params[k] = v
    gbm = H2OGradientBoostingEstimator(nfolds=nfolds, seed=seed,
                                       **model_params
                                      )
    gbm.train(training_frame=train, validation_frame=valid, y=target_col, x=x)
    best_guess_gbms = best_guess_gbms + [gbm]

gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%


Step 2: Train grid search.

In [9]:
from h2o.grid import H2OGridSearch

In [10]:
gbm_search_params = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
                     'min_rows': [1, 5, 10, 15, 30, 100],
                     'sample_rate': [0.50, 0.60, 0.70, 0.80, 0.90, 1.00],
                     'col_sample_rate': [0.4, 0.7, 1.0],
                     'col_sample_rate_per_tree': [0.4, 0.7, 1.0],
                     'min_split_improvement': [1e-4, 1e-5]
                    }

In [11]:
# remove parameters that are being searched from the default dictionary
for k in gbm_search_params.keys():
    default_gbm_params.pop(k, None)

In [12]:
gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,
                         grid_id='gbm_grid',
                         search_criteria=search_criteria,
                         hyper_params=gbm_search_params)
gbm_grid.train(x=x, y=target_col,
               training_frame=train,
               validation_frame=valid,
               nfolds=nfolds,
               seed=seed,
               **default_gbm_params)

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [13]:
all_gbm_models = best_guess_gbms + [h2o.get_model(id) for id in gbm_grid.get_grid().model_ids]

In [14]:
import pandas as pd

gbm_perf = pd.DataFrame.from_dict({x.model_id: x.logloss(valid = True) for x in all_gbm_models},
                                 orient = "index")
gbm_perf.sort_values(by = 0)

Unnamed: 0,0
gbm_grid_model_4,0.122202
gbm_grid_model_2,0.128433
GBM_model_python_1593132904195_85,0.143117
gbm_grid_model_1,0.143567
GBM_model_python_1593132904195_22,0.143732
GBM_model_python_1593132904195_1,0.143763
GBM_model_python_1593132904195_43,0.144601
GBM_model_python_1593132904195_64,0.144904
gbm_grid_model_5,0.15718
gbm_grid_model_3,0.157957


## XGBoost

In [15]:
from h2o.estimators.xgboost import H2OXGBoostEstimator

Step 1: Train best guess models.

In [16]:
default_xgb_params = {
    'score_tree_interval': 5,
    'ntrees': 10000,
    'col_sample_rate': 0.8,
    'col_sample_rate_per_tree': 0.8
}

In [17]:
best_guess_params = [{'max_depth': 10, 'min_rows': 5, 'sample_rate': 0.6},
                     {'max_depth': 20, 'min_rows': 10, 'sample_rate': 0.6},
                     {'max_depth': 5, 'min_rows': 3, 'sample_rate': 0.8},
                    ]

best_guess_xgbs = []
for i in best_guess_params:
    
    model_params = copy.deepcopy(default_xgb_params)
    for k, v in i.items():
        model_params[k] = v
    xgb = H2OXGBoostEstimator(nfolds=nfolds, seed=seed,
                              **model_params
                             )
    xgb.train(training_frame=train, validation_frame=valid, y=target_col, x=x)
    best_guess_xgbs = best_guess_xgbs + [xgb]

xgboost Model Build progress: |███████████████████████████████████████████| 100%
xgboost Model Build progress: |███████████████████████████████████████████| 100%
xgboost Model Build progress: |███████████████████████████████████████████| 100%


Step 2: Train grid search.

In [18]:
xgb_search_params = {'max_depth': [5, 10, 15, 20],
                     'min_rows': [0.01, 0.1, 1.0, 3.0, 5.0, 10.0, 15.0, 20.0],
                     'sample_rate': [0.6, 0.8, 1.0],
                     'col_sample_rate': [0.6, 0.8, 1.0],
                     'col_sample_rate_per_tree': [0.7, 0.8, 0.9, 1.0],
                     'reg_lambda': [0.001, 0.01, 0.1, 1, 10, 100],
                     'reg_alpha': [0.001, 0.01, 0.1, 0.5, 1]
                    }

In [19]:
# remove parameters that are being searched from the default dictionary
for k in xgb_search_params.keys():
    default_xgb_params.pop(k, None)

In [20]:
xgb_grid = H2OGridSearch(model=H2OXGBoostEstimator,
                         grid_id='xgb_grid',
                         search_criteria=search_criteria,
                         hyper_params=xgb_search_params)
xgb_grid.train(x=x, y=target_col,
               training_frame=train,
               validation_frame=valid,
               nfolds=nfolds,
               seed=seed,
               **default_xgb_params)

xgboost Grid Build progress: |████████████████████████████████████████████| 100%


In [21]:
all_xgb_models = best_guess_xgbs + [h2o.get_model(id) for id in xgb_grid.get_grid().model_ids]

In [22]:
xgb_perf = pd.DataFrame.from_dict({x.model_id: x.logloss(valid = True) for x in all_xgb_models},
                                  orient = "index")
xgb_perf.sort_values(by = 0)

Unnamed: 0,0
XGBoost_model_python_1593132904195_248,0.177674
XGBoost_model_python_1593132904195_227,0.178219
XGBoost_model_python_1593132904195_206,0.178755
xgb_grid_model_3,0.179315
xgb_grid_model_2,0.182802
xgb_grid_model_5,0.183267
xgb_grid_model_1,0.186752
xgb_grid_model_4,0.187953


## GLM

In [23]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [24]:
default_glm_params = {
    'lambda_search': True,
    'family': "binomial"
}

Train grid search.

In [25]:
glm_search_params = {'alpha': [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]}

In [26]:
glm_grid = H2OGridSearch(model=H2OGeneralizedLinearEstimator,
                         grid_id='glm_grid',
                         search_criteria=search_criteria,
                         hyper_params=glm_search_params)
glm_grid.train(x=x, y=target_col,
               training_frame=train,
               validation_frame=valid,
               nfolds=nfolds,
               seed=seed,
               **default_glm_params)

glm Grid Build progress: |████████████████████████████████████████████████| 100%


In [27]:
all_glm_models = [h2o.get_model(id) for id in glm_grid.get_grid().model_ids]

In [28]:
glm_perf = pd.DataFrame.from_dict({x.model_id: x.logloss(valid = True) for x in all_glm_models},
                                  orient = "index")
glm_perf.sort_values(by = 0)

Unnamed: 0,0
glm_grid_model_2,0.105525
glm_grid_model_3,0.105549
glm_grid_model_5,0.105554
glm_grid_model_4,0.105718
glm_grid_model_1,0.10605


In [29]:
h2o.cluster().shutdown()

H2O session _sid_bab9 closed.
