# H2O Machine Learning Tutorial: Grid Search & Model Selection

Prepared for H2O Open Chicago 2016: http://open.h2o.ai/chicago.html

## Install H2O

The first step in this tutorial is to download and install the h2o Python module.  
The latest version is always here: http://www.h2o.ai/download/h2o/py

### Start up the H2O Cluster

Once the Python module is installed, we begin by starting up a local (on your laptop) H2O cluster.  If you are already running an H2O cluster from the introductory H2O tutorial, stop the H2O cluster and restart.

In [1]:
# If the cluster is running already, shut down and start up a new instance
#import h2o
#h2o.shutdown(prompt=False)

In [2]:
# Load the H2O library and start up the H2O cluter locally on your machine
import h2o

# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)



No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpnn72d7/h2o_me_started_from_python.out
JVM stderr: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpeV2nfR/h2o_me_started_from_python.err
Using ice_root: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmp8znRmy


Java Version: java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)


Starting H2O JVM and connecting: ......... Connection successful!


  def _ipython_display_formatter_default(self):
  def _formatters_default(self):
  def _deferred_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _deferred_printers_default(self):


0,1
H2O cluster uptime:,1 seconds 40 milliseconds
H2O cluster version:,3.8.2.3
H2O cluster name:,H2O_started_from_python_me_wzy124
H2O cluster total nodes:,1
H2O cluster total free memory:,7.11 GB
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


## Data prep

### Import data
Next we will import a cleaned up version of the Lending Club "Bad Loans" dataset. The purpose here is to predict whether a loan will be bad (i.e. not repaid to the lender). The response column, `bad_loan`, is 1 if the loan was bad, and 0 otherwise.

In [3]:
loan_csv = "/Volumes/H2OTOUR/loan.csv"  # modify this for your machine
# Alternatively, you can import the data directly from a URL
#loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data = h2o.import_file(loan_csv)  # 163,987 rows x 15 columns


Parse Progress: [                                                  ] 00%



Parse Progress: [##################################################] 100%


In [4]:
data.shape

(163987, 15)

### Encode response variable
Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [5]:
data['bad_loan'] = data['bad_loan'].asfactor()  #encode the binary repsonse as a factor
data['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

### Partition data

Next, we partition the data into training, validation and test sets.

In [6]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility

splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

Notice that `split_frame()` uses approximate splitting not exact splitting (for efficiency), so these are not exactly 70%, 15% and 15% of the total rows.

In [7]:
print train.nrow
print valid.nrow
print test.nrow

114908
24498
24581


### Identify response and predictor variables
In H2O, we use `y` to designate the response variable and `x` to designate the list of predictor columns.

In [8]:
y = 'bad_loan'
x = list(data.columns)

In [9]:
x.remove(y)  #remove the response
x.remove('int_rate')  #remove the interest rate column because it's correlated with the outcome

In [10]:
# List of predictor columns
x

[u'loan_amnt',
 u'term',
 u'emp_length',
 u'home_ownership',
 u'annual_inc',
 u'purpose',
 u'addr_state',
 u'dti',
 u'delinq_2yrs',
 u'revol_util',
 u'total_acc',
 u'longest_credit_length',
 u'verification_status']

## H2O Grid Search (GBM)

Now that we have prepared the data, we can train some models.  Rather than training models manually one-by-one, we will make use of the H2O Grid Search functionality train a bunch of models at once.

H2O offers two types of grid search -- "Cartesian" and "RandomDiscrete".  Cartesian is the traditional, exhaustive, grid search over all the combinations of model parameters in the grid.  Random Grid Search will sample sets of model parameters randomly for some specified period of time (or maximum number of models).

We will use GBM as an example to demonstrate H2O's grid search functionality.

In [11]:
# Import H2O Grid Search:
from h2o.grid.grid_search import H2OGridSearch

# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

### Cartesian Grid Search

We first need to define a grid of GBM model hyperparameters.  For this particular example, we will grid over the following model parameters:

- `learn_rate`
- `max_depth`
- `sample_rate`
- `col_sample_rate`

In [12]:
# GBM hyperparameters
gbm_params1 = {'learn_rate': [0.01, 0.1], 
                'max_depth': [3, 5, 9],
                'sample_rate': [0.8, 1.0],
                'col_sample_rate': [0.2, 0.5, 1.0]}

#### Train and validate a grid of GBMs

If you want to specify non-default model parameters that are not part of your grid, you pass them along to the grid via the `H2OGridSearch.train()` method.  See `ntrees=100` in the example below.

In [13]:
gbm_grid1 = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid1',
                          hyper_params=gbm_params1)
gbm_grid1.train(x=x, y=y, 
                training_frame=train, 
                validation_frame=valid, 
                ntrees=100,
                seed=1)


gbm Grid Build Progress: [##################################################] 100%


#### Compare model performance

To compare the model performance among all the models in a grid, sorted by a particular metric (e.g. AUC), you can use the `get_grid` method. 

In [14]:
gbm_gridperf1 = gbm_grid1.get_grid(sort_by='auc', decreasing=True)

In [15]:
print gbm_gridperf1 

      sample_rate  max_depth  learn_rate  col_sample_rate           model_ids  \
0             0.8          5        0.10              0.5  gbm_grid1_model_20   
1             1.0          5        0.10              0.5  gbm_grid1_model_21   
2             1.0          5        0.10              1.0  gbm_grid1_model_33   
3             0.8          5        0.10              1.0  gbm_grid1_model_32   
4             1.0          5        0.10              0.2   gbm_grid1_model_9   
5             0.8          5        0.10              0.2   gbm_grid1_model_8   
6             1.0          9        0.10              0.2  gbm_grid1_model_11   
7             1.0          3        0.10              1.0  gbm_grid1_model_31   
8             0.8          3        0.10              1.0  gbm_grid1_model_30   
9             0.8          3        0.10              0.5  gbm_grid1_model_18   
10            0.8          9        0.10              0.2  gbm_grid1_model_10   
11            1.0          3

### Random Grid Search
This example is set to run fairly quickly -- increase `max_runtime_secs` or `max_models` to cover more of the hyperparameter space.  Also, you can expand the hyperparameter space of each of the algorithms by modifying the hyper parameter list below.

In addition to the hyperparameter dictionary, we will specify the `search_criteria` as 'RandomDiscrete', with a max numeber of models equal to 36. 

In [16]:
# GBM hyperparameters
gbm_params2 = {'learn_rate': [i * 0.01 for i in range(1, 11)], 
                'max_depth': list(range(2, 11)),
                'sample_rate': [i * 0.1 for i in range(5, 11)],
                'col_sample_rate': [i * 0.1 for i in range(1, 11)]}

# Search criteria
search_criteria2 = {'strategy': 'RandomDiscrete', 'max_models': 36}

#### Train and validate a random grid of GBMs

In [17]:
gbm_grid2 = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid2',
                          hyper_params=gbm_params2,
                          search_criteria=search_criteria2)
gbm_grid2.train(x=x, y=y, 
                training_frame=train, 
                validation_frame=valid, 
                ntrees=100,
                seed=1)


gbm Grid Build Progress: [##################################################] 100%


#### Compare model performance

In [18]:
gbm_gridperf2 = gbm_grid2.get_grid(sort_by='auc', decreasing=True)

In [19]:
print gbm_gridperf2 

      sample_rate  max_depth  learn_rate  col_sample_rate           model_ids  \
0             1.0          7        0.09              0.5  gbm_grid2_model_24   
1             0.5          6        0.06              0.3  gbm_grid2_model_13   
2             0.5          6        0.07              0.5  gbm_grid2_model_33   
3             0.5          6        0.07              0.7  gbm_grid2_model_18   
4             0.8          7        0.07              1.0   gbm_grid2_model_2   
5             0.9          7        0.05              0.5  gbm_grid2_model_30   
6             0.9          9        0.10              0.4  gbm_grid2_model_32   
7             0.8          9        0.08              0.7  gbm_grid2_model_19   
8             0.5          6        0.05              0.9  gbm_grid2_model_16   
9             0.7         10        0.02              0.2   gbm_grid2_model_9   
10            0.9          8        0.09              0.9   gbm_grid2_model_0   
11            0.7          7

#### Add models to existing grid
It looks like `learn_rate=0.1` does well here, which was the biggest `learn_rate` in our previous search, so maybe we want to add some models to our grid search with a higher `learn_rate`.  We will create a new `hyper_params` and `search_criteria` objects.

We can add models to the same grid, by re-using the same `model_id`. Let's add as many new models as we can train in 60 seconds by setting `max_runtime_secs=60` in `search_criteria`.

In [20]:
# GBM hyperparameters
gbm_params = {'learn_rate': [i * 0.01 for i in range(1, 31)],  #updated
                'max_depth': list(range(2, 11)),
                'sample_rate': [0.9, 0.95, 1.0],  #updated
                'col_sample_rate': [i * 0.1 for i in range(1, 11)]}

# Search criteria
search_criteria = {'strategy': 'RandomDiscrete', 'max_runtime_secs': 60}  #updated

In [21]:
gbm_grid = H2OGridSearch(model=H2OGradientBoostingEstimator,
                         grid_id='gbm_grid2',
                         hyper_params=gbm_params,
                         search_criteria=search_criteria)
gbm_grid.train(x=x, y=y, 
               training_frame=train, 
               validation_frame=valid, 
               ntrees=100,
               seed=1)


gbm Grid Build Progress: [##################################################] 100%


In [22]:
gbm_gridperf = gbm_grid.get_grid(sort_by='auc', decreasing=True)

In [23]:
print gbm_gridperf

      sample_rate  max_depth  learn_rate  col_sample_rate           model_ids  \
0            0.95          3        0.22              0.4  gbm_grid2_model_36   
1            1.00          7        0.09              0.5  gbm_grid2_model_24   
2            0.95          3        0.18              0.6  gbm_grid2_model_43   
3            0.50          6        0.06              0.3  gbm_grid2_model_13   
4            0.50          6        0.07              0.5  gbm_grid2_model_33   
5            0.50          6        0.07              0.7  gbm_grid2_model_18   
6            0.80          7        0.07              1.0   gbm_grid2_model_2   
7            0.90          7        0.05              0.5  gbm_grid2_model_30   
8            0.90          9        0.10              0.4  gbm_grid2_model_32   
9            0.80          9        0.08              0.7  gbm_grid2_model_19   
10           0.50          6        0.05              0.9  gbm_grid2_model_16   
11           0.70         10

Lastly, let's extract the top model, as determined by validation AUC, from the grid.

In [24]:
# Grab the model_id for the top GBM model, chosen by validation AUC
best_gbm_model = gbm_gridperf.models[0]

In [25]:
# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance

gbm_perf = best_gbm_model.model_performance(test)
print gbm_perf.auc()

0.683855910541


This is slighly higher than the AUC on the validation set of the top model, however, model performance evaluated on a held-out test set is a more honest estimate of performance.  The validation set was used to select the best model, but should not be used to also evaluate the best model's performance.

## H2O Grid Search (DL)

Next we will explore some deep learning parameters in a random grid search.  We will execute the grid search for 120 seconds.

In [29]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

In [30]:
# DL hyperparameters
activation_opt = ["Rectifier", "RectifierWithDropout", "Maxout", "MaxoutWithDropout"]
l1_opt = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1]
l2_opt = [0, 0.00001, 0.0001, 0.001, 0.01, 0.1]
dl_params = {'activation': activation_opt, 'l1': l1_opt, 'l2': l2_opt}

# Search criteria
search_criteria = {'strategy': 'RandomDiscrete', 'max_runtime_secs': 120, 'seed':1}

In [31]:
dl_grid = H2OGridSearch(model=H2ODeepLearningEstimator,
                        grid_id='dl_grid1',
                        hyper_params=dl_params,
                        search_criteria=search_criteria)

dl_grid.train(x=x, y=y,
              training_frame=train, 
              validation_frame=valid, 
              hidden=[10,10],
              hyper_params=dl_params,
              search_criteria=search_criteria)

dl_gridperf = dl_grid.get_grid(sort_by='auc', decreasing=True)


deeplearning Grid Build Progress: [##################################################] 100%


In [32]:
# Grab the model_id for the top GBM model, chosen by validation AUC
best_dl_model = dl_gridperf.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance

dl_perf = best_gbm_model.model_performance(test)
print dl_perf.auc()

0.683855910541
