# Introductory H2O Machine Learning Tutorial

Prepared for H2O Open Chicago 2016: http://open.h2o.ai/chicago.html

## Install H2O

The first step in this tutorial is to download and install the h2o Python module.  
The latest version is always here: http://www.h2o.ai/download/h2o/py

### Start up the H2O Cluster

Once the Python module is installed, we begin by starting up a local (on your laptop) H2O cluster.

In [1]:
# Load the H2O library and start up the H2O cluter locally on your machine
import h2o

# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)



No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpfsL1IA/h2o_me_started_from_python.out
JVM stderr: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpzl_9R7/h2o_me_started_from_python.err
Using ice_root: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpYBAmbc


Java Version: java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)


Starting H2O JVM and connecting: ......... Connection successful!


  def _ipython_display_formatter_default(self):
  def _formatters_default(self):
  def _deferred_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _singleton_printers_default(self):
  def _type_printers_default(self):
  def _deferred_printers_default(self):


0,1
H2O cluster uptime:,1 seconds 115 milliseconds
H2O cluster version:,3.8.2.3
H2O cluster name:,H2O_started_from_python_me_ybz837
H2O cluster total nodes:,1
H2O cluster total free memory:,7.11 GB
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


## Data prep

### Import data
Next we will import a cleaned up version of the Lending Club "Bad Loans" dataset. The purpose here is to predict whether a loan will be bad (i.e. not repaid to the lender). The response column, `bad_loan`, is 1 if the loan was bad, and 0 otherwise.

In [2]:
loan_csv = "/Users/me/h2oai/code/demos/lending_club/loan.csv"  # modify this for your machine
# Alternatively, you can import the data directly from a URL
#loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data = h2o.import_file(loan_csv)  # 163,994 rows x 15 columns


Parse Progress: [##################################################] 100%


In [3]:
data.shape

(163994, 15)

### Encode response variable
Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [4]:
data['bad_loan'] = data['bad_loan'].asfactor()  #encode the binary repsonse as a factor
data['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

### Partition data

Next, we partition the data into training, validation and test sets.

In [5]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility

splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

Notice that `split_frame()` uses approximate splitting not exact splitting (for efficiency), so these are not exactly 70%, 15% and 15% of the total rows.

In [6]:
print train.nrow
print valid.nrow
print test.nrow

114914
24499
24581


### Identify response and predictor variables
In H2O, we use `y` to designate the response variable and `x` to designate the list of predictor columns.

In [7]:
y = 'bad_loan'
x = list(data.columns)

In [8]:
x.remove(y)  #remove the response
x.remove('int_rate')  #remove the interest rate column because it's correlated with the outcome

In [9]:
# List of predictor columns
x

[u'loan_amnt',
 u'term',
 u'emp_length',
 u'home_ownership',
 u'annual_inc',
 u'verification_status',
 u'purpose',
 u'addr_state',
 u'dti',
 u'delinq_2yrs',
 u'revol_util',
 u'total_acc',
 u'longest_credit_length']

## H2O Machine Learning

Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:

- Generalized Linear Model (GLM)
- Random Forest (RF)
- Gradient Boosting Machine (RF)
- Deep Learning (DL)
- Naive Bayes (NB)

## 1. Generalized Linear Model
Let's start with a basic binomial Generalized Linear Model (GLM).  By default, H2O's GLM uses a regularized, elastic net model.

In [10]:
# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

### Train a default GLM
We first create an object of class, `"H2OGeneralizedLinearEstimator"`.  This does not actually do any training, it just sets the model up for training by specifying model parameters.

In [11]:
# Initialize the GLM estimator:
# Similar to R's glm() and H2O's R GLM, H2O's GLM has the "family" argument

glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

Now that `glm_fit1` object is initialized, we can train the model:

In [12]:
glm_fit1.train(x=x, y=y, training_frame=train)


glm Model Build Progress: [##################################################] 100%


### Train a GLM with lambda search

Next we will do some automatic tuning by passing in a validation frame and setting `lambda_search = True`.  Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting).  The model parameter, `lambda`, controls the amount of regularization in a GLM model and we can find the optimal value for `lambda` automatically by setting `lambda_search = True` and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [13]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)


glm Model Build Progress: [##################################################] 100%


### Evaluate model performance
Let's compare the performance of the two GLMs that were just trained.

In [14]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

In [15]:
# Print model performance
print glm_perf1
print glm_perf2


ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.142180931463
R^2: 0.0596363700308
LogLoss: 0.451975547887
Null degrees of freedom: 24580
Residual degrees of freedom: 24526
Null deviance: 23594.8464815
Residual deviance: 22220.0218852
AIC: 22330.0218852
AUC: 0.673463297871
Gini: 0.346926595742

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.194176582531: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13754.0,6263.0,0.3129,(6263.0/20017.0)
1,1970.0,2594.0,0.4316,(1970.0/4564.0)
Total,15724.0,8857.0,0.3349,(8233.0/24581.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1941766,0.3865584,220.0
max f2,0.1125739,0.5515444,312.0
max f0point5,0.2733495,0.3470929,150.0
max accuracy,0.5242047,0.8150197,25.0
max precision,0.7302341,0.6666667,1.0
max recall,0.0002607,1.0,399.0
max specificity,0.7351924,0.9999500,0.0
max absolute_MCC,0.1975597,0.2076043,217.0
max min_per_class_accuracy,0.1786614,0.6247190,236.0



Gains/Lift Table: Avg response rate: 18.57 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100077,0.4669643,2.7586039,2.7586039,0.5121951,0.5121951,0.0276074,0.0276074,175.8603920,175.8603920
,2,0.0200155,0.4288903,2.3645176,2.5615608,0.4390244,0.4756098,0.0236635,0.0512708,136.4517646,156.1560783
,3,0.0300232,0.4032008,2.3207303,2.4812839,0.4308943,0.4607046,0.0232252,0.0744961,132.0730282,148.1283950
,4,0.0400309,0.3832194,2.0361124,2.3699911,0.3780488,0.4400407,0.0203769,0.0948729,103.6112417,136.9991067
,5,0.0500386,0.3677809,2.0580061,2.3075941,0.3821138,0.4284553,0.0205960,0.1154689,105.8006099,130.7594073
,6,0.1000366,0.3090170,1.8580949,2.0829359,0.3449959,0.3867426,0.0929010,0.2083699,85.8094872,108.2935871
,7,0.1500346,0.2726729,1.7090967,1.9583566,0.3173312,0.3636117,0.0854514,0.2938212,70.9096698,95.8356602
,8,0.2000325,0.2487816,1.3891889,1.8160936,0.2579333,0.3371975,0.0694566,0.3632778,38.9188855,81.6093604
,9,0.3000285,0.2103532,1.3300278,1.6540936,0.2469487,0.3071186,0.1329974,0.4962752,33.0027815,65.4093644





ModelMetricsBinomialGLM: glm
** Reported on test data. **

MSE: 0.142186436321
R^2: 0.0595999617154
LogLoss: 0.452003200537
Null degrees of freedom: 24580
Residual degrees of freedom: 24524
Null deviance: 23594.8464815
Residual deviance: 22221.3813448
AIC: 22335.3813448
AUC: 0.673426207356
Gini: 0.346852414711

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.194406524554: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,13784.0,6233.0,0.3114,(6233.0/20017.0)
1,1981.0,2583.0,0.434,(1981.0/4564.0)
Total,15765.0,8816.0,0.3342,(8214.0/24581.0)



Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1944065,0.3860987,222.0
max f2,0.1076138,0.5516607,321.0
max f0point5,0.2744753,0.3471057,152.0
max accuracy,0.5243884,0.8151011,23.0
max precision,0.7273924,0.6666667,1.0
max recall,0.0002627,1.0,399.0
max specificity,0.7346291,0.9999500,0.0
max absolute_MCC,0.1973189,0.2074339,219.0
max min_per_class_accuracy,0.1794809,0.6270815,237.0



Gains/Lift Table: Avg response rate: 18.57 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100077,0.4663871,2.7804976,2.7804976,0.5162602,0.5162602,0.0278265,0.0278265,178.0497602,178.0497602
,2,0.0200155,0.4273468,2.3426240,2.5615608,0.4349593,0.4756098,0.0234443,0.0512708,134.2623964,156.1560783
,3,0.0300232,0.4022065,2.2988366,2.4739861,0.4268293,0.4593496,0.0230061,0.0742770,129.8836600,147.3986056
,4,0.0400309,0.3830004,1.9923251,2.3535708,0.3699187,0.4369919,0.0199387,0.0942156,99.2325054,135.3570805
,5,0.0500386,0.3670886,2.0361124,2.2900791,0.3780488,0.4252033,0.0203769,0.1145925,103.6112417,129.0079128
,6,0.1000366,0.3086472,1.8712418,2.0807456,0.3474369,0.3863359,0.0935583,0.2081507,87.1241770,108.0745613
,7,0.1500346,0.2724995,1.7090967,1.9568962,0.3173312,0.3633406,0.0854514,0.2936021,70.9096698,95.6896232
,8,0.2000325,0.2485387,1.4198649,1.8226657,0.2636290,0.3384177,0.0709904,0.3645925,41.9864949,82.2665716
,9,0.3000285,0.2101283,1.3168809,1.6540936,0.2445077,0.3071186,0.1316827,0.4962752,31.6880918,65.4093644






Instead of printing the entire model performance metrics object, it is probably easier to print just the metric that you are interested in comparing.

In [16]:
# Retreive test set AUC
print glm_perf1.auc()
print glm_perf2.auc()

0.673463297871
0.673426207356


In [17]:
# Compare test AUC to the training AUC and validation AUC
print glm_fit2.auc(train=True)
print glm_fit2.auc(valid=True)

0.676579958645
0.671422993628


## 2. Random Forest
H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.

In [18]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

### Train and a default RF
First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the response encoding. A seed is required for reproducibility.

In [19]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1)

Now that `rf_fit1` object is initialized, we can train the model:

In [20]:
rf_fit1.train(x=x, y=y, training_frame=train)


drf Model Build Progress: [##################################################] 100%


### Train an RF with more trees

Next we will increase the number of trees used in the forest by setting `ntrees = 100`.  The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default.  Usually increasing the number of trees in an RF will increase performance as well.  Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees.  See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [21]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1)
rf_fit2.train(x=x, y=y, training_frame=train)


drf Model Build Progress: [##################################################] 100%


### Compare model performance
Let's compare the performance of the two RFs that were just trained.

In [22]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

In [23]:
# Retreive test set AUC
print rf_perf1.auc()
print rf_perf2.auc()

0.6650349613
0.671842491069


### Cross-validate performance

Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation.  Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O.  No custom code or loops are required, you simply specify the number of desired folds in the `nfolds` argument.

Since we are not going to use a test set here, we can use the original (full) dataset, which we called `data` rather than the subsampled `train` dataset.  Note that this will take approximately k (`nfolds`) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full `training_frame` dataset with n rows. 

In [24]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=data)


drf Model Build Progress: [##################################################] 100%


To evaluate the cross-validated AUC, do the following:

In [25]:
print rf_fit3.auc(xval=True)

0.661846394259


Note that the cross-validated AUC is slighly higher than the test set performance we estimated for `rf_fit1`, and this is likely due to the fact that we trained on more data (n rows) than we did while using `train` as the training set (0.75*n rows) in `rf_fit1`.

## 3. Gradient Boosting Machine
H2O's Gradient Boosting Machine (GBM) offers a Stochastic GBM, which can increase performance quite a bit compared to the original GBM implementation.

In [26]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

### Train a default GBM

First we will train a basic GBM model with default parameters. GBM will infer the response distribution from the response encoding if not specified explicitly through the `distribution` argument. A seed is required for reproducibility.

In [27]:
# Initialize and train the GBM estimator:

gbm_fit1 = H2OGradientBoostingEstimator(model_id='gbm_fit1', seed=1)
gbm_fit1.train(x=x, y=y, training_frame=train)


gbm Model Build Progress: [##################################################] 100%


### Train a GBM with more trees

Next we will increase the number of trees used in the GBM by setting `ntrees=500`.  The default number of trees in an H2O GBM is 50, so this GBM will trained using ten times the default.  Increasing the number of trees in a GBM is one way to increase performance of the model, however, you have to be careful not to overfit your model to the training data by using too many trees.  To automatically find the optimal number of trees, you must use H2O's early stopping functionality.  This example will not do that, however, the following example will.

In [28]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500, seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train)


gbm Model Build Progress: [##################################################] 100%


### Train a GBM with early stopping

We will again set `ntrees = 500`, however, this time we will use early stopping in order to prevent overfitting (from too many trees).  All of H2O's algorithms have early stopping available, however, with the exception of Deep Learning, it is not enabled by default.  

There are several parameters that should be used to control early stopping.  The three that are generic to all the algorithms are: `stopping_rounds`, `stopping_metric` and `stopping_tolerance`.  The stopping metric is the metric by which you'd like to measure performance, and so we will choose AUC here.  The `score_tree_interval` is a parameter specific to Random Forest and GBM.  Setting `score_tree_interval=5` will score the model after every five trees.  The parameters we have set below specify that the model will stop training after there have been three scoring intervals where the AUC has not increased more than 0.0005.  Since we have specified a validation frame, the stopping tolerance will be computed on validation AUC rather than training AUC. 

In [29]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=500, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1)

# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)


gbm Model Build Progress: [##################################################] 100%


### Compare model performance

Let's compare the performance of the three GBMs that were just trained.

In [30]:
gbm_perf1 = gbm_fit1.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)

In [31]:
# Retreive test set AUC
print gbm_perf1.auc()
print gbm_perf2.auc()
print gbm_perf3.auc()

0.682277847572
0.6711075658
0.683018793141


### Scoring History

To examine the scoring history, use the `scoring_history` method on a trained model.  If `score_tree_interval` is not specified, it will score at various intervals, as we can see for `gbm_fit2.scoring_history()` below.  However, regular 5-tree intervals are used for `gbm_fit3.scoring_history()`.  

The `gbm_fit2` was trained only using a training set (no validation set), so the scoring history is calculated for training set performance metrics only.

In [32]:
gbm_fit2.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_AUC,training_lift,training_classification_error
0,,2016-05-02 21:30:41,0.003 sec,0,0.148636,0.473845,0.5,1.0,0.818377
1,,2016-05-02 21:30:41,0.077 sec,1,0.147131,0.468892,0.659246,2.381087,0.338662
2,,2016-05-02 21:30:42,0.137 sec,2,0.145885,0.464902,0.662886,3.007114,0.342909
3,,2016-05-02 21:30:42,0.206 sec,3,0.144821,0.461561,0.668455,3.044227,0.347503
4,,2016-05-02 21:30:42,0.293 sec,4,0.143939,0.458797,0.672099,3.018472,0.347451
5,,2016-05-02 21:30:42,0.390 sec,5,0.14318,0.45646,0.673476,3.03205,0.375141
6,,2016-05-02 21:30:42,0.514 sec,6,0.142511,0.454391,0.675946,3.186165,0.359678
7,,2016-05-02 21:30:42,0.691 sec,7,0.141913,0.452585,0.678256,3.188863,0.358163
8,,2016-05-02 21:30:42,0.892 sec,8,0.141389,0.450968,0.679707,3.319744,0.349209
9,,2016-05-02 21:30:43,1.113 sec,9,0.14092,0.449552,0.680947,3.314056,0.343326


When early stopping is used, we see that training stopped at 105 trees instead of the full 500.  Since we used a validation set in `gbm_fit3`, both training and validation performance metrics are stored in the scoring history object.  Take a look at the validation AUC to observe that the correct stopping tolerance was enforced.

In [33]:
gbm_fit3.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_MSE,training_logloss,training_AUC,training_lift,training_classification_error,validation_MSE,validation_logloss,validation_AUC,validation_lift,validation_classification_error
0,,2016-05-02 21:31:21,0.010 sec,0,0.148636,0.473845,0.5,1.0,0.818377,0.152052,0.481921,0.5,1.0,0.813013
1,,2016-05-02 21:31:22,1.252 sec,5,0.14318,0.45646,0.673476,3.03205,0.375141,0.147481,0.467226,0.654807,2.223073,0.359566
2,,2016-05-02 21:31:23,1.660 sec,10,0.140523,0.448312,0.682198,3.289249,0.346381,0.14551,0.461083,0.661266,2.468289,0.376873
3,,2016-05-02 21:31:23,2.104 sec,15,0.138982,0.443568,0.68936,3.33504,0.339567,0.144536,0.45805,0.664876,2.46661,0.360056
4,,2016-05-02 21:31:24,2.623 sec,20,0.137902,0.44016,0.696091,3.375367,0.31112,0.14394,0.45613,0.668252,2.59758,0.344545
5,,2016-05-02 21:31:24,3.205 sec,25,0.137022,0.43739,0.701169,3.408881,0.317986,0.143482,0.454688,0.671015,2.553923,0.347606
6,,2016-05-02 21:31:25,3.846 sec,30,0.13631,0.435154,0.705434,3.456759,0.316724,0.143138,0.453564,0.67342,2.619408,0.334299
7,,2016-05-02 21:31:26,4.525 sec,35,0.135752,0.433369,0.708569,3.619542,0.297379,0.142954,0.452914,0.67467,2.575752,0.343687
8,,2016-05-02 21:31:26,5.255 sec,40,0.13523,0.431742,0.711801,3.653056,0.319178,0.142793,0.452352,0.67608,2.575752,0.348014
9,,2016-05-02 21:31:27,6.025 sec,45,0.134757,0.430274,0.714796,3.686571,0.311798,0.14266,0.451909,0.677038,2.619408,0.34083


## 4. Deep Learning

H2O's Deep Learning algorithm is a multilayer feed-forward artificial neural network.  It can also be used to train an autoencoder, however, in the example below we will train a standard supervised prediction model.

In [34]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

### Train a default DL

First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the `distribution` argument. A seed is required for reproducibility.

In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

In [35]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1)
dl_fit1.train(x=x, y=y, training_frame=train)


deeplearning Model Build Progress: [##################################################] 100%


### Train a DL with new architecture and more epochs

Next we will increase the number of epochs used in the GBM by setting `epochs=20` (the default is 10).  Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model.  To automatically find the optimal number of epochs, you must use H2O's early stopping functionality.  Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting `stopping_rounds=0`, for comparison.

In [36]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=20, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1)
dl_fit2.train(x=x, y=y, training_frame=train)


deeplearning Model Build Progress: [##################################################] 100%


### Train a DL with early stopping

This example will use the same model parameters as `dl_fit2`, however, we will turn on early stopping and specify the stopping criterion.  We will also pass a validation set, as is recommended for early stopping.

In [37]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=20, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=3,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)


deeplearning Model Build Progress: [##################################################] 100%


### Compare model performance

Again, we will compare the model performance of the three models using a test set and AUC.

In [38]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [39]:
# Retreive test set AUC
print dl_perf1.auc()
print dl_perf2.auc()
print dl_perf3.auc()

0.67102862326
0.678929674676
0.679735207107


In [40]:
dl_fit3.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_MSE,training_r2,training_logloss,training_AUC,training_lift,training_classification_error,validation_MSE,validation_r2,validation_logloss,validation_AUC,validation_lift,validation_classification_error
0,,2016-05-02 21:33:09,0.000 sec,,0.0,0,0,,,,,,,,,,,,
1,,2016-05-02 21:33:09,0.290 sec,486848 rows/sec,0.86851,1,99804,0.144751,0.032042,0.466126,0.66621,2.207266,0.336367,0.147865,0.027352,0.47582,0.662752,2.59758,0.384179
2,,2016-05-02 21:33:10,1.427 sec,623514 rows/sec,6.961458,8,799969,0.146427,0.020835,0.471884,0.67288,2.207266,0.33121,0.14929,0.017975,0.480727,0.668756,2.72855,0.374342
3,,2016-05-02 21:33:11,2.510 sec,693977 rows/sec,13.920123,16,1599617,0.140187,0.062565,0.445,0.683268,2.538356,0.323018,0.143227,0.057857,0.453731,0.67503,2.663065,0.339891
4,,2016-05-02 21:33:12,3.418 sec,734327 rows/sec,20.007832,23,2299180,0.140743,0.058845,0.449854,0.682104,2.538356,0.327164,0.143738,0.054497,0.459053,0.675358,2.859521,0.340871


## 4. Naive Bayes

The Naive Bayes (NB) algorithm does not usually beat an algorithm like a Random Forest or GBM, however it is still a popular algorithm, especially in the text domain (when your input is text encoded as "Bag of Words", for example).  The Naive Bayes algorithm is for binary or multiclass classification problems only, not regression.  Therefore, your response must be a factor instead of numeric. 

In [41]:
# Import H2O NB:
from h2o.estimators.naive_bayes import H2ONaiveBayesEstimator

### Train a default NB

First we will train a basic NB model with default parameters. 

In [44]:
# Initialize and train the NB estimator:

nb_fit1 = H2ONaiveBayesEstimator(model_id='nb_fit1')
nb_fit1.train(x=x, y=y, training_frame=train)


naivebayes Model Build Progress: [##################################################] 100%


### Train a NB model with Laplace Smoothing

One of the few tunable model parameters for the Naive Bayes algorithm is the amount of Laplace smoothing.  The H2O Naive Bayes model will not use any Laplace smoothing by default.

In [45]:
nb_fit2 = H2ONaiveBayesEstimator(model_id='nb_fit2', laplace=6)
nb_fit2.train(x=x, y=y, training_frame=train)


naivebayes Model Build Progress: [##################################################] 100%


### Compare model performance

We will compare the model performance of the two NB models using test set AUC.

In [46]:
nb_perf1 = nb_fit1.model_performance(test)
nb_perf2 = nb_fit2.model_performance(test)

In [47]:
# Retreive test set AUC
print nb_perf1.auc()
print nb_perf2.auc()

0.648801389108
0.649067814706
