# H2O Machine Learning Tutorial

## H2O

H2O is fast, scalable, open-source machine learning and deep learning for smarter applications. With H2O, enterprises like PayPal, Nielsen Catalina, Cisco, and others can use all their data without sampling to get accurate predictions faster. 

Advanced algorithms such as deep learning, boosting, and bagging ensembles are built-in to help application designers create smarter applications through elegant APIs. 

Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. 

To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python, Scala, Java, JSON and CoffeeScript/JavaScript, as well as a built-in web interface. 

H2O includes many common machine learning algorithms, such as generalized linear modeling (linear regression, logistic regression, etc.), Naive Bayes, principal components analysis, k-means clustering, and others. H2O also implements
best-in-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning.


In [83]:
## Copied from https://github.com/h2oai/h2o-tutorials/blob/master/h2o-open-tour-2016/chicago/intro-to-h2o.ipynb

Install H2O
------------

To download and install the h2o Python module, use the following link: http://www.h2o.ai/download/h2o/py

### H2O Python Module

Load the H2O Python module

In [1]:
import h2o

#### Start up the H2O Cluster

In [3]:
# Number of threads, nthreads = -1, means use all cores on your machine
# max_mem_size is the maximum memory (in GB) to allocate to H2O
h2o.init(nthreads = -1, max_mem_size = 8)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
  Starting server from C:\Users\DELL-PC\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\DELL-PC\AppData\Local\Temp\tmp4ikubiuq
  JVM stdout: C:\Users\DELL-PC\AppData\Local\Temp\tmp4ikubiuq\h2o_DELL_PC_started_from_python.out
  JVM stderr: C:\Users\DELL-PC\AppData\Local\Temp\tmp4ikubiuq\h2o_DELL_PC_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,12 secs
H2O cluster version:,3.16.0.2
H2O cluster version age:,18 days
H2O cluster name:,H2O_from_python_DELL_PC_vjvdfy
H2O cluster total nodes:,1
H2O cluster free memory:,7.102 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54321


In [4]:
h2o.init(max_mem_size = 2)            #uses all cores by default
h2o.remove_all()                          #clean slate, in case cluster was already running

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,1 min 20 secs
H2O cluster version:,3.16.0.2
H2O cluster version age:,18 days
H2O cluster name:,H2O_from_python_DELL_PC_vjvdfy
H2O cluster total nodes:,1
H2O cluster free memory:,7.102 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"locked, healthy"
H2O connection url:,http://localhost:54321


To learn more about the h2o package itself, we can use Python's builtin help() function.

In [5]:
help(h2o)

Help on package h2o:

NAME
    h2o - :mod:`h2o` -- module for using H2O services.

DESCRIPTION
    (please add description).

PACKAGE CONTENTS
    assembly
    astfun
    automl (package)
    backend (package)
    cross_validation
    demos
    display
    estimators (package)
    exceptions
    expr
    expr_optimizer
    frame
    grid (package)
    group_by
    h2o
    job
    model (package)
    schemas (package)
    transforms (package)
    two_dim_table
    utils (package)

SUBMODULES
    __init__

FUNCTIONS
    api(endpoint, data=None, json=None, filename=None, save_to=None)
        Perform a REST API request to a previously connected server.
        
        This function is mostly for internal purposes, but may occasionally be useful for direct access to
        the backend H2O server. It has same parameters as :meth:`H2OConnection.request <h2o.backend.H2OConnection.request>`.
    
    as_list(data, use_pandas=True, header=True)
        Convert an H2O data object into a python




help() can be used on H2O functions and models. Jupyter's builtin shift-tab functionality also works

In [7]:
h2o.init(ip="127.0.0.1", port=54321)

Checking whether there is an H2O instance running at http://127.0.0.1:54321. connected.


0,1
H2O cluster uptime:,10 mins 02 secs
H2O cluster version:,3.16.0.2
H2O cluster version age:,18 days
H2O cluster name:,H2O_from_python_DELL_PC_vjvdfy
H2O cluster total nodes:,1
H2O cluster free memory:,7.102 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"locked, healthy"
H2O connection url:,http://127.0.0.1:54321


H2O Deep Learning
--------------------
While H2O Deep Learning has many parameters, it was designed to be just as easy to use as the other supervised training methods in H2O. Early stopping, automatic data standardization and handling of categorical variables and missing values and adaptive learning rates (per weight) reduce the amount of parameters the user has to specify. Often, it's just the number and sizes of hidden layers, the number of epochs and the activation function and maybe some regularization techniques.

Let's get our imports first

In [10]:
%matplotlib inline                         
#IMPORT ALL THE THINGS

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from h2o.estimators.deeplearning import H2OAutoEncoderEstimator, H2ODeepLearningEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator

## Data prep

### Import data

In [45]:
loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"
data_loan = h2o.import_file(loan_csv)  # 163,987 rows x 15 columns

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [46]:
data_loan.shape

(163987, 15)

Encode response variable

Since we want to train a binary classification model, we must ensure that the response is coded as a factor. If the response is 0/1, H2O will assume it's numeric, which means that H2O will train a regression model instead.

In [47]:
data_loan['bad_loan'] = data_loan['bad_loan'].asfactor()  #encode the binary repsonse as a factor
data_loan['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '

[['0', '1']]

#### Partition data

Next, we partition the data into training, validation and test sets.

In [48]:
# Partition data into 70%, 15%, 15% chunks
# Setting a seed will guarantee reproducibility

splits = data_loan.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

Notice that split_frame() uses approximate splitting not exact splitting (for efficiency), so these are not exactly 70%, 15% and 15% of the total rows

In [49]:
print(train.nrow)
print(valid.nrow)
print(test.nrow)

114908
24498
24581


#### Identify response and predictor variables

In H2O, we use y to designate the response variable and x to designate the list of predictor columns

In [54]:
y = 'bad_loan'
x = list(data.columns)

In [55]:
x.remove(y)  #remove the response
x.remove('int_rate')  #remove the interest rate column because it's correlated with the outcome
# List of predictor columns
x

['loan_amnt',
 'term',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'purpose',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'revol_util',
 'total_acc',
 'longest_credit_length',
 'verification_status']

H2O Machine Learning
-----------------------

Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:

   * Random Forest (RF)
   * Deep Learning (DL)
   * Naive Bayes (NB)

### Random Forest (RF)

H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.


In [56]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

#### Train a default RF
First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the response encoding. A seed is required for reproducibility.

In [57]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1', seed=1)

In [58]:
rf_fit1.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


#### Train an RF with more trees

Next we will increase the number of trees used in the forest by setting ntrees = 100. The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default. Usually increasing the number of trees in an RF will increase performance as well. Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees. See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [59]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100, seed=1)
rf_fit2.train(x=x, y=y, training_frame=train)

drf Model Build progress: |███████████████████████████████████████████████| 100%


#### Compare model performance

Let's compare the performance of the two RFs that were just trained.

In [60]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

In [61]:
print(rf_perf1.auc())
print(rf_perf2.auc())

0.6634669533752062
0.6692885327809279


#### Cross-validate performance

Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation. Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O. No custom code or loops are required, you simply specify the number of desired folds in the nfolds argument.

Since we are not going to use a test set here, we can use the original (full) dataset, which we called data rather than the subsampled  train dataset. Note that this will take approximately k (nfolds) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full training_frame dataset with n rows.

In [82]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=data)

drf Model Build progress: |███████████████████████████████████████████████| 100%


### Deep Learning

H2O's Deep Learning algorithm is a multilayer feed-forward artificial neural network. It can also be used to train an autoencoder, however, in the example below we will train a standard supervised prediction model.

In [67]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

#### Train a default DL

First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. H2O's DL will not be reproducbible if run on more than a single core, so in this example, the performance metrics below may vary slightly from what you see on your machine.

In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

In [68]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1', seed=1)
dl_fit1.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


#### Train a DL with new architecture and more epochs

Next we will increase the number of epochs used in the GBM by setting epochs=20 (the default is 10). Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model. To automatically find the optimal number of epochs, you must use H2O's early stopping functionality. Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting stopping_rounds=0, for comparison.

In [69]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=20, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1)
dl_fit2.train(x=x, y=y, training_frame=train)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


#### Train a DL with early stopping

This example will use the same model parameters as dl_fit2, however, we will turn on early stopping and specify the stopping criterion. We will also pass a validation set, as is recommended for early stopping.

In [70]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=20, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=3,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

deeplearning Model Build progress: |██████████████████████████████████████| 100%


#### Compare model performance

Again, we will compare the model performance of the three models using a test set and AUC.

In [71]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [73]:
# Retreive test set AUC
print(dl_perf1.auc())
print(dl_perf2.auc())
print(dl_perf3.auc())

0.679221401264556
0.6812647935579725
0.6794045174358962


In [74]:
dl_fit3.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_lift,validation_classification_error
0,,2017-12-19 23:35:54,0.000 sec,,0.0,0,0.0,,,,,,,,,,
1,,2017-12-19 23:35:55,1.733 sec,83274 obs/sec,0.795001,1,91352.0,0.378359,0.454813,0.663944,3.265791,0.438714,0.377578,0.453046,0.670317,2.465673,0.374439
2,,2017-12-19 23:36:00,5.985 sec,176282 obs/sec,7.923704,10,910497.0,0.380267,0.459688,0.677006,2.449344,0.331715,0.380163,0.45977,0.675773,2.465673,0.328068
3,,2017-12-19 23:36:02,7.933 sec,222328 obs/sec,13.468392,17,1547626.0,0.380605,0.464772,0.677574,2.503773,0.372573,0.380311,0.464613,0.676244,2.861942,0.364928
4,,2017-12-19 23:36:03,9.592 sec,247502 obs/sec,18.220011,23,2093625.0,0.379961,0.460291,0.677489,2.558203,0.324333,0.379758,0.460005,0.677526,2.685822,0.354437
5,,2017-12-19 23:36:04,10.480 sec,257203 obs/sec,20.597217,26,2366785.0,0.379487,0.458102,0.678384,2.340484,0.347087,0.379137,0.457884,0.678752,2.641792,0.360193


### References

   * https://github.com/h2oai/h2o-tutorials/blob/master/h2o-open-tour-2016/chicago/intro-to-h2o.ipynb
    
   * https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/deeplearning/deeplearning.ipynb
    
   * http://h2o2016.wpengine.com/wp-content/themes/h2o2016/images/resources/PythonBooklet.pdf
    
   * http://h2o2016.wpengine.com/wp-content/themes/h2o2016/images/resources/DeepLearningBooklet.pdf
    
   * https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/deeplearning/deeplearning.py