# Introduction
This tutorial shows how a H2O [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model) model can be used to do supervised classification. This tutorial covers usage of H2O from Python. An R version of this tutorial will be available as well in a separate document. This file is available in plain R, plain Python and iPython Notebook formats. More examples and explanations can be found in our [H2O Generalized Linear Modeling booklet](http://h2o.ai/resources/) and on our [H2O Github Repository](http://github.com/h2oai/h2o-3/).


### H2O Python Module

Load the H2O Python module.

In [1]:
import h2o

### Start H2O
Start up a 1-node H2O cloud on your local machine, and allow it to use all CPU cores and up to 2GB of memory:

In [4]:
h2o.init(max_mem_size_GB = 2)            #uses all cores by default
h2o.remove_all()                          #clean slate, in case cluster was already running



No instance found at ip and port: localhost:54321. Trying to start local jar...


JVM stdout: c:\users\kevin\appdata\local\temp\tmp1d3z8d\h2o_Kevin_started_from_python.out
JVM stderr: c:\users\kevin\appdata\local\temp\tmplo27xm\h2o_Kevin_started_from_python.err
Using ice_root: c:\users\kevin\appdata\local\temp\tmpxkbqio


Java Version: java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)


Starting H2O JVM and connecting: . Connection successful!


0,1
H2O cluster uptime:,1 seconds 716 milliseconds
H2O cluster version:,3.7.0.3248
H2O cluster name:,H2O_started_from_python
H2O cluster total nodes:,1
H2O cluster total memory:,1.78 GB
H2O cluster total cores:,8
H2O cluster allowed cores:,8
H2O cluster healthy:,True
H2O Connection ip:,127.0.0.1
H2O Connection port:,54321


To learn more about the h2o package itself, we can use Python's builtin help() function.

In [None]:
help(h2o)

help() can be used on H2O functions and models. Jupyter's builtin shift-tab functionality also works

In [6]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
help(H2OGeneralizedLinearEstimator)
help(h2o.import_file)

Help on class H2OGeneralizedLinearEstimator in module h2o.estimators.glm:

class H2OGeneralizedLinearEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  Method resolution order:
 |      H2OGeneralizedLinearEstimator
 |      h2o.estimators.estimator_base.H2OEstimator
 |      h2o.model.model_base.ModelBase
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, model_id=None, max_iterations=None, beta_epsilon=None, solver=None, standardize=None, family=None, link=None, tweedie_variance_power=None, tweedie_link_power=None, alpha=None, prior=None, lambda_search=None, nlambdas=None, lambda_min_ratio=None, beta_constraints=None, nfolds=None, fold_assignment=None, keep_cross_validation_predictions=None, intercept=None, Lambda=None, max_active_predictors=None, checkpoint=None)
 |      Build a Generalized Linear Model
 |      Fit a generalized linear model, specified by a response variable, a set of predictors,
 |      and a description of the error distribution.

Since we use pandas DataFrames to simplify some processes later in this demo, let's import both pandas and numpy.

In [7]:
import pandas as pd
import numpy as np

##H2O GLM

Generalized linear models (GLMs) are an extension of traditional linear models. They have gained popularity in statistical data analysis due to:  

1. the flexibility of the model structure unifying the typical regression methods (such as linear regression and logistic regression for binary classification)  
2. the recent availability of model-fitting software  
3. the ability to scale well with large datasets  

H2O's GLM algorithm fits generalized linear models to the data by maximizing the log-likelihood. The elastic net penalty can be used for parameter regularization. The model fitting computation is distributed, extremely fast, and scales extremely well for models with a limited number of predictors with non-zero coefficients (~ low thousands).  

###Getting started

We begin by importing our data into H2OFrames, which operate similarly in function to pandas DataFrames but exist on the H2O cloud itself.  

In this case, the H2O cluster is running on our laptops. Data files are imported by their relative locations to this notebook.

In [None]:
covtype_df = h2o.import_file("../data/covtype.full.csv")

We import the full covertype dataset (581k rows, 13 columns, 10 numerical, 3 categorical) and then split the data 3 ways:  
  
60% for training  
20% for validation (hyper parameter tuning)  
20% for final testing  

 We will train a data set on one set and use the others to test the validity of the model by ensuring that it can predict accurately on data the model has not been shown.  
 
 The second set will be used for validation most of the time.  
 
 The third set will be withheld until the end, to ensure that our validation accuracy is consistent with data we have never seen during the iterative process. 

In [8]:
#split the data as described above
train, valid, test = covtype_df.split_frame([0.7, 0.15], seed=1234)

#Prepare predictors and response columns
covtype_X = covtype_df.col_names[:-1]     #last column is Cover_Type, our desired response variable 
covtype_y = covtype_df.col_names[-1]    


Parse Progress: [##################################################] 100%


In [9]:
def cut_column(train_df, train, valid, test, col):
    '''
    Convenience function to change a column from numerical to categorical
    We use train_df only for bucketing with histograms.
    Uses np.histogram to generate a histogram, with the buckets forming the categories of our new categorical.
    Picks buckets based on training data, then applies the same classification to the test and validation sets
    
    Assumes that train, valid, test will have the same histogram behavior.
    '''
    only_col= train_df[col]                            #Isolate the column in question from the training frame
    counts, breaks = np.histogram(only_col, bins=20)   #Generate counts and breaks for our histogram
    min_val = min(only_col)-1                          #Establish min and max values
    max_val = max(only_col)+1
    
    new_b = [min_val]                                  #Redefine breaks such that each bucket has enough support
    for i in xrange(19):
        if counts[i] > 1000 and counts[i+1] > 1000:
            new_b.append(breaks[i+1])
    new_b.append(max_val)
    
    names = [col + '_' + str(x) for x in xrange(len(new_b)-1)]  #Generate names for buckets, these will be categorical names

    train[col+"_cut"] = train[col].cut(breaks=new_b, labels=names)
    valid[col+"_cut"] = valid[col].cut(breaks=new_b, labels=names)
    test[col+"_cut"] = test[col].cut(breaks=new_b, labels=names)


In [None]:
def add_features(train, valid, test):
    '''
    Helper function to add a specific set of features to our covertype dataset
    '''
    #pull train dataset into Python
    train_df = train.as_data_frame(True)
    
    #Make categoricals for several columns
    cut_column(train_df, train, valid, test, "Elevation")
    cut_column(train_df, train, valid, test, "Hillshade_Noon")
    cut_column(train_df, train, valid, test, "Hillshade_9am")
    cut_column(train_df, train, valid, test, "Hillshade_3pm")
    cut_column(train_df, train, valid, test, "Horizontal_Distance_To_Hydrology")
    cut_column(train_df, train, valid, test, "Slope")
    cut_column(train_df, train, valid, test, "Horizontal_Distance_To_Roadways")
    cut_column(train_df, train, valid, test, "Aspect")
    
    
    #Add interaction columns for a subset of columns
    interaction_cols1 = ["Elevation_cut",
                         "Wilderness_Area",
                         "Soil_Type",
                         "Hillshade_Noon_cut",
                         "Hillshade_9am_cut",
                         "Hillshade_3pm_cut",
                         "Horizontal_Distance_To_Hydrology_cut",
                         "Slope_cut",
                         "Horizontal_Distance_To_Roadways_cut",
                         "Aspect_cut"]

    train_cols = train.interaction(factors=interaction_cols1,    #Generate pairwise columns
                                   pairwise=True,
                                   max_factors=1000,
                                   min_occurrence=100,
                                   destination_frame="itrain")
    valid_cols = valid.interaction(factors=interaction_cols1,
                                   pairwise=True,
                                   max_factors=1000,
                                   min_occurrence=100,
                                   destination_frame="ivalid")
    test_cols = test.interaction(factors=interaction_cols1,
                                   pairwise=True,
                                   max_factors=1000,
                                   min_occurrence=100,
                                   destination_frame="itest")
    
    train = train.cbind(train_cols)                              #Append pairwise columns to H2OFrames
    valid = valid.cbind(valid_cols)
    test = test.cbind(test_cols)
    
    
    #Add a three-way interaction for Hillshade
    interaction_cols2 = ["Hillshade_Noon_cut","Hillshade_9am_cut","Hillshade_3pm_cut"]
    
    train_cols = train.interaction(factors=interaction_cols2,    #Generate pairwise columns
                                   pairwise=False,
                                   max_factors=1000,
                                   min_occurrence=100,
                                   destination_frame="itrain")
    valid_cols = valid.interaction(factors=interaction_cols2,
                                   pairwise=False,
                                   max_factors=1000,
                                   min_occurrence=100,
                                   destination_frame="ivalid")
    test_cols = test.interaction(factors=interaction_cols2,
                                   pairwise=False,
                                   max_factors=1000,
                                   min_occurrence=100,
                                   destination_frame="itest")
    
    train = train.cbind(train_cols)                              #Append pairwise columns to H2OFrames
    valid = valid.cbind(valid_cols)
    test = test.cbind(test_cols)
    
    return train, valid, test

In [11]:
%%time
train_v, test_v, valid_v = add_features(train, test, valid)


Interactions Progress: [##################################################] 100%

Interactions Progress: [##################################################] 100%

Interactions Progress: [##################################################] 100%

Interactions Progress: [##################################################] 100%

Interactions Progress: [##################################################] 100%

Interactions Progress: [##################################################] 100%
Wall time: 1min 3s


In [None]:
covtype_df.interaction