# Porto Seguro’s Safe Driver Prediction: H2O.ai


In this competition, you will predict the probability that an auto insurance policy holder files a claim.
In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

## Initializing H2O

In [1]:
# Load the H2O library and start up the H2O cluter locally on your machine
import h2o
h2o.init(ip="localhost", port=54323)

### Importing Libs

In [2]:
import pandas as pd
import numpy as np

### Loading datasets
train.csv contains the training data, where each row corresponds to a policy holder, and the target columns signifies that a claim was filed. test.csv contains the test data. sample_submission.csv is submission file showing the correct format.
"Ind" is related to individual or driver, "reg" is related to region, "car" is related to car itself and "calc" is an calculated feature.

In [3]:
# Setting working directory

path = '../input/'

In [4]:
#load files
train_data = h2o.import_file(path + 'train.csv')
test_data = h2o.import_file(path + 'test.csv')

In [None]:
 test_id = h2o.import_file(path + 'test.csv')['id']

In [None]:
 train_data["target"] = train_data["target"].asfactor()

In [None]:
train, valid, test = train_data.split_frame(ratios=[0.7, 0.15], seed=42)  

In [None]:
y = 'target'
x = list(train_data.columns)

In [None]:
x.remove(y)  #remove the response

In [None]:
print(x)

## H2O Machine Learning
Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:
- Generalized Linear Model (GLM)
- Random Forest (RF)
- Gradient Boosting Machine (RF)
- Deep Learning (DL)

### Generalized Linear Model
Let's start with a basic binomial Generalized Linear Model (GLM). By default, H2O's GLM uses a regularized, elastic net model.

In [None]:
# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

Train a default GLM
We first create an object of class, "H2OGeneralizedLinearEstimator". 

In [None]:
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

Now that glm_fit1 object is initialized, we can train the model:

In [None]:
glm_fit1.train(x=x, y=y, training_frame=train)

#### Train a GLM with lambda search
Next we will do some automatic tuning by passing in a validation frame and setting lambda_search = True. Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting). The model parameter, lambda, controls the amount of regularization in a GLM model and we can find the optimal value for lambda automatically by setting lambda_search = True and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [None]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True,balance_classes = True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

#### Evaluate model performance
Let's compare the performance of the two GLMs that were just trained.

In [None]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

In [None]:
# Retreive test set AUC
print (glm_perf1.gini())
print (glm_perf2.gini())

In [None]:
# Compare test AUC to the training AUC and validation AUC
print (glm_fit2.gini(train=True))
print (glm_fit2.gini(valid=True))

### Random Forest
H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.

In [None]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator

Train and a default RF
First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the response encoding. A seed is required for reproducibility.
:

In [None]:
# Initialize the RF estimator:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1',   seed=1)

Now that rf_fit1 object is initialized, we can train the model:

In [None]:
rf_fit1.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train an RF with more trees
Next we will increase the number of trees used in the forest by setting ntrees = 100. The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default. Usually increasing the number of trees in an RF will increase performance as well. Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees. See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [None]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100,   seed=1)
rf_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

Compare model performance
Let's compare the performance of the two RFs that were just trained.

In [None]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

In [None]:
# Retreive test set AUC
print(rf_perf1.gini())
print(rf_perf2.gini())

Cross-validate performance
Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation. Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O. No custom code or loops are required, you simply specify the number of desired folds in the nfolds argument.
Since we are not going to use a test set here, we can use the original (full) dataset, which we called data rather than the subsampled  train dataset. Note that this will take approximately k (nfolds) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full training_frame dataset with n rows.

In [None]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=train)

To evaluate the cross-validated AUC, do the following:

In [None]:
print( rf_fit3.gini(xval=True))

Note that the cross-validated AUC is slighly higher than the test set performance we estimated for rf_fit1, and this is likely due to the fact that we trained on more data (n rows) than we did while using train as the training set (0.75*n rows) in rf_fit1.
3. Gradient Boosting Machine
H2O's Gradient Boosting Machine (GBM) offers a Stochastic GBM, which can increase performance quite a bit compared to the original GBM implementation.

In [None]:
# Import H2O GBM:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

Train a default GBM
First we will train a basic GBM model with default parameters. GBM will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. A seed is required for reproducibility.

In [None]:
# Initialize and train the GBM estimator:

gbm_fit1 = H2OGradientBoostingEstimator(model_id='gbm_fit1',   seed=1)
gbm_fit1.train(x=x, y=y, training_frame=train, validation_frame=valid)

Train a GBM with more trees
Next we will increase the number of trees used in the GBM by setting ntrees=500. The default number of trees in an H2O GBM is 50, so this GBM will trained using ten times the default. Increasing the number of trees in a GBM is one way to increase performance of the model, however, you have to be careful not to overfit your model to the training data by using too many trees. To automatically find the optimal number of trees, you must use H2O's early stopping functionality. This example will not do that, however, the following example will.

In [None]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500,   seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train a GBM with early stopping
We will again set ntrees = 500, however, this time we will use early stopping in order to prevent overfitting (from too many trees). All of H2O's algorithms have early stopping available, however, with the exception of Deep Learning, it is not enabled by default.
There are several parameters that should be used to control early stopping. The three that are generic to all the algorithms are: stopping_rounds, stopping_metric and stopping_tolerance. The stopping metric is the metric by which you'd like to measure performance, and so we will choose AUC here. The score_tree_interval is a parameter specific to Random Forest and GBM. Setting score_tree_interval=5 will score the model after every five trees. The parameters we have set below specify that the model will stop training after there have been three scoring intervals where the AUC has not increased more than 0.0005. Since we have specified a validation frame, the stopping tolerance will be computed on validation AUC rather than training AUC.

In [None]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=1000, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1)
# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

In [None]:
# Let's try XGBOOSTING
from h2o.estimators import H2OXGBoostEstimator
param = {
      "model_id": 'gbm_fit4'
    , "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}
gbm_fit4 = H2OXGBoostEstimator(**param)
gbm_fit4.train(x=x, y=y, training_frame=train, validation_frame=valid)

Compare model performance
Let's compare the performance of the three GBMs that were just trained.

In [None]:
gbm_perf1 = gbm_fit1.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)
gbm_perf4 = gbm_fit4.model_performance(test)

In [None]:
# Retreive test set AUC
print (gbm_perf1.gini())
print (gbm_perf2.gini())
print (gbm_perf3.gini())
print (gbm_perf4.gini())

### Deep Learning
H2O's Deep Learning algorithm is a multilayer feed-forward artificial neural network. It can also be used to train an autoencoder, however, in the example below we will train a standard supervised prediction model.

In [None]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

Train a default DL
First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. H2O's DL will not be reproducbible if run on more than a single core, so in this example, the performance metrics below may vary slightly from what you see on your machine.
In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

In [None]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1',   seed=1,  balance_classes = True)
dl_fit1.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train a DL with new architecture and more epochs
Next we will increase the number of epochs used in the GBM by setting epochs=20 (the default is 10). Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model. To automatically find the optimal number of epochs, you must use H2O's early stopping functionality. Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting stopping_rounds=0, for comparison.

In [None]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=50, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1,
                                   balance_classes = True)
dl_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train a DL with early stopping
This example will use the same model parameters as dl_fit2, however, we will turn on early stopping and specify the stopping criterion. We will also pass a validation set, as is recommended for early stopping.

In [None]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=500, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=50,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1,  
                                   balance_classes = True)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

Compare model performance
Again, we will compare the model performance of the three models using a test set and AUC.

In [None]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

In [None]:
# Retreive test set AUC
print (dl_perf1.gini())
print (dl_perf2.gini())
print( dl_perf3.gini())

In [None]:
test_pred = gbm_fit4.predict(test_data)

In [None]:
sub = pd.DataFrame()
sub['id'] = h2o.as_list(test_id)
sub['target'] = h2o.as_list(test_pred['p1'])
sub.to_csv('xgb_h2o.csv', index=False)