# AutoML: Automatic Machine Learning

AutoML: Automatic Machine Learning  

H2O’s AutoML is used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

## Tutorials

* Intro to AutoML + Hands-on Lab - Erin LeDell, Machine Learning Scientist... [https://youtu.be/42Oo8TOl85I](https://youtu.be/42Oo8TOl85I)  
* Scalable Automatic Machine Learning in H2O [https://youtu.be/j6rqrEYQNdo](https://youtu.be/j6rqrEYQNdo)      


![Scalable Automatic Machine Learning in H2O](http://nikbearbrown.com/YouTube/MachineLearning/IMG/Scalable_Automatic_Machine_Learning_in_H2O.png)


Scalable Automatic Machine Learning in H2O [https://youtu.be/j6rqrEYQNdo](https://youtu.be/j6rqrEYQNdo)    


## Installing H2O and h2o python

See [http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html)  

Click the Download H2O button on the [http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html](http://h2o-release.s3.amazonaws.com/h2o/latest_stable.html) page. This downloads a zip file that contains everything you need to get started.


```bash
cd ~/Downloads
unzip h2o-3.20.0.1.zip
cd h2o-3.20.0.1
java -jar h2o.jar
```

Point your browser to http://localhost:54321.

**Install in Python**  

Install dependencies (prepending with sudo if needed):

```bash
pip install requests
pip install tabulate
pip install scikit-learn
pip install colorama
pip install future
```

Remove any existing H2O module for Python.

```bash
pip uninstall h2o
```

Use pip to install this version of the H2O Python module.  

```bash
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
```

Note: When installing H2O from pip in OS X El Capitan, users must include the --user flag. For example:


```bash
pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o --user
```

Initialize H2O in Python and run a demo to see H2O at work.

```python
python
import h2o
h2o.init()
h2o.demo("glm")
```

## Saving data

H2O model file file will be saved in one of two formats.


There are two ways to save the leader model -- binary format and MOJO format.  If you're taking your leader model to production, then we'd suggest the MOJO format since it's optimized for production use.

See [http://docs.h2o.ai/h2o/latest-stable/h2o-docs/save-and-load-model.html](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/save-and-load-model.html)    

```python

# save the model
model_path = h2o.save_model(model=model, path="/tmp/mymodel", force=True)

# or

h2o.save_model(aml.leader, path = "./models")

# or

aml.leader.download_mojo(path = "./models")

# load the model
saved_model = h2o.load_model(model_path)

```

**Saving data from runs**   


Stats about the models can be saved as text or csv or put directly in a database.

Much of the data is gathered by converting H2O objects to pandas data frame.  So anything that a pandas data frame can be saved as is supported.
[https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)    


```python
data_pd = object.as_data_frame(use_pandas=True)
```  

Otherwise data is returned as python dictionaries or lists.  

```python
[('addr_state', 258199.28125, 1.0, 0.19965953057652525), ('int_rate', 203347.0625, 0.7875585924002257, 0.15724357886013807), ('dti', 116477.5703125, 0.45111500600856147, 0.09006941033569575), ('revol_util', 110586.1484375, 0.42829766179877776, 0.08551371010176734), ('annual_inc', 96993.90625, 0.3756552139898724, 0.07500314368384206), ('loan_amnt', 95294.5, 0.36907345186500207, 0.0736890321476241), ('total_acc', 90064.8046875, 0.3488189597255124, 0.06964502975498767), ('longest_credit_length', 84291.921875, 0.3264607146345416, 0.06518099303560954), ('purpose', 77462.203125, 0.30000936776426446, 0.05989972953637317), ('emp_length', 63839.28125, 0.24724809821677224, 0.04936543922589935), ('term', 34895.7265625, 0.1351503629040408, 0.02698405801466782), ('home_ownership', 26499.876953125, 0.10263342649457897, 0.02049174175536795), ('delinq_2yrs', 20556.2578125, 0.0796139234508423, 0.015895678583550586), ('verification_status', 14689.3369140625, 0.056891470971368166, 0.01135892438795138)]
```  


## Starting H2O server

In [1]:
# import h2o package and specific estimator 
import h2o
from h2o.automl import H2OAutoML

h2o.init with python seems very sensitive the the H2O version.  If the H2O cluster version is 3.20.0.1 and the python h2o library is 3.19.0 it will fail so we set strict_version_check=False

If the H2O cluster isn't found h2o.init will start one.

Note that the current script starts each H2O instance on a different port.  It's not clear why but should we do this we should choose from only the higher ports.

A port number is a 16-bit unsigned integer, thus ranging from 0 to 65535.  There is no reason to choose a port less than 10000.  


In [2]:
h2o.init(strict_version_check=False) # start h2o

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_121"; OpenJDK Runtime Environment (Zulu 8.20.0.5-macosx) (build 1.8.0_121-b15); OpenJDK 64-Bit Server VM (Zulu 8.20.0.5-macosx) (build 25.121-b15, mixed mode)
  Starting server from /Users/bear/anaconda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpwxe9j0n1
  JVM stdout: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpwxe9j0n1/h2o_bear_started_from_python.out
  JVM stderr: /var/folders/lh/42j8mfjx069d1bkc2wlf2pw40000gn/T/tmpwxe9j0n1/h2o_bear_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.20.0.1
H2O cluster version age:,26 days
H2O cluster name:,H2O_from_python_bear_mdycef
H2O cluster total nodes:,1
H2O cluster free memory:,3.556 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


## h2o.automl Parameters

[http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)  

NB:  Eventually one wants to expose all the parameters to the expert user.   

**Required Data Parameters**

y: This argument is the name (or index) of the response column.  

training_frame: Specifies the training set.  

The user gives the name of the depenent variable and training file name.   


**Required Stopping Parameters**  

One of the following stopping strategies (time or number-of-model based) must be specified. When both options are set, then the AutoML run will stop as soon as it hits one of either of these limits.

max_runtime_secs: This argument controls how long the AutoML run will execute for. This defaults to 3600 seconds (1 hour).  


max_models: Specify the maximum number of models to build in an AutoML run, excluding the Stacked Ensemble models. Defaults to NULL/None.


### Optional Parameters

**Optional Data Parameters**  

x: A list/vector of predictor column names or indexes. This argument only needs to be specified if the user wants to exclude columns from the set of predictors. If all columns (other than the response) should be used in prediction, then this does not need to be set.  

validation_frame: This argument is used to specify the validation frame used for early stopping of individual models and early stopping of the grid searches (unless max_models or max_runtime_secs overrides metric-based early stopping).  

leaderboard_frame: This argument allows the user to specify a particular data frame use to score & rank models on the leaderboard. This frame will not be used for anything besides leaderboard scoring. If a leaderboard frame is not specified by the user, then the leaderboard will use cross-validation metrics instead (or if cross-validation is turned off by setting nfolds = 0, then a leaderboard frame will be generated automatically from the validation frame (if provided) or the training frame).  

fold_column: Specifies a column with cross-validation fold index assignment per observation. This is used to override the default, randomized, 5-fold cross-validation scheme for individual models in the AutoML run.  

weights_column: Specifies a column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.  

ignored_columns: (Optional, Python only) Specify the column or columns (as a list/vector) to be excluded from the model. This is the converse of the x argument.  

**Optional Miscellaneous Parameters**  

nfolds: Number of folds for k-fold cross-validation of the models in the AutoML run. Defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensembles (thus decreasing the overall best model performance).

balance_classes: Specify whether to oversample the minority classes to balance the class distribution. This option is not enabled by default and can increase the data frame size. This option is only applicable for classification. Majority classes can be undersampled to satisfy the max_after_balance_size parameter.

class_sampling_factors: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance.

max_after_balance_size: Specify the maximum relative size of the training data after balancing class counts (balance_classes must be enabled). Defaults to 5.0. (The value can be less than 1.0).

stopping_metric: Specifies the metric to use for early stopping of the grid searches and individual models. Defaults to "AUTO". The available options are:

AUTO: This defaults to logloss for classification, deviance for regression
deviance (mean residual deviance)
logloss
MSE
RMSE
MAE
RMSLE
AUC
lift_top_group
misclassification
mean_per_class_error  

stopping_tolerance: This option specifies the relative tolerance for the metric-based stopping criterion to stop a grid search and the training of individual models within the AutoML run. This value defaults to 0.001 if the dataset is at least 1 million rows; otherwise it defaults to a bigger value determined by the size of the dataset and the non-NA-rate. In that case, the value is computed as 1/sqrt(nrows * non-NA-rate).

stopping_rounds: This argument is used to stop model training when the stopping metric (e.g. AUC) doesn’t improve for this specified number of training rounds, based on a simple moving average. In the context of AutoML, this controls early stopping both within the random grid searches as well as the individual models. Defaults to 3 and must be an non-negative integer. To disable early stopping altogether, set this to 0.

sort_metric: Specifies the metric used to sort the Leaderboard by at the end of an AutoML run. Available options include:

AUTO: This defaults to AUC for binary classification, mean_per_class_error for multinomial classification, and deviance for regression.
deviance (mean residual deviance)
logloss
MSE
RMSE
MAE
RMSLE
AUC
mean_per_class_error  

seed: Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_runtime_secs is resource limited, meaning that if the available compute resources are not the same between runs, AutoML may be able to train more models on one run vs another. Defaults to NULL/None.

project_name: Character string to identify an AutoML project. Defaults to NULL/None, which means a project name will be auto-generated based on the training frame ID. More models can be trained and added to an existing AutoML project by specifying the same project name in muliple calls to the AutoML function (as long as the same training frame is used in subsequent runs).

exclude_algos: List/vector of character strings naming the algorithms to skip during the model-building phase. An example use is exclude_algos = ["GLM", "DeepLearning", "DRF"] in Python or exclude_algos = c("GLM", "DeepLearning", "DRF") in R. Defaults to None/NULL, which means that all appropriate H2O algorithms will be used, if the search stopping criteria allow. The algorithm names are:

GLM
DeepLearning
GBM
DRF (This includes both the Random Forest and Extremely Randomized Trees (XRT) models. Refer to the Extremely Randomized Trees section in the DRF chapter and the histogram_type parameter description for more information.)
StackedEnsemble
keep_cross_validation_predictions: Specify whether to keep the predictions of the cross-validation predictions. If set to FALSE, then running the same AutoML object for repeated runs will cause an exception because CV predictions are are required to build additional Stacked Ensemble models in AutoML. This option defaults to TRUE.


keep_cross_validation_models: Specify whether to keep the cross-validated models. Deleting cross-validation models will save memory in the H2O cluster. This option defaults to TRUE.




In [3]:
# Assume the following are passed by the user from the web interface

'''
Need a user id and project id?

'''
target='bad_loan' 
data_file='loan.csv'
run_time=30
run_id='SOME_ID_20180617_221529' # Just some arbitrary ID
server_path='/Users/bear/Documents/INFO_7390/H2O'
classification=True
scale=False
max_models=None
balance_y=False # balance_classes=balance_y
balance_threshold=0.2
project ="automl_test"  # project_name = project

In [4]:
# Use local data file or download from some type of bucket
import os

data_path=os.path.join(server_path,data_file)
data_path

'/Users/bear/Documents/INFO_7390/H2O/loan.csv'

In [5]:
# Use local data file or download from some type of bucket
if not os.path.isfile(data_path):
  data_path = 'https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv'

# Load data into H2O
df = h2o.import_file(data_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
df.describe()

Rows:163987
Cols:15




Unnamed: 0,loan_amnt,term,int_rate,emp_length,home_ownership,annual_inc,purpose,addr_state,dti,delinq_2yrs,revol_util,total_acc,bad_loan,longest_credit_length,verification_status
type,int,enum,real,int,enum,real,enum,enum,real,int,real,int,int,int,enum
mins,500.0,,5.42,0.0,,1896.0,,,0.0,0.0,0.0,1.0,0.0,0.0,
mean,13074.169141456332,,13.715904065566189,5.684352932995338,,71915.67051974905,,,15.881530121290167,0.22735700606252723,54.07917280242262,24.579733834274574,0.1830388994249544,14.854273655448333,
maxs,35000.0,,26.06,10.0,,7141778.0,,,39.99,29.0,150.70000000000002,118.0,1.0,65.0,
sigma,7993.556188734672,,4.391939870545808,3.610663731100238,,59070.91565491818,,,7.5876682241925355,0.6941679229284191,25.285366766770498,11.685190365910666,0.3866995896078875,6.947732922546689,
zeros,0,,0,14248,,0,,,270,139459,1562,0,133971,11,
missing,0,0,0,5804,0,4,0,0,0,29,193,29,0,29,0
0,5000.0,36 months,10.65,10.0,RENT,24000.0,credit_card,AZ,27.65,0.0,83.7,9.0,0.0,26.0,verified
1,2500.0,60 months,15.27,0.0,RENT,30000.0,car,GA,1.0,0.0,9.4,4.0,1.0,12.0,verified
2,2400.0,36 months,15.96,10.0,RENT,12252.0,small_business,IL,8.72,0.0,98.5,10.0,0.0,10.0,not verified


In [7]:
# assign target and inputs for logistic regression
y = target
X = [name for name in df.columns if name != y]
print(y)
print(X)

bad_loan
['loan_amnt', 'term', 'int_rate', 'emp_length', 'home_ownership', 'annual_inc', 'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'revol_util', 'total_acc', 'longest_credit_length', 'verification_status']


In [8]:
# determine column types
ints, reals, enums = [], [], []
for key, val in df.types.items():
    if key in X:
        if val == 'enum':
            enums.append(key)
        elif val == 'int':
            ints.append(key)            
        else: 
            reals.append(key)

print(ints)
print(enums)
print(reals)

['loan_amnt', 'emp_length', 'delinq_2yrs', 'total_acc', 'longest_credit_length']
['term', 'home_ownership', 'purpose', 'addr_state', 'verification_status']
['int_rate', 'annual_inc', 'dti', 'revol_util']


In [9]:
# impute missing values
_ = df[reals].impute(method='mean')
_ = df[ints].impute(method='median')

if scale:
    df[reals] = df[reals].scale()
    df[ints] = df[ints].scale()

In [10]:
# set target to factor for classification by default or if user specifies classification
if classification:
    df[y] = df[y].asfactor()

In [11]:
df[y].levels()

[['0', '1']]

### balance_classes check 

If one class in two class classification is less than 20% of the total then one should set balance_classes=True

That is,

balance_classes=balance_y


In [12]:
if classification:
    class_percentage = y_balance=df[y].mean()[0]/(df[y].max()-df[y].min())
    if class_percentage < balance_threshold:
        balance_y=True
        
        

In [13]:
print(run_time)
type(run_time)

30


int

## Cross-validate rather than take a test training split

Cross-validation rather than taking a test training split reduces the variance of the estimates of goodness of fit statistics.  In rare cases one should take a test training split but this should be left to the expert users.

This also means the pro user can just upload the data and not worry about taking a test training split.  

We can pass the original, full dataset, `df` (without passing a `leaderboard_frame`).  This is a more efficient use of our data since we can use 100% of the data for training, rather than 80% or so.  This time our leaderboard will use cross-validated metrics. It also gives better estimates of goodness of fit statistics.

*Note: Using an explicit `leaderboard_frame` for scoring may be useful in some cases, which is why the option is available.*  

But it's not preferable in most cases.  Leave it as an expert option.  


In [14]:
# automl
# runs for run_time seconds then builds a stacked ensemble
aml = H2OAutoML(max_runtime_secs=run_time,project_name = project,balance_classes=balance_y) # init automl, run for 300 seconds
aml.train(x=X,  
           y=y,
           training_frame=df) 

AutoML progress: |████████████████████████████████████████████████████████| 100%


## Leaderboard

Next, we will view the AutoML Leaderboard.  Since we did not specify a `leaderboard_frame` in the `H2OAutoML.train()` method for scoring and ranking the models, the AutoML leaderboard uses cross-validation metrics to rank the models.  

A default performance metric for each machine learning task (binary classification, multiclass classification, regression) is specified internally and the leaderboard will be sorted by that metric.  In the case of binary classification, the default ranking metric is Area Under the ROC Curve (AUC).  In the future, the user will be able to specify any of the H2O metrics so that different metrics can be used to generate rankings on the leaderboard.

The leader model is stored at `aml.leader` and the leaderboard is stored at `aml.leaderboard`.

In [15]:
# view leaderboard
lb = aml.leaderboard
lb

model_id,auc,logloss,mean_per_class_error,rmse,mse
DRF_0_AutoML_20180703_112311,0.684122,0.479454,0.37037,0.3836,0.147149




Now we will view a snapshot of the top models.  Here we should see the two Stacked Ensembles at or near the top of the leaderboard.  Stacked Ensembles can almost always outperform a single model.

In [16]:
aml.leader

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_0_AutoML_20180703_112311


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.11340788234116275
RMSE: 0.33676086818566486
LogLoss: 0.3549776713152746
Mean Per-Class Error: 0.038664903973763765
AUC: 0.9936680310240623
Gini: 0.9873360620481246
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.25569971472199443: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,102283.0,4798.0,0.0448,(4798.0/107081.0)
1,3478.0,103463.0,0.0325,(3478.0/106941.0)
Total,105761.0,108261.0,0.0387,(8276.0/214022.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.2556997,0.9615431,269.0
max f2,0.2053985,0.9738037,287.0
max f0point5,0.3111659,0.9667645,250.0
max accuracy,0.2556997,0.9613311,269.0
max precision,0.9999762,1.0,0.0
max recall,0.0626602,1.0,360.0
max specificity,0.9999762,1.0,0.0
max absolute_mcc,0.2556997,0.9227329,269.0
max min_per_class_accuracy,0.2678654,0.9603707,265.0


Gains/Lift Table: Avg response rate: 49.97 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100036,0.9491927,2.0013091,2.0013091,1.0,1.0,0.0200204,0.0200204,100.1309133,100.1309133
,2,0.0200026,0.9037044,2.0013091,2.0013091,1.0,1.0,0.0200110,0.0400314,100.1309133,100.1309133
,3,0.0300016,0.8707166,2.0003739,2.0009975,0.9995327,0.9998443,0.0200017,0.0600331,100.0373942,100.0997451
,4,0.0400006,0.8444126,2.0013091,2.0010754,1.0,0.9998832,0.0200110,0.0800441,100.1309133,100.1075363
,5,0.0500042,0.8218583,2.0013091,2.0011221,1.0,0.9999066,0.0200204,0.1000645,100.1309133,100.1122130
,6,0.1000037,0.7397437,2.0005611,2.0008416,0.9996262,0.9997664,0.1000271,0.2000916,100.0561050,100.0841603
,7,0.1500033,0.6780059,2.0000000,2.0005611,0.9993459,0.9996262,0.0999991,0.3000907,99.9999988,100.0561073
,8,0.2000028,0.6249758,1.9988779,2.0001403,0.9987852,0.9994160,0.0999430,0.4000337,99.8877863,100.0140281
,9,0.3000019,0.5259132,1.9916776,1.9973194,0.9951874,0.9980064,0.1991659,0.5991996,99.1677564,99.7319419




ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.14903310516783722
RMSE: 0.3860480606968998
LogLoss: 0.4821376849599736
Mean Per-Class Error: 0.3638057239360183
AUC: 0.6877402527198438
Gini: 0.3754805054396877
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.10055061700789454: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,17436.0,9454.0,0.3516,(9454.0/26890.0)
1,2313.0,3829.0,0.3766,(2313.0/6142.0)
Total,19749.0,13283.0,0.3562,(11767.0/33032.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1005506,0.3942342,256.0
max f2,0.0563211,0.5646177,323.0
max f0point5,0.1595904,0.3522935,182.0
max accuracy,0.4245376,0.8144527,28.0
max precision,0.6098247,0.8333333,2.0
max recall,0.0003886,1.0,399.0
max specificity,0.6351036,0.9999628,0.0
max absolute_mcc,0.1231030,0.2160883,226.0
max min_per_class_accuracy,0.0981921,0.6352979,259.0


Gains/Lift Table: Avg response rate: 18.59 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100206,0.3689836,2.7296461,2.7296461,0.5075529,0.5075529,0.0273527,0.0273527,172.9646110,172.9646110
,2,0.0200109,0.3189162,2.5423522,2.6361408,0.4727273,0.4901664,0.0253989,0.0527515,154.2352210,163.6140834
,3,0.0300012,0.2891187,2.0208441,2.4312489,0.3757576,0.4520686,0.0201889,0.0729404,102.0844064,143.1248873
,4,0.0400218,0.2674095,2.1284740,2.3554407,0.3957704,0.4379728,0.0213286,0.0942690,112.8474050,135.5440653
,5,0.0500121,0.2513634,1.8578728,2.2560476,0.3454545,0.4194915,0.0185607,0.1128297,85.7872769,125.6047553
,6,0.1000242,0.2024502,1.9012002,2.0786239,0.3535109,0.3865012,0.0950830,0.2079127,90.1200246,107.8623899
,7,0.1500061,0.1715512,1.7199345,1.9591090,0.3198062,0.3642785,0.0859655,0.2938782,71.9934496,95.9109025
,8,0.2000182,0.1509043,1.4975207,1.8436945,0.2784504,0.3428182,0.0748942,0.3687724,49.7520742,84.3694488
,9,0.3000121,0.1218877,1.3839978,1.6904777,0.2573418,0.3143290,0.1383914,0.5071638,38.3997832,69.0477732




ModelMetricsBinomial: drf
** Reported on cross-validation data. **

MSE: 0.1471492396617259
RMSE: 0.3836003645224101
LogLoss: 0.4794542730965334
Mean Per-Class Error: 0.3674051643392733
AUC: 0.6841224026225468
Gini: 0.3682448052450935
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.11167637609816343: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,76548.0,30533.0,0.2851,(30533.0/107081.0)
1,10877.0,12997.0,0.4556,(10877.0/23874.0)
Total,87425.0,43530.0,0.3162,(41410.0/130955.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.1116764,0.3856448,239.0
max f2,0.0481420,0.5551261,334.0
max f0point5,0.1491935,0.3470326,194.0
max accuracy,0.5244993,0.8178840,10.0
max precision,0.5406449,0.7209302,8.0
max recall,0.0003730,1.0,399.0
max specificity,0.6630444,0.9999907,0.0
max absolute_mcc,0.1133350,0.2127000,237.0
max min_per_class_accuracy,0.0950666,0.6319702,262.0


Gains/Lift Table: Avg response rate: 18.23 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100034,0.3518201,2.5123310,2.5123310,0.4580153,0.4580153,0.0251319,0.0251319,151.2330959,151.2330959
,2,0.0200069,0.3068091,2.4537099,2.4830204,0.4473282,0.4526718,0.0245455,0.0496775,145.3709903,148.3020431
,3,0.0300027,0.2796578,2.3885377,2.4515422,0.4354469,0.4469331,0.0238753,0.0735528,138.8537722,145.1542226
,4,0.0400061,0.2600907,2.2150385,2.3924050,0.4038168,0.4361519,0.0221580,0.0957108,121.5038462,139.2405000
,5,0.0500019,0.2449838,1.9862577,2.3112128,0.3621085,0.4213500,0.0198542,0.1155650,98.6257685,131.1212752
,6,0.1000038,0.1961955,1.9174940,2.1143534,0.3495724,0.3854612,0.0958784,0.2114434,91.7494016,111.4353384
,7,0.1500057,0.1681192,1.7022053,1.9769707,0.3103238,0.3604154,0.0851135,0.2965569,70.2205260,97.6970676
,8,0.2,0.1472975,1.5348998,1.8664656,0.2798228,0.3402696,0.0767362,0.3732931,53.4899780,86.6465611
,9,0.3000038,0.1185259,1.3227274,1.6852149,0.2411423,0.3072263,0.1322778,0.5055709,32.2727414,68.5214932



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.6538429,0.0183382,0.6781719,0.6496506,0.625482,0.6270093,0.6889008
auc,0.6841686,0.0025851,0.6814531,0.6876243,0.6783245,0.6871284,0.6863129
err,0.3461571,0.0183382,0.3218281,0.3503494,0.3745180,0.3729907,0.3110992
err_count,9066.2,480.29688,8429.0,9176.0,9809.0,9769.0,8148.0
f0point5,0.3197558,0.0073429,0.3269654,0.3198925,0.3023575,0.3165978,0.3329658
f1,0.3866551,0.0040865,0.3856862,0.3896501,0.3759781,0.392664,0.3892969
f2,0.4901663,0.0130113,0.4701158,0.4983157,0.4969895,0.5168407,0.4685696
lift_top_group,2.5133004,0.0676990,2.5162978,2.4443011,2.697146,2.4670079,2.4417496
logloss,0.4794543,0.0026525,0.4834266,0.4767689,0.4734881,0.4824348,0.4811529


Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_lift,validation_classification_error
,2018-07-03 11:24:30,1 min 18.380 sec,0.0,,,,,,,,,,
,2018-07-03 11:24:30,1 min 18.776 sec,1.0,0.4243810,2.9500809,0.8534337,1.5963901,0.1935611,0.5079289,7.1723286,0.5582234,1.2736602,0.8140591
,2018-07-03 11:24:31,1 min 19.114 sec,2.0,0.4116788,2.4100875,0.8666501,1.6374159,0.1780079,0.4270653,2.7814400,0.5901808,1.5326344,0.4629753
,2018-07-03 11:24:31,1 min 19.396 sec,3.0,0.3991564,2.0077627,0.8799705,1.6718959,0.1630942,0.4046227,1.5547977,0.6142016,1.7547725,0.4966093
,2018-07-03 11:24:31,1 min 19.689 sec,4.0,0.3861652,1.6688530,0.8943019,1.7065689,0.1491473,0.3972569,1.1286337,0.6245992,2.1058640,0.3931945
,2018-07-03 11:24:32,1 min 19.989 sec,5.0,0.3805014,1.3732612,0.9050657,1.7360614,0.1417329,0.3928225,0.8725981,0.6331596,2.3721925,0.4191996
,2018-07-03 11:24:32,1 min 20.310 sec,6.0,0.3739801,1.1361723,0.9161899,1.7644842,0.1340910,0.3908130,0.7464744,0.6425739,2.4534319,0.4048801
,2018-07-03 11:24:32,1 min 20.605 sec,7.0,0.3706536,0.9468316,0.9253759,1.7935431,0.1289952,0.3894329,0.6540850,0.6486163,2.4371840,0.3721543
,2018-07-03 11:24:33,1 min 20.913 sec,8.0,0.3645844,0.8059596,0.9347503,1.8208734,0.1197094,0.3882897,0.6116692,0.6559480,2.4859277,0.3747578


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
addr_state,255846.1406250,1.0,0.1989208
int_rate,200356.5156250,0.7831133,0.1557776
revol_util,113865.6718750,0.4450553,0.0885308
dti,112172.3203125,0.4384366,0.0872142
loan_amnt,95355.8125000,0.3727076,0.0741393
annual_inc,91148.6015625,0.3562633,0.0708682
total_acc,90776.2968750,0.3548082,0.0705787
longest_credit_length,84317.3828125,0.3295628,0.0655569
purpose,76727.6718750,0.2998977,0.0596559




In [17]:
aml.leader.algo

'drf'

In [18]:
dir(aml.leader)

['F0point5',
 'F1',
 'F2',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_bc',
 '_bcin',
 '_check_targets',
 '_compute_algo',
 '_estimator_type',
 '_future',
 '_get_metrics',
 '_have_mojo',
 '_have_pojo',
 '_id',
 '_is_xvalidated',
 '_job',
 '_keyify_if_h2oframe',
 '_metrics_class',
 '_model_json',
 '_parms',
 '_plot',
 '_requires_training_frame',
 '_resolve_model',
 '_verify_training_frame_params',
 '_xval_keys',
 'accuracy',
 'actual_params',
 'aic',
 'algo',
 'auc',
 'balance_classes',
 'biases',
 'binomial_double_trees',
 'build_tree_one_node',
 'calibrate_model',
 'calibration_frame',
 'categorical_encoding',
 'catoffsets',
 'checkpoint',
 'clas

## Ensemble Exploration

To understand how the ensemble works, let's take a peek inside the Stacked Ensemble "All Models" model.  The "All Models" ensemble is an ensemble of all of the individual models in the AutoML run.  This is often the top performing model on the leaderboard.

In [19]:
aml_leaderboard_df=aml.leaderboard.as_data_frame()
aml_leaderboard_df

Unnamed: 0,model_id,auc,logloss,mean_per_class_error,rmse,mse
0,DRF_0_AutoML_20180703_112311,0.684122,0.479454,0.37037,0.3836,0.147149


## Getting models

Individul models can ne found through a search of the leader board or directly by the name.  

In [20]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(aml.leader.metalearner()['name'])

IndexError: list index out of range

In [None]:
metalearner.coef_norm()

In [None]:
%matplotlib inline
metalearner.std_coef_plot()

**Getting a model directly by name**

In [None]:
aml_leaderboard_df.head()

In [None]:
m_id=''
for model in aml_leaderboard_df['model_id']:
    if 'StackedEnsemble' not in model:
      print (model)
      if m_id=='':
            m_id=model
print ("model_id ", m_id)

In [None]:
non_stacked= h2o.get_model(m_id)
print (non_stacked.algo)

In [None]:
dir(non_stacked)

Note that since this is a pandas dataframe the data can be saved.

The type of exploration depends on the learner.  If the learner isn't an ensemble then ensemble exploration doesn't make sense.  


Examine the variable importance of the metalearner (combiner) algorithm in the ensemble.  This shows us how much each base learner is contributing to the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM. 

## Save Leader Model

There are two ways to save the leader model -- binary format and MOJO format.  If you're taking your leader model to production, then we'd suggest the MOJO format since it's optimized for production use.

In [None]:
h2o.save_model(aml.leader, path = "./models")

In [None]:
aml.leader.download_mojo(path = "./models")

## Making predictions

If one wants predictions the user will do this on new data.

Here we are taking 10% of original file just to show the syntax

In [None]:
# split into training and test for showing how to predict
train, test = df.split_frame([0.8])

## Predict Using Leader Model

If you need to generate predictions on a test set, you can make predictions on the `"H2OAutoML"` object directly, or on the leader model object.

In [None]:
pred = aml.predict(test)
pred.head()

## model_performance()

The standard `model_performance()` method can be applied to the AutoML leader model and a test set to generate an H2O model performance object.


In [None]:
perf = aml.leader.model_performance(test)
perf

In [None]:
dir(perf)

In [None]:
d=perf.confusion_matrix()
d

In [None]:
dir(perf)

In R we get plots like:
    
    #compute performance
perf <- h2o.performance(automl_leader,conv_data.hex)
h2o.confusionMatrix(perf)
h2o.accuracy(perf)
h2o.tpr(perf)

In [None]:
aml.leader.algo

In [None]:
dir(aml.leader)

In [None]:
aml.leader.model_performance(test).auc() 

In [None]:
best_perf = aml.leader.model_performance()
best_perf

In [None]:
best_perf.plot()

In [None]:
aml.leader.confusion_matrix()


In [None]:
roc=aml.leader.roc()
roc

In [None]:
aml.leader.tnr

In [None]:
aml.leader.tpr

In [None]:
aml.leader.weights

### Test Data Sets for Binary Classifier 

#### Some Kaggle Binary classification competitions  

The idea here is to get a range of datasets to test our H2O binary classification models as well as to understand which approaches work best for binary classification.   The hope is to get a single model or set of models that perform well in these competitions as well as logic and tests to dynamically choose the best models and their parameters.  

[Santander Customer Satisfaction](https://www.kaggle.com/c/santander-customer-satisfaction)    

[Facebook Recruiting IV: Human or Robot?](https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot)    

[DonorsChoose.org Application Screening Predict whether teachers' project proposals are accepted](https://www.kaggle.com/c/donorschoose-application-screening)    

[Statoil/C-CORE Iceberg Classifier Challenge Ship or iceberg, can you decide from space?](https://www.kaggle.com/c/statoil-iceberg-classifier-challenge)    

[WSDM - KKBox's Churn Prediction Challenge Can you predict when subscribers will churn?](https://www.kaggle.com/c/kkbox-churn-prediction-challenge)    

[Porto Seguro’s Safe Driver Prediction Predict if a driver will file an insurance claim next year.](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction)    

[Porto Seguro’s Safe Driver Prediction Predict if a driver will file an insurance claim next year.](https://www.kaggle.com/c/dato-native)    

[Data Science Bowl 2017 Can you improve lung cancer detection?](https://www.kaggle.com/c/data-science-bowl-2017)    

[Random Acts of Pizza Predicting altruism through free pizza](https://www.kaggle.com/c/random-acts-of-pizza)    


Last update:  June 24, 2018