# H2O-3 for Distributed ML

This notebook is intended to help you get started with distributed machine learning in the H2O AI Cloud using python.

* **Product Documentation:** https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
* **Python Documentation:** https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html
* **Additional Tutorials:** https://github.com/h2oai/h2o-tutorials

In [1]:
import h2o_engine_manager

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
import h2o

import pandas as pd
import numpy as np

## Securely connect to the platform

In [None]:
engine_manager = h2o_engine_manager.login()

## Connect to H2O-3
We will create and connect to a new H2O-3 engine. This step will connect our imported `h2o` library which we can then use to interact with the cluster. 

In [None]:
h2o_engine = engine_manager.h2o_engine_client.create_engine(
    display_name="My-Tutorial-Engine-04",
    version="latest"
)

h2o_engine.wait()

In [None]:
h2o.connect(config=h2o_engine.get_connection_config())

## Data

We can create an H2O Dataframe object with data from our local machine or a URL. 

In [None]:
data = h2o.import_file("https://h2o-internal-release.s3-us-west-2.amazonaws.com/data/Splunk/churn.csv")

In [None]:
data.shape

In [None]:
data.head()

In [None]:
data.types

### Data Exploration

We can use H2O-3 to explore our dataset.  We can find correlations, build decision trees, and visualize our dataset.  In this demo, we will view the correlations and plot distributions.

We will first use H2O-3 to find the numeric columns that are correlated to Churn. 

In [None]:
numeric_cols = [k for k, v in data.types.items() if v in ['real', 'int']]
churn_cor_hf = data['Churn?'].cor(data[numeric_cols])
churn_cor_hf

The result of the `cor` function is an H2O Frame with one row, showing the correlation of each variable to `Churn?`. Since this data is very small, we can convert it to a Pandas dataframe and order it based on absolute correlation.

In [None]:
churn_cor = churn_cor_hf.as_data_frame().transpose().reset_index()
churn_cor.columns = ['Feature', 'Correlation']
churn_cor = churn_cor.iloc[(-churn_cor['Correlation'].abs()).argsort()]
churn_cor.head()

The greatest indicators of churn seem to be more calls to customer service as well as more calling minutes/charges.

We can use the histogram function to see the distribution of these top features.

In [None]:
data['CustServ Calls'].hist();

In [None]:
data['Day Mins'].hist();

### Split a Dataset

We will next split a dataset for training. When building models, we want to separate a section of data for validating how the model does.  This can be a good indicator of how well the model is at generalizing and predicting on unseen data.

We will use the split function to create a random split on the dataset.

In [14]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)

train = splits[0]
valid = splits[1]
test = splits[2]

### Prepare columns for training

In [15]:
y = 'Churn?'
x = [i for i in data.columns if i not in [y, 'Phone']] # remove columns

## Modeling

In this section, we will create models predicting Churn using H2O-3's algorithms.

### Baseline Model

We will start our modeling by building a baseline model.  This is a simple model that we will use as a control. In this example, we will build a Linear model to predict churn.

We first create an object of class, "H2OGeneralizedLinearEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.

In [16]:
glm_fit1 = H2OGeneralizedLinearEstimator(model_id='glm_fit1',
                                         ## fix a random number generator seed for reproducibility
                                         seed=1234,

                                         ## predict a yes/no column
                                         family='binomial',

                                         ## cross validation
                                         nfolds=3,

                                         ## use cross validation to find the best regularization
                                         lambda_search=True
                                        )

Now that `glm_fit1` object is initialized, we can train the model:

In [None]:
glm_fit1.train(x = x, y = y, training_frame = train, validation_frame = valid);

The plot below shows the objective loss function as each iteration of the model is trained.

In [None]:
glm_fit1.plot();

#### Explore predictions

Let's see the performance of the GLM that were just trained. 

In [19]:
glm_perf1 = glm_fit1.model_performance(test)

In [None]:
glm_perf1.plot();

We can see the AUC on the training dataset compared with the validation dataset.  The linear model is slighly better at predicting on the training dataset.  This is to be expected since the training data was seen by the model.

In [None]:
print ("AUC on Training Data: {0:.3f}".format(glm_fit1.auc(train = True)))
print ("AUC on Validation Data: {0:.3f}".format(glm_fit1.auc(valid = True)))

Here are the predictions on the validation dataset.

In [None]:
glm_preds = glm_fit1.predict(valid)
glm_preds.head()

Since our model is a GLM model, we can also see the coefficients of each variable. The plot below shows that `Int'l Plan=Yes` increases the likelihood of churn.

In [None]:
glm_fit1.std_coef_plot(num_of_features=5)

In [None]:
glm_fit1.partial_plot(data=train, cols=['Day Mins'], figsize=(5, 5));

### AutoML

We will now automate the machine learning process using AutoML and the `explain` function.  AutoML will automatically tune models and try out different algorithms for the user and provide a leaderboard. 

In [None]:
from h2o.automl import H2OAutoML

aml = H2OAutoML(## train up to 10 models
                max_models=10,

                ## set seed for reproducibility
                seed=1234,

                ## option to exclude specific algorithms
                exclude_algos=['StackedEnsemble', 'DeepLearning']
                )
aml.train(x=x, y=y, training_frame=train, leaderboard_frame=valid);

The leaderboard belows shows us the AUC of each model on our `valid` data.  The best model is a Random Forest model.

In [None]:
aml.leaderboard

In [None]:
aml.leader

We can examine the model automatically by calling the explain function.  The explain function returns performance metrics, comparison between the models, and explains the best model.

In [None]:
aml.explain(train)

## Deployment objects

We can download the Model Deployment Object (the MOJO) for easy deployment in MLOps.

In [31]:
local_mojo = aml.leader.download_mojo()

## Clean up

In [None]:
h2o_engine.delete()