# H2O-3

This notebook is intended to help you get started with distributed machine learning in the H2O AI Cloud using python.

* **Product Documentation:** https://docs.h2o.ai/h2o/latest-stable/h2o-docs/index.html
* **Python Documentation:** https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html
* **Additional Tutorials:** https://github.com/h2oai/h2o-tutorials

## Prerequisites
This tutorial relies on the latest Steam SDK (1.8.11) which can be installed into a python environment by:

1. Click on My AI Engines from the H2O AI Cloud and then `Python client` to download the wheel file
2. Navigate to the location where the python client was downloaded and install the client using `pip install h2osteam-1.8.11-py2.py3-none-any.whl`

We require the `h2o_authn` library for securely connecting to the H2O AI Cloud platform: `pip install h2o_authn`.

We also set the following variables to connect to a specific H2O AI Cloud environment. They can be found by logging into the platform, clicking on your name, and choosing the `CLI & API Access` page. Then, copy values from the `Accessing H2O AI Cloud APIs` section.

In [2]:
CLIENT_ID = "q8s-internal-platform"
TOKEN_ENDPOINT = "https://auth.demo.h2o.ai/auth/realms/q8s-internal/protocol/openid-connect/token"
REFRESH_TOKEN = "https://cloud-internal.h2o.ai/auth/get-platform-token"

H2O_STEAM_URL = "https://steam.cloud-internal.h2o.ai/"

In [1]:
from getpass import getpass

import h2o_authn
import h2osteam
from h2osteam.clients import H2oKubernetesClient

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
import h2o

import pandas as pd
import numpy as np

## Securely connect to the platform
We first connect to the H2O AI Cloud using our personal access token to create a token provider object. We can then use this object to log into Steam and other APIs.

In [3]:
print(f"Visit {REFRESH_TOKEN} to get your personal access token")
tp = h2o_authn.TokenProvider(
    refresh_token=getpass("Enter your access token: "),
    client_id=CLIENT_ID,
    token_endpoint_url=TOKEN_ENDPOINT
)

Visit https://cloud-internal.h2o.ai/auth/get-platform-token to get your personal access token
Enter your access token: ········


Next, we will connect to our AI Engine manager to view all clusters of H2O-3 that we have access to. If you don't have a cluster of H2O-3 please view the Enterprise Steam tutorial. 

In [5]:
steam = h2osteam.login(
    url=H2O_STEAM_URL,
    access_token=tp()
)

## Connect to H2O-3
We will connect to a specific instance of H2O-3. This step will connect our imported `h2o` library which we can then use to interact with the cluster. 

We will also check if our local python package and the backend server have the same version numbers. 

In [9]:
for instance in steam.get_h2o_kubernetes_clusters():
    print(instance["id"], "\t", instance["profile_name"], "\t", instance["status"], "\t", instance["name"])

82 	 default-h2o-kubernetes 	 stopped 	 es-test-instance
83 	 default-h2o-kubernetes 	 stopped 	 es-test-instance-2
160 	 default-h2o-kubernetes 	 running 	 test-instance


In [10]:
cluster = H2oKubernetesClient().get_cluster(name="test-instance", created_by="michelle.tanco@h2o.ai")
cluster.connect()

Connecting to H2O server at https://steam.cloud-internal.h2o.ai:443/proxy/h2o-k8s/160 ... successful.


0,1
H2O_cluster_uptime:,1 min 57 secs
H2O_cluster_timezone:,Etc/GMT
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.3
H2O_cluster_version_age:,1 month and 28 days
H2O_cluster_name:,test-instance
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.185 Gb
H2O_cluster_total_cores:,1
H2O_cluster_allowed_cores:,1


<h2osteam.clients.h2ok8s.h2ok8s.H2oKubernetesCluster at 0x1288cb2b0>

In [15]:
h2o.__version__

'3.36.0.3'

## Data

We can create an H2O Dataframe object with data from our local machine or a URL. 

In [19]:
data = h2o.import_file("https://h2o-internal-release.s3-us-west-2.amazonaws.com/data/Splunk/churn.csv") 

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [20]:
data.shape

(3333, 21)

In [21]:
data.head()

State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
KS,128,415,382-4657,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
OH,107,415,371-7191,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
NJ,137,415,358-1921,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
OH,84,408,375-9999,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
OK,75,415,330-6626,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.
AL,118,510,391-8027,yes,no,0,223.4,98,37.98,220.6,101,18.75,203.9,118,9.18,6.3,6,1.7,0,False.
MA,121,510,355-9993,no,yes,24,218.2,88,37.09,348.5,108,29.62,212.6,118,9.57,7.5,7,2.03,3,False.
MO,147,415,329-9001,yes,no,0,157.0,79,26.69,103.1,94,8.76,211.8,96,9.53,7.1,6,1.92,0,False.
LA,117,408,335-4719,no,no,0,184.5,97,31.37,351.6,80,29.89,215.8,90,9.71,8.7,4,2.35,1,False.
WV,141,415,330-8173,yes,yes,37,258.6,84,43.96,222.0,111,18.87,326.4,97,14.69,11.2,5,3.02,0,False.




In [24]:
data.types

{'State': 'enum',
 'Account Length': 'int',
 'Area Code': 'int',
 'Phone': 'string',
 "Int'l Plan": 'enum',
 'VMail Plan': 'enum',
 'VMail Message': 'int',
 'Day Mins': 'real',
 'Day Calls': 'int',
 'Day Charge': 'real',
 'Eve Mins': 'real',
 'Eve Calls': 'int',
 'Eve Charge': 'real',
 'Night Mins': 'real',
 'Night Calls': 'int',
 'Night Charge': 'real',
 'Intl Mins': 'real',
 'Intl Calls': 'int',
 'Intl Charge': 'real',
 'CustServ Calls': 'int',
 'Churn?': 'enum'}

### Split a Dataset

In [25]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

### Prepare columns for training

In [27]:
y = 'Churn?'
x = list(data.columns)
x.remove(y)  #remove the response

## Modeling

We first create an object of class, "H2OGeneralizedLinearEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.

In [28]:
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

### Launch experiment

Now that glm_fit1 object is initialized, we can train the model:

In [29]:
glm_fit1.train(x = x, y = y, training_frame = train, validation_frame = valid)



glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_fit1


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,binomial,logit,"Elastic Net (alpha = 0.5, lambda = 1.568E-4 )",71,65,5,py_3_sid_a22d




ModelMetricsBinomialGLM: glm
** Reported on train data. **

MSE: 0.09700905774232732
RMSE: 0.3114627710374505
LogLoss: 0.31739964206788246
Null degrees of freedom: 2341
Residual degrees of freedom: 2276
Null deviance: 1982.3961686936407
Residual deviance: 1486.6999234459618
AIC: 1618.6999234459618
AUC: 0.8392095420283235
AUCPR: 0.5108690267881953
Gini: 0.6784190840566471

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.23159597734350887: 


Unnamed: 0,Unnamed: 1,False.,True.,Error,Rate
0,False.,1729.0,261.0,0.1312,(261.0/1990.0)
1,True.,125.0,227.0,0.3551,(125.0/352.0)
2,Total,1854.0,488.0,0.1648,(386.0/2342.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.231596,0.540476,191.0
1,max f2,0.152883,0.641325,244.0
2,max f0point5,0.372365,0.523844,126.0
3,max accuracy,0.461503,0.865073,96.0
4,max precision,0.991556,1.0,0.0
5,max recall,0.014646,1.0,384.0
6,max specificity,0.991556,1.0,0.0
7,max absolute_mcc,0.231596,0.452031,191.0
8,max min_per_class_accuracy,0.157894,0.774874,241.0
9,max mean_per_class_accuracy,0.157894,0.775221,241.0



Gains/Lift Table: Avg response rate: 15.03 %, avg score: 15.03 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010248,0.785831,5.267282,5.267282,0.791667,0.864255,0.791667,0.864255,0.053977,0.053977,426.72822,426.72822,0.051465
1,2,0.020068,0.716649,4.628458,4.954666,0.695652,0.746966,0.744681,0.806859,0.045455,0.099432,362.84585,395.466634,0.093402
2,3,0.030316,0.6302,4.158381,4.685499,0.625,0.671029,0.704225,0.760944,0.042614,0.142045,315.838068,368.549936,0.131493
3,4,0.040137,0.581446,4.049901,4.529981,0.608696,0.608745,0.680851,0.723704,0.039773,0.181818,304.990119,352.998066,0.166743
4,5,0.050384,0.54414,2.495028,4.116092,0.375,0.561248,0.618644,0.690662,0.025568,0.207386,149.502841,311.609206,0.184773
5,6,0.100342,0.399868,3.355138,3.737234,0.504274,0.471876,0.561702,0.581735,0.167614,0.375,235.513792,273.723404,0.323241
6,7,0.150299,0.298976,2.78647,3.421213,0.418803,0.343047,0.514205,0.502398,0.139205,0.514205,178.647047,242.12132,0.428275
7,8,0.200256,0.24272,2.047203,3.078443,0.307692,0.268505,0.462687,0.44405,0.102273,0.616477,104.72028,207.844301,0.489844
8,9,0.300171,0.161335,1.450102,2.536435,0.217949,0.196309,0.381223,0.361587,0.144886,0.761364,45.010198,153.643476,0.542771
9,10,0.400085,0.110872,0.796134,2.101824,0.119658,0.134379,0.315902,0.304846,0.079545,0.840909,-20.386558,110.1824,0.518799




ModelMetricsBinomialGLM: glm
** Reported on validation data. **

MSE: 0.0942585173998845
RMSE: 0.3070155002599779
LogLoss: 0.31602431469892744
Null degrees of freedom: 497
Residual degrees of freedom: 432
Null deviance: 390.8772978932661
Residual deviance: 314.76021744013167
AIC: 446.76021744013167
AUC: 0.8036966891133558
AUCPR: 0.4317381899113072
Gini: 0.6073933782267116

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2600844629113609: 


Unnamed: 0,Unnamed: 1,False.,True.,Error,Rate
0,False.,374.0,58.0,0.1343,(58.0/432.0)
1,True.,23.0,43.0,0.3485,(23.0/66.0)
2,Total,397.0,101.0,0.1627,(81.0/498.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.260084,0.51497,97.0
1,max f2,0.196579,0.607235,119.0
2,max f0point5,0.290944,0.471014,83.0
3,max accuracy,0.598987,0.87751,17.0
4,max precision,0.909452,1.0,0.0
5,max recall,0.013232,1.0,374.0
6,max specificity,0.909452,1.0,0.0
7,max absolute_mcc,0.260084,0.436178,97.0
8,max min_per_class_accuracy,0.162937,0.742424,141.0
9,max mean_per_class_accuracy,0.196579,0.768098,119.0



Gains/Lift Table: Avg response rate: 13.25 %, avg score: 15.39 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.01004,0.721192,6.036364,6.036364,0.8,0.806668,0.8,0.806668,0.060606,0.060606,503.636364,503.636364,0.058291
1,2,0.02008,0.692601,1.509091,3.772727,0.2,0.713878,0.5,0.760273,0.015152,0.075758,50.909091,277.272727,0.064184
2,3,0.03012,0.628115,6.036364,4.527273,0.8,0.667717,0.6,0.729421,0.060606,0.136364,503.636364,352.727273,0.122475
3,4,0.040161,0.571153,4.527273,4.527273,0.6,0.608535,0.6,0.6992,0.045455,0.181818,352.727273,352.727273,0.1633
4,5,0.050201,0.544978,3.018182,4.225455,0.4,0.555547,0.56,0.670469,0.030303,0.212121,201.818182,322.545455,0.186658
5,6,0.100402,0.43171,2.414545,3.32,0.32,0.478211,0.44,0.57434,0.121212,0.333333,141.454545,232.0,0.268519
6,7,0.150602,0.328285,3.32,3.32,0.44,0.380258,0.44,0.509646,0.166667,0.5,232.0,232.0,0.402778
7,8,0.200803,0.26039,2.716364,3.169091,0.36,0.292077,0.42,0.455254,0.136364,0.636364,171.636364,216.909091,0.502104
8,9,0.301205,0.155589,1.056364,2.464848,0.14,0.198825,0.326667,0.369777,0.106061,0.742424,5.636364,146.484848,0.508628
9,10,0.399598,0.117274,0.307978,1.93376,0.040816,0.136539,0.256281,0.312347,0.030303,0.772727,-69.202226,93.375971,0.430135




Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2022-04-14 00:27:00,0.000 sec,0,991.198084,0.423227,,,,,,,,,,,,,,
1,,2022-04-14 00:27:00,0.102 sec,1,784.460199,0.337412,,,,,,,,,,,,,,
2,,2022-04-14 00:27:00,0.113 sec,2,745.632273,0.321273,,,,,,,,,,,,,,
3,,2022-04-14 00:27:00,0.125 sec,3,743.46814,0.320706,,,,,,,,,,,,,,
4,,2022-04-14 00:27:00,0.135 sec,4,743.414674,0.320702,,,,,,,,,,,,,,
5,,2022-04-14 00:27:00,0.216 sec,5,743.349962,0.320684,0.311463,0.3174,0.240391,0.83921,0.510869,5.267282,0.164816,0.307016,0.316024,0.180118,0.803697,0.431738,6.036364,0.162651



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Int'l Plan.yes,2.126976,1.0,0.071222
1,State.IL,1.766897,0.830709,0.059164
2,State.HI,1.201278,0.564782,0.040225
3,State.SC,1.12516,0.528995,0.037676
4,State.NJ,1.000195,0.470243,0.033492
5,State.MT,0.980623,0.461041,0.032836
6,State.VT,0.914072,0.429752,0.030608
7,State.IN,0.855199,0.402073,0.028636
8,State.AL,0.847208,0.398316,0.028369
9,State.AK,0.774537,0.364149,0.025935



See the whole table with table.as_data_frame()




### Explore predictions

Let's see the performance of the GLM that were just trained. 

In [30]:
glm_perf1 = glm_fit1.model_performance(test)

In [31]:
print (glm_perf1.auc())

0.8289899352983465


In [34]:
print (glm_fit1.auc(train = True))
print (glm_fit1.auc(valid = True))

0.8392095420283235
0.8036966891133558
