# Steam to H2O3
This notebook provides a getting started tutorial for how to securely connect to an instance of the H2O AI Cloud from a local workstation and then accomplish common tasks using the H2O3.

##### For more H2O-3 tutorials for more data science details go to: 'https://github.com/h2oai/h2o-tutorials/blob/master/h2o-open-tour-2016/chicago/intro-to-h2o.ipynb'

## Notebook Setup
This tutorial relies on the latest Steam SDK which can be installed into a python environment using `pip install https://enterprise-steam.s3.amazonaws.com/release/1.8.9/python/h2osteam-1.8.9-py2.py3-none-any.whl`.

In [34]:
import h2osteam
import h2o
import os
import getpass
import h2o_mlops_client as mlops

from h2osteam.clients import H2oKubernetesClient

# Import H2O GLM:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

## Table of Contents
<div class="toc"><ul class="toc-item"><li><span><a href="#Notebook-Setup" data-toc-modified-id="Notebook-Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Notebook Setup</a></span></li><li><span><a href="#Securely-Connect" data-toc-modified-id="Securely-Connect-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Securely Connect</a></span></li><li><span><a href="#DAI-Instances" data-toc-modified-id="DAI-Instances-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Driverless AI Instances</a></span><ul class="toc-item"><li><span><a href="#Create-new-cluster" data-toc-modified-id="Create-new-cluster"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Create new cluster</a></span></li><li><span><a href="#list-all-existing-clusters" data-toc-modified-id="list-all-existing-clusters-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>List all existing clusters</a></span></li><li><span><a href="#add-a-dataset" data-toc-modified-id="add-a-dataset-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Add a dataset</a></span></li><li><span><a href="#Run-an-experiment" data-toc-modified-id="Run-an-experiment-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Run an experiment</a></span></li><li><span><a href="#Pause-our-instance" data-toc-modified-id="Pause-our-instance-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Pause our instance</a></span></li><li><span><a href="#Delete-the-instance" data-toc-modified-id="Delete-the-instance-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Delete the instance</a></span></li></ul></li></ul></div>

## Securely Connect

To get the personal access token, login to the Steam you would like to test with, click on configuration and then token. Copy and paste the token into the box 

In [35]:
refresh_token = 'https://cloud-internal.h2o.ai/auth/get-platform-token'

In [36]:
print('Click link to get personalized password:', refresh_token)

tp = mlops.TokenProvider(
    token_endpoint_url = 'https://auth.demo.h2o.ai/auth/realms/q8s-internal/protocol/openid-connect/token',
    client_id = 'q8s-internal-platform',
    refresh_token=getpass.getpass()
)

Click link to get personalized password: https://cloud-internal.h2o.ai/auth/get-platform-token
········


In [37]:
steam = h2osteam.login(
    url="https://steam.cloud-internal.h2o.ai/",
    access_token=tp.ensure_fresh_token(),
)

## H2O3 Clusters

This example hows how to create an cluster and connect to it. First lets check the version of H2O python client and the available H2O server versions.

H2O Python Client Version:

In [38]:
h2o.__version__

'3.36.0.2'

H2O server version available:

In [39]:
h2osteam.api().get_h2o_engines()

[]

### Create new cluster

Now lets launch the cluster!

In [40]:
cluster = H2oKubernetesClient().launch_cluster(
    name="test_cluster",
    version="3.36.0.1",
)

To ensure the cluster is running, run the following line:

In [41]:
cluster.is_running()

True

#### Connecting to new cluster

Finally, lets connect to the cluster

In [42]:
cluster.connect()

Connecting to H2O server at https://steam.cloud-internal.h2o.ai:443/proxy/h2o-k8s/112 ... successful.


0,1
H2O_cluster_uptime:,3 mins 27 secs
H2O_cluster_timezone:,Etc/GMT
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.1
H2O_cluster_version_age:,2 months
H2O_cluster_name:,test_cluster
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.185 Gb
H2O_cluster_total_cores:,1
H2O_cluster_allowed_cores:,1


<h2osteam.clients.h2ok8s.h2ok8s.H2oKubernetesCluster at 0x7f83191ddf70>

In [43]:
h2o.cluster().version

'3.36.0.1'

#### Connecting to existing clusters

If you want to connect to existing H2O cluster, run the following code!

In [44]:
name = "test_cluster"
cluster = H2oKubernetesClient.get_cluster(name)
cluster.connect()

  cluster = H2oKubernetesClient.get_cluster(name)


Connecting to H2O server at https://steam.cloud-internal.h2o.ai:443/proxy/h2o-k8s/112 ... successful.


0,1
H2O_cluster_uptime:,3 mins 36 secs
H2O_cluster_timezone:,Etc/GMT
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.1
H2O_cluster_version_age:,2 months
H2O_cluster_name:,test_cluster
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.185 Gb
H2O_cluster_total_cores:,1
H2O_cluster_allowed_cores:,1


<h2osteam.clients.h2ok8s.h2ok8s.H2oKubernetesCluster at 0x7f83191dda00>

### List all existing clusters

In [45]:
clusters = H2oKubernetesClient().get_clusters()
clusters

[<h2osteam.clients.h2ok8s.h2ok8s.H2oKubernetesCluster at 0x7f8329210ac0>]

### Add a dataset

In [48]:
#loan_csv = "/Volumes/H2OTOUR/loan.csv"  # modify this for your machine
# Alternatively, you can import the data directly from a URL
loan_csv = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv"

data = h2o.import_file(loan_csv)  # 163,987 rows x 15 columns

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [49]:
data.shape

(163987, 15)

In [52]:
h2o.cluster().show_status(True)

0,1
H2O_cluster_uptime:,14 mins 13 secs
H2O_cluster_timezone:,Etc/GMT
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.1
H2O_cluster_version_age:,2 months
H2O_cluster_name:,test_cluster
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.173 Gb
H2O_cluster_total_cores:,1
H2O_cluster_allowed_cores:,1


0,1
Nodes info:,Node 1
h2o,steam-1a3cbfec-5e5c-42be-a677-1958e0612c23-0/10.1.6.109:54321
healthy,True
last_ping,1646075391940.0000000
num_cpus,1
sys_load,0.88
mem_value_size,6048090
free_mem,3406542216.0000000
pojo_mem,68223262
swap_mem,0


### Run an experiment
#### 1. First Split the dataset for training and testing

In [53]:
data['bad_loan'] = data['bad_loan'].asfactor()  #encode the binary repsonse as a factor
data['bad_loan'].levels()  #optional: after encoding, this shows the two factor levels, '0' and '1'

[['0', '1']]

In [54]:
splits = data.split_frame(ratios=[0.7, 0.15], seed=1)  

train = splits[0]
valid = splits[1]
test = splits[2]

In [55]:
y = 'bad_loan'
x = list(data.columns)

In [56]:
x.remove(y)  #remove the response
x.remove('int_rate')  #remove the interest rate column because it's correlated with the outcome

#### 2. Set up the experiment's settings (ie. accuracy, time, target column, etc.)

We first create an object of class, "H2OGeneralizedLinearEstimator". This does not actually do any training, it just sets the model up for training by specifying model parameters.

In [57]:
# Initialize the GLM estimator:
# Similar to R's glm() and H2O's R GLM, H2O's GLM has the "family" argument

glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

#### 3. Launch Experiment

Now that glm_fit1 object is initialized, we can train the model:

In [58]:
glm_fit1.train(x=x, y=y, training_frame=train, validation_frame=valid)

glm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  glm_fit1


GLM Model: summary


Unnamed: 0,Unnamed: 1,family,link,regularization,number_of_predictors_total,number_of_active_predictors,number_of_iterations,training_frame
0,,binomial,logit,"Elastic Net (alpha = 0.5, lambda = 8.155E-5 )",82,52,5,py_5_sid_b772




ModelMetricsBinomialGLM: glm
** Reported on train data. **

MSE: 0.13984781035810745
RMSE: 0.37396231141400793
LogLoss: 0.44573426798537014
Null degrees of freedom: 114907
Residual degrees of freedom: 114855
Null deviance: 108939.63716428309
Residual deviance: 102436.86653132582
AIC: 102542.86653132582
AUC: 0.6741464380849886
AUCPR: 0.3129311477245476
Gini: 0.3482928761699773

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.1889005821903624: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,62183.0,31841.0,0.3386,(31841.0/94024.0)
1,1,8549.0,12335.0,0.4094,(8549.0/20884.0)
2,Total,70732.0,44176.0,0.3515,(40390.0/114908.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.188901,0.379188,223.0
1,max f2,0.111699,0.547285,316.0
2,max f0point5,0.262339,0.341004,153.0
3,max accuracy,0.521566,0.818733,24.0
4,max precision,0.73849,1.0,0.0
5,max recall,0.00099,1.0,399.0
6,max specificity,0.73849,1.0,0.0
7,max absolute_mcc,0.207706,0.201205,204.0
8,max min_per_class_accuracy,0.180996,0.624785,232.0
9,max mean_per_class_accuracy,0.180996,0.626243,232.0



Gains/Lift Table: Avg response rate: 18.17 %, avg score: 18.17 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010008,0.4668043,2.727179,2.727179,0.495652,0.515867,0.495652,0.515867,0.027294,0.027294,172.71787,172.71787,0.021125
1,2,0.020007,0.4283093,2.466174,2.596733,0.448216,0.44596,0.471944,0.480929,0.02466,0.051954,146.617438,159.67333,0.039042
2,3,0.030007,0.4037065,2.150121,2.447906,0.390775,0.415485,0.444896,0.459121,0.0215,0.073453,115.012096,144.79057,0.053097
3,4,0.040006,0.3839043,2.130966,2.368688,0.387293,0.393512,0.430498,0.442722,0.021308,0.094762,113.096621,136.868806,0.066918
4,5,0.050005,0.3672821,2.159698,2.326897,0.392515,0.375539,0.422903,0.429288,0.021595,0.116357,115.969834,132.689739,0.081089
5,6,0.100002,0.3109176,1.882912,2.104924,0.342211,0.336551,0.38256,0.382923,0.094139,0.210496,88.291217,110.49241,0.135037
6,7,0.150007,0.2744069,1.611592,1.94047,0.292899,0.291427,0.352672,0.352423,0.080588,0.291084,61.15919,94.047049,0.172412
7,8,0.200003,0.2483117,1.513225,1.833669,0.275022,0.260707,0.333261,0.329496,0.075656,0.36674,51.322544,83.366853,0.203771
8,9,0.300005,0.211301,1.276074,1.647804,0.231921,0.228755,0.299481,0.295916,0.12761,0.49435,27.607432,64.780379,0.237511
9,10,0.399998,0.1848513,1.116243,1.514922,0.202872,0.197359,0.27533,0.271278,0.111617,0.605966,11.62432,51.492232,0.251716




ModelMetricsBinomialGLM: glm
** Reported on validation data. **

MSE: 0.14197789922883763
RMSE: 0.3767995478086958
LogLoss: 0.4509779914567159
Null degrees of freedom: 24497
Residual degrees of freedom: 24445
Null deviance: 23495.187288593857
Residual deviance: 22096.117669413252
AIC: 22202.117669413252
AUC: 0.6754349275583739
AUCPR: 0.31637176075742307
Gini: 0.35086985511674773

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.19001917992659606: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,13326.0,6630.0,0.3322,(6630.0/19956.0)
1,1,1859.0,2683.0,0.4093,(1859.0/4542.0)
2,Total,15185.0,9313.0,0.3465,(8489.0/24498.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.190019,0.387297,224.0
1,max f2,0.120977,0.554689,307.0
2,max f0point5,0.286058,0.342812,134.0
3,max accuracy,0.494281,0.815414,25.0
4,max precision,0.580675,0.708333,9.0
5,max recall,0.004379,1.0,398.0
6,max specificity,0.719034,0.99995,0.0
7,max absolute_mcc,0.210439,0.207877,203.0
8,max min_per_class_accuracy,0.180059,0.62728,236.0
9,max mean_per_class_accuracy,0.181039,0.629799,235.0



Gains/Lift Table: Avg response rate: 18.54 %, avg score: 18.20 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010001,0.4639619,2.641792,2.641792,0.489796,0.512898,0.489796,0.512898,0.02642,0.02642,164.179225,164.179225,0.020156
1,2,0.020002,0.4251522,2.487688,2.56474,0.461224,0.442705,0.47551,0.477801,0.024879,0.051299,148.76877,156.473998,0.038421
2,3,0.030002,0.4031526,1.82724,2.318907,0.338776,0.413759,0.429932,0.456454,0.018274,0.069573,82.723964,131.890653,0.048577
3,4,0.040003,0.384716,2.157464,2.278546,0.4,0.393118,0.422449,0.44062,0.021576,0.091149,115.746367,127.854582,0.062787
4,5,0.050004,0.3682875,2.223508,2.267538,0.412245,0.376411,0.420408,0.427778,0.022237,0.113386,122.350848,126.753835,0.077808
5,6,0.100008,0.3115652,1.906493,2.087016,0.353469,0.337674,0.386939,0.382726,0.095332,0.208719,90.649341,108.701588,0.133453
6,7,0.150012,0.2764154,1.611493,1.928508,0.298776,0.292662,0.357551,0.352705,0.080581,0.2893,61.149327,92.850834,0.17099
7,8,0.200016,0.2495315,1.382538,1.792016,0.256327,0.261916,0.332245,0.330008,0.069133,0.358432,38.253795,79.201574,0.194472
8,9,0.300024,0.2123237,1.397948,1.66066,0.259184,0.229557,0.307891,0.296524,0.139806,0.498239,39.79484,66.065996,0.243328
9,10,0.399992,0.1846307,1.125423,1.526892,0.208657,0.197912,0.28309,0.271879,0.112506,0.610744,12.542256,52.689158,0.25872




Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,iterations,negative_log_likelihood,objective,training_rmse,training_logloss,training_r2,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2022-02-28 19:11:50,0.000 sec,0,54469.818582,0.47403,,,,,,,,,,,,,,
1,,2022-02-28 19:11:51,0.504 sec,1,51474.873626,0.448341,,,,,,,,,,,,,,
2,,2022-02-28 19:11:51,0.587 sec,2,51227.197551,0.446199,,,,,,,,,,,,,,
3,,2022-02-28 19:11:51,0.649 sec,3,51218.459867,0.44613,,,,,,,,,,,,,,
4,,2022-02-28 19:11:51,0.712 sec,4,51218.457018,0.44613,,,,,,,,,,,,,,
5,,2022-02-28 19:11:51,0.937 sec,5,51218.433266,0.44613,0.373962,0.445734,0.059619,0.674146,0.312931,2.727179,0.351499,0.3768,0.450978,0.059927,0.675435,0.316372,2.641792,0.346518



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,purpose.small_business,0.66414,1.0,0.077961
1,term.36 months,0.432335,0.65097,0.05075
2,purpose.credit_card,0.428948,0.64587,0.050353
3,annual_inc,0.401408,0.604402,0.04712
4,purpose.car,0.391287,0.589163,0.045932
5,addr_state.CO,0.315671,0.475307,0.037056
6,addr_state.TN,0.310776,0.467937,0.036481
7,term.60 months,0.302388,0.455308,0.035496
8,addr_state.DC,0.298106,0.448861,0.034994
9,addr_state.WV,0.296639,0.446652,0.034821



See the whole table with table.as_data_frame()




#### 4. View information, summary, model artifacts, and model performance of experiment

Let's see the performance of the GLM that were just trained. 

In [59]:
glm_perf1 = glm_fit1.model_performance(test)

Instead of printing the entire model performance metrics object, it is probably easier to print just the metric that you are interested in comparing. Here we are going to compare the test AUC to the training and validation AUC

In [60]:
print (glm_perf1.auc())

0.6774747329108557


In [61]:
print (glm_fit1.auc(train=True))
print (glm_fit1.auc(valid=True))

0.6741464380849886
0.6754349275583739


### Pause our instance

In [62]:
cluster.stop()

<h2osteam.clients.h2ok8s.h2ok8s.H2oKubernetesCluster at 0x7f83191dda00>

### Delete the instance

In [63]:
cluster.terminate()

<h2osteam.clients.h2ok8s.h2ok8s.H2oKubernetesCluster at 0x7f83191dda00>