# Steam to Driverless AI
This notebook provides a getting started tutorial for how to securely connect to an instance of the H2O AI Cloud from a local workstation and then accomplish common tasks using Driverless AI.

##### For more Driverless AI tutorials for more data science details go to: 'https://github.com/h2oai/driverlessai-tutorials/tree/master/dai_python_client'

# Notebook Setup
This tutorial relies on the latest Steam SDK which can be installed into a python environment using `pip install https://enterprise-steam.s3.amazonaws.com/release/1.8.11/python/h2osteam-1.8.11-py2.py3-none-any.whl`.

This notebook was built on Steam version 1.8.9. If you are using a different version, there might be some differences in the code.

In [3]:
import os
import getpass
import h2osteam
import h2o_mlops_client as mlops
from h2osteam.clients import DriverlessClient

## Table of Contents
<div class="toc"><ul class="toc-item"><li><span><a href="#Notebook-Setup" data-toc-modified-id="Notebook-Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Notebook Setup</a></span></li><li><span><a href="#Securely-Connect" data-toc-modified-id="Securely-Connect-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Securely Connect</a></span></li><li><span><a href="#AI-Engines" data-toc-modified-id="AI-Engines-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>AI Engines</a></span><ul class="toc-item"><li><span><a href="#List-all-engines-I-own" data-toc-modified-id="List-all-engines-I-own-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>List all engines I own</a></span></li></ul></li><li><ul class="toc-item"><li><span><a href="#Create-new-instance" data-toc-modified-id="Create-new-instance"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Create new instance</a></span></li><li><span><a href="#add-a-dataset" data-toc-modified-id="add-a-dataset-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Add a dataset</a></span></li><li><span><a href="#Run-an-experiment" data-toc-modified-id="Run-an-experiment-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Run an experiment</a></span></li><li><span><a href="#Pause-our-instance" data-toc-modified-id="Pause-the-instance-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Pause the instance</a></span></li><li><span><a href="#Resume-the-instance" data-toc-modified-id="Resume-the-instance-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Resume the instance</a></span></li><li><span><a href="#Delete-the-instance" data-toc-modified-id="Delete-the-instance-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>Delete the instance</a></span></li></ul></li></ul></div>

## Securely Connect

To get the personal access token, login to the Steam you would like to test with, click on configuration and then token. Copy and paste the token into the code 

In [6]:
refresh_token = 'https://cloud-internal.h2o.ai/auth/get-platform-token'

In [7]:
print('Click link to get personalized password:', refresh_token)

tp = mlops.TokenProvider(
    token_endpoint_url = 'https://auth.demo.h2o.ai/auth/realms/q8s-internal/protocol/openid-connect/token',
    client_id = 'q8s-internal-platform',
    refresh_token=getpass.getpass()
)

Click link to get personalized password: https://cloud-internal.h2o.ai/auth/get-platform-token
········


In [8]:
steam = h2osteam.login(
    url="https://steam.cloud-internal.h2o.ai/",
    access_token=tp.ensure_fresh_token(),
)

## AI Engines

AI Engine is a tool that helps you build an AI system or machine learning model. These tools help to reiterate taks that repetitive and often difficult to achieve by a human. In this notebook, we are specifically looking at Driverless AI.

First lets check to see whether you have the same version of steam as the server version.

Server version of steam:

In [9]:
print(h2osteam.api().get_config_meta())

{'version': '1.8.11', 'build': '1.8.10-30-gaf0c13e-dirty', 'built': 'Thu Feb  3 00:05:06 UTC 2022', 'restart_pending': True, 'support_email': 'cloud-feedback@h2o.ai', 'license_valid': True, 'is_hadoop_enabled': False, 'is_kubernetes_enabled': True, 'is_h2o_enabled': True, 'is_h2o_running': True, 'is_sparkling_enabled': False, 'is_sparkling_running': False, 'is_driverless_enabled': True, 'is_driverless_running': True, 'is_h2o_engine_uploaded': False, 'is_h2o_kubernetes_engine_uploaded': True, 'is_sparkling_engine_uploaded': False, 'is_driverless_engine_uploaded': True, 'driverless_backend_type': 'kubernetes', 'is_minio_enabled': False, 'h2o_backend_type': 'kubernetes', 'inside_cluster': True}


User version of steam:

In [10]:
print(h2osteam.__version__)

1.8.11


### List all Driverless AI instances I own

The following code lists all the available Driverless AI instances on steam and its attributes. Note that if you have not created an instance, nothing will show up on this list. Run this code again after you created an instance to see it!

In [11]:
dai_steam_instances = {}

for dai_details in steam.get_driverless_instances():
    dai_name = dai_details["name"]
    dai_status = dai_details["status"]
    
    dai_steam_instances[dai_name] = dai_details
                
dai_steam_instances

{}

### Create new instance

In case we want to use a different version of Driverless AI, run the following line to see what versions are available:

In [12]:
h2osteam.api().get_driverless_engines()

[{'version': '1.9.0.6', 'major': 1, 'minor': 9, 'patch': 0, 'fix': 6},
 {'version': '1.9.1.1', 'major': 1, 'minor': 9, 'patch': 1, 'fix': 1},
 {'version': '1.9.1.3', 'major': 1, 'minor': 9, 'patch': 1, 'fix': 3},
 {'version': '1.9.2.1', 'major': 1, 'minor': 9, 'patch': 2, 'fix': 1},
 {'version': '1.9.3', 'major': 1, 'minor': 9, 'patch': 3, 'fix': 0},
 {'version': '1.10.0', 'major': 1, 'minor': 10, 'patch': 0, 'fix': 0},
 {'version': '1.10.1', 'major': 1, 'minor': 10, 'patch': 1, 'fix': 0},
 {'version': '1.10.1.1', 'major': 1, 'minor': 10, 'patch': 1, 'fix': 1},
 {'version': '1.10.1.2', 'major': 1, 'minor': 10, 'patch': 1, 'fix': 2},
 {'version': '1.10.1.3', 'major': 1, 'minor': 10, 'patch': 1, 'fix': 3}]

If you would like to use a different profile, run the following line to see what profiles are available:

In [13]:
h2osteam.print_profiles()

===
Profile name: default-h2o
Profile type: h2o
Number of nodes: MIN=1 MAX=10
CPUs per node: MIN=0 MAX=0
YARN virtual cores: MIN=0 MAX=0
Node memory [GB]: MIN=4 MAX=32
Extra node memory [%]: MIN=10 MAX=200
Max idle time [hrs]: MIN=1 MAX=24
Max uptime [hrs]: MIN=1 MAX=24
YARN queues: 
Start timeout [s]: MIN=60 MAX=600
===
Profile name: default-sparkling-internal
Profile type: sparkling_internal
Driver cores: MIN=1 MAX=8
Driver memory [GB]: MIN=4 MAX=32
Number of executors: MIN=1 MAX=10
Executor cores: MIN=1 MAX=8
Executor memory [GB]: MIN=4 MAX=32
H2O threads per node: MIN=0 MAX=0
Extra node memory [%]: MIN=0 MAX=200
Max idle time [hrs]: MIN=1 MAX=24
Max uptime [hrs]: MIN=1 MAX=24
YARN queues: 
Start timeout: MIN=60 MAX=600
===
Profile name: default-sparkling-external
Profile type: sparkling_external
Driver cores: MIN=1 MAX=8
Driver memory [GB]: MIN=4 MAX=32
Number of executors: MIN=1 MAX=10
Executor cores: MIN=1 MAX=8
Executor memory [GB]: MIN=4 MAX=32
H2O nodes: MIN=1 MAX=10
H2O CPUs 

This example hows how to create an instance of Driverless AI v1.10.0 and connect to it.

In [14]:
instance = DriverlessClient().launch_instance(name="test-instance",
                                              version="1.10.1.3",
                                              profile_name="default-driverless-kubernetes")
client = instance.connect()

Driverless AI instance is submitted, please wait...
Driverless AI instance is running


If you want to interact with the UI, you can use this link!

In [15]:
client.server.gui()

### Add a dataset

To create a dataset on the Driverless AI server using data from an URL, include the URL and the name of the connecter used for data transfer, and set the dataset name.

In [16]:
telco_churn = client.datasets.create(data="https://h2o-internal-release.s3-us-west-2.amazonaws.com/data/Splunk/churn.csv", 
                                  data_source="s3", 
                                  name="Telco_Churn")

Complete 100.00% - [4/4] Computed stats for column Churn?


To create a dataset on the Driverless AI server using data on your local machine, first download the dataset onto a path on your local machine. 

To download a dataset to your local machine, do:

In [17]:
download_location = '/Users/admin/Downloads/'

In [18]:
local_file_path = telco_churn.download(download_location, overwrite=True)

Downloaded '/Users/admin/Downloads/churn.csv.1646077902.4344351.csv'


Then set the dataset name and create the dataset on the Driverless AI server

In [19]:
telco_churn2 = client.datasets.create(local_file_path, name="Telco_Churn_Duplicate")

Complete 100.00% - [4/4] Computed stats for column Churn?


To change the dataset name, simply run this command

In [20]:
print("Old Name:", telco_churn2.name)

telco_churn2.rename("Telco Churn New Name")

print("New Name:", telco_churn2.name)

Old Name: Telco_Churn_Duplicate
New Name: Telco Churn New Name


### Run an experiment
#### 1. First split the dataset for training and testing

In [21]:
telco_churn_split = telco_churn.split_to_train_test(
    train_size=0.8, 
    train_name='telco_churn_train', 
    test_name='telco_churn_test', 
    target_column= "Churn?",
    seed=42
)

Complete


In [22]:
telco_churn_split

{'train_dataset': <class 'Dataset'> ecc456ce-98cf-11ec-8f6a-e670ca277224 telco_churn_train,
 'test_dataset': <class 'Dataset'> ecc479ce-98cf-11ec-8f6a-e670ca277224 telco_churn_test}

#### 2. Set up the experiment's settings (ie. accuracy, time, target column, etc.)

You can list all existing experiments using:

In [23]:
[e.name for e in client.experiments.list()]

[]

You might want to run several experiments with different dial and expert settings. All of these will likely have some things in common, namely details about this specific dataset. We will create a dictionary to use in many experiments.

In [24]:
telco_settings = {
    **telco_churn_split,
    'task': 'classification',
    'target_column': "Churn?", 
    'scorer': 'F1'
}

In [25]:
client.experiments.preview( # Get experiment preview with our settings
    **telco_settings
)

ACCURACY [7/10]:
- Training data size: *2,666 rows, 20 cols*
- Feature evolution: *[Constant, LightGBM, XGBoostGBM]*, *3-fold CV**, 2 reps*
- Final pipeline: *Blend of up to 2 [Constant, LightGBM, XGBoostGBM] models, each averaged across 3-fold CV splits*

TIME [2/10]:
- Feature evolution: *8 individuals*, up to *10 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, Frequent, Interactions, NumCatTE, NumToCatWoEMonotonic, NumToCatWoE, Original, Text, WeightOfEvidence]
- Pre-trained PyTorch NLP models (with fine-tuning): ['disabled']

[Constant, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *48*
- Feature evolution: *288*
- Final pipeline: *6*

Estimated runtime: *15 minutes*
Estimated mojo_size: *2.6MB*
Auto-click Finish/Abort if not done in: *1 day*/*7 days*


There may be several common types of experiments you want to run, and H2O.ai will be creating common experiment settings in dictionaries for easy use. The one below turns off all extra settings such as building pipelines or checking for leakage. It also uses the fastest experiment settings.

In [26]:
fast_settings = {
    'accuracy': 1,
    'time': 1,
    'interpretability': 6,
    'make_python_scoring_pipeline': 'off',
    'make_mojo_scoring_pipeline': 'off',
    'benchmark_mojo_latency': 'off',
    'make_autoreport': False,
    'check_leakage': 'off',
    'check_distribution_shift': 'off'
}

#### 3. Launch experiment

In [27]:
default_baseline = client.experiments.create_async( #comment out the other experiments that you dont want to run
    **telco_settings, 
    #name='Fastest Settings', **fast_settings,
    name='Default Baseline', accuracy=7, time=2, interpretability=8
)

Experiment launched at: https://steam.cloud-internal.h2o.ai:443/proxy/driverless/331/#/experiment?key=01c6ee7e-98d0-11ec-8f6a-e670ca277224


#### 4. View information, summary, model artifacts, and model performance of experiment

In [28]:
#Prints information on experiment

print("Name:", default_baseline.name)
print("Datasets:", default_baseline.datasets)
print("Target:", default_baseline.settings['target_column']) # beta users from before March 15th use target_col
print("Scorer:", default_baseline.metrics()['scorer'])
print("Task:", default_baseline.settings['task'])
print("Status:", default_baseline.status(verbose=2))
print("Web Page: ", end='')
default_baseline.gui()

Name: Default Baseline
Datasets: {'train_dataset': <class 'Dataset'> ecc456ce-98cf-11ec-8f6a-e670ca277224 telco_churn_train, 'validation_dataset': None, 'test_dataset': <class 'Dataset'> ecc479ce-98cf-11ec-8f6a-e670ca277224 telco_churn_test}
Target: Churn?
Scorer: F1
Task: classification
Status: Complete 100.00% - Status: Complete
Web Page: 

In [29]:
#view experiment summary

default_baseline.summary() 

Status: Complete
Experiment: Default Baseline (01c6ee7e-98d0-11ec-8f6a-e670ca277224)
  Version: 1.10.1.2, 2022-02-28 19:58
  Settings: 7/2/8, seed=393800133, GPUs disabled
  Train data: telco_churn_train (2666, 21)
  Validation data: N/A
  Test data: [Test] (667, 20)
  Target column: Churn? (binary, 14.479% target class)
System specs: Docker/Linux, 28 GB, 32 CPU cores, 0/0 GPU
  Max memory usage: 0.906 GB, 0 GB GPU
Recipe: AutoDL (14 iterations, 8 individuals)
  Validation scheme: stratified, 6 internal holdouts (3-fold CV)
  Feature engineering: 40 features scored (19 selected)
Timing: MOJO latency 0.0823 millis (1.7MB), Python latency 101.0092 millis (1.2MB)
  Data preparation: 10.65 secs
  Shift/Leakage detection: 3.66 secs
  Model and feature tuning: 70.01 secs (55 models trained)
  Feature evolution: 147.06 secs (126 of 288 models trained)
  Final pipeline training: 29.68 secs (6 models trained)
  Python / MOJO scorer building: 38.09 secs / 22.29 secs
Validation score: F1 = 0.2529

In [30]:
#see what model artifacts are available

print("Available artifacts:", default_baseline.artifacts.list()) 

Available artifacts: ['autodoc', 'logs', 'mojo_pipeline', 'python_pipeline', 'summary', 'test_predictions', 'train_predictions']


In [31]:
#generate autodoc

default_baseline.artifacts.create('autoreport') 

Generating autodoc...


In [32]:
#download autodoc

artifacts = default_baseline.artifacts.download(['autoreport'], download_location, overwrite=True) 

Downloaded '/Users/admin/Downloads/report.docx'


In [33]:
#OSX - open autodoc on MacOS

!open -a "Microsoft Word" {artifacts["autoreport"]} 

In [34]:
#view final model performance

default_baseline.metrics() 

{'scorer': 'F1',
 'val_score': 0.770317916991267,
 'val_score_sd': 0.02596585550001304,
 'val_roc_auc': 0.9067638329415454,
 'val_pr_auc': 0.7882773719648645,
 'test_score': 0.7608695652173914,
 'test_score_sd': 0.03247189403885633,
 'test_roc_auc': 0.9001266051727257,
 'test_pr_auc': 0.7703275526356188}

### Pause the instance

You can pause an instance that is currently running. Pausing an instance shuts it down, it is similar to powering off a server. You will not loose any data and you can start an instance at any time.

In [35]:
instance.stop()

Driverless AI instance is stopping, please wait...
Driverless AI instance is stopped


### Resume the instance

You can resume a paused instance by simply running:

In [36]:
instance.start()

Driverless AI instance is starting, please wait...
Driverless AI instance is running


### Delete the instance

When you no longer need an instance, you can terminate it. Once deleted, there is no way to restart the instance or access any data.

In [37]:
instance.terminate()