# Steam to Driverless AI
This notebook provides a getting started tutorial for how to securely connect to an instance of the H2O AI Cloud from a local workstation and then accomplish common tasks using Driverless AI.

##### For more Driverless AI tutorials for more data science details go to: 'https://github.com/h2oai/driverlessai-tutorials/tree/master/dai_python_client'

## Notebook Setup
This tutorial relies on the latest Steam SDK which can be installed into a python environment using `pip install https://enterprise-steam.s3.amazonaws.com/release/1.8.9/python/h2osteam-1.8.9-py2.py3-none-any.whl`.

In [226]:
import os
import getpass
import h2osteam
from h2osteam.clients import DriverlessClient

## Table of Contents
<div class="toc"><ul class="toc-item"><li><span><a href="#Notebook-Setup" data-toc-modified-id="Notebook-Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Notebook Setup</a></span></li><li><span><a href="#Securely-Connect" data-toc-modified-id="Securely-Connect-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Securely Connect</a></span></li><li><span><a href="#AI-Engines" data-toc-modified-id="AI-Engines-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>AI Engines</a></span><ul class="toc-item"><li><span><a href="#List-all-engines-I-own" data-toc-modified-id="List-all-engines-I-own-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>List all engines I own</a></span></li></ul></li><li><span><a href="#DAI-Instances" data-toc-modified-id="DAI-Instances-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Driverless AI Instances</a></span><ul class="toc-item"><li><span><a href="#Create-new-instance" data-toc-modified-id="Create-new-instance"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Create new instance</a></span></li><li><span><a href="#add-a-dataset" data-toc-modified-id="add-a-dataset-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Add a dataset</a></span></li><li><span><a href="#Run-an-experiment" data-toc-modified-id="Run-an-experiment-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Run an experiment</a></span></li><li><span><a href="#Pause-our-instance" data-toc-modified-id="Pause-our-instance-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Pause our instance</a></span></li><li><span><a href="#Resume-the-instance" data-toc-modified-id="Resume-the-instance-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Resume the instance</a></span></li><li><span><a href="#Delete-the-instance" data-toc-modified-id="Delete-the-instance-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Delete the instance</a></span></li></ul></li></ul></div>

## Securely Connect

To get the personal access token, login to the Steam you would like to test with, click on configuration and then token. Copy and paste the token into the code 

In [227]:
steam = h2osteam.login(
    url="https://steam.cloud-dev.h2o.ai",
    password= getpass.getpass("Enter your Steam token:"),
    verify_ssl=True,
)

Enter your Steam token: ········································


## AI Engines

### List all AI Engines I own

The following function lists all the available Driverless AI instances on steam and its attributes.

In [233]:
dai_steam_instances = {}

for dai_details in steam.get_driverless_instances():
    dai_name = dai_details["name"]
    dai_status = dai_details["status"]
    
    dai_steam_instances[dai_name] = dai_details
                
dai_steam_instances

{'test-instance': {'id': 99,
  'profile_name': 'default-driverless-kubernetes',
  'name': 'test-instance',
  'status': 'running',
  'target_status': 'running',
  'version': '1.10.0',
  'backend_type': '',
  'instance_type': 'singlenode',
  'master_id': -1,
  'cpu_count': 8,
  'gpu_count': 0,
  'memory_gb': 30,
  'storage_gb': 64,
  'max_idle_seconds': 7200,
  'max_uptime_seconds': 28800,
  'timeout_seconds': 600,
  'address': 'http://10.1.40.149:12345',
  'authentication': 'oidc',
  'password': '0ivpdwj5mb26aly2nlwdep8pb5s6u884',
  'created_at': 1636132064,
  'started_at': 1636132352,
  'created_by': 'jeffrey.canisius@h2o.ai',
  'current_uptime_seconds': 128,
  'current_idle_seconds': 128,
  'pod_latest_event': None,
  'service_latest_event': None,
  'pvc_latest_event': None,
  'stop_reason': '',
  'config_toml': ''}}

## Driverless AI Instances

### Create new instance

This example hows how to create an instance of Driverless AI v1.10.0 and connect to it.

In [229]:
instance = DriverlessClient().launch_instance(name="test-instance",
                                              version="1.10.0",
                                              profile_name="default-driverless-kubernetes")
client = instance.connect()

Driverless AI instance is submitted, please wait...
Driverless AI instance is running


In [230]:
client.server.gui()

In case we want to use a different version of Driverless AI, run the following line to see what versions are available:

In [232]:
h2osteam.api().get_driverless_engines()

[{'version': '1.9.0.6', 'major': 1, 'minor': 9, 'patch': 0, 'fix': 6},
 {'version': '1.9.1.1', 'major': 1, 'minor': 9, 'patch': 1, 'fix': 1},
 {'version': '1.9.1.3', 'major': 1, 'minor': 9, 'patch': 1, 'fix': 3},
 {'version': '1.9.2.1', 'major': 1, 'minor': 9, 'patch': 2, 'fix': 1},
 {'version': '1.9.3', 'major': 1, 'minor': 9, 'patch': 3, 'fix': 0},
 {'version': '1.10.0', 'major': 1, 'minor': 10, 'patch': 0, 'fix': 0}]

### Add a dataset

To create a dataset on the Driverless AI server using data from an URL, include the URL and the name of the connecter used for data transfer, and set the dataset name.

In [234]:
telco_churn = client.datasets.create(data="https://h2o-internal-release.s3-us-west-2.amazonaws.com/data/Splunk/churn.csv", 
                                  data_source="s3", 
                                  name="Telco_Churn")

Complete 100.00% - [4/4] Computed stats for column Churn?


To create a dataset on the Driverless AI server using data on your local machine, first download the dataset onto a path on your local machine. 

To download a dataset to your local machine, do:

In [235]:
download_location = '/Users/admin/Downloads/'

In [236]:
local_file_path = telco_churn.download(download_location, overwrite=True)

Downloaded '/Users/admin/Downloads/churn.csv.1636132492.6540399.csv'


Then set the dataset name and create the dataset on the Driverless AI server

In [237]:
telco_churn2 = client.datasets.create(local_file_path, name="Telco_Churn_Duplicate")

Complete 100.00% - [4/4] Computed stats for column Churn?


To change the dataset name, simply run this command

In [239]:
print("Old Name:", telco_churn2.name)

telco_churn2.rename("Telco Churn New Name")

print("New Name:", telco_churn2.name)

Old Name: Fancy New Name
New Name: Telco Churn New Name


### Run an experiment
#### 1. First Split the dataset for training and testing

In [240]:
telco_churn_split = telco_churn.split_to_train_test(
    train_size=0.8, 
    train_name='telco_churn_train', 
    test_name='telco_churn_test', 
    target_column= "Churn?",
    seed=42
)

Complete


In [241]:
telco_churn_split

{'train_dataset': <class 'Dataset'> 05d650ae-3e5d-11ec-8930-4ad205dfadcc telco_churn_train,
 'test_dataset': <class 'Dataset'> 05d690aa-3e5d-11ec-8930-4ad205dfadcc telco_churn_test}

#### 2. Set up the experiment's settings (ie. accuracy, time, target column, etc.)

You can list all existing experiments using:

In [253]:
[e.name for e in client.experiments.list()]

['Default Baseline']

You might want to run several experiments with different dial and expert settings. All of these will likely have some things in common, namely details about this specific dataset. We will create a dictionary to use in many experiments.

In [242]:
telco_settings = {
    **telco_churn_split,
    'task': 'classification',
    'target_column': "Churn?", 
    'scorer': 'F1'
}

In [243]:
client.experiments.preview( # Get experiment preview with our settings
    **telco_settings
)

ACCURACY [7/10]:
- Training data size: *2,666 rows, 21 cols*
- Feature evolution: *[Constant, DecisionTree, LightGBM, XGBoostGBM]*, *3-fold CV**, 2 reps*
- Final pipeline: *Blend of up to 2 [Constant, DecisionTree, LightGBM, XGBoostGBM] models, each averaged across 3-fold CV splits*

TIME [2/10]:
- Feature evolution: *8 individuals*, up to *10 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, Frequent, Interactions, NumCatTE, NumToCatWoEMonotonic, NumToCatWoE, Original, Text, WeightOfEvidence]
- Pre-trained PyTorch NLP models (with fine-tuning): ['disabled']

[Constant, DecisionTree, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *48*
- Feature evolution: *528*
- Final pipeline: *6*

Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7

There may be several common types of experiments you want to run, and H2O.ai will be creating common experiment settings in dictionaries for easy use. The one below turns off all extra settings such as building pipelines or checking for leakage. It also uses the fastest experiment settings.

In [244]:
fast_settings = {
    'accuracy': 1,
    'time': 1,
    'interpretability': 6,
    'make_python_scoring_pipeline': 'off',
    'make_mojo_scoring_pipeline': 'off',
    'benchmark_mojo_latency': 'off',
    'make_autoreport': False,
    'check_leakage': 'off',
    'check_distribution_shift': 'off'
}

#### 3. Launch Experiment

In [245]:
default_baseline = client.experiments.create_async( #comment out the other experiments that you dont want to run
    **telco_settings, 
    #name='Fastest Settings', **fast_settings,
    name='Default Baseline', accuracy=7, time=2, interpretability=8
)

Experiment launched at: https://steam.cloud-dev.h2o.ai:443/proxy/driverless/99/#/experiment?key=3d08c75a-3e5d-11ec-8930-4ad205dfadcc


#### 4. View information, summary, model artifacts, and model performance of experiment

In [246]:
#Prints information on experiment

print("Name:", default_baseline.name)
print("Datasets:", default_baseline.datasets)
print("Target:", default_baseline.settings['target_column']) # beta users from before March 15th use target_col
print("Scorer:", default_baseline.metrics()['scorer'])
print("Task:", default_baseline.settings['task'])
print("Status:", default_baseline.status(verbose=2))
print("Web Page: ", end='')
default_baseline.gui()

Name: Default Baseline
Datasets: {'train_dataset': <class 'Dataset'> 05d650ae-3e5d-11ec-8930-4ad205dfadcc telco_churn_train, 'validation_dataset': None, 'test_dataset': <class 'Dataset'> 05d690aa-3e5d-11ec-8930-4ad205dfadcc telco_churn_test}
Target: Churn?
Scorer: F1
Task: classification
Status: Complete 100.00% - Status: Complete
Web Page: 

In [247]:
#view experiment summary

default_baseline.summary() 

Status: Complete
Experiment: Default Baseline (3d08c75a-3e5d-11ec-8930-4ad205dfadcc)
  Version: 1.10.0, 2021-11-05 17:30
  Settings: 7/2/8, seed=771370521, GPUs disabled
  Train data: telco_churn_train (2666, 21)
  Validation data: N/A
  Test data: [Test] (667, 20)
  Target column: Churn? (binary, 14.479% target class)
System specs: Docker/Linux, 28 GB, 32 CPU cores, 0/0 GPU
  Max memory usage: 0.838 GB, 0 GB GPU
Recipe: AutoDL (18 iterations, 8 individuals)
  Validation scheme: stratified, 6 internal holdouts (3-fold CV)
  Feature engineering: 254 features scored (50 selected)
Timing: MOJO latency 0.1690 millis (1.5MB), Python latency 231.4950 millis (1.9MB)
  Data preparation: 6.09 secs
  Shift/Leakage detection: 2.55 secs
  Model and feature tuning: 102.87 secs (55 models trained)
  Feature evolution: 173.57 secs (222 of 528 models trained)
  Final pipeline training: 34.63 secs (6 models trained)
  Python / MOJO scorer building: 32.67 secs / 22.51 secs
Validation score: F1 = 0.25294

In [248]:
#see what model artifacts are available

print("Available artifacts:", default_baseline.artifacts.list()) 

Available artifacts: ['autodoc', 'logs', 'mojo_pipeline', 'python_pipeline', 'summary', 'test_predictions', 'train_predictions']


In [249]:
#generate autodoc

default_baseline.artifacts.create('autoreport') 

Generating autodoc...


In [250]:
#download autodoc

artifacts = default_baseline.artifacts.download(['autoreport'], download_location, overwrite=True) 

Downloaded '/Users/admin/Downloads/report.docx'


In [251]:
#OSX - open autodoc on MacOS

!open -a "Microsoft Word" {artifacts["autoreport"]} 

In [252]:
#view final model performance

default_baseline.metrics() 

{'scorer': 'F1',
 'val_score': 0.7512781458128254,
 'val_score_sd': 0.02826029478157018,
 'val_roc_auc': 0.9100810971123915,
 'val_pr_auc': 0.7858128325514461,
 'test_score': 0.7333333333333333,
 'test_score_sd': 0.02826029478157018,
 'test_roc_auc': 0.8960752396455055,
 'test_pr_auc': 0.779115779079072}

### Pause our instance

In [254]:
instance.stop()

Driverless AI instance is stopping, please wait...
Driverless AI instance is stopped


### Resume the instance

In [255]:
instance.start()

Driverless AI instance is starting, please wait...
Driverless AI instance is running


### Delete the instance

In [256]:
instance.terminate()