# Steam to DAI
This notebook provides a getting started tutorial for how to securely connect to an instance of the H2O AI Cloud from a local workstation and then accomplish common tasks using the Driverless AI.

## Notebook Setup
This tutorial relies on the latest Steam SDK which can be installed into a python environment using `pip install https://enterprise-steam.s3.amazonaws.com/release/1.8.4/python/h2osteam-1.8.4-py2.py3-none-any.whl`.

In [104]:
import os
import h2osteam
from h2osteam.clients import DriverlessClient

## Table of Contents
<div class="toc"><ul class="toc-item"><li><span><a href="#Notebook-Setup" data-toc-modified-id="Notebook-Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Notebook Setup</a></span></li><li><span><a href="#Securely-Connect" data-toc-modified-id="Securely-Connect-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Securely Connect</a></span></li><li><span><a href="#AI-Engines" data-toc-modified-id="AI-Engines-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>AI Engines</a></span><ul class="toc-item"><li><span><a href="#List-all-engines-I-own" data-toc-modified-id="List-all-engines-I-own-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>List all engines I own</a></span></li></ul></li><li><span><a href="#DAI-Instances" data-toc-modified-id="DAI-Instances-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Driverless AI Instances</a></span><ul class="toc-item"><li><span><a href="#Create-new-instance" data-toc-modified-id="Create-new-instance"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Create new instance</a></span></li><li><span><a href="#add-a-dataset" data-toc-modified-id="add-a-dataset-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Add a dataset</a></span></li><li><span><a href="#Run-an-experiment" data-toc-modified-id="Run-an-experiment-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Run an experiment</a></span></li><li><span><a href="#Pause-our-instance" data-toc-modified-id="Pause-our-instance-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Pause our instance</a></span></li><li><span><a href="#Resume-the-instance" data-toc-modified-id="Resume-the-instance-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Resume the instance</a></span></li><li><span><a href="#Delete-the-instance" data-toc-modified-id="Delete-the-instance-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Delete the instance</a></span></li></ul></li></ul></div>

## Securely Connect

To get the personal access token, login to the Steam you would like to test with, click on configuration and then token. Copy and paste the token into the code 

In [105]:
# Get Steam URL from the environment, use the DEMO url as a default if env is not set
steam_url = os.environ.get("STEAM_URL", "https://steam.demo.h2o.ai/")
secure_environment = steam_url.startswith("https://")

if os.getenv("STEAM_URL") is not None:  # running in the AI Cloud
    steam = h2osteam.login(
    url=steam_url,
    access_token=q.auth.access_token,
    verify_ssl=secure_environment,
)
else:  # running locally
    personal_access_token = "pat_sppb0rmk15cu1tqi19v9pbj4hhzrhrmyn92o" #<-- Insert token here
    steam = h2osteam.login(
    url=steam_url,
    password=personal_access_token,
    verify_ssl=secure_environment,
)

## AI Engines

### List all AI Engines I own

The following function lists all the availbale dai instances on steam and its attributes.

In [106]:
dai_steam_instances = {}

if steam is not None:
        for dai_details in steam.get_driverless_instances():
            dai_name = dai_details["name"]
            dai_status = dai_details["status"]
    
            if dai_status == "running" or dai_status == "stopped":
                dai_steam_instances[dai_name] = dai_details
                
dai_steam_instances

{}

## Driverless AI Instances

### Create new instance
This example hows how to create an instance of Driverless AI v1.10.0.431 and connect to it.

In [107]:
instance = DriverlessClient().launch_instance(name="test-instance",
                                              version="1.10.0.431",
                                              profile_name="default-driverless-kubernetes")
client = instance.connect()

Driverless AI instance is submitted, please wait...
Driverless AI instance is running


In [108]:
client.server.gui()

### Add a dataset

In [109]:
telco_churn = client.datasets.create(data="https://h2o-internal-release.s3-us-west-2.amazonaws.com/data/Splunk/churn.csv", 
                                  data_source="s3", 
                                  name="Telco_Churn")

Complete 100.00% - [4/4] Computed stats for column Churn?


In [110]:
download_location = '/Users/admin/Downloads/'

In [111]:
local_file_path = telco_churn.download(download_location, overwrite=True)

Downloaded '/Users/admin/Downloads/churn.csv.1635874253.674064.csv'


In [113]:
telco_churn2 = client.datasets.create(local_file_path, name="Telco_Churn_Duplicate")

Complete 100.00% - [4/4] Computed stats for column Churn?


In [114]:
print("Old Name:", telco_churn2.name)

telco_churn2.rename("Fancy New Name")

print("New Name:", telco_churn2.name)

Old Name: Telco_Churn_Duplicate
New Name: Fancy New Name


### Run an experiment
#### 1. First Split the dataset for training and testing

In [115]:
telco_churn_split = telco_churn.split_to_train_test(
    train_size=0.8, 
    train_name='telco_churn_train', 
    test_name='telco_churn_test', 
    target_column= "Churn?",
    seed=42
)

Complete


In [116]:
telco_churn_split

{'train_dataset': <class 'Dataset'> a87891e8-3c03-11ec-80a7-3653d48b0dcd telco_churn_train,
 'test_dataset': <class 'Dataset'> a878b6d2-3c03-11ec-80a7-3653d48b0dcd telco_churn_test}

#### 2. Set up the experiment's settings (ie. accuracy, time, target column, etc.)

##### You can list all existing experiments using:

In [117]:
[e.name for e in client.experiments.list()]

[]

##### You might want to run several experiments with different dial and expert settings. All of these will likely have some things in common, namely details about this specific dataset. We will create a dictionary to use in many experiments.

In [119]:
telco_settings = {
    **telco_churn_split,
    'task': 'classification',
    'target_column': "Churn?", 
    'scorer': 'F1'
}

In [120]:
client.experiments.preview( # Get experiment preview with our settings
    **telco_settings
)

ACCURACY [7/10]:
- Training data size: *2,666 rows, 21 cols*
- Feature evolution: *[Constant, DecisionTree, LightGBM, XGBoostGBM]*, *3-fold CV**, 2 reps*
- Final pipeline: *Blend of up to 2 [Constant, DecisionTree, LightGBM, XGBoostGBM] models, each averaged across 3-fold CV splits*

TIME [2/10]:
- Feature evolution: *8 individuals*, up to *10 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [8/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, Frequent, Interactions, NumCatTE, NumToCatWoEMonotonic, NumToCatWoE, Original, Text, WeightOfEvidence]
- Pre-trained PyTorch NLP models (with fine-tuning): ['disabled']

[Constant, DecisionTree, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *48*
- Feature evolution: *528*
- Final pipeline: *6*

Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7

##### There may be several common types of experiments you want to run, and H2O.ai will be creating common experiment settings in dictionaries for easy use. The one below turns off all extra settings such as building pipelines or checking for leakage. It also uses the fastest experiment settings.

In [121]:
fast_settings = {
    'accuracy': 1,
    'time': 1,
    'interpretability': 6,
    'make_python_scoring_pipeline': 'off',
    'make_mojo_scoring_pipeline': 'off',
    'benchmark_mojo_latency': 'off',
    'make_autoreport': False,
    'check_leakage': 'off',
    'check_distribution_shift': 'off'
}

#### 3. Launch Experiment

In [122]:
default_baseline = client.experiments.create_async( #comment out the other experiments that you dont want to run
    **telco_settings, 
    #name='Fastest Settings', **fast_settings,
    name='Default Baseline', accuracy=7, time=2, interpretability=8
)

Experiment launched at: https://steam.demo.h2o.ai:443/proxy/driverless/853/#/experiment?key=df15e69c-3c03-11ec-80a7-3653d48b0dcd


#### 4. View information, summary, model artifacts, and model performance of experiment

In [125]:
#Prints information on experiment
print("Name:", default_baseline.name)
print("Datasets:", default_baseline.datasets)
print("Target:", default_baseline.settings['target_column']) # beta users from before March 15th use target_col
print("Scorer:", default_baseline.metrics()['scorer'])
print("Task:", default_baseline.settings['task'])
print("Status:", default_baseline.status(verbose=2))
print("Web Page: ", end='')
default_baseline.gui()

Name: Default Baseline
Datasets: {'train_dataset': <class 'Dataset'> a87891e8-3c03-11ec-80a7-3653d48b0dcd telco_churn_train, 'validation_dataset': None, 'test_dataset': <class 'Dataset'> a878b6d2-3c03-11ec-80a7-3653d48b0dcd telco_churn_test}
Target: Churn?
Scorer: F1
Task: classification
Status: Complete 100.00% - Status: Complete
Web Page: 

In [126]:
default_baseline.summary() #view experimemnt summary

Status: Complete
Experiment: Default Baseline (df15e69c-3c03-11ec-80a7-3653d48b0dcd)
  Version: 1.10.0, 2021-11-02 17:46
  Settings: 7/2/8, seed=716918054, GPUs disabled
  Train data: telco_churn_train (2666, 21)
  Validation data: N/A
  Test data: [Test] (667, 20)
  Target column: Churn? (binary, 14.479% target class)
System specs: Docker/Linux, 28 GB, 32 CPU cores, 0/0 GPU
  Max memory usage: 0.778 GB, 0 GB GPU
Recipe: AutoDL (18 iterations, 8 individuals)
  Validation scheme: stratified, 6 internal holdouts (3-fold CV)
  Feature engineering: 248 features scored (40 selected)
Timing: MOJO latency 0.1236 millis (917.8kB), Python latency 160.1279 millis (1.6MB)
  Data preparation: 5.90 secs
  Shift/Leakage detection: 2.55 secs
  Model and feature tuning: 101.31 secs (55 models trained)
  Feature evolution: 193.93 secs (222 of 528 models trained)
  Final pipeline training: 26.26 secs (6 models trained)
  Python / MOJO scorer building: 36.34 secs / 18.35 secs
Validation score: F1 = 0.252

In [127]:
print("Available artifacts:", default_baseline.artifacts.list()) #see what model artifacts are available

Available artifacts: ['logs', 'mojo_pipeline', 'python_pipeline', 'summary', 'test_predictions', 'train_predictions']


In [128]:
default_baseline.artifacts.create('autoreport') #generate autodoc

Generating autodoc...


In [129]:
artifacts = default_baseline.artifacts.download(['autoreport'], download_location, overwrite=True) #download autodoc

Downloaded '/Users/admin/Downloads/report.docx'


In [130]:
!open -a "Microsoft Word" {artifacts["autoreport"]} #OSX - open autodoc on MacOS

In [131]:
default_baseline.metrics() #view final model performance

{'scorer': 'F1',
 'val_score': 0.7886191295135262,
 'val_score_sd': 0.03347658888253998,
 'val_roc_auc': 0.9126724360603727,
 'val_pr_auc': 0.8177303919498858,
 'test_score': 0.7513227513227513,
 'test_score_sd': 0.03347658888253998,
 'test_roc_auc': 0.8954060408753843,
 'test_pr_auc': 0.7826997815298492}

### Pause our instance

In [132]:
instance.stop()

Driverless AI instance is stopping, please wait...
Driverless AI instance is stopped


### Resume the instance

In [133]:
instance.start()

Driverless AI instance is starting, please wait...
Driverless AI instance is running


### Delete the instance

In [134]:
instance.terminate()