# Driverless AI: Using the Python API
This notebook provides an H2OAI Client workflow, of model building and scoring, that parallels the Driverless AI workflow.

### Workflow Steps

**Build an Experiment with Python API:**
1. Sign in
2. Import train & test set/new data
3. Specify columns to drop & select target column
4. Set problem type, configuration, and accuracy settings
5. Launch Experiement
6. View Results
    
**Build an Experiment in Web UI and Access Through Python:**
1. Get pointer to experiment
    
**Scoring:**
1. Score using Driverless
2. Score using the Scoring Package

**Train H2O Model:**
1. Train H2O Model on raw data
2. Train H2O Model on transformed data
3. Compare model performance

## Build an Experiment with Python API

### Import Datasets

#### 1. Sign In
To upload a dataset to the Driverless AI Server, pass in your credentials through the Client class which creates an authentication token to send to the Driverless AI Server. In plain english: to sign into the Driverless AI webage (which then sends requests to the Driverless Server) instantiate the Client class with your Driverless AI address and login credentials.

In [1]:
import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters

In [2]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

### Equivalent Steps in Driverless: Signing In
![Equivalent Steps in Driverless: Signing In](h2oai_client_images/sign_in_home_page_0.png)
![Equivalent Steps in Driverless: Signing In](h2oai_client_images/skip_sign_in_home_page_1.png)

#### 2. Upload Datasets

In our example, we will use the BNP Paribas data.  We will split our data into training and validation so we can compare the performance of our Driverless AI model to an external model later on.

In [1]:
# Import Data
import pandas as pd
data_path = '/data/Kaggle/BNPParibas/BNPParibas-train.csv'
data = pd.read_csv(data_path)

In [2]:
# Split Data
import numpy as np
train_indices = np.random.rand(len(data)) < 0.8
train = data[train_indices]
test = data[~train_indices]

In [3]:
# Export Data
train_path = '/data/Kaggle/BNPParibas/train-80.csv'
test_path = '/data/Kaggle/BNPParibas/valid-20.csv'
train.to_csv(path_or_buf=train_path, index=False)
test.to_csv(path_or_buf=test_path, index=False)

In [7]:
# Add Dataset to Driverless 
train = h2oai.create_dataset_sync(train_path)
test = h2oai.create_dataset_sync(test_path)

### Equivalent Steps in Driverless: Uploading Train & Test CSV Files
![Equivalent Steps in Driverless: Uploading Train & Test CSV Files](h2oai_client_images/import_datasets_bnp.png)

#### 3. Set Columns: Target, Ignore, Weight, and Fold

* Target Column: the column we are trying to predict
* Ignore Columns: the columns we do not want to use as predictors such as ID columns, columns with data leakage, etc
* Weight Column: the column that indicates the per row observation weights - if none each row will have an observation weight of 1 
* Fold Column: the column that indicates the fold - if none the folds will be determined by Driverless AI

For this example, we will be predicting **`target`** and we will ignore the column: **`ID`**.  We do not have a `weight_col` or `fold_col`.  

In [8]:
# set the parameters you want to pass to the UI
target = "target"
drop_cols = ['ID']
weight_col = None
fold_col = None
time_col = None

### Equivalent Steps in Driverless: Set Target & Ignored Columns
![Equivalent Steps in Driverless: Set Target & Ignored Columns](h2oai_client_images/add_drop_test_bnp.png)

#### 4. Set Problem Type, Configuration & Accuracy Settings

Pre-set parameters to pass model

In [64]:
is_classification = True
enable_gpus = True
seed=True
scorer_str = 'logloss'

Pre-set accuracy knobs

In [10]:
accuracy_value = 7
time_value = 5
interpretability = 5

#### 5. Launch Experiment: Feature Engineering + Final Model Training

In [11]:
experiment = h2oai.start_experiment_sync(ModelParameters(
    dataset_key=train.key,
    target_col=target,
    is_classification=is_classification,
    cols_to_drop=drop_cols,
    testset_key=test.key, 
    enable_gpus=enable_gpus,
    seed=seed,
    accuracy=accuracy_value, 
    time= time_value,
    interpretability=interpretability, 
    scorer=scorer_str,
    weight_col = weight_col,
    fold_col = fold_col,
    time_col = time_col
))

### Equivalent Steps in Driverless: Set the Knobs, Configuration & Launch
![Equivalent Steps in Driverless: Set the Knobs](h2oai_client_images/set_experiment_settings_bnp.png)

![Equivalent Steps in Driverless: Launch Your Experiement](h2oai_client_images/exp_running_bnp.png)
![Equivalent Steps in Driverless: Launch Your Experiement](h2oai_client_images/launch_experiment_bnp.png)

#### 6. View Results

You can use the experiment object to get the final model score on the training and testing data. This final model may be an ensemble model depending on the accuracy setting.

In [12]:
print("Final Model Score on Train Data: " + str(round(experiment.train_score, 4)))
print("Final Model Score on Test Data: " + str(round(experiment.test_score, 4)))

Final Model Score on Train Data: 0.4585
Final Model Score on Test Data: 0.46


### Equivalent Steps in Driverless: View Results
![Equivalent Steps in Driverless: View Results](h2oai_client_images/experiment_done_bnp.png)

#### All These Yellow Download Buttons Can Be Done through the Python API:
* the munged test csv
* the munged train csv
* the predictions on the (holdout) train csv
* the predictions on the test csv

We will show an example of downloading the test predictions below:

In [13]:
h2oai.download(src_path = experiment.test_predictions_path, dest_dir = ".")

'./test_preds.csv'

In [14]:
test_preds = pd.read_csv("./test_preds.csv")
test_preds.head()

Unnamed: 0,target.1
0,0.646166
1,0.932822
2,0.756273
3,0.91709
4,0.659382


## Build an Experiment in Web UI and Access Through Python

It is also possible to use the Python API to examine an experiment that was started through the Web UI using the experiment key.

![Experiments List](h2oai_client_images/launch_experiment_bnp.png)

#### 1. Get pointer to experiment

You can get a pointer to the experiment by referencing the experiment key in the web UI.

In [4]:
experiment_list = list(map(lambda x: x.key, h2oai.list_models(offset = 0, limit = 100)))
experiment_list

['797d7d']

In [5]:
experiment = h2oai.get_model_job(experiment_list[0]).entity

## Scoring

There are two ways to score: through Driverless AI or by downloading the Scoring Package.  The Scoring Package contains a python module and is able to score independently of Driverless AI. 

#### 1. Score using Driverless AI

We will start out by scoring through Driverless AI.  This is equivalent to the `SCORE ON ANOTHER DATASET` button in the Web UI.  In the example below, we will score on the test data and download the predictions.

In [17]:
prediction = h2oai.make_prediction_sync(experiment.key, test_path)
pred_path = h2oai.download(prediction.predictions_csv_path, '.')
pred_table = pd.read_csv(pred_path)
pred_table.head()

Unnamed: 0,target.1
0,0.646166
1,0.932822
2,0.756273
3,0.91709
4,0.659382


#### 2. Score using the Scoring Package

We will download the scoring package and import the python scoring module to again score on the test data.

In [21]:
# Download the scoring package
h2oai.download(experiment.scoring_package_path, '.')

'./scorer.zip'

In [22]:
%%bash
# Unzip scoring package and install the scoring python library
unzip scorer

cd scoring-package
ls

Archive:  scorer.zip
   creating: scoring-package/
  inflating: scoring-package/h2o4gpu-0.1.0+master.c200632-py36-none-any.whl  
  inflating: scoring-package/example_client.py  
  inflating: scoring-package/requirements.txt  
  inflating: scoring-package/server.py  
  inflating: scoring-package/example.py  
  inflating: scoring-package/run_tcp_server.sh  
  inflating: scoring-package/run_tcp_client.sh  
  inflating: scoring-package/README.txt  
  inflating: scoring-package/run_http_server.sh  
  inflating: scoring-package/features.txt  
 extracting: scoring-package/server_requirements.txt  
  inflating: scoring-package/run_http_client.sh  
  inflating: scoring-package/run_example.sh  
  inflating: scoring-package/scoring.thrift  
  inflating: scoring-package/h2oaicore-1.0.10-cp36-cp36m-linux_x86_64.whl  
  inflating: scoring-package/datatable-0.2.2+master.372.noomp-cp36-cp36m-linux_x86_64.whl  
  inflating: scoring-package/scoring_797d7d_20171203000345_39780-1.0.0-py3-none-any.whl  
 e

In [9]:
%%bash
# Import scoring module
pip install scoring-package/scoring_797d7d_20171203000345_39780-1.0.0-py3-none-any.whl

Processing ./scoring-package/scoring_797d7d_20171203000345_39780-1.0.0-py3-none-any.whl
Installing collected packages: scoring-797d7d-20171203000345-39780
Successfully installed scoring-797d7d-20171203000345-39780-1.0.0


In [10]:
import pandas as pd
from numpy import nan
from scoring_797d7d_20171203000345_39780 import Scorer

In [11]:
#
# Create a singleton Scorer instance.
# For optimal performance, create a Scorer instance once, and call score() or score_batch() multiple times.
#
scorer = Scorer()

  from pandas.core import datetools


In [58]:
# Import test data
test_data = pd.read_csv(test_path)

In [29]:
# Score the train data
from IPython.utils import io
with io.capture_output() as captured:
    scorer.score_batch(test_data)

In [75]:
# Transform test data
transformed_test = scorer.transform_batch(test_data)
transformed_test.head()

Unnamed: 0,0_CV_TE_v107_0,1_CV_TE_v110_0,2_CV_TE_v112_0,3_CV_TE_v113_0,4_CV_TE_v125_0,5_CV_TE_v129_0,6_CV_TE_v22_0,7_CV_TE_v24_0,8_CV_TE_v3_0,9_CV_TE_v30_0,...,147_CV_CatNumEnc_v113_v56_v66__v12_std,147_CV_CatNumEnc_v113_v56_v66__v5_std,147_CV_CatNumEnc_v113_v56_v66__v50_std,147_CV_CatNumEnc_v113_v56_v66__v6_std,148_NumCatTE_v1_v11_v113_v4_v47_v5_v50_v56_v6_v66_v69_0,149_NumCatTE_v1_v11_v113_v12_v2_v24_v3_v31_v5_v50_v56_0,150_WoE_v110_v66_0,151_NumToCatTE_v1_v11_0,152_CV_TE_v56_0,153_NumCatTE_v1_v11_v110_v113_v12_v3_v31_v4_v5_v50_v56_v6_v66_v7_v79_0
0,0.764832,0.6903,0.789034,0.68381,0.795302,0.726638,0.80303,0.800503,0.760087,0.77412,...,0.316153,1.740665,0.586085,0.596651,0.761961,0.761961,-0.504404,0.770112,0.636078,0.761961
1,0.758349,0.6903,0.773467,0.68381,0.774168,0.726638,0.775436,0.75598,0.760087,0.761961,...,0.926541,2.03992,1.16603,0.602259,0.761961,0.761961,0.479076,0.770112,0.761961,0.761961
2,0.758349,0.6903,0.778271,0.676176,0.784792,0.726638,0.761931,0.733728,0.760087,0.761961,...,0.639567,1.700511,0.843003,0.376991,0.761961,0.761961,-0.504404,0.770112,0.704082,0.761961
3,0.743438,0.6903,0.747622,0.676176,0.750423,0.726638,0.692308,0.75598,0.760087,0.727316,...,0.430758,2.247023,1.217886,0.540155,0.840945,0.761961,-0.504404,0.770112,0.871929,0.795727
4,0.758349,0.6903,0.773467,0.660176,0.774168,0.726638,0.681034,0.75598,0.760087,0.761961,...,0.364064,2.725092,0.622478,0.624809,0.525734,0.761961,0.479076,0.770112,0.771385,0.653876


## Train H2O Model 

We are not limited to using the scores from Driverless AI.  We can also use the transformed data to train our own model.  In this section, we will train a GBM model on the raw data and transformed data and compare the results to Driverless AI. 

#### 1. Train H2O Model on Raw Data

In [15]:
# Install H2O

! pip install http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/1/Python/h2o-3.16.0.1-py2.py3-none-any.whl

Collecting h2o==3.16.0.1 from http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/1/Python/h2o-3.16.0.1-py2.py3-none-any.whl
  Downloading http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/1/Python/h2o-3.16.0.1-py2.py3-none-any.whl (119.6MB)
[K    100% |################################| 119.6MB 109.4MB/s ta 0:00:01
Installing collected packages: h2o
Successfully installed h2o-3.16.0.1


In [76]:
# Import H2O and connect
import h2o
h2o.init(port = 54323)

Checking whether there is an H2O instance running at http://localhost:54323..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_151"; Java(TM) SE Runtime Environment (build 1.8.0_151-b12); Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
  Starting server from /h2oai_env/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpzaf2vqc6
  JVM stdout: /tmp/tmpzaf2vqc6/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpzaf2vqc6/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54323
Connecting to H2O server at http://127.0.0.1:54323... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster version:,3.16.0.1
H2O cluster version age:,8 days
H2O cluster name:,H2O_from_python_unknownUser_2k4ws0
H2O cluster total nodes:,1
H2O cluster free memory:,13.33 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54323


In [77]:
train_h2o = h2o.import_file(train_path)
test_h2o = h2o.import_file(test_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [78]:
if is_classification:
    train_h2o[target] = train_h2o[target].asfactor()
    test_h2o[target] = test_h2o[target].asfactor()

In [79]:
# Train GBM on Raw Data
from h2o.estimators import H2OGradientBoostingEstimator
gbm_raw = H2OGradientBoostingEstimator(model_id = "raw_data", 
                                       nfolds = 3, # 3-fold cross validation
                                       ntrees = 1000,   # more than enough trees
                                       ## early stopping once the validation logloss doesn't improve 
                                       ## by at least 0.1% for 3 consecutive scoring events
                                       stopping_rounds = 3, stopping_tolerance = 1e-3, stopping_metric = "logloss", 
                                       score_tree_interval = 10,
                                       seed = 1234)
gbm_raw.train(y = target, training_frame = train_h2o)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


#### 2. Train H2O Model on Transformed Data

In [80]:
# Convert Transform Test Data to H2O Frame
transformed_test = pd.concat([transformed_test, test_data[target]], axis = 1)
transformed_test_h2o = h2o.H2OFrame(transformed_test)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [81]:
# Import Transformed Train Data as H2O Frame
transformed_train_path = h2oai.download(src_path = experiment.train_transformed_csv_path, dest_dir = ".")
transformed_train_h2o = h2o.import_file(transformed_train_path)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [82]:
if is_classification:
    transformed_test_h2o[target] = transformed_test_h2o[target].asfactor()
    transformed_train_h2o[target] = transformed_train_h2o[target].asfactor()

In [83]:
# Train GBM on Transformed Data
gbm_transformed = H2OGradientBoostingEstimator(model_id = "transformed_data", 
                                               nfolds = 3, # 3-fold cross validation
                                               ntrees = 1000,   # more than enough trees 
                                               ## early stopping once the validation logloss doesn't improve 
                                               ## by at least 0.1% for 3 consecutive scoring events
                                               stopping_rounds = 3, stopping_tolerance = 1e-3, stopping_metric = "logloss",
                                               score_tree_interval = 10,
                                               seed = 1234)
gbm_transformed.train(y = target, training_frame = transformed_train_h2o)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


#### 3. Compare Model Performance

In [84]:
raw_data_logloss = gbm_raw.model_performance(test_data = test_h2o).logloss()
transformed_data_logloss = gbm_transformed.model_performance(test_data = transformed_test_h2o).logloss()
driverless_logloss = experiment.test_score

In [85]:
print("Raw Data Logloss: " + str(round(raw_data_logloss, 4)))
print("Transformed Data Logloss: " + str(round(transformed_data_logloss, 4)))
print("Driverless AI Logloss: " + str(round(driverless_logloss, 4)))

Raw Data Logloss: 0.4683
Transformed Data Logloss: 0.4685
Driverless AI Logloss: 0.46
