# Driverless AI: Using the Python API
This notebook provides an H2OAI Client workflow, of model building and scoring, that parallels the Driverless AI workflow.

### Workflow Steps

**Build an Experiment with Python API:**
1. Sign in
2. Import train & test set/new data
3. Specify columns to drop & select target column
4. Set problem type, configuration, and accuracy settings
5. Launch Experiement
6. View Results
    
**Build an Experiment in Web UI and Access Through Python:**
1. Get pointer to experiment
    
**Scoring:**
1. Score using Driverless
2. Score using the Scoring Package

**Train H2O Model:**
1. Train H2O Model on raw data
2. Train H2O Model on transformed data
3. Compare model performance

## Build an Experiment with Python API

### Import Datasets

#### 1. Sign In
To upload a dataset to the Driverless AI Server, pass in your credentials through the Client class which creates an authentication token to send to the Driverless AI Server. In plain english: to sign into the Driverless AI webage (which then sends requests to the Driverless Server) instantiate the Client class with your Driverless AI address and login credentials.

In [71]:
import h2oai_client
import numpy as np
import pandas as pd
# import h2o
import requests
import math
from h2oai_client import Client, ModelParameters

In [72]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

### Equivalent Steps in Driverless: Signing In
![Equivalent Steps in Driverless: Signing In](h2oai_client_images/sign_in_home_page_0.png)
![Equivalent Steps in Driverless: Signing In](h2oai_client_images/skip_sign_in_home_page_1.png)

#### 2. Upload Datasets

In our example, we will use the BNP Paribas data.  We will split our data into training and validation so we can compare the performance of our Driverless AI model to an external model later on.

In [73]:
# Import Data
import pandas as pd
data_path = '/data/Kaggle/BNPParibas/BNPParibas-train.csv'
data = pd.read_csv(data_path)

In [74]:
# Split Data
import numpy as np
train_indices = np.random.rand(len(data)) < 0.8
train = data[train_indices]
test = data[~train_indices]

In [75]:
# Export Data
train_path = '/data/Kaggle/BNPParibas/train-80.csv'
test_path = '/data/Kaggle/BNPParibas/valid-20.csv'
train.to_csv(path_or_buf=train_path, index=False)
test.to_csv(path_or_buf=test_path, index=False)

In [76]:
# Add Dataset to Driverless 
train = h2oai.create_dataset_sync(train_path)
test = h2oai.create_dataset_sync(test_path)

### Equivalent Steps in Driverless: Uploading Train & Test CSV Files
![Equivalent Steps in Driverless: Uploading Train & Test CSV Files](h2oai_client_images/import_datasets_bnp.png)

#### 3. Set Columns: Target, Ignore, Weight, and Fold

* Target Column: the column we are trying to predict
* Ignore Columns: the columns we do not want to use as predictors such as ID columns, columns with data leakage, etc
* Weight Column: the column that indicates the per row observation weights - if none each row will have an observation weight of 1 
* Fold Column: the column that indicates the fold - if none the folds will be determined by Driverless AI

For this example, we will be predicting **`target`** and we will ignore the column: **`ID`**.  We do not have a `weight_col` or `fold_col`.  

In [77]:
# set the parameters you want to pass to the UI
target = "target"
drop_cols = ['ID']
weight_col = None
fold_col = None
time_col = None

### Equivalent Steps in Driverless: Set Target & Ignored Columns
![Equivalent Steps in Driverless: Set Target & Ignored Columns](h2oai_client_images/add_drop_test_bnp.png)

#### 4. Set Problem Type, Configuration & Accuracy Settings

Pre-set parameters to pass model

In [78]:
is_classification = True
enable_gpus = True
seed=True
scorer_str = 'auc'

Pre-set accuracy knobs

In [79]:
accuracy_value = 7
time_value = 5
interpretability = 5

#### 5. Launch Experiment: Feature Engineering + Final Model Training

In [80]:
experiment = h2oai.start_experiment_sync(ModelParameters(
    dataset_key=train.key,
    target_col=target,
    is_classification=is_classification,
    cols_to_drop=drop_cols,
    testset_key=test.key, 
    enable_gpus=enable_gpus,
    seed=seed,
    accuracy=accuracy_value, 
    time= time_value,
    interpretability=interpretability, 
    scorer=scorer_str,
    weight_col = weight_col,
    fold_col = fold_col,
    time_col = time_col
))

### Equivalent Steps in Driverless: Set the Knobs, Configuration & Launch
![Equivalent Steps in Driverless: Set the Knobs](h2oai_client_images/set_experiment_settings_bnp.png)

![Equivalent Steps in Driverless: Launch Your Experiement](h2oai_client_images/exp_running_bnp.png)
![Equivalent Steps in Driverless: Launch Your Experiement](h2oai_client_images/launch_experiment_bnp.png)

#### 6. View Results

You can use the experiment object to get the final model score on the training and testing data. This final model may be an ensemble model depending on the accuracy setting.

In [82]:
print("Final Model Score on Train Data: " + str(round(experiment.train_score, 4)))
print("Final Model Score on Test Data: " + str(round(experiment.test_score, 4)))

Final Model Score on Train Data: 0.7818
Final Model Score on Test Data: 0.7822


### Equivalent Steps in Driverless: View Results
![Equivalent Steps in Driverless: View Results](h2oai_client_images/experiment_done_bnp.png)

#### All These Yellow Download Buttons Can Be Done through the Python API:
* the munged test csv
* the munged train csv
* the predictions on the (holdout) train csv
* the predictions on the test csv

We will show an example of downloading the test predictions below:

In [83]:
h2oai.download(src_path = experiment.test_predictions_path, dest_dir = ".")

'./test_preds.csv'

In [84]:
test_preds = pd.read_csv("./test_preds.csv")
test_preds.head()

Unnamed: 0,target.1
0,0.877975
1,0.948201
2,0.932511
3,0.899017
4,0.931647


## Build an Experiment in Web UI and Access Through Python

It is also possible to use the Python API to examine an experiment that was started through the Web UI using the experiment key.

![Experiments List](h2oai_client_images/launch_experiment_bnp.png)

#### 1. Get pointer to experiment

You can get a pointer to the experiment by referencing the experiment key in the web UI.

In [85]:
experiment_list = list(map(lambda x: x.key, h2oai.list_models(offset = 0, limit = 100)))
experiment_list

['482b68']

In [86]:
experiment = h2oai.get_model_job(experiment_list[0]).entity

## Scoring

There are two ways to score: through Driverless AI or by downloading the Scoring Package.  The Scoring Package contains a python module and is able to score independently of Driverless AI. 

#### 1. Score using Driverless AI

We will start out by scoring through Driverless AI.  This is equivalent to the `SCORE ON ANOTHER DATASET` button in the Web UI.  In the example below, we will score on the test data and download the predictions.

In [87]:
prediction = h2oai.make_prediction_sync(experiment.key, test_path)
pred_path = h2oai.download(prediction.predictions_csv_path, '.')
pred_table = pd.read_csv(pred_path)
pred_table.head()

Unnamed: 0,target.1
0,0.877975
1,0.948201
2,0.932511
3,0.899017
4,0.931647


#### 2. Score using the Scoring Package

We will download the scoring package and import the python scoring module to again score on the test data.

In [88]:
# Download the scoring package
h2oai.download(experiment.scoring_package_path, '.')

'./scorer.zip'

In [89]:
%%bash
# Unzip scoring package and install the scoring python library
unzip scorer

cd scoring-package
ls

Archive:  scorer.zip
   creating: scoring-package/
  inflating: scoring-package/example_client.py  
  inflating: scoring-package/requirements.txt  
  inflating: scoring-package/server.py  
  inflating: scoring-package/datatable-0.2.2+master.366.noomp-cp36-cp36m-linux_x86_64.whl  
  inflating: scoring-package/example.py  
  inflating: scoring-package/run_tcp_server.sh  
  inflating: scoring-package/run_tcp_client.sh  
  inflating: scoring-package/README.txt  
  inflating: scoring-package/h2oaicore-1.0.9-cp36-cp36m-linux_x86_64.whl  
  inflating: scoring-package/scoring_482b68_20171202204342_b6f33-1.0.0-py3-none-any.whl  
  inflating: scoring-package/run_http_server.sh  
  inflating: scoring-package/features.txt  
 extracting: scoring-package/server_requirements.txt  
  inflating: scoring-package/run_http_client.sh  
  inflating: scoring-package/run_example.sh  
  inflating: scoring-package/scoring.thrift  
 extracting: scoring-package/client_requirements.txt  
  inflating: scoring-packa

In [90]:
%%bash
# Import scoring module
pip install scoring-package/scoring_482b68_20171202204342_b6f33-1.0.0-py3-none-any.whl

Processing ./scoring-package/scoring_482b68_20171202204342_b6f33-1.0.0-py3-none-any.whl
Installing collected packages: scoring-482b68-20171202204342-b6f33
Successfully installed scoring-482b68-20171202204342-b6f33-1.0.0


In [91]:
import pandas as pd
from numpy import nan
from scoring_482b68_20171202204342_b6f33 import Scorer

In [92]:
#
# Create a singleton Scorer instance.
# For optimal performance, create a Scorer instance once, and call score() or score_batch() multiple times.
#
scorer = Scorer()

In [93]:
# Import test data
test_data = pd.read_csv(test_path)

In [124]:
check = scorer.score_batch(test_data)

('[Sat Dec  2 22:32:00 2017] [make_holdout_preds_subprocess] Begin GB used by self   26585: 0.150147',)
('loading subprocess input data (model, X) took 0.01145 secs',)
('[Sat Dec  2 22:32:00 2017] [predict_subprocess_xgboost] Begin GB used by self   26599: 0.152592',)
('loading subprocess input data (model, X) took 0.01638 secs',)
('subprocess predict took 0.68071 secs',)
('saving subprocess output data (preds) took 0.00304 secs',)
('[Sat Dec  2 22:32:01 2017] [predict_subprocess_xgboost] End GB used by self   26599: 0.239006',)
('subprocess predict took 0.90996 secs',)
('saving subprocess output data (preds) took 0.00395 secs',)
('[Sat Dec  2 22:32:01 2017] [make_holdout_preds_subprocess] End GB used by self   26585: 0.193364',)
('[Sat Dec  2 22:32:01 2017] [make_holdout_preds_subprocess] Begin GB used by self   26585: 0.193384',)
('loading subprocess input data (model, X) took 0.01348 secs',)
('[Sat Dec  2 22:32:01 2017] [predict_subprocess_xgboost] Begin GB used by self   26651: 0.1

('[Sat Dec  2 22:32:12 2017] [make_holdout_preds_subprocess] Begin GB used by self   26585: 0.281895',)
('loading subprocess input data (model, X) took 0.00903 secs',)
('[Sat Dec  2 22:32:12 2017] [predict_subprocess_xgboost] Begin GB used by self   27174: 0.148386',)
('loading subprocess input data (model, X) took 0.02847 secs',)
('subprocess predict took 0.86315 secs',)
('saving subprocess output data (preds) took 0.00243 secs',)
('[Sat Dec  2 22:32:13 2017] [predict_subprocess_xgboost] End GB used by self   27174: 0.247665',)
('subprocess predict took 1.10955 secs',)
('saving subprocess output data (preds) took 0.00388 secs',)
('[Sat Dec  2 22:32:13 2017] [make_holdout_preds_subprocess] End GB used by self   26585: 0.281915',)
('[Sat Dec  2 22:32:13 2017] [make_holdout_preds_subprocess] Begin GB used by self   26585: 0.281915',)
('loading subprocess input data (model, X) took 0.00740 secs',)
('[Sat Dec  2 22:32:14 2017] [predict_subprocess_xgboost] Begin GB used by self   27226: 0.1

('[Sat Dec  2 22:32:24 2017] [make_holdout_preds_subprocess] Begin GB used by self   26585: 0.286347',)
('loading subprocess input data (model, X) took 0.00679 secs',)
('[Sat Dec  2 22:32:24 2017] [predict_subprocess_xgboost] Begin GB used by self   27749: 0.146276',)
('loading subprocess input data (model, X) took 0.02092 secs',)
('subprocess predict took 0.79961 secs',)
('saving subprocess output data (preds) took 0.00267 secs',)
('[Sat Dec  2 22:32:25 2017] [predict_subprocess_xgboost] End GB used by self   27749: 0.23253',)
('subprocess predict took 1.02942 secs',)
('saving subprocess output data (preds) took 0.00245 secs',)
('[Sat Dec  2 22:32:25 2017] [make_holdout_preds_subprocess] End GB used by self   26585: 0.286368',)
('[Sat Dec  2 22:32:25 2017] [make_holdout_preds_subprocess] Begin GB used by self   26585: 0.286368',)
('loading subprocess input data (model, X) took 0.00387 secs',)
('[Sat Dec  2 22:32:25 2017] [predict_subprocess_xgboost] Begin GB used by self   27801: 0.14

In [94]:
# Score the train data
scorer.score_batch(test_data)

('[Sat Dec  2 22:09:47 2017] [make_holdout_preds_subprocess] Begin GB used by self   21186: 0.09984',)
('loading subprocess input data (model, X) took 0.01517 secs',)
('[Sat Dec  2 22:09:47 2017] [predict_subprocess_xgboost] Begin GB used by self   21196: 0.095998',)
('loading subprocess input data (model, X) took 0.01635 secs',)
('subprocess predict took 0.69448 secs',)
('saving subprocess output data (preds) took 0.00254 secs',)
('[Sat Dec  2 22:09:48 2017] [predict_subprocess_xgboost] End GB used by self   21196: 0.171848',)
('subprocess predict took 0.89756 secs',)
('saving subprocess output data (preds) took 0.00167 secs',)
('[Sat Dec  2 22:09:48 2017] [make_holdout_preds_subprocess] End GB used by self   21186: 0.142971',)
('[Sat Dec  2 22:09:48 2017] [make_holdout_preds_subprocess] Begin GB used by self   21186: 0.143065',)
('loading subprocess input data (model, X) took 0.01642 secs',)
('[Sat Dec  2 22:09:48 2017] [predict_subprocess_xgboost] Begin GB used by self   21228: 0.08

('loading subprocess input data (model, X) took 0.01058 secs',)
('[Sat Dec  2 22:09:59 2017] [predict_subprocess_xgboost] Begin GB used by self   21551: 0.0938926',)
('loading subprocess input data (model, X) took 0.03185 secs',)
('subprocess predict took 0.92132 secs',)
('saving subprocess output data (preds) took 0.00220 secs',)
('[Sat Dec  2 22:10:00 2017] [predict_subprocess_xgboost] End GB used by self   21551: 0.179253',)
('subprocess predict took 1.14244 secs',)
('saving subprocess output data (preds) took 0.00146 secs',)
('[Sat Dec  2 22:10:00 2017] [make_holdout_preds_subprocess] End GB used by self   21186: 0.243368',)
('[Sat Dec  2 22:10:00 2017] [make_holdout_preds_subprocess] Begin GB used by self   21186: 0.243372',)
('loading subprocess input data (model, X) took 0.00749 secs',)
('[Sat Dec  2 22:10:00 2017] [predict_subprocess_xgboost] Begin GB used by self   21586: 0.0959734',)
('loading subprocess input data (model, X) took 0.02244 secs',)
('subprocess predict took 0.7

('[Sat Dec  2 22:10:11 2017] [predict_subprocess_xgboost] Begin GB used by self   21906: 0.0917914',)
('loading subprocess input data (model, X) took 0.02359 secs',)
('subprocess predict took 0.84650 secs',)
('saving subprocess output data (preds) took 0.00178 secs',)
('[Sat Dec  2 22:10:12 2017] [predict_subprocess_xgboost] End GB used by self   21906: 0.164852',)
('subprocess predict took 1.05603 secs',)
('saving subprocess output data (preds) took 0.00140 secs',)
('[Sat Dec  2 22:10:12 2017] [make_holdout_preds_subprocess] End GB used by self   21186: 0.252846',)
('[Sat Dec  2 22:10:12 2017] [make_holdout_preds_subprocess] Begin GB used by self   21186: 0.252854',)
('loading subprocess input data (model, X) took 0.00430 secs',)
('[Sat Dec  2 22:10:12 2017] [predict_subprocess_xgboost] Begin GB used by self   21938: 0.0980787',)
('loading subprocess input data (model, X) took 0.01517 secs',)
('subprocess predict took 0.69104 secs',)
('saving subprocess output data (preds) took 0.0017

array([ 0.87132628,  0.9488554 ,  0.93376813, ...,  0.94583185,
        0.79710248,  0.93223693])

In [95]:
# Transform test data
transformed_test = scorer.transform_batch(test_data)
transformed_test.head()

Unnamed: 0,0_v1,1_v10,2_v101,3_v102,4_v103,5_v104,6_v105,7_v106,8_v109,9_v11,...,119_CV_CatNumEnc_v110_v66__v1_median,119_CV_CatNumEnc_v110_v66__v11_median,119_CV_CatNumEnc_v110_v66__v4_median,119_CV_CatNumEnc_v110_v66__v40_median,119_CV_CatNumEnc_v110_v66__v5_median,119_CV_CatNumEnc_v110_v66__v50_median,119_CV_CatNumEnc_v110_v66__v6_median,120_NumCatTE_v1_v12_v5_v50_v56_v6_v66_v7_v79_0,121_WoE_v113_v22_v30_v72_v75_0,122_WoE_v113_v22_v56_v66_v75_0
0,1.335739,0.503281,8.389237,2.757375,4.374296,1.574039,0.007294,12.579184,3.930922,16.434108,...,1.444466,15.446568,4.406222,8.568845,8.906318,1.418824,2.445583,0.761595,1.161431,1.161431
1,0.797415,6.542669,8.507281,2.503055,4.872157,2.573664,0.113967,12.554274,1.990131,16.347483,...,1.458727,15.507245,4.272285,10.815426,8.713364,1.324921,2.437831,0.761595,1.161431,1.161431
2,,1.050328,,,,,,,,,...,1.442494,15.492702,4.254459,5.83677,8.446665,1.188468,2.367107,0.761595,1.161431,1.161431
3,,1.31291,,,,,,,,,...,1.435345,15.507402,4.151858,10.256855,8.466711,1.025603,2.377665,0.777084,1.161431,1.161431
4,1.344477,1.050329,7.940586,2.329044,4.751472,2.281442,0.075693,12.360166,-2.394541e-07,16.003086,...,1.442494,15.492702,4.254459,5.83677,8.446665,1.188468,2.367107,0.761595,1.161431,1.161431


## Train H2O Model 

We are not limited to using the scores from Driverless AI.  We can also use the transformed data to train our own model.  In this section, we will train a GBM model on the raw data and transformed data and compare the results to Driverless AI. 

#### 1. Train H2O Model on Raw Data

In [96]:
# Install H2O
! pip install http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/1/Python/h2o-3.16.0.1-py2.py3-none-any.whl



In [100]:
# Import H2O and connect
import h2o
h2o.init(port = 54323)

Checking whether there is an H2O instance running at http://localhost:54323..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_151"; Java(TM) SE Runtime Environment (build 1.8.0_151-b12); Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
  Starting server from /h2oai_env/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpuc0yr3ej
  JVM stdout: /tmp/tmpuc0yr3ej/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpuc0yr3ej/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54323
Connecting to H2O server at http://127.0.0.1:54323... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster version:,3.16.0.1
H2O cluster version age:,8 days
H2O cluster name:,H2O_from_python_unknownUser_of43v7
H2O cluster total nodes:,1
H2O cluster free memory:,13.33 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"accepting new members, healthy"
H2O connection url:,http://127.0.0.1:54323


In [106]:
train_h2o = h2o.import_file(train_path, col_types = {target: "enum"})
test_h2o = h2o.import_file(test_path, col_types = {target: "enum"})

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [109]:
from h2o.estimators import H2OGradientBoostingEstimator
gbm_raw = H2OGradientBoostingEstimator(model_id = "raw_data", 
                                       nfolds = 3, # 3-fold cross validation
                                       ntrees = 1000,   # more than enough trees
                                       ## early stopping once the validation AUC doesn't improve 
                                       ## by at least 0.1% for 3 consecutive scoring events
                                       stopping_rounds = 3, stopping_tolerance = 1e-3, stopping_metric = "AUC", 
                                       score_tree_interval = 10,
                                       seed = 1234)
gbm_raw.train(y = target, training_frame = train_h2o)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


#### 2. Train H2O Model on Transformed Data

In [110]:
def CreateTransformedH2OFrame(path, scorer, target):
    
    # Import Data
    data = pd.read_csv(path)
    
    # Transform Data with scoring package
    transformed_data = scorer.transform_batch(data)
    transformed_data = pd.concat([transformed_data, data[target]], axis = 1)
    
    # Convert to H2O Frame
    transformed_h2o = h2o.H2OFrame(transformed_data, column_types = {target:"enum"})
    
    return transformed_h2o

In [111]:
transformed_train_h2o = CreateTransformedH2OFrame(train_path, scorer, target)
transformed_test_h2o = CreateTransformedH2OFrame(test_path, scorer, target)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [113]:
gbm_transformed = H2OGradientBoostingEstimator(model_id = "transformed_data", 
                                               nfolds = 3, # 3-fold cross validation
                                               ntrees = 1000,   # more than enough trees 
                                               ## early stopping once the validation AUC doesn't improve 
                                               ## by at least 0.1% for 3 consecutive scoring events
                                               stopping_rounds = 3, stopping_tolerance = 1e-3, stopping_metric = "AUC",
                                               score_tree_interval = 10,
                                               seed = 1234)
gbm_transformed.train(y = target, training_frame = transformed_train_h2o)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


#### 3. Compare Model Performance

In [114]:
raw_data_auc = gbm_raw.model_performance(test_data = test_h2o).auc()
transformed_data_auc = gbm_transformed.model_performance(test_data = transformed_test_h2o).auc()
driverless_auc = experiment.test_score

In [115]:
print("Raw Data AUC: " + str(round(raw_data_auc, 4)))
print("Transformed Data AUC: " + str(round(transformed_data_auc, 4)))
print("Driverless AI AUC: " + str(round(driverless_auc, 4)))

Raw Data AUC: 0.7533
Transformed Data AUC: 0.6182
Driverless AI AUC: 0.7822
