## 1) Train a binary classification model

This flow is essentially the same as a regression model but uses slightly different arguments for drum.

In [1]:
# # Read the train and test data
# TRAIN_DATA_CLF = './data/surgical_dataset_train.csv'
# TEST_DATA_CLF = './data/surgical_dataset_test.csv'

# clf_X_train = pd.read_csv(TRAIN_DATA_CLF)
# clf_Y_train = clf_X_train.pop('complication')

# clf_X_test = pd.read_csv(TEST_DATA_CLF)

# # Fit the model
# clf_rf_model = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=0)
# clf_rf_model.fit(clf_X_train, clf_Y_train)

# # Pickle the file and write it to the file system
# if not os.path.exists('custom_model_clf'):
#     os.makedirs('custom_model_clf')
# with open('custom_model_c`lf/clf_rf_model.pkl', 'wb') as pkl:
#     pickle.dump(clf_rf_model, pkl)
    
# # Call predict to confirm it works
# clf_rf_model.predict(clf_X_test)

# threshold = 0.3
# predicted_proba = clf_rf_model.predict_proba(clf_X_test)
# predicted = (predicted_proba [:,1] >= threshold).astype('int')
# predicted
# # accuracy_score(clf_Y_test, predicted)

In [11]:
"""
A test to see if I can pickle a model as a pipeline and not have issues with a custom model warning
"""
import pandas as pd
import numpy as np
import os
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pickle

# Read the train and test data
TRAIN_DATA_CLF = '../data/surgical_dataset_train.csv'
TEST_DATA_CLF = '../data/surgical_dataset_test.csv'

clf_X_train = pd.read_csv(TRAIN_DATA_CLF)
clf_Y_train = clf_X_train.pop('complication')

clf_X_test = pd.read_csv(TEST_DATA_CLF)

# Fit the model as a pipeline with an imputer
clf_rf_model = Pipeline([("imputer", SimpleImputer(missing_values=np.nan,
                                        strategy="constant",
                                        fill_value=0)),
                      ("forest", RandomForestClassifier(n_estimators=10,
                                        max_depth=2,
                                        random_state=0))])
clf_rf_model.fit(clf_X_train, clf_Y_train)

# Pickle the file and write it to the file system
if not os.path.exists('custom_model_clf'):
    os.makedirs('custom_model_clf')
with open('custom_model_clf/clf_rf_model_pipeline.pkl', 'wb') as pkl:
    pickle.dump(clf_rf_model, pkl)

# Call predict to confirm it works
clf_rf_model.predict(clf_X_test)

threshold = 0.3
predicted_proba = clf_rf_model.predict_proba(clf_X_test)
predicted = (predicted_proba [:,1] >= threshold).astype('int')
predicted
# accuracy_score(clf_Y_test, predicted)


array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1])

## 2) Generate the model template file for any additional pipeline processing

This file, custom.py, is optional but allows you to insert additional processing steps into the flow of getting predictions.  The following functions are available:

* init
* load_model
* transform
* score
* post_process

Place the file in the location specified by the --code-dir argument.  For this example, you must edit the transform function in custom.py to impute any null values to 0.  Please see the comments in custom.py for further description information of each function.

In [12]:
# Create a new directory with a custom.py template
# !drum new model --code-dir ./custom_model_clf/ --language python

## 3) Validate the classification model can handle data with errors

The validation check takes the input file and alters it to test various fail conditions, such as setting column values to null.  For this example, you must edit the transform function in custom.py to impute any null values to 0.

In [13]:
!drum validation --code-dir ./custom_model_clf --input ../data/surgical_dataset_test.csv  --positive-class-label 1 --negative-class-label 0

           1         0
0   0.043216  0.956784
1   0.264750  0.735250
2   0.274012  0.725988
3   0.173259  0.826741
4   0.278859  0.721141
5   0.286626  0.713374
6   0.338873  0.661127
7   0.278859  0.721141
8   0.278859  0.721141
9   0.281780  0.718220
10  0.043216  0.956784
11  0.264545  0.735455
12  0.043216  0.956784
13  0.256982  0.743018
14  0.243151  0.756849
15  0.415967  0.584033
16  0.286626  0.713374
17  0.043216  0.956784
18  0.345199  0.654801
19  0.432839  0.567161
           1         0
0   0.126457  0.873543
1   0.264750  0.735250
2   0.274012  0.725988
3   0.173259  0.826741
4   0.278859  0.721141
5   0.264113  0.735887
6   0.338873  0.661127
7   0.256345  0.743655
8   0.256345  0.743655
9   0.259267  0.740733
10  0.126457  0.873543
11  0.264545  0.735455
12  0.126457  0.873543
13  0.234469  0.765531
14  0.220638  0.779362
15  0.415967  0.584033
16  0.286626  0.713374
17  0.126457  0.873543
18  0.322686  0.677314
19  0.432839  0.567161
           1         0
0   0.04321

           1         0
0   0.043216  0.956784
1   0.256982  0.743018
2   0.274012  0.725988
3   0.173259  0.826741
4   0.278859  0.721141
5   0.256345  0.743655
6   0.331105  0.668895
7   0.256345  0.743655
8   0.256345  0.743655
9   0.251499  0.748501
10  0.043216  0.956784
11  0.264545  0.735455
12  0.043216  0.956784
13  0.234469  0.765531
14  0.220638  0.779362
15  0.415967  0.584033
16  0.278859  0.721141
17  0.043216  0.956784
18  0.322686  0.677314
19  0.425072  0.574928
           1         0
0   0.043216  0.956784
1   0.264750  0.735250
2   0.274012  0.725988
3   0.173259  0.826741
4   0.278859  0.721141
5   0.264113  0.735887
6   0.338873  0.661127
7   0.256345  0.743655
8   0.256345  0.743655
9   0.259267  0.740733
10  0.043216  0.956784
11  0.264545  0.735455
12  0.043216  0.956784
13  0.234469  0.765531
14  0.220638  0.779362
15  0.415967  0.584033
16  0.286626  0.713374
17  0.043216  0.956784
18  0.299554  0.700446
19  0.340870  0.659130
           1         0
0   0.04321

## 4) Test the classification model can return predictions 

Input the prediction dataset that includes all features except the target feature.

In [7]:
!drum score --code-dir ./custom_model_clf/ --input ../data/surgical_dataset_test.csv --positive-class-label 1 --negative-class-label 0 --output surgical_complications_test_results.csv --verbose

Detected score mode
Start initializing pipeline
No file detected at /Users/matthew.cohen/Documents/DR/MLOps/_DRUM local testing/DRUM notebook example/test - model with pipeline/custom_model_clf/custom.py
Start running pipeline
[32m [0m
[32m [0m
[32mComponent: csv_to_df[0m
[32mLanguage:  Python[0m
[32mOutput:[0m
[32m------------------------------------------------------------[0m
[32m------------------------------------------------------------[0m
[32mRuntime:    0.0 sec[0m
[32mNR outputs: 1[0m
[32m [0m
[32m [0m
[32m [0m
[32mComponent: python_predictor[0m
[32mLanguage:  Python[0m
[32mOutput:[0m
[32m------------------------------------------------------------[0m
[32m------------------------------------------------------------[0m
[32mRuntime:    0.0 sec[0m
[32mNR outputs: 1[0m
[32m [0m
[32m [0m
[32m [0m
[32mComponent: df_to_csv[0m
[32mLanguage:  Python[0m
[32mOutput:[0m
[32m------------------------------------------------------------[0m
[

## Testing model performance

Use this to asses model response time for prediction requests.

In [8]:
!drum perf-test --code-dir ./custom_model_clf --input data/surgical_dataset_test.csv --positive-class-label 1 --negative-class-label 0

Preparing test data...



Running test case: 96 bytes - 1 samples, 100 iterations
[KProcessing |################################| 100/100
Running test case: 0.1MB - 1091 samples, 50 iterations
[KProcessing |################################| 50/50
Running test case: 10MB - 109113 samples, 5 iterations
[KProcessing |################################| 5/5
Running test case: 50MB - 545566 samples, 1 iterations
[KProcessing |################################| 1/1
[m[?7h[4l>7[r[?1;3;4;6l8
  size     samples   iters    min     avg     max    used (MB)   total (MB)
96 bytes         1     100   0.062   0.068   0.099     118.672    16384.000
0.1MB         1091      50   0.104   0.118   0.232     122.848    16384.000
10MB        109113       5   0.959   1.013   1.167     200.574    16384.000
50MB        545566       1   4.641   4.641   4.641     472.871    16384.000
[?25h

## Prediction server mode

The code below launchs drum as a server and stop program flow.  So to test that it responds to prediction requests, issue this command in a terminal shell or another notebook environment:

curl -F "X=@./data/boston_housing_test.csv" localhost:6789/predict/

In [13]:
!drum server --code-dir ./custom_model_reg --address localhost:6789

^C


## Fit a model

https://github.com/datarobot/datarobot-user-models/blob/master/QUICKSTART-FOR-TRAINING.md

In [14]:
!drum fit --code-dir model_templates/training/python3_sklearn --target complication --input data/surgical_dataset_train.csv --positive-class-label 1 --negative-class-label 0

Validation Complete 🎉 Your model can be fit to your data, and predictions can be made on the fit model! 
You're ready to add it to DataRobot. 


## Running inside a docker container