# $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ Tryst with Data Robot with Quality Wine
### $\;\;\;\;\;\;\;\;\;$ Getting started with using DataRobot python API for multi-class classification
###### $\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ Shaheen Gauher, PhD 

<img src="https://raw.githubusercontent.com/shaheeng/DataRobot/master/pythonAPI/image_drblog.PNG">

Recently, I had the opportunity to explore [DataRobot](https://www.datarobot.com/). As a hardcore pythonic data scientist, I was curious about its capabilities and wondered if it would help expedite my work. DataRobot promises [automated machine learning](https://www.datarobot.com/product/automated-machine-learning/) wherein it chooses the most appropriate machine learning algorithms, automatically optimizes data preprocessing, applies feature engineering, and tunes parameters for each algorithm. It creates and ranks highly accurate [models](https://www.datarobot.com/wiki/model/) and recommends the best model to deploy for the data and prediction target. So when the opportunity to use DataRobot presented itself, I decided to give it a try. 

So far, I have been quite pleased with the functionality and was able to integrate it into my pythonic workflow. I especially like how it internalises all the complexity of AutoML and presents a clean interface to work with while maintaining transparency and explainability. I have consolidated my DataRobot exploration into this tutorial on ***Getting started with using DataRobot python API for multi-class classification***. Hopefully, this will help you understand how DataRobot works and expedite the onboarding process for your data science needs within your enterprise.

For this tutorial, I used the [wine quality data set](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the UCI Machine Learning Repository. The dataset contains quality ratings (labels) for 4,898 white wine samples. The features are the wines’ physical and chemical properties (11 predictors). We want to use these physicochemical properties to predict the quality of the wine. Data Robot expects training data in the form of a flat file. For the wine data set used here there was minimal preparation required. You can find the full code for this tutorial on [github](https://github.com/shaheeng/DataRobot/blob/master/pythonAPI/Getting_started_with_using_DataRobot_pythonAPI.ipynb) for intermediate steps. To access the DataRobot modeling engine, it is necessary to establish an authenticated [connection](https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.15.0/setup/getting_started.html). With the training data prepared and connection established through my python session I will create three different projects to showcase the three different modes DataRobot can work with on my data. The three modes are the full autopilot mode, the quick autopilot mode and the manual mode. I will call the three projects corresponding to the three modes project_wine1, project_wine2, and project_wine3. Once the training process is complete, I will show how to retrieve the results from the various models and how to generate predictions on new data using a selected model. 

In [20]:
import pandas as pd
import numpy as np
from numpy import array
import datarobot as dr

# Prepare Training and Test Data (saved in dataframes data_train and data_test below)
# Estabish Connection

data_train.head(2)

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,target
4751,7.3,0.36,0.62,7.1,0.033,48.0,185.0,0.99472,3.14,0.62,10.6,Med
3047,7.7,0.18,0.53,1.2,0.041,42.0,167.0,0.9908,3.11,0.44,11.9,Low


DataFrame data_train contains the input to Data Robot modelling engine. We will use data_test to test the performance of the model from Data Robot

# AUTOPILOT_MODE.FULL_AUTO

I will run the first project on full autopilot mode. Given a dataset, DataRobot starts by recommending a set of blueprints that are appropriate for the task at hand. A blueprint is a series of steps or computation paths that a dataset will pass through before producing predictions from data. There can be multiple blueprints for the same algorithm depending on the underlying preprocessing steps and each blueprint can result in one or more model. A model is the result of training a blueprint on a dataset at a specified sample percentage, a set of features and hyperparameters. 

In autopilot mode (which is also the default mode on DataRobot), the modeling process proceeds completely automatically including running the recommended models at different sample sizes and blending models. For the wine dataset, DataRobot recommended 22 blueprints and created 28 models based on these blueprints in less that half an hour!!!! 

Recommendations depend on the size and complexity of the data. For other larger datasets (~3million rows and ~50 columns) that I tried, I saw up to 60 blueprints recommendation which took a few hours to complete.



In [51]:
# Quickly Starting a project. Project.start method combines the project creation, file upload and target selection 
project_wine1 = dr.Project.start(data_train, project_name='shaheen_dr1',target="target", target_type=dr.enums.TARGET_TYPE.MULTICLASS)

# Get the recommended blueprints for this project
allblueprints = project_wine1.get_blueprints()
print(len(allblueprints))
print(allblueprints)

22
[Blueprint(Vowpal Wabbit Classifier), Blueprint(Vowpal Wabbit Stagewise Polynomial Classifier), Blueprint(eXtreme Gradient Boosted Trees Classifier), Blueprint(Gradient Boosted Trees Classifier), Blueprint(RandomForest Classifier (Gini)), Blueprint(Stochastic Gradient Descent Classifier), Blueprint(Stochastic Gradient Descent Classifier), Blueprint(Stochastic Gradient Descent Classifier), Blueprint(TensorFlow Logistic Regression), Blueprint(TensorFlow Neural Network Classifier), Blueprint(Light Gradient Boosted Trees Classifier with Early Stopping (SoftMax Loss) (16 leaves)), Blueprint(ExtraTrees Classifier (Gini)), Blueprint(TensorFlow Deep Learning Classifier), Blueprint(Decision Tree Classifier (Gini)), Blueprint(Stochastic Gradient Descent Classifier), Blueprint(Majority Class Classifier), Blueprint(Regularized Logistic Regression (L2)), Blueprint(TensorFlow Deep Learning Classifier), Blueprint(Gradient Boosted Greedy Trees Classifier), Blueprint(Vowpal Wabbit Low Rank Quadratic

Next I will compare all the models created using the Accuracy metric and select one model and examine its performance in detail. Then, I will upload the test_data from above to get predictions using this model.

In [13]:
# get_models method returns a list of the project models that have finished training:
allmodels = project_wine1.get_models()
print('Number of models created in full auto mode = ',len(allmodels))

# Accuracy of all models returned in full autopilot mode in order
chosen_metric = 'Accuracy' # Options are 'Accuracy', 'Balanced Accuracy', 'LogLoss' and 'AUC'
df_modelmetrics = pd.DataFrame(columns = ['model_name', 'accuracy_crossValidation', 'accuracy_validation']) ##create empty df
for m in range(len(allmodels)):
    mth_model = allmodels[m]
    list_row = [mth_model.model_type, mth_model.metrics.get(chosen_metric).get('crossValidation'), \
                mth_model.metrics.get('Accuracy').get('validation') ]
    df_modelmetrics.loc[m] = list_row
    
df_modelmetrics = df_modelmetrics.sort_values('accuracy_validation', ascending=False)
print(df_modelmetrics.shape)
df_modelmetrics.head(4)

Number of models created in full auto mode =  28
(28, 3)


Unnamed: 0,model_name,accuracy_crossValidation,accuracy_validation
2,ENET Blender,0.676558,0.6874
4,RandomForest Classifier (Gini),0.679748,0.68262
3,ENET Blender,0.675916,0.68102
7,Advanced AVG Blender,0.670178,0.67943


Next we will select the eXtreme Gradiented Boosted model and examine the metrics of the model in detail and use it to predict for the test data

In [52]:
# selecting a model, XGBoost for further examination
example_model = project_wine1.get_models(search_params={'name': "eXtreme Gradient Boosted" })[0]
# print blueprint chart for the selected model and look at the pre-processing steps.
print(example_model.get_model_blueprint_chart().to_graphviz())
print(example_model.sample_pct)

digraph "Blueprint Chart" {
graph [rankdir=LR]
0 [label="Data"]
-1 [label="Numeric Variables"]
1 [label="Missing Values Imputed"]
2 [label="eXtreme Gradient Boosted Trees Classifier"]
3 [label="Prediction"]
0 -> -1
-1 -> 1
1 -> 2
2 -> 3
}
64.0123


### Performance on Training Data  
Lets extract the confusion matrix and retrieve accuracy, precision and recall for the selected model. 

In [12]:
# Get the Confusion Chart data for the selected model.
cms = example_model.get_all_confusion_charts()
print(cms)

[ConfusionChart(validation), ConfusionChart(crossValidation)]


In [47]:
# Extract the crossValidation results
crossValidation_results = cms[1]
raw_results = crossValidation_results.raw_data

# extract confusion matrix from raw results
cm = raw_results.get('confusion_matrix')

from numpy import array
cm = array(cm)
print(cm)

[[369  26 282]
 [ 27 689 333]
 [160 259 990]]


In [48]:
# Extract precion, recall and f1 for the three classes using one vs all confusion matrix from raw_results
num_class = len(raw_results.get('class_metrics'))
df_examplemodelmetrics = pd.DataFrame(columns = ['Class', 'precision', 'recall', 'f1']) ##create empty df
for n in range(num_class):
    list_row = [raw_results.get('class_metrics')[n].get('class_name') , raw_results.get('class_metrics')[n].get('precision') ,\
               raw_results.get('class_metrics')[n].get('recall'), raw_results.get('class_metrics')[n].get('f1')]
    df_examplemodelmetrics.loc[n] = list_row

df_examplemodelmetrics

Unnamed: 0,Class,precision,recall,f1
0,High,0.663669,0.545052,0.59854
1,Low,0.707392,0.656816,0.681167
2,Med,0.616822,0.702626,0.656934


### Performance on Test Data

Now we want to test the performance of the selected model on test data. We will upload the test data, start a predict job and retrieve the predictions.

In [18]:
# upload test dataset
data_test_dr = project_wine1.upload_dataset(data_test)
# start a predict job
predict_job = example_model.request_predictions(data_test_dr.id)
# retrieve the predictions when complete
predictions = predict_job.get_result_when_complete()
predictions.head(2)

Unnamed: 0,prediction,row_id,class_High,class_Low,class_Med
0,Med,0,0.222951,0.047465,0.729585
1,Med,1,0.174603,0.266344,0.559053


With the predictions retrieved as a data frame the performance metrics can be easily calculated using scikit-learn.

### Feature Impact

DataRobot also computes [Feature Impact](https://www.datarobot.com/wiki/feature-impact/), a measure of the relevance of each feature in the model. A prerequisite to computing prediction explanations is that you need to compute the feature impact for your model (this only needs to be done once per model).

In [19]:
# Features used in the selected model
print(example_model.get_features_used())
# compute feature impact for selected model
feature_impacts = example_model.get_or_request_feature_impact()
df_feature = pd.DataFrame(feature_impacts).sort_values(by=['impactNormalized'], ascending=False)
df_feature.head(2)

['alcohol', 'chlorides', 'citric_acid', 'density', 'fixed_acidity', 'free_sulfur_dioxide', 'pH', 'residual_sugar', 'sulphates', 'target', 'total_sulfur_dioxide', 'volatile_acidity']


Unnamed: 0,featureName,impactNormalized,impactUnnormalized,redundantWith
0,alcohol,1.0,0.306579,
1,volatile_acidity,0.494697,0.151664,


The ‘impactUnnormalized’ is how much worse the error metric score is when making predictions on this modified data. The ‘impactNormalized’ is normalized so that the largest value is 1. In both cases, larger values indicate more important features. If a feature is a redundant feature, i.e. once other features are considered it doesn’t contribute much in additional, the ‘redundantWith’ value is the name of feature that has the highest correlation with this feature. Along with Feature Impact DataRobot can also compute the [prediction explanations](https://www.datarobot.com/wiki/prediction-explanations/) for every row of the dataset. `This functionality is however not available for multiclass models.`

# AUTOPILOT_MODE.QUICK

The quick mode (AUTOPILOT_MODE.QUICK) is for quickrun wherein we can run on a more limited set of models rather than the full recommended set of blueprints to get insights more quickly. In the autopilot quick mode, DataRobot built 8 models for my data in the quick mode from the 22 blueprints. We can select one of the models just like I showed above to look at the performance and to use it for getting predictions for the test data. In the quick mode DataRobot does not make any model recommendations.

In [23]:
project_wine2 = dr.Project.create(data_train, project_name='shaheen_dr2')
project_wine2.set_target('target', target_type=dr.enums.TARGET_TYPE.MULTICLASS, mode=dr.AUTOPILOT_MODE.QUICK)
project_wine2.wait_for_autopilot()
print('Number of blueprints = ',len(project_wine2.get_blueprints()))
print('Number of models created in quick mode = ',len(project_wine2.get_models()))

Number of blueprints =  22
Number of models created in quick mode =  8


# AUTOPILOT_MODE.MANUAL

In the manual mode (AUTOPILOT_MODE.MANUAL), we can select which models to execute before starting the modeling process rather than use the DataRobot autopilot. We will select the 'eXtreme Gradient Boosted Trees Classifier' blueprint and train a model using this blueprint. We can look at the performance of the manual model created in more detail as I showed above and use it for getting predictions for the test data.

In [None]:
project_wine3 = dr.Project.create(data, project_name='shaheen_dr3')
project_wine3.set_target(target='target',target_type=dr.enums.TARGET_TYPE.MULTICLASS, mode=dr.AUTOPILOT_MODE.MANUAL)

No models get created in manual model at the start. To create a model, we will specify a blueprint to use below.

In [26]:
print('Number of blueprints = ',len(project_wine3.get_blueprints()))
print('Number of models created in manual mode = ',len(project_wine3.get_models()))

Number of blueprints =  22
Number of models created in manual mode =  0


In [23]:
# Select one of the blueprints instance from above to train a model. The default dataset for the project is used.
allblueprints = project_wine3.get_blueprints()  
# select the eXtreme Gradient Boosted Trees Classifier blueprint
import re
blueprint_manual = [x for x in allblueprints if re.search("eXtreme Gradient Boosted", str(x))][0]
# Train a model using 80% of the data. This train method will put a new modeling job into the queue and returns id of created ModelJob.
manualmodel_job_id = project_wine3.train(blueprint_manual, sample_pct=80)

# select the manual model created for further inverstigation
manualmodel = project_wine3.get_models()[0]
print('Manual model created = ',manualmodel.model_type)

Manual model created =  eXtreme Gradient Boosted Trees Classifier


# Delete and manage projects

It is always good to cleaup after!

In [3]:
allmyprojects = dr.Project.list()
print('all my projects - ',allmyprojects)
# load the project 
project_to_delete = allmyprojects[0]
print('project to delete - ',project_to_delete)
# delete the selected project
project_to_delete.delete()

all my projects -  [Project(shaheen_dr3), Project(shaheen_dr2), Project(shaheen_dr1)]
project to delete -  Project(shaheen_dr3)


Hope this tutorial was helpful in undertanding how DataRobot works and facilitates in integrating its rich functionality in your pythonic workflow.

About the Author:

Shaheen Gauher is an AI communicator, an intelligent solution enabler and a Data Scientist by profession. She helps enterprises build and deploy predictive solutions to best leverage their data and empowers them to achieve more through technology and AI. She is a climate scientist and physicist by training and serves on the advisory board for Data Analytics at Tufts University Graduate School of Arts and Sciences. Find me on Twitter, @Shaheen_Gauher.
