### Step 0: Specify configuration
- project id, data file
- configuration: which models to use

### Step 1: Download stacked predictions from DR 
(from models selected using specified heuristics)

### Step 2: Merge these Predictions together, along with the target and partitioning

### Step 3: Run non-negative least squares
This is a constrained optimization problem where the coefficient must be positive or 0 (as the name implies).   

### Step 4: Take the non-zero coefficients and build blenders
I've seen this work well in my own research projects. Can be applied to both regression and classification projects (if applied to classification problems, target must be converted to an integer). The advantage of this is that it's fast and doesn't require making a new project. Ideally, the solution would automatically produce the right balance between diversity and accuracy. You could also just compute predictions using the calculated coefficients for each model directly (this approach is the default for the SuperLearner package https://www.rdocumentation.org/packages/SuperLearner/versions/2.0-24/topics/SuperLearner).

In [1]:
import datarobot as dr
import pandas as pd
import numpy as np
dr.__version__

'2.17.0'

In [2]:
# Which models do we use?
limit_to_already_cv_only = False  # only use models that already ran CV
limit_to_top_models = True  # limit to top X models
num_top_models = 30  # only use top 30 models

In [3]:
# https://datarobot.atlassian.net/wiki/spaces/CFDS/pages/96239700/Predictive+Maintenance+-+Binary+Classification

training_df = pd.read_csv("https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_Pred_Main_Bin.csv")
training_df.sample(5)

Unnamed: 0,material_id,rig_plant,qty_replaced,m_weight,material_type,material_group,surface_matl,has_coatings,has_documents,has_matlspecs,...,has_documents.1,has_matlspecs.1,has_weldspecs.1,has_qspecs.1,area1,area2,area3,area4,part_desc,Date
9151,M169747,ZW1S0,1,5150.0,HALB,F-T03-RUB,True,False,True,False,...,False,True,False,False,57,19,483,81,fail suction bodies rapid front while along je...,5-2-2012
1346,M1111115528,ZW1S0,1,1421.0,HALB,99,True,True,False,False,...,False,False,False,False,178,35,71,21,internal vwall isolation growing protects Casi...,7-21-2012
863,M1111114451,ZW1S0,0,116250.0,HALB,99,True,True,True,False,...,False,True,False,False,1,1,496,86,mobile depths micropiling accessibility northe...,7-4-2012
6671,M1111181242,ZW1S0,0,68100.0,HALB,A-A05-SWA,False,False,True,False,...,False,False,False,False,1,1,506,88,mec hercules number decrease processing arms g...,2-20-2012
7152,M1111189881,ZW1S0,0,140500.0,HALB,A-A05-SWA,True,False,True,False,...,False,True,False,False,175,35,161,36,transmitting option mowing interlocked functio...,10-4-2012


In [4]:
# Run autopilot

dr.Client(config_path="/Users/taylor.larkin/.config/datarobot/drconfig.yaml")

project = dr.Project.start(project_name="BBB NNLS Pred Main", sourcedata=training_df,
                           target="qty_replaced", worker_count=-1)
project.wait_for_autopilot()

In progress: 20, queued: 17 (waited: 0s)
In progress: 20, queued: 17 (waited: 1s)
In progress: 20, queued: 17 (waited: 1s)
In progress: 20, queued: 17 (waited: 2s)
In progress: 20, queued: 17 (waited: 4s)
In progress: 20, queued: 17 (waited: 6s)
In progress: 19, queued: 17 (waited: 10s)
In progress: 20, queued: 16 (waited: 17s)
In progress: 20, queued: 14 (waited: 35s)
In progress: 8, queued: 14 (waited: 56s)
In progress: 3, queued: 14 (waited: 76s)
In progress: 3, queued: 13 (waited: 96s)
In progress: 10, queued: 5 (waited: 117s)
In progress: 14, queued: 1 (waited: 138s)
In progress: 11, queued: 0 (waited: 158s)
In progress: 2, queued: 0 (waited: 178s)
In progress: 2, queued: 0 (waited: 199s)
In progress: 1, queued: 0 (waited: 219s)
In progress: 13, queued: 4 (waited: 239s)
In progress: 17, queued: 0 (waited: 259s)
In progress: 17, queued: 0 (waited: 280s)
In progress: 12, queued: 0 (waited: 300s)
In progress: 4, queued: 0 (waited: 320s)
In progress: 2, queued: 0 (waited: 341s)
In pro

In [5]:
print("Project Optimization Metric: " + project.metric)
print("Project Target: " + project.target)
total_rows = training_df.shape[0]
print(f'Total Training Rows: {total_rows}')

Project Optimization Metric: LogLoss
Project Target: qty_replaced
Total Training Rows: 10000


In [6]:
# assemble scoring data and attributes
# order by cv score, then by validation score

model_scores = pd.DataFrame(
    [[model.metrics[project.metric]['crossValidation'],
      model.metrics[project.metric]['validation'],
      model,
      model.training_row_count,
      model.model_category] for model in project.get_models(with_metric=project.metric)],
    columns=['cv', 'v', 'model', 'rows', 'category']).sort_values(['cv', 'v'],
                                                                 na_position='last')
# TODO: add ability to detect need for reverse sorting for metrics like AUC
model_scores

Unnamed: 0,cv,v,model,rows,category
0,0.490946,0.48357,Model('RandomForest Classifier (Entropy)'),8000,model
1,0.496986,0.49653,Model('Advanced GLM Blender'),6400,blend
3,0.497130,0.49869,Model('GLM Blender'),6400,blend
4,0.499236,0.49949,Model('AVG Blender'),6400,blend
2,0.499308,0.49748,Model('ENET Blender'),6400,blend
6,0.499442,0.50133,Model('Advanced AVG Blender'),6400,blend
5,0.500212,0.49986,Model('RandomForest Classifier (Entropy)'),6400,model
7,0.503054,0.50536,Model('RandomForest Classifier (Entropy)'),6400,model
8,0.528438,0.52774,Model('eXtreme Gradient Boosted Trees Classifi...,6400,model
11,0.528728,0.53113,Model('eXtreme Gradient Boosted Trees Classifi...,6400,model


In [7]:
# Eliminate blenders, prime models, and scaleout models.
# Eliminate anything not at the max training percentage of the project
# If we are not training new CV models, eliminate the ones that haven't run CV yet.
# If we are limiting to top models, select the top X models from remaining ones,
#     otherwise select all of them.

if limit_to_already_cv_only:
    models_to_use = model_scores.loc[
        (model_scores['category'] == 'model')
        & (model_scores['rows'] == project.max_train_rows)
        & model_scores['cv'].notna(),
        'model'].tolist()
else:
    models_to_use = model_scores.loc[
        (model_scores['category'] == 'model')
        & (model_scores['rows'] == project.max_train_rows),
        'model'].tolist()
if limit_to_top_models:
    models_to_use = models_to_use[:num_top_models]
models_to_use

[Model('RandomForest Classifier (Entropy)'),
 Model('RandomForest Classifier (Entropy)'),
 Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping'),
 Model('eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features'),
 Model('RandomForest Classifier (Gini)'),
 Model('Light Gradient Boosted Trees Classifier with Early Stopping'),
 Model('ExtraTrees Classifier (Gini)'),
 Model('Gradient Boosted Greedy Trees Classifier with Early Stopping'),
 Model('Auto-Tuned Word N-Gram Text Modeler using token occurrences - part_desc')]

In [8]:
# kick off all training predictions jobs if they haven't already been run
train_pred_jobs = list()
for model in models_to_use:
    if not model.metrics[project.metric]['crossValidation']:  # if CV has not been run, try to run CV
        try:
            model.cross_validate()
            print('Cross Validation begun for model ' + model.id)
        except Exception as e:
            print('CV error: ', str(e))
    try:
        train_pred_jobs.append(model.request_training_predictions(dr.enums.DATA_SUBSET.ALL))
        print('Training Predictions begun for model ' + model.id)
    except Exception as e:
        if str(e) == "422 client error: {'message': 'Training predictions request already submitted for these parameters'}":
            print('Predictions already kicked off for model ' + model.id)
        else:
            print(str(e))
print('\nAll training prediction jobs kicked off')

Training Predictions begun for model 5d827edd95d03038b679d17e
Training Predictions begun for model 5d82807e95d0303f5c79d185
Training Predictions begun for model 5d827edd95d03038b679d182
Training Predictions begun for model 5d827edd95d03038b679d184
Training Predictions begun for model 5d827edd95d03038b679d180
Training Predictions begun for model 5d827edd95d03038b679d186
Training Predictions begun for model 5d827edd95d03038b679d188
Training Predictions begun for model 5d827edd95d03038b679d18a
Training Predictions begun for model 5d827edd95d03038b679d18c
Cross Validation begun for model 5d827edd95d03038b679d17d
Training Predictions begun for model 5d827edd95d03038b679d17d

All training prediction jobs kicked off


In [9]:
model_ids = [model.id for model in models_to_use]
# create a mapping from model ids to names, which start with model id and end with model_type,
#   like "5d6edd3578132c4e72cd27df eXtreme Gradient Boosted Trees Classifier with Early Stopping - Forest (10x)"
# I set them in this order so that the model id would be copyable, since tooltips are not. I would prefer using
#   "M127 eXtreme Gradient Boosted Trees Classifier with Early Stopping - Forest (10x)" but the M### are not available
model_ids_to_names = {model.id: model.id + ' ' + model.model_type
                      for model in models_to_use}

In [10]:
for train_pred_job in train_pred_jobs:
    train_pred_job.get_result_when_complete()
print('All training prediction jobs complete!')

All training prediction jobs complete!


In [11]:
tp_list = dr.TrainingPredictions.list(project_id=project.id)
predictions = [(model_ids_to_names[tp.model_id], tp.get_all_as_dataframe())
               for tp in tp_list if tp.model_id in model_ids and tp.data_subset == 'all']
predictions

[('5d827edd95d03038b679d180 eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features',
        row_id partition_id  prediction  class_1.0  class_0.0
  0          0      Holdout         1.0   0.741989   0.258011
  1          1          1.0         1.0   0.822867   0.177133
  2          2          1.0         1.0   0.618867   0.381133
  3          3          4.0         1.0   0.825805   0.174195
  4          4          0.0         1.0   0.553553   0.446447
  5          5          2.0         1.0   0.603436   0.396564
  6          6      Holdout         1.0   0.742929   0.257071
  7          7          2.0         1.0   0.666791   0.333209
  8          8          4.0         1.0   0.565023   0.434977
  9          9      Holdout         1.0   0.529202   0.470798
  10        10          2.0         0.0   0.471382   0.528618
  11        11      Holdout         1.0   0.567357   0.432643
  12        12      Holdout         1.0   0.540411   0.459589
  13 

In [12]:
pred_df = pd.DataFrame()
for model, tp_df in predictions:
    tp_df['Model'] = model
    pred_df = pred_df.append(tp_df, sort=True)

In [13]:
# quick sanity check to make sure partitions are consistent
all(pred_df.groupby(['partition_id', 'row_id']).agg({'Model': 'count'}).reset_index()['Model'] == 14)

False

In [14]:
# merge all the predictions...
merged_df = pd.DataFrame()
merged_df['target'] = training_df[project.target]
first = True
for model, tp_df in predictions:
    if first:
        merged_df['partition_id'] = tp_df['partition_id']
        first = False
    if project.target_type != "Regression":
        # Grab first class probabilities
        merged_df[model] = tp_df.iloc[:,3]
    else:
        # Grab predictions
        merged_df[model] = tp_df.iloc[:,2]   
    print(model + ' predictions downloaded')
merged_df

5d827edd95d03038b679d180 eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features predictions downloaded
5d827edd95d03038b679d182 eXtreme Gradient Boosted Trees Classifier with Early Stopping predictions downloaded
5d827edd95d03038b679d17d Auto-Tuned Word N-Gram Text Modeler using token occurrences - part_desc predictions downloaded
5d827edd95d03038b679d186 RandomForest Classifier (Gini) predictions downloaded
5d82807e95d0303f5c79d185 RandomForest Classifier (Entropy) predictions downloaded
5d827edd95d03038b679d17e RandomForest Classifier (Entropy) predictions downloaded
5d827edd95d03038b679d184 eXtreme Gradient Boosted Trees Classifier with Early Stopping predictions downloaded
5d827edd95d03038b679d188 Light Gradient Boosted Trees Classifier with Early Stopping predictions downloaded
5d827edd95d03038b679d18a ExtraTrees Classifier (Gini) predictions downloaded
5d827edd95d03038b679d18c Gradient Boosted Greedy Trees Classifier with Early Stopping p

Unnamed: 0,target,partition_id,5d827edd95d03038b679d180 eXtreme Gradient Boosted Trees Classifier with Early Stopping and Unsupervised Learning Features,5d827edd95d03038b679d182 eXtreme Gradient Boosted Trees Classifier with Early Stopping,5d827edd95d03038b679d17d Auto-Tuned Word N-Gram Text Modeler using token occurrences - part_desc,5d827edd95d03038b679d186 RandomForest Classifier (Gini),5d82807e95d0303f5c79d185 RandomForest Classifier (Entropy),5d827edd95d03038b679d17e RandomForest Classifier (Entropy),5d827edd95d03038b679d184 eXtreme Gradient Boosted Trees Classifier with Early Stopping,5d827edd95d03038b679d188 Light Gradient Boosted Trees Classifier with Early Stopping,5d827edd95d03038b679d18a ExtraTrees Classifier (Gini),5d827edd95d03038b679d18c Gradient Boosted Greedy Trees Classifier with Early Stopping
0,1,Holdout,0.741989,0.765262,0.602620,0.762088,0.804613,0.804282,0.742621,0.780468,0.689931,0.550729
1,1,1.0,0.822867,0.763222,0.589188,0.653049,0.681062,0.766236,0.776252,0.791147,0.508981,0.402087
2,1,1.0,0.618867,0.633534,0.604113,0.502531,0.477713,0.493013,0.609539,0.472570,0.531060,0.485972
3,1,4.0,0.825805,0.803723,0.587299,0.749289,0.620631,0.747209,0.802483,0.624525,0.793556,0.704208
4,0,0.0,0.553553,0.602726,0.602620,0.659419,0.744921,0.630837,0.623506,0.506363,0.547745,0.376269
5,1,2.0,0.603436,0.669207,0.578986,0.685361,0.511040,0.525357,0.578789,0.534684,0.584909,0.516532
6,1,Holdout,0.742929,0.647193,0.611696,0.652936,0.646609,0.688198,0.759669,0.776722,0.580641,0.492947
7,1,2.0,0.666791,0.567625,0.545774,0.704558,0.757033,0.759624,0.604511,0.654219,0.590040,0.414814
8,1,4.0,0.565023,0.606439,0.552854,0.541847,0.628526,0.600222,0.495165,0.444140,0.522509,0.409952
9,1,Holdout,0.529202,0.447255,0.618847,0.479857,0.408894,0.417302,0.490429,0.433025,0.668792,0.536973


In [15]:
# Prepping data for non-negative least squares

y = merged_df.pop("target")

# In case of a classification problem
if project.target_type != "Regression":
    y = y.astype("category").cat.codes

X = merged_df.drop("partition_id", axis = 1)

In [16]:
# Perform non-negative least squares to select features

from scipy.optimize import nnls

nnls_opt = nnls(A=X.values, b=y)

# Selected models
selected_models = list(X.columns[np.nonzero(nnls_opt[0])])
selected_models

['5d82807e95d0303f5c79d185 RandomForest Classifier (Entropy)',
 '5d827edd95d03038b679d17e RandomForest Classifier (Entropy)',
 '5d827edd95d03038b679d188 Light Gradient Boosted Trees Classifier with Early Stopping',
 '5d827edd95d03038b679d18a ExtraTrees Classifier (Gini)',
 '5d827edd95d03038b679d18c Gradient Boosted Greedy Trees Classifier with Early Stopping']

In [17]:
# Make some blends

# Grab ids
model_ids_to_blend = [x.split(" ", 1)[0] for x in selected_models]

# Start blending
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.AVERAGE)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.MEDIAN)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.ENET)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.GLM)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.PLS)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.TENSORFLOW)

ModelJob(TF Blender, status=inprogress)

In [18]:
project.open_leaderboard_browser()

True