### Step 0: Specify configuration
- project id, data file
- configuration: which models to use

### Step 1: Download stacked predictions from DR 
(from models selected using specified heuristics)

### Step 2: Merge these Predictions together, along with the target and partitioning

### Step 3: Run non-negative least squares
This is a constrained optimization problem where the coefficient must be positive or 0 (as the name implies).   

### Step 4: Take the non-zero coefficients and build blenders
I've seen this work well in my own research projects. Can be applied to both regression and classification projects (if applied to classification problems, target must be converted to an integer). The advantage of this is that it's fast and doesn't require making a new project. Ideally, the solution would automatically produce the right balance between diversity and accuracy. You could also just compute predictions using the calculated coefficients for each model directly (this approach is the default for the SuperLearner package https://www.rdocumentation.org/packages/SuperLearner/versions/2.0-24/topics/SuperLearner).

In [1]:
import datarobot as dr
import pandas as pd
import numpy as np
dr.__version__

'2.17.0'

In [2]:
# Which models do we use?
limit_to_already_cv_only = False  # only use models that already ran CV
limit_to_top_models = True  # limit to top X models
num_top_models = 30  # only use top 30 models

In [3]:
# https://datarobot.atlassian.net/wiki/spaces/CFDS/pages/558858298/Sports+Analytics+Predicting+NBA+Player+Performance+-+Regression

training_df = pd.read_csv("https://s3.amazonaws.com/datarobot_public_datasets/DR_Demo_NBA_2017-2018.csv")
training_df.sample(5)

Unnamed: 0,roto_fpts_per_min,roto_minutes,roto_fpts,roto_value,free_throws_lag30_mean,field_goals_decay1_mean,game_score_lag30_mean,minutes_played_decay1_mean,PF_lastseason,free_throws_attempted_lag30_mean,...,team,opponent,over_under,eff_field_goal_percent_lastseason,spread_decay1_mean,position,OWS_lastseason,free_throws_percent_decay1,text_yesterday_and_today,game_score
3810,,,,,1.461538,0.966304,11.230769,7.509653,2.4,1.923077,...,OKC,CHI,,0.571,7.301062,C,3.3,0.788001,Thunder's Steven Adams: Doubtful for Wednesday...,0.0
479,1.01,27.0,27.3,5.15,0.866667,3.591254,6.133333,26.921195,2.6,1.4,...,MIA,PHI,201.0,0.535,0.901175,PF,1.9,0.935095,Heat's James Johnson: Scores eight points in W...,2.9
2639,0.84,33.0,27.7,4.2,2.2,7.531135,12.82,33.92915,1.6,2.866667,...,DAL,CHI,210.5,0.498,0.53044,SF,2.5,0.864345,,18.0
6291,0.87,31.0,26.9,5.07,0.933333,1.718828,3.473333,25.055692,0.9,1.133333,...,SAS,PHI,206.5,0.481,3.433701,SF,0.7,0.760559,,10.5
1355,,,,,0.928571,2.797223,5.264286,25.690895,2.3,1.428571,...,BRK,IND,,0.543,-18.001993,SG,0.1,0.621636,,9.3


In [4]:
# Run autopilot

dr.Client(config_path="/Users/taylor.larkin/.config/datarobot/drconfig.yaml")

project = dr.Project.start(project_name="BBB NNLS NBA", sourcedata=training_df,
                           target="game_score", worker_count=-1)
project.wait_for_autopilot()

In progress: 4, queued: 32 (waited: 0s)
In progress: 4, queued: 32 (waited: 1s)
In progress: 4, queued: 32 (waited: 1s)
In progress: 4, queued: 32 (waited: 2s)
In progress: 6, queued: 30 (waited: 4s)
In progress: 8, queued: 28 (waited: 6s)
In progress: 8, queued: 28 (waited: 9s)
In progress: 14, queued: 22 (waited: 17s)
In progress: 17, queued: 19 (waited: 30s)
In progress: 17, queued: 16 (waited: 51s)
In progress: 12, queued: 16 (waited: 71s)
In progress: 6, queued: 16 (waited: 92s)
In progress: 8, queued: 11 (waited: 114s)
In progress: 15, queued: 1 (waited: 134s)
In progress: 16, queued: 0 (waited: 155s)
In progress: 9, queued: 0 (waited: 175s)
In progress: 6, queued: 0 (waited: 196s)
In progress: 1, queued: 0 (waited: 216s)
In progress: 3, queued: 14 (waited: 236s)
In progress: 6, queued: 11 (waited: 257s)
In progress: 14, queued: 2 (waited: 277s)
In progress: 13, queued: 0 (waited: 303s)
In progress: 8, queued: 0 (waited: 323s)
In progress: 3, queued: 0 (waited: 344s)
In progress:

In [5]:
print("Project Optimization Metric: " + project.metric)
print("Project Target: " + project.target)
total_rows = training_df.shape[0]
print(f'Total Training Rows: {total_rows}')

Project Optimization Metric: RMSE
Project Target: game_score
Total Training Rows: 9999


In [6]:
# assemble scoring data and attributes
# order by cv score, then by validation score

model_scores = pd.DataFrame(
    [[model.metrics[project.metric]['crossValidation'],
      model.metrics[project.metric]['validation'],
      model,
      model.training_row_count,
      model.model_category] for model in project.get_models(with_metric=project.metric)],
    columns=['cv', 'v', 'model', 'rows', 'category']).sort_values(['cv', 'v'],
                                                                 na_position='last')
# TODO: add ability to detect need for reverse sorting for metrics like AUC
model_scores

Unnamed: 0,cv,v,model,rows,category
0,6.702872,6.66194,Model('ExtraTrees Regressor'),8000,model
4,6.726294,6.67778,Model('ENET Blender'),6400,blend
3,6.726992,6.67760,Model('AVG Blender'),6400,blend
2,6.727492,6.67451,Model('Advanced AVG Blender'),6400,blend
1,6.729854,6.66923,Model('ENET Blender'),6400,blend
5,6.731212,6.69450,Model('ExtraTrees Regressor'),6400,model
10,6.771144,6.71953,Model('ExtraTrees Regressor'),6400,model
6,6.775368,6.71667,Model('eXtreme Gradient Boosted Trees Regresso...,6400,model
8,6.778788,6.71789,Model('eXtreme Gradient Boosted Trees Regresso...,6400,model
11,6.783520,6.73290,Model('eXtreme Gradient Boosted Trees Regresso...,6400,model


In [7]:
# Eliminate blenders, prime models, and scaleout models.
# Eliminate anything not at the max training percentage of the project
# If we are not training new CV models, eliminate the ones that haven't run CV yet.
# If we are limiting to top models, select the top X models from remaining ones,
#     otherwise select all of them.

if limit_to_already_cv_only:
    models_to_use = model_scores.loc[
        (model_scores['category'] == 'model')
        & (model_scores['rows'] == project.max_train_rows)
        & model_scores['cv'].notna(),
        'model'].tolist()
else:
    models_to_use = model_scores.loc[
        (model_scores['category'] == 'model')
        & (model_scores['rows'] == project.max_train_rows),
        'model'].tolist()
if limit_to_top_models:
    models_to_use = models_to_use[:num_top_models]
models_to_use

[Model('ExtraTrees Regressor'),
 Model('ExtraTrees Regressor'),
 Model('eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
 Model('eXtreme Gradient Boosted Trees Regressor with Early Stopping'),
 Model('eXtreme Gradient Boosted Trees Regressor with Early Stopping and Unsupervised Learning Features'),
 Model('Gradient Boosted Trees Regressor with Early Stopping (Least-Squares Loss)'),
 Model('Nystroem Kernel SVM Regressor'),
 Model('Gradient Boosted Trees Regressor (Least-Squares Loss)'),
 Model('Gradient Boosted Trees Regressor'),
 Model('Auto-Tuned Word N-Gram Text Modeler using token occurrences - text_yesterday_and_today')]

In [8]:
# kick off all training predictions jobs if they haven't already been run
train_pred_jobs = list()
for model in models_to_use:
    if not model.metrics[project.metric]['crossValidation']:  # if CV has not been run, try to run CV
        try:
            model.cross_validate()
            print('Cross Validation begun for model ' + model.id)
        except Exception as e:
            print('CV error: ', str(e))
    try:
        train_pred_jobs.append(model.request_training_predictions(dr.enums.DATA_SUBSET.ALL))
        print('Training Predictions begun for model ' + model.id)
    except Exception as e:
        if str(e) == "422 client error: {'message': 'Training predictions request already submitted for these parameters'}":
            print('Predictions already kicked off for model ' + model.id)
        else:
            print(str(e))
print('\nAll training prediction jobs kicked off')

Training Predictions begun for model 5d827f3495d0303a8079d17e
Training Predictions begun for model 5d8280fb27f88e74fa044e34
Training Predictions begun for model 5d827f3495d0303a8079d189
Training Predictions begun for model 5d827f3495d0303a8079d182
Training Predictions begun for model 5d827f3495d0303a8079d180
Training Predictions begun for model 5d827f3495d0303a8079d186
Training Predictions begun for model 5d827f3495d0303a8079d184
Training Predictions begun for model 5d827f3495d0303a8079d187
Training Predictions begun for model 5d827f3495d0303a8079d18a
Cross Validation begun for model 5d827f3495d0303a8079d17d
Training Predictions begun for model 5d827f3495d0303a8079d17d

All training prediction jobs kicked off


In [9]:
model_ids = [model.id for model in models_to_use]
# create a mapping from model ids to names, which start with model id and end with model_type,
#   like "5d6edd3578132c4e72cd27df eXtreme Gradient Boosted Trees Classifier with Early Stopping - Forest (10x)"
# I set them in this order so that the model id would be copyable, since tooltips are not. I would prefer using
#   "M127 eXtreme Gradient Boosted Trees Classifier with Early Stopping - Forest (10x)" but the M### are not available
model_ids_to_names = {model.id: model.id + ' ' + model.model_type
                      for model in models_to_use}

In [10]:
for train_pred_job in train_pred_jobs:
    train_pred_job.get_result_when_complete()
print('All training prediction jobs complete!')

All training prediction jobs complete!


In [11]:
tp_list = dr.TrainingPredictions.list(project_id=project.id)
predictions = [(model_ids_to_names[tp.model_id], tp.get_all_as_dataframe())
               for tp in tp_list if tp.model_id in model_ids and tp.data_subset == 'all']
predictions

[('5d827f3495d0303a8079d18a Gradient Boosted Trees Regressor',
        row_id partition_id  prediction
  0          0      Holdout   13.161561
  1          1      Holdout   10.716866
  2          2      Holdout   13.691411
  3          3          4.0   11.504570
  4          4          3.0   12.963454
  5          5      Holdout   13.077293
  6          6      Holdout   14.000896
  7          7          4.0   11.731391
  8          8          0.0   13.350650
  9          9          1.0   14.454387
  10        10          4.0   13.748579
  11        11      Holdout   13.465634
  12        12          2.0   13.768457
  13        13          3.0   12.730611
  14        14      Holdout   13.892001
  15        15          4.0   14.224886
  16        16      Holdout   13.339055
  17        17          4.0   12.773731
  18        18      Holdout   13.617614
  19        19      Holdout   13.317456
  20        20          2.0   13.762582
  21        21          3.0    9.858629
  22        22   

In [12]:
pred_df = pd.DataFrame()
for model, tp_df in predictions:
    tp_df['Model'] = model
    pred_df = pred_df.append(tp_df, sort=True)

In [13]:
# quick sanity check to make sure partitions are consistent
all(pred_df.groupby(['partition_id', 'row_id']).agg({'Model': 'count'}).reset_index()['Model'] == 14)

False

In [14]:
# merge all the predictions...
merged_df = pd.DataFrame()
merged_df['target'] = training_df[project.target]
first = True
for model, tp_df in predictions:
    if first:
        merged_df['partition_id'] = tp_df['partition_id']
        first = False
    if project.target_type != "Regression":
        # Grab first class probabilities
        merged_df[model] = tp_df.iloc[:,3]
    else:
        # Grab predictions
        merged_df[model] = tp_df.iloc[:,2]   
    print(model + ' predictions downloaded')
merged_df

5d827f3495d0303a8079d18a Gradient Boosted Trees Regressor predictions downloaded
5d827f3495d0303a8079d189 eXtreme Gradient Boosted Trees Regressor with Early Stopping predictions downloaded
5d827f3495d0303a8079d186 Gradient Boosted Trees Regressor with Early Stopping (Least-Squares Loss) predictions downloaded
5d827f3495d0303a8079d180 eXtreme Gradient Boosted Trees Regressor with Early Stopping and Unsupervised Learning Features predictions downloaded
5d827f3495d0303a8079d17d Auto-Tuned Word N-Gram Text Modeler using token occurrences - text_yesterday_and_today predictions downloaded
5d827f3495d0303a8079d187 Gradient Boosted Trees Regressor (Least-Squares Loss) predictions downloaded
5d827f3495d0303a8079d182 eXtreme Gradient Boosted Trees Regressor with Early Stopping predictions downloaded
5d827f3495d0303a8079d17e ExtraTrees Regressor predictions downloaded
5d827f3495d0303a8079d184 Nystroem Kernel SVM Regressor predictions downloaded
5d8280fb27f88e74fa044e34 ExtraTrees Regressor predi

Unnamed: 0,target,partition_id,5d827f3495d0303a8079d18a Gradient Boosted Trees Regressor,5d827f3495d0303a8079d189 eXtreme Gradient Boosted Trees Regressor with Early Stopping,5d827f3495d0303a8079d186 Gradient Boosted Trees Regressor with Early Stopping (Least-Squares Loss),5d827f3495d0303a8079d180 eXtreme Gradient Boosted Trees Regressor with Early Stopping and Unsupervised Learning Features,5d827f3495d0303a8079d17d Auto-Tuned Word N-Gram Text Modeler using token occurrences - text_yesterday_and_today,5d827f3495d0303a8079d187 Gradient Boosted Trees Regressor (Least-Squares Loss),5d827f3495d0303a8079d182 eXtreme Gradient Boosted Trees Regressor with Early Stopping,5d827f3495d0303a8079d17e ExtraTrees Regressor,5d827f3495d0303a8079d184 Nystroem Kernel SVM Regressor,5d8280fb27f88e74fa044e34 ExtraTrees Regressor
0,8.1,Holdout,13.161561,11.294995,13.142506,12.367445,10.550072,11.607484,13.037875,10.669949,13.001448,10.071771
1,18.6,Holdout,10.716866,12.066273,12.677987,12.228329,10.550072,10.299732,11.657744,11.700363,11.253059,11.444714
2,5.4,Holdout,13.691411,13.241269,13.232603,13.419540,10.550072,11.119019,14.622608,13.497095,11.278088,14.381510
3,16.5,4.0,11.504570,13.182579,12.133708,10.675391,10.508400,10.641212,13.059536,12.778480,11.506106,12.554135
4,8.4,3.0,12.963454,13.725925,13.128561,13.923594,10.546838,12.764852,13.475514,14.389809,12.881784,14.683216
5,25.5,Holdout,13.077293,13.677299,13.605642,13.413597,10.550072,12.767859,14.029983,13.356206,12.565127,13.760434
6,20.7,Holdout,14.000896,13.670481,14.239653,14.609518,10.550072,13.787153,14.522532,14.815258,13.604861,14.960875
7,18.3,4.0,11.731391,13.163185,12.865276,13.420309,10.597748,12.973867,12.949984,15.000263,12.765175,15.453476
8,23.0,0.0,13.350650,12.728014,12.751873,13.673044,10.550072,13.622487,13.708130,13.942984,12.886249,13.376334
9,0.4,1.0,14.454387,13.500424,14.982912,13.938272,11.474463,14.089572,13.821175,14.383501,13.397105,15.290266


In [15]:
# Prepping data for non-negative least squares

y = merged_df.pop("target")

# In case of a classification problem
if project.target_type != "Regression":
    y = y.astype("category").cat.codes

X = merged_df.drop("partition_id", axis = 1)

In [16]:
# Perform non-negative least squares to select features

from scipy.optimize import nnls

nnls_opt = nnls(A=X.values, b=y)

# Selected models
selected_models = list(X.columns[np.nonzero(nnls_opt[0])])
selected_models

['5d827f3495d0303a8079d189 eXtreme Gradient Boosted Trees Regressor with Early Stopping',
 '5d827f3495d0303a8079d186 Gradient Boosted Trees Regressor with Early Stopping (Least-Squares Loss)',
 '5d827f3495d0303a8079d180 eXtreme Gradient Boosted Trees Regressor with Early Stopping and Unsupervised Learning Features',
 '5d827f3495d0303a8079d17e ExtraTrees Regressor',
 '5d827f3495d0303a8079d184 Nystroem Kernel SVM Regressor']

In [17]:
# Make some blends

# Grab ids
model_ids_to_blend = [x.split(" ", 1)[0] for x in selected_models]

# Start blending
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.AVERAGE)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.MEDIAN)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.ENET)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.GLM)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.PLS)
project.blend(model_ids_to_blend, dr.enums.BLENDER_METHOD.TENSORFLOW)

ModelJob(TF Blender, status=inprogress)

In [18]:
project.open_leaderboard_browser()

True