Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

exalate-issue-sync · 2023-05-11T19:19:21Z

With smaller data or when your data size to compute resources ratio is high, H2O AutoML will typically produce a better Stacked Ensemble model using cross-validation, however, for larger datasets, especially in time-constrained scenarios <1 hour, we see better results when we reduce nfolds or skip cross-validation completely and instead use a blending frame to train the Stacked Ensemble. We need a dynamic strategy, based on data to compute ratio, for choosing the number of folds or using a blending.

There's some strong evidence that as datasets grow in size, it's better to switch over to using a blending frame for training the metalearner in Stacked Ensembles instead of 5-fold CV (which is what we use by default on all datasets).

On a benchmark of the HIGGS dataset, we compare blending, 3-fold, 5-fold of 1 hour to 5-fold for 4 hours. Here we see that with 1M rows, we can beat a 1-hour blending frame by running the default 5-fold for longer (4 hours) – though it’s still obviously better to use less time (and hence the blending frame here). At 10M rows, blending for 1 hour is still giving better results than default 5-fold CV for 4 hours…. which means there is really no reason we should be doing CV at this point. These results use a separate test set for leaderboard scoring (the AUCs you see on the plot).

We will have to do some more benchmarking on this because if we switch over to using a 10% (or some other fraction) blending frame for datasets of a certain "size" or "size in relation to compute resources", then we don’t get the CV metrics for the leaderboard, so we will have to chop off another piece of data just for the leaderboard scoring, which could be ok if there's "enough" data, but we need to be careful about doing this properly.

!Screen Shot 2020-05-16 at 5.07.53 PM.png|width=200,height=183!

exalate-issue-sync · 2023-05-11T19:19:23Z

Tomas Fryda commented: NOTE: We should also consider {{time constraint}} and {{computational power}} since blending behaves differently on different sizes of AWS instances at least for smaller datasets.

!Screen Shot 2020-05-18 at 2.20.48 PM.png|width=1238,height=955!

!Screen Shot 2020-05-18 at 2.12.36 PM.png|width=1280,height=665!

exalate-issue-sync · 2023-05-11T19:19:25Z

Sebastien Poirier commented: For a first simple heuristic, I suggest to focus on single node and determine empirically a threshold for the following ratio:

with:

{{computation_power = nthreads}}= allocated CPUs

if {{r > threshold}} we would automatically switch to blending.

h2o-ops · 2023-05-14T21:29:10Z

JIRA Issue Migration Info

Jira Issue: PUBDEV-7542
Assignee: Tomas Fryda
Reporter: Erin LeDell
State: Resolved
Fix Version: 3.36.0.1
Attachments: Available (Count: 3)
Development PRs: Available

Linked PRs from JIRA

#5225

Attachments From Jira

Attachment Name: Screen Shot 2020-05-16 at 5.07.53 PM.png
Attached By: Erin LeDell
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7542/Screen Shot 2020-05-16 at 5.07.53 PM.png

Attachment Name: Screen Shot 2020-05-18 at 2.12.36 PM.png
Attached By: Tomas Fryda
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7542/Screen Shot 2020-05-18 at 2.12.36 PM.png

Attachment Name: Screen Shot 2020-05-18 at 2.20.48 PM.png
Attached By: Tomas Fryda
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7542/Screen Shot 2020-05-18 at 2.20.48 PM.png

h2o-ops assigned tomasfryda May 14, 2023

h2o-ops closed this as completed May 14, 2023

h2o-ops added the fixVersion/3.36.0.1 label May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

Comments

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023