Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments
Closed

Dynamic Stacked Ensemble metalearning strategy in AutoML #8096

exalate-issue-sync bot opened this issue May 11, 2023 · 3 comments
Assignees

Comments

@exalate-issue-sync
Copy link

With smaller data or when your data size to compute resources ratio is high, H2O AutoML will typically produce a better Stacked Ensemble model using cross-validation, however, for larger datasets, especially in time-constrained scenarios <1 hour, we see better results when we reduce nfolds or skip cross-validation completely and instead use a blending frame to train the Stacked Ensemble. We need a dynamic strategy, based on data to compute ratio, for choosing the number of folds or using a blending.

There's some strong evidence that as datasets grow in size, it's better to switch over to using a blending frame for training the metalearner in Stacked Ensembles instead of 5-fold CV (which is what we use by default on all datasets).

On a benchmark of the HIGGS dataset, we compare blending, 3-fold, 5-fold of 1 hour to 5-fold for 4 hours. Here we see that with 1M rows, we can beat a 1-hour blending frame by running the default 5-fold for longer (4 hours) – though it’s still obviously better to use less time (and hence the blending frame here). At 10M rows, blending for 1 hour is still giving better results than default 5-fold CV for 4 hours…. which means there is really no reason we should be doing CV at this point. These results use a separate test set for leaderboard scoring (the AUCs you see on the plot).

We will have to do some more benchmarking on this because if we switch over to using a 10% (or some other fraction) blending frame for datasets of a certain "size" or "size in relation to compute resources", then we don’t get the CV metrics for the leaderboard, so we will have to chop off another piece of data just for the leaderboard scoring, which could be ok if there's "enough" data, but we need to be careful about doing this properly.

!Screen Shot 2020-05-16 at 5.07.53 PM.png|width=200,height=183!

@exalate-issue-sync
Copy link
Author

Tomas Fryda commented: NOTE: We should also consider {{time constraint}} and {{computational power}} since blending behaves differently on different sizes of AWS instances at least for smaller datasets.

!Screen Shot 2020-05-18 at 2.20.48 PM.png|width=1238,height=955!

!Screen Shot 2020-05-18 at 2.12.36 PM.png|width=1280,height=665!

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: For a first simple heuristic, I suggest to focus on single node and determine empirically a threshold for the following ratio:

{{r = dataset_size / (time_constraint * computation_power)}}

with:

{{dataset_size = nrows * ncols}}

{{computation_power = nthreads}}= allocated CPUs

if {{r > threshold}} we would automatically switch to blending.

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-7542
Assignee: Tomas Fryda
Reporter: Erin LeDell
State: Resolved
Fix Version: 3.36.0.1
Attachments: Available (Count: 3)
Development PRs: Available

Linked PRs from JIRA

#5225

Attachments From Jira

Attachment Name: Screen Shot 2020-05-16 at 5.07.53 PM.png
Attached By: Erin LeDell
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7542/Screen Shot 2020-05-16 at 5.07.53 PM.png

Attachment Name: Screen Shot 2020-05-18 at 2.12.36 PM.png
Attached By: Tomas Fryda
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7542/Screen Shot 2020-05-18 at 2.12.36 PM.png

Attachment Name: Screen Shot 2020-05-18 at 2.20.48 PM.png
Attached By: Tomas Fryda
File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-7542/Screen Shot 2020-05-18 at 2.20.48 PM.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants