You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With smaller data or when your data size to compute resources ratio is high, H2O AutoML will typically produce a better Stacked Ensemble model using cross-validation, however, for larger datasets, especially in time-constrained scenarios <1 hour, we see better results when we reduce nfolds or skip cross-validation completely and instead use a blending frame to train the Stacked Ensemble. We need a dynamic strategy, based on data to compute ratio, for choosing the number of folds or using a blending.
There's some strong evidence that as datasets grow in size, it's better to switch over to using a blending frame for training the metalearner in Stacked Ensembles instead of 5-fold CV (which is what we use by default on all datasets).
On a benchmark of the HIGGS dataset, we compare blending, 3-fold, 5-fold of 1 hour to 5-fold for 4 hours. Here we see that with 1M rows, we can beat a 1-hour blending frame by running the default 5-fold for longer (4 hours) – though it’s still obviously better to use less time (and hence the blending frame here). At 10M rows, blending for 1 hour is still giving better results than default 5-fold CV for 4 hours…. which means there is really no reason we should be doing CV at this point. These results use a separate test set for leaderboard scoring (the AUCs you see on the plot).
We will have to do some more benchmarking on this because if we switch over to using a 10% (or some other fraction) blending frame for datasets of a certain "size" or "size in relation to compute resources", then we don’t get the CV metrics for the leaderboard, so we will have to chop off another piece of data just for the leaderboard scoring, which could be ok if there's "enough" data, but we need to be careful about doing this properly.
!Screen Shot 2020-05-16 at 5.07.53 PM.png|width=200,height=183!
The text was updated successfully, but these errors were encountered:
Tomas Fryda commented: NOTE: We should also consider {{time constraint}} and {{computational power}} since blending behaves differently on different sizes of AWS instances at least for smaller datasets.
!Screen Shot 2020-05-18 at 2.20.48 PM.png|width=1238,height=955!
!Screen Shot 2020-05-18 at 2.12.36 PM.png|width=1280,height=665!
Sebastien Poirier commented: For a first simple heuristic, I suggest to focus on single node and determine empirically a threshold for the following ratio:
With smaller data or when your data size to compute resources ratio is high, H2O AutoML will typically produce a better Stacked Ensemble model using cross-validation, however, for larger datasets, especially in time-constrained scenarios <1 hour, we see better results when we reduce nfolds or skip cross-validation completely and instead use a blending frame to train the Stacked Ensemble. We need a dynamic strategy, based on data to compute ratio, for choosing the number of folds or using a blending.
There's some strong evidence that as datasets grow in size, it's better to switch over to using a blending frame for training the metalearner in Stacked Ensembles instead of 5-fold CV (which is what we use by default on all datasets).
On a benchmark of the HIGGS dataset, we compare blending, 3-fold, 5-fold of 1 hour to 5-fold for 4 hours. Here we see that with 1M rows, we can beat a 1-hour blending frame by running the default 5-fold for longer (4 hours) – though it’s still obviously better to use less time (and hence the blending frame here). At 10M rows, blending for 1 hour is still giving better results than default 5-fold CV for 4 hours…. which means there is really no reason we should be doing CV at this point. These results use a separate test set for leaderboard scoring (the AUCs you see on the plot).
We will have to do some more benchmarking on this because if we switch over to using a 10% (or some other fraction) blending frame for datasets of a certain "size" or "size in relation to compute resources", then we don’t get the CV metrics for the leaderboard, so we will have to chop off another piece of data just for the leaderboard scoring, which could be ok if there's "enough" data, but we need to be careful about doing this properly.
!Screen Shot 2020-05-16 at 5.07.53 PM.png|width=200,height=183!
The text was updated successfully, but these errors were encountered: