Early Stopping in EBM #268

onacrame · 2021-07-28T15:12:33Z

Correct me if I'm wrong but the native early stopping mechanism within EBM will just take a random slice of the data. In the case of (i) grouped observations (panel data where one ID might relate to multiple rows of data) or (ii) imbalanced data where one might want to ensure stratification, a random cut may not be optimal. Is there any way to use an iterator to predefine which slice of the data is used for early stopping?

interpret-ml · 2021-07-29T22:13:42Z

Hi @onacrame,

Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.

An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val) as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!

-InterpretML Team

onacrame · 2021-07-30T07:22:16Z

Hi @onacrame,

Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.

An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val) as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!

-InterpretML Team

Defining the validation set would be a great option as one can just use whatever sklearn-type iterators one wants and keeping Interpret-ML api simpler. So default option would be as it is now but with the ability to pass in a user defined validation set.

candalfigomoro · 2021-08-04T17:37:08Z

@interpret-ml

In catboost (https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html#python-reference_catboostclassifier_fit), xgboost (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) and lightgbm (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) you have an eval_set parameter for the fit() method, that you can use to provide "A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed".

onacrame · 2021-08-05T07:58:30Z

Another ancillary point is that typically after a model building process is finished, it's customary to train the final model on all the data, using whatever early stopping thresholds were found during cross validation or by found while running against a validation set. The EBM framework doesn't really allow for this given that there's always a holdout set and no "refit" of the model without the validation set, so there will always be some portion of the data that cannot be used in the final model.

Just an observation.

candalfigomoro · 2021-08-17T14:19:54Z

Another problem is that if, for example, you oversampled a class in the training set, you should not have an oversampled validation set (the validation set distribution should be similar to the test set distribution and to the live data distribution). If you split the validation set from the training set, you inherit the oversampled training set distribution. This is also true if you perform data augmentation on the training set. Splitting the validation set from the training set is often a bad idea.

sarim-zafar · 2023-02-20T18:56:57Z

Any timeline as to how soon this feature will be incorporated? This is extremely crucial esp. for problems where you cant randomly split the data.

paulbkoch · 2023-08-11T18:59:02Z

This can now be accomplished with the bags parameter. Details in our docs: https://interpret.ml/docs/ebm.html#interpret.glassbox.ExplainableBoostingClassifier.fit

paulbkoch added the enhancement New feature or request label Feb 10, 2023

This was referenced Feb 10, 2023

Custom validation set #31

Closed

Backlog #400

Open

paulbkoch closed this as completed Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early Stopping in EBM #268

Early Stopping in EBM #268

onacrame commented Jul 28, 2021

interpret-ml commented Jul 29, 2021 •

edited

onacrame commented Jul 30, 2021

candalfigomoro commented Aug 4, 2021

onacrame commented Aug 5, 2021

candalfigomoro commented Aug 17, 2021

sarim-zafar commented Feb 20, 2023

paulbkoch commented Aug 11, 2023

Early Stopping in EBM #268

Early Stopping in EBM #268

Comments

onacrame commented Jul 28, 2021

interpret-ml commented Jul 29, 2021 • edited

onacrame commented Jul 30, 2021

candalfigomoro commented Aug 4, 2021

onacrame commented Aug 5, 2021

candalfigomoro commented Aug 17, 2021

sarim-zafar commented Feb 20, 2023

paulbkoch commented Aug 11, 2023

interpret-ml commented Jul 29, 2021 •

edited