Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early Stopping in EBM #268

Closed
onacrame opened this issue Jul 28, 2021 · 7 comments
Closed

Early Stopping in EBM #268

onacrame opened this issue Jul 28, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@onacrame
Copy link

Correct me if I'm wrong but the native early stopping mechanism within EBM will just take a random slice of the data. In the case of (i) grouped observations (panel data where one ID might relate to multiple rows of data) or (ii) imbalanced data where one might want to ensure stratification, a random cut may not be optimal. Is there any way to use an iterator to predefine which slice of the data is used for early stopping?

@interpret-ml
Copy link
Collaborator

interpret-ml commented Jul 29, 2021

Hi @onacrame,

Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.

An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val) as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!

-InterpretML Team

@onacrame
Copy link
Author

Hi @onacrame,

Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.

An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val) as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!

-InterpretML Team

Defining the validation set would be a great option as one can just use whatever sklearn-type iterators one wants and keeping Interpret-ML api simpler. So default option would be as it is now but with the ability to pass in a user defined validation set.

@candalfigomoro
Copy link

@interpret-ml

In catboost (https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html#python-reference_catboostclassifier_fit), xgboost (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) and lightgbm (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) you have an eval_set parameter for the fit() method, that you can use to provide "A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed".

@onacrame
Copy link
Author

onacrame commented Aug 5, 2021

Another ancillary point is that typically after a model building process is finished, it's customary to train the final model on all the data, using whatever early stopping thresholds were found during cross validation or by found while running against a validation set. The EBM framework doesn't really allow for this given that there's always a holdout set and no "refit" of the model without the validation set, so there will always be some portion of the data that cannot be used in the final model.

Just an observation.

@candalfigomoro
Copy link

Another problem is that if, for example, you oversampled a class in the training set, you should not have an oversampled validation set (the validation set distribution should be similar to the test set distribution and to the live data distribution). If you split the validation set from the training set, you inherit the oversampled training set distribution. This is also true if you perform data augmentation on the training set. Splitting the validation set from the training set is often a bad idea.

@paulbkoch paulbkoch added the enhancement New feature or request label Feb 10, 2023
This was referenced Feb 10, 2023
@sarim-zafar
Copy link

Any timeline as to how soon this feature will be incorporated? This is extremely crucial esp. for problems where you cant randomly split the data.

@paulbkoch
Copy link
Collaborator

This can now be accomplished with the bags parameter. Details in our docs: https://interpret.ml/docs/ebm.html#interpret.glassbox.ExplainableBoostingClassifier.fit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

5 participants