-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early Stopping in EBM #268
Comments
Hi @onacrame, Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet. An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined -InterpretML Team |
Defining the validation set would be a great option as one can just use whatever sklearn-type iterators one wants and keeping Interpret-ML api simpler. So default option would be as it is now but with the ability to pass in a user defined validation set. |
In catboost (https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html#python-reference_catboostclassifier_fit), xgboost (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) and lightgbm (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) you have an |
Another ancillary point is that typically after a model building process is finished, it's customary to train the final model on all the data, using whatever early stopping thresholds were found during cross validation or by found while running against a validation set. The EBM framework doesn't really allow for this given that there's always a holdout set and no "refit" of the model without the validation set, so there will always be some portion of the data that cannot be used in the final model. Just an observation. |
Another problem is that if, for example, you oversampled a class in the training set, you should not have an oversampled validation set (the validation set distribution should be similar to the test set distribution and to the live data distribution). If you split the validation set from the training set, you inherit the oversampled training set distribution. This is also true if you perform data augmentation on the training set. Splitting the validation set from the training set is often a bad idea. |
Any timeline as to how soon this feature will be incorporated? This is extremely crucial esp. for problems where you cant randomly split the data. |
This can now be accomplished with the bags parameter. Details in our docs: https://interpret.ml/docs/ebm.html#interpret.glassbox.ExplainableBoostingClassifier.fit |
Correct me if I'm wrong but the native early stopping mechanism within EBM will just take a random slice of the data. In the case of (i) grouped observations (panel data where one ID might relate to multiple rows of data) or (ii) imbalanced data where one might want to ensure stratification, a random cut may not be optimal. Is there any way to use an iterator to predefine which slice of the data is used for early stopping?
The text was updated successfully, but these errors were encountered: