-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Imbalanced Datasets #157
Comments
Is it possible in meantime to add scale_pos_weight advanced option for xgboost? thanks |
@tmontana I've created the issue #168 for adding
|
I can work on this, But shouldn't we also add oversampling and undersampling before we add imbalance learn? It can be in the same module ,eg |
There can be 3 methods to handle imbalanced datasets:
The performance of each method depends on the dataset. I would add them in AutoML steps after However, maybe first, the advance metrics for evaluation should be implemented(#73)? And the weight vector feature (#154). and then we can start on imbalanced. Please take a look at roadmap in the docs. |
Sure @pplonski |
@pplonski Some binary/multi-class datasets also require rolling fading (based on the time), and it's possible only with sample weights. Oversampling could be a solution for very tiny datasets only, which is a rare case. handle_imbalance can be defined as: 'auto' | 'oversampling' | 'undersampling' | 'sample_weights_balanced' | 'sample_weights_sqrt_balanced' | float[] | float Where float will be a threshold for enabling auto, float[] is pre-calculated sample_weights If the float passed is 0 or 1, skip this step 'auto' could always be handled by the sklearn class_weights / sample_weights for the binary/multi-class target I want to use it together with fading object importance so I can implement a small prototype to merge. It will support:
I just saw that no work was done to implement this so that I could take this feature. Also, I want to implement TabR, NODE, and GATE algorithms here because it's cutting edge algorithms to work with binary/multi-class problem It probably will create a new issue for that and will implement it in new PR NODE/GATE could give a few percent boost compared to CatBoost on most datasets, and TabR is showing a good increase compared to KMeans on all datasets relevant to KMeans. |
There is also a cases where combination of undersampling and oversampling could be used It will be okay? Also, I'm not sure about running that after not_so_random step, because hyperparameters could have different behavior on different types of handling imbalanced datasets, so probably better to run it after default_algorithms step |
Hi @strukevych, There are many ways to implement it. I think it is good to start with some simple solution. I would love to check some prototype. Do you have example, public datasets for testing? |
Yep, will create PR after testing :) |
Consider adding an option to handle for imbalanced data https://github.com/scikit-learn-contrib/imbalanced-learn.
It can be implemented in similar way as the
Golden Features
step.The text was updated successfully, but these errors were encountered: