Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Imbalanced Datasets #157

Open
pplonski opened this issue Aug 28, 2020 · 9 comments
Open

Handle Imbalanced Datasets #157

pplonski opened this issue Aug 28, 2020 · 9 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@pplonski
Copy link
Contributor

Consider adding an option to handle for imbalanced data https://github.com/scikit-learn-contrib/imbalanced-learn.

It can be implemented in similar way as the Golden Features step.

@pplonski pplonski added enhancement New feature or request help wanted Extra attention is needed labels Aug 28, 2020
@tmontana
Copy link

tmontana commented Sep 5, 2020

Is it possible in meantime to add scale_pos_weight advanced option for xgboost? thanks

@pplonski
Copy link
Contributor Author

pplonski commented Sep 7, 2020

@tmontana I've created the issue #168 for adding scale_post_weight. I do a little research for scale_post_weight parameter:

  • it works only for binary classification
  • for multiclass classification the weight should be passed for each sample in the data (during creation of DMatrix). This solution is more general because will allow to handle sample's weight passed be user (or created in-the-fly for imbalanced data).

@shahules786
Copy link
Contributor

I can work on this, But shouldn't we also add oversampling and undersampling before we add imbalance learn? It can be in the same module ,eg handle_imbalance.py? @pplonski

@pplonski
Copy link
Contributor Author

There can be 3 methods to handle imbalanced datasets:

  • sample weights
  • oversampling
  • undersampling

The performance of each method depends on the dataset. I would add them in AutoML steps after not_so_random step (before golden_features) (see the docs. I would add a parameter handle_imbalance to AutoML.__init__().

However, maybe first, the advance metrics for evaluation should be implemented(#73)? And the weight vector feature (#154). and then we can start on imbalanced. Please take a look at roadmap in the docs.

@shahules786
Copy link
Contributor

Sure @pplonski

@strukevych
Copy link

strukevych commented Nov 8, 2023

@pplonski
Sample weights are the most preferred way because they allow us to manipulate object importance.

Some binary/multi-class datasets also require rolling fading (based on the time), and it's possible only with sample weights.

Oversampling could be a solution for very tiny datasets only, which is a rare case.

handle_imbalance can be defined as: 'auto' | 'oversampling' | 'undersampling' | 'sample_weights_balanced' | 'sample_weights_sqrt_balanced' | float[] | float

Where float will be a threshold for enabling auto, float[] is pre-calculated sample_weights

If the float passed is 0 or 1, skip this step

'auto' could always be handled by the sklearn class_weights / sample_weights for the binary/multi-class target
It will call oversampling only in case if the dataset has < 1000 entries or < 1/(10*N)% for one of the classes

I want to use it together with fading object importance so I can implement a small prototype to merge.

It will support:

  • passing float[] for custom sample weights
  • passing float for a threshold for enabling sample weights
  • passing auto to enable sample weights balanced
  • passing 0 or 1 turn-off sample weights
  • passing sample_weights_balanced
  • passing sample_weights_sqrt_balanced (there is some way how to detect which is better, but it will include a lot of checks and could be very dependent on exact use cases, so it is better to pass it by a string and not use auto-detection for this scenario)

I just saw that no work was done to implement this so that I could take this feature.

Also, I want to implement TabR, NODE, and GATE algorithms here because it's cutting edge algorithms to work with binary/multi-class problem

It probably will create a new issue for that and will implement it in new PR
TabR is KMeans on steroids, and NODE/GATE will be pretty similar to CatBoost/XGBoost/LightGBM, but just more complicated

NODE/GATE could give a few percent boost compared to CatBoost on most datasets, and TabR is showing a good increase compared to KMeans on all datasets relevant to KMeans.

@strukevych
Copy link

strukevych commented Nov 8, 2023

@pplonski

There is also a cases where combination of undersampling and oversampling could be used
But still don't want to have this in the first prototype

It will be okay?

Also, I'm not sure about running that after not_so_random step, because hyperparameters could have different behavior on different types of handling imbalanced datasets, so probably better to run it after default_algorithms step

@pplonski
Copy link
Contributor Author

pplonski commented Nov 8, 2023

Hi @strukevych,

There are many ways to implement it. I think it is good to start with some simple solution. I would love to check some prototype. Do you have example, public datasets for testing?

@strukevych
Copy link

Hi @strukevych,

There are many ways to implement it. I think it is good to start with some simple solution. I would love to check some prototype. Do you have example, public datasets for testing?

Yep, will create PR after testing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants