-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an option to get stratified Train-Test splits #2662
Comments
Thanks for opening the detailed issue, feel free to open a PR on this on, my only concern is that it might be more efficient to build an index vector first and then to use |
The index vector idea makes sense. I will change the implementation to reflect that. Thanks @zoq. |
Sounds good, thanks. |
@Abilityguy looks like you did most of the work already! 😄 As a fun challenge, I think that this can be done most efficiently with only a single pass over the label set and a single pass over the dataset. 👍 |
@rcurtin challenge accepted! 😂 |
This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍 |
Keep open. |
Thabks @Abilityguy! I forgot that this issue was open and needed to be closed after the PR was merged. :) |
What is the desired addition or change?
Addition of an option to get stratified train-test splits in mlpack. While going through the docs, I noticed we don't have such an option yet. I brought this up in the IRC a few weeks ago and decided to take this up and add an issue on the same.
What is the motivation for this feature?
In stratified train test splits, the proportion of labels seen in the dataset is maintained across the train set and test set. In cases of datasets with imbalanced classes, it would be desirable reflect this class imbalance in the train and test sets.
If applicable, describe how this feature would be implemented.
I referred to the implementation of split in mlpack/src/mlpack/core/data/split_data.hpp and wrote a sample template. Let me know what you guys think about this.
Additional information?
I integrated this with mlpack_process_split and ran tests on a few datasets.
Dataset 1: covertype dataset (https://www.mlpack.org/datasets/covertype-small.data.csv.gz)
Dataset 2: MNIST train dataset from Kaggle (https://www.kaggle.com/c/digit-recognizer/data)
Should I go ahead and make a PR on this?
Let me know if you guys have any suggestions or changes on this.
The text was updated successfully, but these errors were encountered: