New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an implementation to Stratify Data #2671
Conversation
Hi @zoq. I have made the changes. I was thinking if it would be better to have the stratified split templates in the file split_data.hpp file itself. What is your take on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should also add some tests here, both for method(tests/split_data_test.cpp
) as well as bindings(main_tests/preprocess_split_test.cpp
).
…red code in preprocess_split_main.cpp
Hey @Yashwants19 , thanks for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the plan right now is to add some tests for the new feature?
Yes, I will look into adding a few tests. |
The timings are pretty close, maybe the hashing isn't perfect and the lookup is not O(1)? Personally I would go with the current approach just to save some memory. |
Hey @zoq, it seems like you are right. The hashing may not have been I had to make an extra pass to figure out the maximum value of the label. ( The times are listed below (average of 5).
I think we can stick to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slightly slow response on my end. I have just a couple comments but everything is looking good to me. Thanks for putting time into this! From my perspective I think it is just about ready. 👍
Not an issue. :) |
Oops. looks I missed out testing the stratify_data flag in mlpack/tests/main_tests/preprocess_split_test.cpp. |
I have added tests in Also, I noticed that data was loaded using the below snippet in
If the file doesn't exist or there is an error in loading the file, there are instances where the test case can pass (if test case depends on input size) even though there is an error in loading the file. A better way to load data would be
In this case, if the file is not loaded properly, the error message comes up in the testing log.
Even if the test doesn't depend on the input size in any way, I think it would be a good idea to add the I could open an issue on this and I think this could be a good first issue to introduce new contributors to the codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for taking the time to implement and tune this! 👍 Do you want to add a note to HISTORY.md
too?
Totally agreed---if you'd like to take the lead on that, it would be great! 👍 |
Yup. Added a note to
Sure. I will open up an issue soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second approval provided automatically after 24 hours. 👍
Thanks @Abilityguy! |
Reference #2662
@rcurtin challenge completed!
The current implementation seems to be faster than the one in the issue.
Dataset 1: covertype dataset (https://www.mlpack.org/datasets/covertype-small.data.csv.gz)
Dataset 2: MNIST train dataset from Kaggle (https://www.kaggle.com/c/digit-recognizer/data)
I have added some comments to convey the basic idea in stratified_split_data.hpp
Let me know what you guys think.