Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added an implementation to Stratify Data #2671

Merged
merged 27 commits into from Nov 15, 2020

Conversation

Abilityguy
Copy link
Contributor

@Abilityguy Abilityguy commented Oct 13, 2020

Reference #2662

@rcurtin challenge completed!

The current implementation seems to be faster than the one in the issue.

Dataset 1: covertype dataset (https://www.mlpack.org/datasets/covertype-small.data.csv.gz)

Dataset size - 54 x 100000
Test ratio - 0.3

Previous time:
[INFO ]   loading_data: 0.458566s
[INFO ]   saving_data: 2.796008s
[INFO ]   total_time: 0.641220s

Improved time:
[INFO ]   loading_data: 0.366347s
[INFO ]   saving_data: 2.499058s
[INFO ]   total_time: 0.413254s

Dataset 2: MNIST train dataset from Kaggle (https://www.kaggle.com/c/digit-recognizer/data)

Dataset size - 784 x 42000
Test ratio - 0.2

Previous time:
[INFO ]   loading_data: 2.912084s
[INFO ]   saving_data: 19.503108s
[INFO ]   total_time: 3.998145s

Improved time:
[INFO ]   loading_data: 2.153167s
[INFO ]   saving_data: 14.734252s
[INFO ]   total_time: 2.406783s

I have added some comments to convey the basic idea in stratified_split_data.hpp
Let me know what you guys think.

src/mlpack/core/data/stratified_split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/stratified_split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/stratified_split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/stratified_split_data.hpp Outdated Show resolved Hide resolved
@Abilityguy
Copy link
Contributor Author

Hi @zoq. I have made the changes.

I was thinking if it would be better to have the stratified split templates in the file split_data.hpp file itself.
This would help to refactor the code in preprocess_split_main.cpp into a single call to data::Split(..., stratify_data) with stratify_data being an extra flag that indicates whether to stratify or not.

What is your take on this?

@rcurtin rcurtin added this to the mlpack 3.4.2 milestone Oct 16, 2020
Copy link
Member

@Yashwants19 Yashwants19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should also add some tests here, both for method(tests/split_data_test.cpp) as well as bindings(main_tests/preprocess_split_test.cpp).

src/mlpack/methods/preprocess/preprocess_split_main.cpp Outdated Show resolved Hide resolved
src/mlpack/methods/preprocess/preprocess_split_main.cpp Outdated Show resolved Hide resolved
src/mlpack/methods/preprocess/preprocess_split_main.cpp Outdated Show resolved Hide resolved
@Abilityguy
Copy link
Contributor Author

Abilityguy commented Oct 19, 2020

Hey @Yashwants19 , thanks for the review.
I have made changes according to your inputs.
I also deleted the stratified_split_data.hpp file and moved it's template to split_data.hpp.
There seems to be a documentation fail. It might be due to a mistake by me while documenting. I am not sure about it's fix though. If someone can let me know how to fix it, I will make the change.

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plan right now is to add some tests for the new feature?

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
@Abilityguy
Copy link
Contributor Author

Yes, I will look into adding a few tests.

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/tests/split_data_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/split_data_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/split_data_test.cpp Outdated Show resolved Hide resolved
@zoq
Copy link
Member

zoq commented Oct 31, 2020

Yup. I too expected the opposite.
I made another check and recorded the timing over 5 runs and got similar results. The indices approach is faster compared to the labelMap and testLabelMap approach.
Of course, the indices approach takes more memory so it's a trade off.
Let me know which implementation would be better.

The timings are pretty close, maybe the hashing isn't perfect and the lookup is not O(1)? Personally I would go with the current approach just to save some memory.

@Abilityguy
Copy link
Contributor Author

Abilityguy commented Nov 1, 2020

Hey @zoq, it seems like you are right. The hashing may not have been O(1). I made another implementation where I replaced the unordered_map with an arma::uvec vector.

I had to make an extra pass to figure out the maximum value of the label. (inputLabel.max()). But I found that this implementation achieves the best of both worlds by saving memory and being faster.

The times are listed below (average of 5).

Dataset 1: MNIST dataset

StratifiedSplit (with indicies vector):  0.123578 s
StratifiedSplit (with unordered map):    0.125799 s
StratifiedSplit (with arma::uvec):       0.1226292 s
Dataset 2: Covertype dataset

StratifiedSplit (with indicies vector):  0.0388812 s
StratifiedSplit (with unordered map):    0.040097 s
StratifiedSplit (with arma::uvec):       0.0356198 s

I think we can stick to the arma::uvec based implementation. What is your take @rcurtin ?

@Abilityguy
Copy link
Contributor Author

Hey @rcurtin, could you take a look at this when you have the time?
I also wanted to know how we could go ahead and solve the issue raised by @zoq .

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slightly slow response on my end. I have just a couple comments but everything is looking good to me. Thanks for putting time into this! From my perspective I think it is just about ready. 👍

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved
@Abilityguy
Copy link
Contributor Author

Abilityguy commented Nov 11, 2020

Sorry for the slightly slow response on my end. I have just a couple comments but everything is looking good to me. Thanks for putting time into this! From my perspective I think it is just about ready.

Not an issue. :)
Thanks for the review comments!.
I have made the required documentation and style changes.

@Abilityguy
Copy link
Contributor Author

Oops. looks I missed out testing the stratify_data flag in mlpack/tests/main_tests/preprocess_split_test.cpp.
I will add a few tests in there.

@Abilityguy
Copy link
Contributor Author

Abilityguy commented Nov 13, 2020

I have added tests in preprocess_split_test.cpp to test the bindings.
This PR should be ready for merge now.

Also, I noticed that data was loaded using the below snippet in preprocess_split_test.cpp.

data::Load("vc2.csv", inputData);

If the file doesn't exist or there is an error in loading the file, there are instances where the test case can pass (if test case depends on input size) even though there is an error in loading the file.

A better way to load data would be

  if (!data::Load("vc2.csv", inputData))
    FAIL("Cannot load train dataset vc2.csv!");

In this case, if the file is not loaded properly, the error message comes up in the testing log.
A quick search through the catch testing files showed most files follow this convention in the testing suite, though there are a few files where this is not followed. Some of these are

  • nbc_test.cpp
  • feedforward_network_test.cpp
  • krann_search_test.cpp
  • random_forest_test.cpp
  • facilities_test.cpp

Even if the test doesn't depend on the input size in any way, I think it would be a good idea to add the FAIL() message just to ensure better testing and debugging.

I could open an issue on this and I think this could be a good first issue to introduce new contributors to the codebase.
What are your thoughts on this?

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for taking the time to implement and tune this! 👍 Do you want to add a note to HISTORY.md too?

@rcurtin
Copy link
Member

rcurtin commented Nov 14, 2020

I could open an issue on this and I think this could be a good first issue to introduce new contributors to the codebase.
What are your thoughts on this?

Totally agreed---if you'd like to take the lead on that, it would be great! 👍

@Abilityguy
Copy link
Contributor Author

Looks good, thanks for taking the time to implement and tune this! Do you want to add a note to HISTORY.md too?

Yup. Added a note to HISTORY.md. Thanks for the detailed reviews. I learned a lot while coding this feature! :)

Totally agreed---if you'd like to take the lead on that, it would be great! 👍🏼

Sure. I will open up an issue soon.

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@rcurtin rcurtin merged commit f3e3735 into mlpack:master Nov 15, 2020
@rcurtin
Copy link
Member

rcurtin commented Nov 15, 2020

Thanks @Abilityguy!

@Abilityguy Abilityguy deleted the StratifiedSplit branch November 15, 2020 12:59
This was referenced Oct 14, 2022
@rcurtin rcurtin mentioned this pull request Oct 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants