Added an implementation to Stratify Data #2671

Abilityguy · 2020-10-13T07:43:00Z

Reference #2662

@rcurtin challenge completed!

The current implementation seems to be faster than the one in the issue.

Dataset 1: covertype dataset (https://www.mlpack.org/datasets/covertype-small.data.csv.gz)

Dataset size - 54 x 100000
Test ratio - 0.3

Previous time:
[INFO ]   loading_data: 0.458566s
[INFO ]   saving_data: 2.796008s
[INFO ]   total_time: 0.641220s

Improved time:
[INFO ]   loading_data: 0.366347s
[INFO ]   saving_data: 2.499058s
[INFO ]   total_time: 0.413254s

Dataset 2: MNIST train dataset from Kaggle (https://www.kaggle.com/c/digit-recognizer/data)

Dataset size - 784 x 42000
Test ratio - 0.2

Previous time:
[INFO ]   loading_data: 2.912084s
[INFO ]   saving_data: 19.503108s
[INFO ]   total_time: 3.998145s

Improved time:
[INFO ]   loading_data: 2.153167s
[INFO ]   saving_data: 14.734252s
[INFO ]   total_time: 2.406783s

I have added some comments to convey the basic idea in stratified_split_data.hpp
Let me know what you guys think.

…split

src/mlpack/core/data/stratified_split_data.hpp

Abilityguy · 2020-10-14T13:35:09Z

Hi @zoq. I have made the changes.

I was thinking if it would be better to have the stratified split templates in the file split_data.hpp file itself.
This would help to refactor the code in preprocess_split_main.cpp into a single call to data::Split(..., stratify_data) with stratify_data being an extra flag that indicates whether to stratify or not.

What is your take on this?

Yashwants19

I guess we should also add some tests here, both for method(tests/split_data_test.cpp) as well as bindings(main_tests/preprocess_split_test.cpp).

src/mlpack/methods/preprocess/preprocess_split_main.cpp

…red code in preprocess_split_main.cpp

Abilityguy · 2020-10-19T06:04:02Z

Hey @Yashwants19 , thanks for the review.
I have made changes according to your inputs.
I also deleted the stratified_split_data.hpp file and moved it's template to split_data.hpp.
There seems to be a documentation fail. It might be due to a mistake by me while documenting. I am not sure about it's fix though. If someone can let me know how to fix it, I will make the change.

src/mlpack/core/data/split_data.hpp

zoq

I think the plan right now is to add some tests for the new feature?

src/mlpack/core/data/split_data.hpp

Abilityguy · 2020-10-21T06:22:01Z

Yes, I will look into adding a few tests.

src/mlpack/core/data/split_data.hpp

src/mlpack/tests/split_data_test.cpp

zoq · 2020-10-31T20:41:55Z

Yup. I too expected the opposite.
I made another check and recorded the timing over 5 runs and got similar results. The indices approach is faster compared to the labelMap and testLabelMap approach.
Of course, the indices approach takes more memory so it's a trade off.
Let me know which implementation would be better.

The timings are pretty close, maybe the hashing isn't perfect and the lookup is not O(1)? Personally I would go with the current approach just to save some memory.

Abilityguy · 2020-11-01T15:07:20Z

Hey @zoq, it seems like you are right. The hashing may not have been O(1). I made another implementation where I replaced the unordered_map with an arma::uvec vector.

I had to make an extra pass to figure out the maximum value of the label. (inputLabel.max()). But I found that this implementation achieves the best of both worlds by saving memory and being faster.

The times are listed below (average of 5).

Dataset 1: MNIST dataset

StratifiedSplit (with indicies vector):  0.123578 s
StratifiedSplit (with unordered map):    0.125799 s
StratifiedSplit (with arma::uvec):       0.1226292 s

Dataset 2: Covertype dataset

StratifiedSplit (with indicies vector):  0.0388812 s
StratifiedSplit (with unordered map):    0.040097 s
StratifiedSplit (with arma::uvec):       0.0356198 s

I think we can stick to the arma::uvec based implementation. What is your take @rcurtin ?

src/mlpack/core/data/split_data.hpp

Abilityguy · 2020-11-08T17:27:56Z

Hey @rcurtin, could you take a look at this when you have the time?
I also wanted to know how we could go ahead and solve the issue raised by @zoq .

rcurtin

Sorry for the slightly slow response on my end. I have just a couple comments but everything is looking good to me. Thanks for putting time into this! From my perspective I think it is just about ready. 👍

src/mlpack/core/data/split_data.hpp

Abilityguy · 2020-11-11T08:28:10Z

Sorry for the slightly slow response on my end. I have just a couple comments but everything is looking good to me. Thanks for putting time into this! From my perspective I think it is just about ready.

Not an issue. :)
Thanks for the review comments!.
I have made the required documentation and style changes.

src/mlpack/core/data/split_data.hpp

Abilityguy · 2020-11-12T18:48:15Z

Oops. looks I missed out testing the stratify_data flag in mlpack/tests/main_tests/preprocess_split_test.cpp.
I will add a few tests in there.

Abilityguy · 2020-11-13T14:09:45Z

I have added tests in preprocess_split_test.cpp to test the bindings.
This PR should be ready for merge now.

Also, I noticed that data was loaded using the below snippet in preprocess_split_test.cpp.

data::Load("vc2.csv", inputData);

If the file doesn't exist or there is an error in loading the file, there are instances where the test case can pass (if test case depends on input size) even though there is an error in loading the file.

A better way to load data would be

  if (!data::Load("vc2.csv", inputData))
    FAIL("Cannot load train dataset vc2.csv!");

In this case, if the file is not loaded properly, the error message comes up in the testing log.
A quick search through the catch testing files showed most files follow this convention in the testing suite, though there are a few files where this is not followed. Some of these are

nbc_test.cpp
feedforward_network_test.cpp
krann_search_test.cpp
random_forest_test.cpp
facilities_test.cpp

Even if the test doesn't depend on the input size in any way, I think it would be a good idea to add the FAIL() message just to ensure better testing and debugging.

I could open an issue on this and I think this could be a good first issue to introduce new contributors to the codebase.
What are your thoughts on this?

rcurtin

Looks good, thanks for taking the time to implement and tune this! 👍 Do you want to add a note to HISTORY.md too?

rcurtin · 2020-11-14T00:21:23Z

I could open an issue on this and I think this could be a good first issue to introduce new contributors to the codebase.
What are your thoughts on this?

Totally agreed---if you'd like to take the lead on that, it would be great! 👍

Abilityguy · 2020-11-14T06:14:11Z

Looks good, thanks for taking the time to implement and tune this! Do you want to add a note to HISTORY.md too?

Yup. Added a note to HISTORY.md. Thanks for the detailed reviews. I learned a lot while coding this feature! :)

Totally agreed---if you'd like to take the lead on that, it would be great! 👍🏼

Sure. I will open up an issue soon.

mlpack-bot

Second approval provided automatically after 24 hours. 👍

rcurtin · 2020-11-15T01:12:03Z

Thanks @Abilityguy!

Abilityguy added 6 commits October 8, 2020 22:40

StratifiedSplit file added with basic code

c0f725e

Added Stratified split implementation

db94a2c

Minor changes to preprocess_split file

e6985fc

Stratified Split implementation done

9d34e23

Added Stratified Split implementation and integrated with preprocess …

1f8697b

…split

Minor style fix

196d561

mlpack-bot bot added s: needs review s: unanswered s: unlabeled labels Oct 13, 2020

Basic idea added in comments

dd11c74

Yashwants19 added c: methods t: added feature and removed s: unanswered s: unlabeled labels Oct 13, 2020

zoq reviewed Oct 13, 2020

View reviewed changes

Refactoring code and style fixes

a7befb5

rcurtin added this to the mlpack 3.4.2 milestone Oct 16, 2020

Yashwants19 reviewed Oct 18, 2020

View reviewed changes

src/mlpack/methods/preprocess/preprocess_split_main.cpp Outdated Show resolved Hide resolved

src/mlpack/methods/preprocess/preprocess_split_main.cpp Outdated Show resolved Hide resolved

src/mlpack/methods/preprocess/preprocess_split_main.cpp Outdated Show resolved Hide resolved

Moved StratifiedSplit templates moved into split_data.hpp and refacto…

2f36940

…red code in preprocess_split_main.cpp

Yashwants19 reviewed Oct 19, 2020

View reviewed changes

src/mlpack/core/data/split_data.hpp Show resolved Hide resolved

Fix for documentation failing test

fdc01e5

zoq reviewed Oct 20, 2020

View reviewed changes

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

Code review changes

d06cf96

Abilityguy added 2 commits October 23, 2020 21:14

Added tests for stratified split

86bc837

Possible fix for failing tests

9eabc40

zoq reviewed Oct 23, 2020

View reviewed changes

Review changes

e7539c6

Changed unordered map implementation to uvec implementation

76fe709

zoq reviewed Nov 3, 2020

View reviewed changes

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

Abilityguy added 2 commits November 8, 2020 10:21

Removed the 'ReportIgnoredParam' attribute

1a49c5f

Fix for failing style check

b1e8382

rcurtin reviewed Nov 11, 2020

View reviewed changes

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

Changes made based on review comments

2fc09bf

rcurtin reviewed Nov 11, 2020

View reviewed changes

src/mlpack/core/data/split_data.hpp Outdated Show resolved Hide resolved

Changed to direct looping of labels

3f2220f

Abilityguy added 2 commits November 13, 2020 19:00

Added tests to preprocess_split_test

67dc33e

Removed unused variables in a test case

04f2d46

rcurtin approved these changes Nov 14, 2020

View reviewed changes

Abilityguy added 2 commits November 14, 2020 11:34

Merge branch 'master' into StratifiedSplit

db62edf

Modified HISTORY.md

765d278

Abilityguy mentioned this pull request Nov 14, 2020

Adding a FAIL message when data::Load is used in test files. #2715

Closed

36 tasks

mlpack-bot bot approved these changes Nov 15, 2020

View reviewed changes

mlpack-bot bot removed the s: needs review label Nov 15, 2020

rcurtin merged commit f3e3735 into mlpack:master Nov 15, 2020

Abilityguy deleted the StratifiedSplit branch November 15, 2020 12:59

This was referenced Oct 14, 2022

Release version 4.0.0 #3285

Closed

Release version 4.0.0 #3286

Closed

rcurtin mentioned this pull request Oct 23, 2022

Release version 4.0.0 #3293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added an implementation to Stratify Data #2671

Added an implementation to Stratify Data #2671

Abilityguy commented Oct 13, 2020 •

edited

Abilityguy commented Oct 14, 2020

Yashwants19 left a comment

Abilityguy commented Oct 19, 2020 •

edited

zoq left a comment

Abilityguy commented Oct 21, 2020

zoq commented Oct 31, 2020

Abilityguy commented Nov 1, 2020 •

edited

Abilityguy commented Nov 8, 2020

rcurtin left a comment

Abilityguy commented Nov 11, 2020 •

edited

Abilityguy commented Nov 12, 2020

Abilityguy commented Nov 13, 2020 •

edited

rcurtin left a comment

rcurtin commented Nov 14, 2020

Abilityguy commented Nov 14, 2020

mlpack-bot bot left a comment

rcurtin commented Nov 15, 2020

Added an implementation to Stratify Data #2671

Added an implementation to Stratify Data #2671

Conversation

Abilityguy commented Oct 13, 2020 • edited

Abilityguy commented Oct 14, 2020

Yashwants19 left a comment

Choose a reason for hiding this comment

Abilityguy commented Oct 19, 2020 • edited

zoq left a comment

Choose a reason for hiding this comment

Abilityguy commented Oct 21, 2020

zoq commented Oct 31, 2020

Abilityguy commented Nov 1, 2020 • edited

Abilityguy commented Nov 8, 2020

rcurtin left a comment

Choose a reason for hiding this comment

Abilityguy commented Nov 11, 2020 • edited

Abilityguy commented Nov 12, 2020

Abilityguy commented Nov 13, 2020 • edited

rcurtin left a comment

Choose a reason for hiding this comment

rcurtin commented Nov 14, 2020

Abilityguy commented Nov 14, 2020

mlpack-bot bot left a comment

Choose a reason for hiding this comment

rcurtin commented Nov 15, 2020

Abilityguy commented Oct 13, 2020 •

edited

Abilityguy commented Oct 19, 2020 •

edited

Abilityguy commented Nov 1, 2020 •

edited

Abilityguy commented Nov 11, 2020 •

edited

Abilityguy commented Nov 13, 2020 •

edited