Refactor for faster assembly of secondHashTable. #675

Merged
merged 103 commits into from Jun 14, 2016

Projects

None yet

7 participants

@rcurtin
Member
rcurtin commented Jun 4, 2016

@mentekid: a first attempt, not 100% tested. Let me know what you think. I should have time tonight to do further testing and comparison but it appears to be good so far.

rcurtin and others added some commits May 16, 2016
@rcurtin rcurtin Force positive-definiteness when training a Gaussian. 4536910
@rcurtin rcurtin Better handling of NaNs. 2744628
@rcurtin rcurtin Add --random_initialization for mlpack_hmm_train. 3e4f3ca
@rcurtin rcurtin Don't forget the period in the output.
9866203
@zoq zoq Add MultiplyConstantLayer which multiplies the input by a non-learnab…
…le constant.
2114358
@zoq zoq Add NegativeLogLikelihoodLayer class which is useful to train a class…
…ication problem with n classes.
cc7bac3
@zoq zoq Add the VRClassRewardLayer class which implements the REINFORCE algor…
…itm for classification models. To be precise, this is is a Variance Reduces classification reinforcement learning rule.
b69c6dc
@zoq zoq Add GlimpseLayer class which takes an input image and a location to e…
…xtract a retina-like representation of the input image.
be43684
@zoq zoq Remove debug message. fe69d33
@zoq zoq Add Recurrent Model of Visual Attention (RMVA) implementation. 4d8347a
@zoq zoq Removes trailing whitespaces at the end of lines. d1650e4
@zoq zoq Include split_data.hpp file into the build process, so that it is ava…
…ilable afterwards.
989dd35
@zoq zoq Remove unused output parameter. f1bf339
@zoq zoq Add RMVA class and function documentation.
39eefde
@MarcosPividori MarcosPividori Properly use Enum type.
dd136db
@rcurtin rcurtin Merge pull request #641 from MarcosPividori/master
Properly use Enum type.
9b42c22
@dasayan05 dasayan05 Typo fix in knn_main.cpp
9b811f9
@rcurtin rcurtin Merge pull request #643 from dasayan05/master
Typo fix in knn_main.cpp
36b7316
@MarcosPividori MarcosPividori Properly use Enum type.
c82c747
@MarcosPividori MarcosPividori Remove duplicated code for traversal info.
 Instead of including: methods/neighbor_search/ns_traversal_info.hpp
 Include the definition in: core/tree/traversal_info.hpp
eef40b9
@MarcosPividori MarcosPividori Remove unnecessary include.
b4ee954
@rcurtin rcurtin Merge pull request #645 from MarcosPividori/master
Properly use Enum type, in rann and range_search.
f55427d
@rcurtin rcurtin Merge pull request #646 from MarcosPividori/traversal-info
Remove duplicated code for traversal info.
ade7fa0
@zoq zoq Remove trailing underscores in header guards as discussed in #533. 8acc4ce
@zoq zoq Add CMakeLists file to build the RMVA.
cd7f063
@zoq zoq Remove CMakeLists file; do not build the RMVA code.
c15541b
@zoq zoq Use n_rows and n_cols to define the matrix size instead of arma::size().
04fe0d5
@dasayan05 dasayan05 Deprecated arma function replaced by new arma constant
4fae385
@rcurtin rcurtin Merge pull request #648 from dasayan05/PR1
Deprecated arma function replaced by new arma constant
5d1723d
@zoq zoq Update documentation for changed names.
6f6173c
@keon keon add cli executable for data_split
a35c390
@dasayan05 dasayan05 vc2_test.csv wasn't in proper csv format
c68c3b9
@stereomatchingkiss stereomatchingkiss fix performance issue--forgot to move name parameters
fe1b6b9
@keon keon change parameter flags and names for consistency with other programs …
…#discussion_r64743312
de4226c
@rcurtin rcurtin Merge pull request #652 from dasayan05/test
vc2_test.csv wasn't in proper csv format
290977d
@keon keon add TrainTestSplit without label
2cad593
@keon keon rename TrainTestSplit to Split
c8a60b2
@rcurtin rcurtin Merge pull request #653 from stereomatchingkiss/split_data_performanc…
…e_issue

fix performance issue--forgot to move name parameters
ec1c46a
@keon keon add data split executable without labels + tests
e41b3db
@MarcosPividori MarcosPividori Fix error, a balltree should use ballbounds instead of hrectbounds. 706f0ee
@MarcosPividori MarcosPividori Set proper template order. To work with binary space trees, the metri…
…c template parameter must be the first.
9e5d9c2
@MarcosPividori MarcosPividori Proper template order to BallBound. c0cdb8b
@MarcosPividori MarcosPividori Fix error. kdtree where should be a balltree. 903be7e
@MarcosPividori MarcosPividori Change template name to be similar to the definition of hrectbound.
8285c31
@keon keon fix directive name for split_data.hpp
ef01c05
@zoq zoq Use add_cli_executable to control CLI executables build; more informa…
…tion take a look at #619.
3d1ed0f
@rcurtin rcurtin Merge pull request #655 from MarcosPividori/fix-ball-tree
Fix erroneous ball tree definition.
2fe9e82
@rcurtin rcurtin Fix spacing.
1dad2b6
@keon keon refine style and detail test
9e88669
@stereomatchingkiss stereomatchingkiss 1 : fixing issue #658, work with non-transpose data, do not work with
transpose data yet
0e61fe2
@stereomatchingkiss stereomatchingkiss remove useless codes 29f0536
@stereomatchingkiss stereomatchingkiss use algorithm to replace for loop 9aac1dc
@keon keon Merge pull request #1 from mlpack/master
update
8ad3b90
@rcurtin rcurtin Document the state of loading sparse matrices.
e36eec5
@stereomatchingkiss stereomatchingkiss fix bug--isdigit will not treat 1.004e10 as digit a340e23
@stereomatchingkiss stereomatchingkiss support transpose mapping 56a0361
@stereomatchingkiss stereomatchingkiss fix format--place const before variable 0634d4a
@stereomatchingkiss stereomatchingkiss fix bugs--may read invalid stream 5d964a1
@stereomatchingkiss stereomatchingkiss remove debug message 6eb1723
@stereomatchingkiss stereomatchingkiss reduce copy 5eaa2a9
@stereomatchingkiss stereomatchingkiss add new test cases
4ea13b0
@rcurtin rcurtin A first simple test. 1beb58d
@stereomatchingkiss stereomatchingkiss fix bug--forgot to put the functions into nameless namespace
3ee8906
@rcurtin rcurtin Better Doxygen formatting.
02e31b3
@stereomatchingkiss stereomatchingkiss make TransPoseTokens as rcurtin suggested, should not use nameless na…
…mespace in header file
3e75b48
@MarcosPividori MarcosPividori Implement B_aux according to what was discussed in #642. 6877607
@MarcosPividori MarcosPividori Remove unused information from neighbor search stats. ba4c5bb
@MarcosPividori MarcosPividori Remove unnecessary include
79d07f2
@rcurtin rcurtin Adapt Keon's test into something harder.
b3fc468
@mentekid mentekid Adds code that gives controllable access to LSH projection tables e94896d
@mentekid mentekid Adds code that gives controllable access to LSH projection tables f3c4939
@mentekid mentekid Adds code that gives controllable access to LSH projection tables
934fe08
@rcurtin rcurtin I'm not sure what line width Pari used, but it wasn't 80 columns.
This will probably make the merge of #663 and other LSH improvements by Yannis
harder...
e6d2ca7
@mentekid mentekid Fixes naming conventions of accessors
4773efb
@mentekid mentekid Changes LSHSearch.projections from std::vector<arma::mat> to arma::cube
b067e89
@mentekid mentekid Changes LSHSearch::projections from std::vector<arma::mat> to arma::cube
7ebf1da
@rcurtin @mentekid rcurtin Add a templated version for BOOST_CLASS_VERSION(). ff7877d
@rcurtin @mentekid rcurtin We actually need to wrap mlpack::data::SecondShim<> objects. 6958cc0
@rcurtin @mentekid rcurtin Include new serialization version macro. 1474050
@rcurtin @mentekid rcurtin Use n_slices not size() to fix correctness. 81b6b4f
@rcurtin @mentekid rcurtin Refactor Serialize(), add backwards compatibility, and update tests.
f989f1f
@rcurtin rcurtin Add Classify() functions and tests.
5546ebc
@rcurtin rcurtin Add --output_probabilities_file option.
dca52fd
@rcurtin rcurtin Fix spacing and make method const.
eba4f99
@keon keon add more program info of split
2417a19
@keon keon fix tests for split data
ac016fe
@mentekid mentekid Adds new constructor to LSHSearch and merges Train() and BuildHash() …
…methods
06fdfa8
@rcurtin rcurtin Merge branch 'LSHTableAccess' of https://github.com/mentekid/mlpack i…
…nto mentekid-LSHTableAccess
e0b6ce7
@rcurtin rcurtin Remove BuildHash(); simpler cube default. b30e697
@rcurtin rcurtin Minor code cleanups and style fixes. d3e3c54
@rcurtin rcurtin We can't remove Projection() because we'd break our versioning princi…
…ples.

So we'll have to wait until mlpack 2.0.1 to remove it... :(
8ed22ce
@rcurtin rcurtin Update HISTORY.
e3a23c2
@stereomatchingkiss stereomatchingkiss Merge pull request #650 from keonkim/master
add cli executable for data_split
181792d
@rcurtin rcurtin Update some documentation, change --input_labels to --input_labels_file.
6a94eb6
@rcurtin rcurtin Merge pull request #668 from MarcosPividori/knn-bounds
Add B_aux according to issue #642
281555a
@rcurtin rcurtin Marcus thinks this will fix the Windows build... let's find out.
dde781f
@rcurtin rcurtin Merge pull request #670 from rcurtin/master
Marcus thinks this will fix the Windows build... let's find out.
2629a4f
@rcurtin rcurtin Merge pull request #660 from stereomatchingkiss/fix_mapping_issue
Fix mapping issue
ee95e20
@rcurtin rcurtin Merge pull request #661 from rcurtin/keontests
More tests for DatasetInfo
a2440f1
@rcurtin rcurtin Refactor for faster assembly of secondHashTable.
1bc61db
@rcurtin
Member
rcurtin commented Jun 4, 2016

I had fun rewriting it too, just ran out of time for now to test it. :)

@mentekid
Contributor
mentekid commented Jun 4, 2016

Essentially you reshape the secondHashTable from secondHashSize x bucketSize to secondHashSize x maxBucketSize if I understand correctly.

I'll run some tests to see if that improves search time.

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

An idea would be to make secondHashTable from arma::Mat to arma::SpMat. The problem with this is currently we denote an empty bucket position by setting it to N, not 0, because 0 corresponds to point 0 in the reference set.
So if we do SpMat we will need to change that notation everywhere, which will probably cause compatibility issues when reading saved models.
Another drawback of this, is we still set a maximum number of points that secondHashTable stores per bucket - but this number can be set arbitrarily high because we're more efficient, memory-wise.
A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.
I started trying to do this but wasted 2 hours and I'm not done yet. It seemed like a quick and dirty trick but it's quite complicated :(

An alternative would be to have a C++ array of std::vectors. Each vector holds the contents (indices) of the corresponding bucket. This might require a little more refactoring but since vectors use only the memory they require (plus whatever they need for amortization of expansion) I think we will be better off than using SpMat.

Let me know what you think

@mentekid
Contributor
mentekid commented Jun 4, 2016

Here's what I have in mind regarding an std::vector<size_t> array.

This requires more work, but let me know what you think.

If you run this with default parameters it will probably take a lot of
time, that's because the old implementation discards a lot of points while
this way we don't discard any.

We could also implement an optional size limit for the vectors for backward
compatibility, for example set the default to actually be 500 points but if
you set bucketSize to 0 the vectors grow arbitrarily large.

On Sat, Jun 4, 2016 at 3:43 AM, Ryan Curtin notifications@github.com
wrote:

I had fun rewriting it too, just ran out of time for now to test it. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#675 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFxX3Co0w7vNpLCoCljOORwjhzLR2yQvks5qIMopgaJpZM4IuAcT
.

@rcurtin
Member
rcurtin commented Jun 5, 2016

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

I agree, definitely we can end up with a case where we discard a lot of points. I wasn't trying to fix that situation here because I'm not 100% sure what the best option is: keeping all of the points in every bucket can lead to very long runtimes (because very large buckets, like you saw with the Corel data).

A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.

I actually think we should transpose how secondHashTable is stored; I think it is being accessed by rows. To me it would make the most sense to have the points in each bin stored in a column, not in a row like it is now.

Here's what I have in mind regarding an std::vector<size_t> array.

I took a look through the code and the benchmarks that you sent. I noticed that the recall was much higher in the std::vector<size_t> implementation, but the runtime tended to be much higher for the larger datasets (though it was lower for small datasets). This makes sense: in the existing approach, we are taking extra time to calculate the size of the minimal second hash table and then fill it, but in your approach we do not do that (there may be many empty std::vector<size_t>s).

It seems like our changes are actually a bit orthogonal: my idea with this PR was to take the existing approach (with the existing bucket size limitation enforced by bucketSize) and produce the same results without changing any behavior. But I think that your changes address a different problem: specifically, that there are some situations where we discard massive numbers of points that fall into a single bucket. If I have not misunderstood, the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

@mentekid
Contributor
mentekid commented Jun 5, 2016

the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Not exactly, though ignoring it is a side-effect. The major change is that we can have buckets that occupy variable lengths - one bucket can be very large without requiring all other buckets to grow with it as was done when specifying (for example) bucketSize to be 5000-6000. I don't think we can do something like that with a monolithic matrix because it has to be rectangular.

Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

I agree, we shouldn't remove the choice from the user to set a maximum capacity to the buckets. My implementation ignores bucketSize because I just wanted to quickly demonstrate my idea - the final code should definately work as you say, have some default bucketSize and if the user specifies 0 simply store all points.

The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

I can't see your changes any more because there's something wrong with the commits (it looks like there's 100 commits and 74 files changed, but can't find the changes you made in lsh_search_impl.hpp). It is true that allocating even an empty std::vector takes some memory, and some time, but I believe the memory we save (by not wasting it on empty buckets) makes up for it. Of course if we can avoid both it would be cool, but I'm not sure how :/
Another concern with std::vectors is memory coherency - if each vector is spread around in RAM there will be overhead. On the other hand, we don't access the buckets sequentially, so there's already some moving around being done.

@rcurtin
Member
rcurtin commented Jun 5, 2016

Oh, right, I did not think about the fact that different buckets have different numbers of points in them! Now that I think of that, I do think that perhaps std::vector<size_t>* is the right way to go (or actually maybe std::vector<arma::Col<size_t>>).

I think that we can have the best of both worlds if we do it like this:

  • Use std::vector<arma::Col<size_t>> for representing secondHashTable (this also avoids memory allocation, which is good---I am pretty sure your code had a subtle bug where the user could initialize the LSHSearch object without training, but then the destructor would still try to delete the std::vector<size_t>* object which would cause a crash).
  • Before filling secondHashTable, calculate the sizes of each bin (the code I wrote does this), truncating the length to bucketSize. Then we can allocate the exact correct size for each arma::Col<size_t> (and also allocate exactly the right number of arma::Col<size_t>s), and then fill them like your code does.
  • When the object is constructed, if bucketSize = 0, set bucketSize = referenceSet.n_cols.

What do you think, do you think this would work? We would have to modify the serialization again, but I don't think we need to increment the version from 1 to 2 because we did not release mlpack with the serialization change we did before (which was the change from std::vector<arma::mat> to arma::cube). I was going to try and release mlpack 2.0.2 today, but, if we are going to change serialization again I will wait on this otherwise we will end up with more-complex-than-necessary legacy code to handle. :)

I can't see your changes any more because there's something wrong with the commits

Yes, there was a force push to the repository to the state it was in about 20 days ago, but I restored the current state earlier today. It seems like the PR interface has not been updated though, so it still shows way more commits.

@mentekid
Contributor
mentekid commented Jun 5, 2016

I am pretty sure your code had a subtle bug where the user could initialize the LSHSearch object without training, but then the destructor would still try to delete the std::vector<size_t>* object which would cause a crash
You're right I didn't even think about that case.

I like this idea, I think it's quite an improvement over the current implementation. I'm still not clear on how we can first use secondHashSize as the hash size and then compress the empty buckets though. I'll look into your and Pari's code to see that.

I think I can have this working in one or two days... Though I have some final stuff to do with my thesis and I might not be very active till Wednesday, but I'll find time to work on this. So if there isn't a very big hurry, wait for me and release once this is done. I'll do my best to deliver it as soon as possible :)

@rcurtin
Member
rcurtin commented Jun 5, 2016

Are you sure you have time to do that? If you think you can't, I can refactor what I've done since I'm pretty familiar with the code already. But if you do think you have time to do this, I will wait. :)

@mentekid
Contributor
mentekid commented Jun 5, 2016

Sure, if you have time I'm fine with that, I just didn't want to pile extra
work on you when I could do it :)

Thanks!

On Sun, Jun 5, 2016 at 10:07 PM, Ryan Curtin notifications@github.com
wrote:

Are you sure you have time to do that? If you think you can't, I can
refactor what I've done since I'm pretty familiar with the code already.
But if you do think you have time to do this, I will wait. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#675 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFxX3LfJzb6RVrb4cHPr-bX_livVtm11ks5qIx54gaJpZM4IuAcT
.

rcurtin added some commits Jun 8, 2016
@rcurtin rcurtin Switch secondHashTable to vector<Col<size_t>>.
This should provide a good amount of speedup, and also save RAM.
bd65c87
@rcurtin rcurtin Fix serialization.
74cdbc8
@rcurtin
Member
rcurtin commented Jun 9, 2016

Okay, this is ready to go. It'll need to be merged manually but that is no problem...

Here are some benchmarks as measured by mlpack_lsh on both the old and the new version. The results are significant enough I didn't bother with more than one trial or tuning parameters.

corel.csv: hash building: 0.474s -> 0.293s, computing neighbors 28.057s -> 5.832s
isolet.csv: hash building 0.614s -> 0.517s, computing neighbors 0.776s -> 0.565s
covertype.csv: hash building 5.173s -> 6.215s, computing neighbors 1739.332s -> 1172.641s
phy.csv: hash building 2.074s -> 1.823s, computing neighbors 167.203s -> 51.062s
pokerhand.csv: hash building 0.329s -> 0.212s, computing neighbors 1.101s -> 0.737s
miniboone.csv: hash building 1.311s -> 1.429s, computing neighbors 51.144s -> 25.309s

@rcurtin rcurtin merged commit 9c28c08 into mlpack:master Jun 14, 2016

1 check failed

continuous-integration/appveyor/pr AppVeyor was unable to build non-mergeable pull request
Details
@mentekid mentekid added a commit to mentekid/mlpack that referenced this pull request Jun 23, 2016
@mentekid mentekid Merged changes from #690 and #675 840efe9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment