New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor for faster assembly of secondHashTable. #675

Merged
merged 103 commits into from Jun 14, 2016

Conversation

Projects
None yet
7 participants
@rcurtin
Member

rcurtin commented Jun 4, 2016

@mentekid: a first attempt, not 100% tested. Let me know what you think. I should have time tonight to do further testing and comparison but it appears to be good so far.

rcurtin and others added some commits May 16, 2016

Add the VRClassRewardLayer class which implements the REINFORCE algor…
…itm for classification models. To be precise, this is is a Variance Reduces classification reinforcement learning rule.
Add GlimpseLayer class which takes an input image and a location to e…
…xtract a retina-like representation of the input image.
Merge pull request #641 from MarcosPividori/master
Properly use Enum type.
Merge pull request #643 from dasayan05/master
Typo fix in knn_main.cpp
Remove duplicated code for traversal info.
 Instead of including: methods/neighbor_search/ns_traversal_info.hpp
 Include the definition in: core/tree/traversal_info.hpp
Merge pull request #645 from MarcosPividori/master
Properly use Enum type, in rann and range_search.
Merge pull request #646 from MarcosPividori/traversal-info
Remove duplicated code for traversal info.
Merge pull request #648 from dasayan05/PR1
Deprecated arma function replaced by new arma constant

keon and others added some commits Jun 2, 2016

We can't remove Projection() because we'd break our versioning princi…
…ples.

So we'll have to wait until mlpack 2.0.1 to remove it... :(
Merge pull request #650 from keonkim/master
add cli executable for data_split
Merge pull request #668 from MarcosPividori/knn-bounds
Add B_aux according to issue #642
Merge pull request #670 from rcurtin/master
Marcus thinks this will fix the Windows build... let's find out.
Merge pull request #661 from rcurtin/keontests
More tests for DatasetInfo
@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 4, 2016

Member

I had fun rewriting it too, just ran out of time for now to test it. :)

Member

rcurtin commented Jun 4, 2016

I had fun rewriting it too, just ran out of time for now to test it. :)

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 4, 2016

Contributor

Essentially you reshape the secondHashTable from secondHashSize x bucketSize to secondHashSize x maxBucketSize if I understand correctly.

I'll run some tests to see if that improves search time.

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

An idea would be to make secondHashTable from arma::Mat to arma::SpMat. The problem with this is currently we denote an empty bucket position by setting it to N, not 0, because 0 corresponds to point 0 in the reference set.
So if we do SpMat we will need to change that notation everywhere, which will probably cause compatibility issues when reading saved models.
Another drawback of this, is we still set a maximum number of points that secondHashTable stores per bucket - but this number can be set arbitrarily high because we're more efficient, memory-wise.
A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.
I started trying to do this but wasted 2 hours and I'm not done yet. It seemed like a quick and dirty trick but it's quite complicated :(

An alternative would be to have a C++ array of std::vectors. Each vector holds the contents (indices) of the corresponding bucket. This might require a little more refactoring but since vectors use only the memory they require (plus whatever they need for amortization of expansion) I think we will be better off than using SpMat.

Let me know what you think

Contributor

mentekid commented Jun 4, 2016

Essentially you reshape the secondHashTable from secondHashSize x bucketSize to secondHashSize x maxBucketSize if I understand correctly.

I'll run some tests to see if that improves search time.

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

An idea would be to make secondHashTable from arma::Mat to arma::SpMat. The problem with this is currently we denote an empty bucket position by setting it to N, not 0, because 0 corresponds to point 0 in the reference set.
So if we do SpMat we will need to change that notation everywhere, which will probably cause compatibility issues when reading saved models.
Another drawback of this, is we still set a maximum number of points that secondHashTable stores per bucket - but this number can be set arbitrarily high because we're more efficient, memory-wise.
A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.
I started trying to do this but wasted 2 hours and I'm not done yet. It seemed like a quick and dirty trick but it's quite complicated :(

An alternative would be to have a C++ array of std::vectors. Each vector holds the contents (indices) of the corresponding bucket. This might require a little more refactoring but since vectors use only the memory they require (plus whatever they need for amortization of expansion) I think we will be better off than using SpMat.

Let me know what you think

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 4, 2016

Contributor

Here's what I have in mind regarding an std::vector<size_t> array.

This requires more work, but let me know what you think.

If you run this with default parameters it will probably take a lot of
time, that's because the old implementation discards a lot of points while
this way we don't discard any.

We could also implement an optional size limit for the vectors for backward
compatibility, for example set the default to actually be 500 points but if
you set bucketSize to 0 the vectors grow arbitrarily large.

On Sat, Jun 4, 2016 at 3:43 AM, Ryan Curtin notifications@github.com
wrote:

I had fun rewriting it too, just ran out of time for now to test it. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#675 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFxX3Co0w7vNpLCoCljOORwjhzLR2yQvks5qIMopgaJpZM4IuAcT
.

Contributor

mentekid commented Jun 4, 2016

Here's what I have in mind regarding an std::vector<size_t> array.

This requires more work, but let me know what you think.

If you run this with default parameters it will probably take a lot of
time, that's because the old implementation discards a lot of points while
this way we don't discard any.

We could also implement an optional size limit for the vectors for backward
compatibility, for example set the default to actually be 500 points but if
you set bucketSize to 0 the vectors grow arbitrarily large.

On Sat, Jun 4, 2016 at 3:43 AM, Ryan Curtin notifications@github.com
wrote:

I had fun rewriting it too, just ran out of time for now to test it. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#675 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFxX3Co0w7vNpLCoCljOORwjhzLR2yQvks5qIMopgaJpZM4IuAcT
.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 5, 2016

Member

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

I agree, definitely we can end up with a case where we discard a lot of points. I wasn't trying to fix that situation here because I'm not 100% sure what the best option is: keeping all of the points in every bucket can lead to very long runtimes (because very large buckets, like you saw with the Corel data).

A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.

I actually think we should transpose how secondHashTable is stored; I think it is being accessed by rows. To me it would make the most sense to have the points in each bin stored in a column, not in a row like it is now.

Here's what I have in mind regarding an std::vector<size_t> array.

I took a look through the code and the benchmarks that you sent. I noticed that the recall was much higher in the std::vector<size_t> implementation, but the runtime tended to be much higher for the larger datasets (though it was lower for small datasets). This makes sense: in the existing approach, we are taking extra time to calculate the size of the minimal second hash table and then fill it, but in your approach we do not do that (there may be many empty std::vector<size_t>s).

It seems like our changes are actually a bit orthogonal: my idea with this PR was to take the existing approach (with the existing bucket size limitation enforced by bucketSize) and produce the same results without changing any behavior. But I think that your changes address a different problem: specifically, that there are some situations where we discard massive numbers of points that fall into a single bucket. If I have not misunderstood, the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

Member

rcurtin commented Jun 5, 2016

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

I agree, definitely we can end up with a case where we discard a lot of points. I wasn't trying to fix that situation here because I'm not 100% sure what the best option is: keeping all of the points in every bucket can lead to very long runtimes (because very large buckets, like you saw with the Corel data).

A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.

I actually think we should transpose how secondHashTable is stored; I think it is being accessed by rows. To me it would make the most sense to have the points in each bin stored in a column, not in a row like it is now.

Here's what I have in mind regarding an std::vector<size_t> array.

I took a look through the code and the benchmarks that you sent. I noticed that the recall was much higher in the std::vector<size_t> implementation, but the runtime tended to be much higher for the larger datasets (though it was lower for small datasets). This makes sense: in the existing approach, we are taking extra time to calculate the size of the minimal second hash table and then fill it, but in your approach we do not do that (there may be many empty std::vector<size_t>s).

It seems like our changes are actually a bit orthogonal: my idea with this PR was to take the existing approach (with the existing bucket size limitation enforced by bucketSize) and produce the same results without changing any behavior. But I think that your changes address a different problem: specifically, that there are some situations where we discard massive numbers of points that fall into a single bucket. If I have not misunderstood, the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 5, 2016

Contributor

the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Not exactly, though ignoring it is a side-effect. The major change is that we can have buckets that occupy variable lengths - one bucket can be very large without requiring all other buckets to grow with it as was done when specifying (for example) bucketSize to be 5000-6000. I don't think we can do something like that with a monolithic matrix because it has to be rectangular.

Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

I agree, we shouldn't remove the choice from the user to set a maximum capacity to the buckets. My implementation ignores bucketSize because I just wanted to quickly demonstrate my idea - the final code should definately work as you say, have some default bucketSize and if the user specifies 0 simply store all points.

The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

I can't see your changes any more because there's something wrong with the commits (it looks like there's 100 commits and 74 files changed, but can't find the changes you made in lsh_search_impl.hpp). It is true that allocating even an empty std::vector takes some memory, and some time, but I believe the memory we save (by not wasting it on empty buckets) makes up for it. Of course if we can avoid both it would be cool, but I'm not sure how :/
Another concern with std::vectors is memory coherency - if each vector is spread around in RAM there will be overhead. On the other hand, we don't access the buckets sequentially, so there's already some moving around being done.

Contributor

mentekid commented Jun 5, 2016

the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Not exactly, though ignoring it is a side-effect. The major change is that we can have buckets that occupy variable lengths - one bucket can be very large without requiring all other buckets to grow with it as was done when specifying (for example) bucketSize to be 5000-6000. I don't think we can do something like that with a monolithic matrix because it has to be rectangular.

Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

I agree, we shouldn't remove the choice from the user to set a maximum capacity to the buckets. My implementation ignores bucketSize because I just wanted to quickly demonstrate my idea - the final code should definately work as you say, have some default bucketSize and if the user specifies 0 simply store all points.

The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

I can't see your changes any more because there's something wrong with the commits (it looks like there's 100 commits and 74 files changed, but can't find the changes you made in lsh_search_impl.hpp). It is true that allocating even an empty std::vector takes some memory, and some time, but I believe the memory we save (by not wasting it on empty buckets) makes up for it. Of course if we can avoid both it would be cool, but I'm not sure how :/
Another concern with std::vectors is memory coherency - if each vector is spread around in RAM there will be overhead. On the other hand, we don't access the buckets sequentially, so there's already some moving around being done.

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 5, 2016

Member

Oh, right, I did not think about the fact that different buckets have different numbers of points in them! Now that I think of that, I do think that perhaps std::vector<size_t>* is the right way to go (or actually maybe std::vector<arma::Col<size_t>>).

I think that we can have the best of both worlds if we do it like this:

  • Use std::vector<arma::Col<size_t>> for representing secondHashTable (this also avoids memory allocation, which is good---I am pretty sure your code had a subtle bug where the user could initialize the LSHSearch object without training, but then the destructor would still try to delete the std::vector<size_t>* object which would cause a crash).
  • Before filling secondHashTable, calculate the sizes of each bin (the code I wrote does this), truncating the length to bucketSize. Then we can allocate the exact correct size for each arma::Col<size_t> (and also allocate exactly the right number of arma::Col<size_t>s), and then fill them like your code does.
  • When the object is constructed, if bucketSize = 0, set bucketSize = referenceSet.n_cols.

What do you think, do you think this would work? We would have to modify the serialization again, but I don't think we need to increment the version from 1 to 2 because we did not release mlpack with the serialization change we did before (which was the change from std::vector<arma::mat> to arma::cube). I was going to try and release mlpack 2.0.2 today, but, if we are going to change serialization again I will wait on this otherwise we will end up with more-complex-than-necessary legacy code to handle. :)

I can't see your changes any more because there's something wrong with the commits

Yes, there was a force push to the repository to the state it was in about 20 days ago, but I restored the current state earlier today. It seems like the PR interface has not been updated though, so it still shows way more commits.

Member

rcurtin commented Jun 5, 2016

Oh, right, I did not think about the fact that different buckets have different numbers of points in them! Now that I think of that, I do think that perhaps std::vector<size_t>* is the right way to go (or actually maybe std::vector<arma::Col<size_t>>).

I think that we can have the best of both worlds if we do it like this:

  • Use std::vector<arma::Col<size_t>> for representing secondHashTable (this also avoids memory allocation, which is good---I am pretty sure your code had a subtle bug where the user could initialize the LSHSearch object without training, but then the destructor would still try to delete the std::vector<size_t>* object which would cause a crash).
  • Before filling secondHashTable, calculate the sizes of each bin (the code I wrote does this), truncating the length to bucketSize. Then we can allocate the exact correct size for each arma::Col<size_t> (and also allocate exactly the right number of arma::Col<size_t>s), and then fill them like your code does.
  • When the object is constructed, if bucketSize = 0, set bucketSize = referenceSet.n_cols.

What do you think, do you think this would work? We would have to modify the serialization again, but I don't think we need to increment the version from 1 to 2 because we did not release mlpack with the serialization change we did before (which was the change from std::vector<arma::mat> to arma::cube). I was going to try and release mlpack 2.0.2 today, but, if we are going to change serialization again I will wait on this otherwise we will end up with more-complex-than-necessary legacy code to handle. :)

I can't see your changes any more because there's something wrong with the commits

Yes, there was a force push to the repository to the state it was in about 20 days ago, but I restored the current state earlier today. It seems like the PR interface has not been updated though, so it still shows way more commits.

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 5, 2016

Contributor

I am pretty sure your code had a subtle bug where the user could initialize the LSHSearch object without training, but then the destructor would still try to delete the std::vector<size_t>* object which would cause a crash
You're right I didn't even think about that case.

I like this idea, I think it's quite an improvement over the current implementation. I'm still not clear on how we can first use secondHashSize as the hash size and then compress the empty buckets though. I'll look into your and Pari's code to see that.

I think I can have this working in one or two days... Though I have some final stuff to do with my thesis and I might not be very active till Wednesday, but I'll find time to work on this. So if there isn't a very big hurry, wait for me and release once this is done. I'll do my best to deliver it as soon as possible :)

Contributor

mentekid commented Jun 5, 2016

I am pretty sure your code had a subtle bug where the user could initialize the LSHSearch object without training, but then the destructor would still try to delete the std::vector<size_t>* object which would cause a crash
You're right I didn't even think about that case.

I like this idea, I think it's quite an improvement over the current implementation. I'm still not clear on how we can first use secondHashSize as the hash size and then compress the empty buckets though. I'll look into your and Pari's code to see that.

I think I can have this working in one or two days... Though I have some final stuff to do with my thesis and I might not be very active till Wednesday, but I'll find time to work on this. So if there isn't a very big hurry, wait for me and release once this is done. I'll do my best to deliver it as soon as possible :)

@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 5, 2016

Member

Are you sure you have time to do that? If you think you can't, I can refactor what I've done since I'm pretty familiar with the code already. But if you do think you have time to do this, I will wait. :)

Member

rcurtin commented Jun 5, 2016

Are you sure you have time to do that? If you think you can't, I can refactor what I've done since I'm pretty familiar with the code already. But if you do think you have time to do this, I will wait. :)

@mentekid

This comment has been minimized.

Show comment
Hide comment
@mentekid

mentekid Jun 5, 2016

Contributor

Sure, if you have time I'm fine with that, I just didn't want to pile extra
work on you when I could do it :)

Thanks!

On Sun, Jun 5, 2016 at 10:07 PM, Ryan Curtin notifications@github.com
wrote:

Are you sure you have time to do that? If you think you can't, I can
refactor what I've done since I'm pretty familiar with the code already.
But if you do think you have time to do this, I will wait. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#675 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFxX3LfJzb6RVrb4cHPr-bX_livVtm11ks5qIx54gaJpZM4IuAcT
.

Contributor

mentekid commented Jun 5, 2016

Sure, if you have time I'm fine with that, I just didn't want to pile extra
work on you when I could do it :)

Thanks!

On Sun, Jun 5, 2016 at 10:07 PM, Ryan Curtin notifications@github.com
wrote:

Are you sure you have time to do that? If you think you can't, I can
refactor what I've done since I'm pretty familiar with the code already.
But if you do think you have time to do this, I will wait. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#675 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AFxX3LfJzb6RVrb4cHPr-bX_livVtm11ks5qIx54gaJpZM4IuAcT
.

rcurtin added some commits Jun 8, 2016

Switch secondHashTable to vector<Col<size_t>>.
This should provide a good amount of speedup, and also save RAM.
@rcurtin

This comment has been minimized.

Show comment
Hide comment
@rcurtin

rcurtin Jun 9, 2016

Member

Okay, this is ready to go. It'll need to be merged manually but that is no problem...

Here are some benchmarks as measured by mlpack_lsh on both the old and the new version. The results are significant enough I didn't bother with more than one trial or tuning parameters.

corel.csv: hash building: 0.474s -> 0.293s, computing neighbors 28.057s -> 5.832s
isolet.csv: hash building 0.614s -> 0.517s, computing neighbors 0.776s -> 0.565s
covertype.csv: hash building 5.173s -> 6.215s, computing neighbors 1739.332s -> 1172.641s
phy.csv: hash building 2.074s -> 1.823s, computing neighbors 167.203s -> 51.062s
pokerhand.csv: hash building 0.329s -> 0.212s, computing neighbors 1.101s -> 0.737s
miniboone.csv: hash building 1.311s -> 1.429s, computing neighbors 51.144s -> 25.309s

Member

rcurtin commented Jun 9, 2016

Okay, this is ready to go. It'll need to be merged manually but that is no problem...

Here are some benchmarks as measured by mlpack_lsh on both the old and the new version. The results are significant enough I didn't bother with more than one trial or tuning parameters.

corel.csv: hash building: 0.474s -> 0.293s, computing neighbors 28.057s -> 5.832s
isolet.csv: hash building 0.614s -> 0.517s, computing neighbors 0.776s -> 0.565s
covertype.csv: hash building 5.173s -> 6.215s, computing neighbors 1739.332s -> 1172.641s
phy.csv: hash building 2.074s -> 1.823s, computing neighbors 167.203s -> 51.062s
pokerhand.csv: hash building 0.329s -> 0.212s, computing neighbors 1.101s -> 0.737s
miniboone.csv: hash building 1.311s -> 1.429s, computing neighbors 51.144s -> 25.309s

@rcurtin rcurtin merged commit 9c28c08 into mlpack:master Jun 14, 2016

1 check failed

continuous-integration/appveyor/pr AppVeyor was unable to build non-mergeable pull request
Details

mentekid added a commit to mentekid/mlpack that referenced this pull request Jun 23, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment