Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Refactor for faster assembly of secondHashTable. #675
Essentially you reshape the secondHashTable from secondHashSize x bucketSize to secondHashSize x maxBucketSize if I understand correctly.
I'll run some tests to see if that improves search time.
I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.
An idea would be to make secondHashTable from arma::Mat to arma::SpMat. The problem with this is currently we denote an empty bucket position by setting it to N, not 0, because 0 corresponds to point 0 in the reference set.
An alternative would be to have a C++ array of std::vectors. Each vector holds the contents (indices) of the corresponding bucket. This might require a little more refactoring but since vectors use only the memory they require (plus whatever they need for amortization of expansion) I think we will be better off than using SpMat.
Let me know what you think
Here's what I have in mind regarding an std::vector<size_t> array.
This requires more work, but let me know what you think.
If you run this with default parameters it will probably take a lot of
We could also implement an optional size limit for the vectors for backward
On Sat, Jun 4, 2016 at 3:43 AM, Ryan Curtin email@example.com
I agree, definitely we can end up with a case where we discard a lot of points. I wasn't trying to fix that situation here because I'm not 100% sure what the best option is: keeping all of the points in every bucket can lead to very long runtimes (because very large buckets, like you saw with the Corel data).
I actually think we should transpose how secondHashTable is stored; I think it is being accessed by rows. To me it would make the most sense to have the points in each bin stored in a column, not in a row like it is now.
I took a look through the code and the benchmarks that you sent. I noticed that the recall was much higher in the std::vector<size_t> implementation, but the runtime tended to be much higher for the larger datasets (though it was lower for small datasets). This makes sense: in the existing approach, we are taking extra time to calculate the size of the minimal second hash table and then fill it, but in your approach we do not do that (there may be many empty
It seems like our changes are actually a bit orthogonal: my idea with this PR was to take the existing approach (with the existing bucket size limitation enforced by
Do you think we should remove the
The other concern is that if we want to compress
Not exactly, though ignoring it is a side-effect. The major change is that we can have buckets that occupy variable lengths - one bucket can be very large without requiring all other buckets to grow with it as was done when specifying (for example) bucketSize to be 5000-6000. I don't think we can do something like that with a monolithic matrix because it has to be rectangular.
I agree, we shouldn't remove the choice from the user to set a maximum capacity to the buckets. My implementation ignores bucketSize because I just wanted to quickly demonstrate my idea - the final code should definately work as you say, have some default bucketSize and if the user specifies 0 simply store all points.
I can't see your changes any more because there's something wrong with the commits (it looks like there's 100 commits and 74 files changed, but can't find the changes you made in lsh_search_impl.hpp). It is true that allocating even an empty std::vector takes some memory, and some time, but I believe the memory we save (by not wasting it on empty buckets) makes up for it. Of course if we can avoid both it would be cool, but I'm not sure how :/
Oh, right, I did not think about the fact that different buckets have different numbers of points in them! Now that I think of that, I do think that perhaps
I think that we can have the best of both worlds if we do it like this:
What do you think, do you think this would work? We would have to modify the serialization again, but I don't think we need to increment the version from 1 to 2 because we did not release mlpack with the serialization change we did before (which was the change from
Yes, there was a force push to the repository to the state it was in about 20 days ago, but I restored the current state earlier today. It seems like the PR interface has not been updated though, so it still shows way more commits.
I like this idea, I think it's quite an improvement over the current implementation. I'm still not clear on how we can first use secondHashSize as the hash size and then compress the empty buckets though. I'll look into your and Pari's code to see that.
I think I can have this working in one or two days... Though I have some final stuff to do with my thesis and I might not be very active till Wednesday, but I'll find time to work on this. So if there isn't a very big hurry, wait for me and release once this is done. I'll do my best to deliver it as soon as possible :)
Sure, if you have time I'm fine with that, I just didn't want to pile extra
On Sun, Jun 5, 2016 at 10:07 PM, Ryan Curtin firstname.lastname@example.org
Okay, this is ready to go. It'll need to be merged manually but that is no problem...
Here are some benchmarks as measured by mlpack_lsh on both the old and the new version. The results are significant enough I didn't bother with more than one trial or tuning parameters.
corel.csv: hash building: 0.474s -> 0.293s, computing neighbors 28.057s -> 5.832s