Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add simhash #24

Merged
merged 18 commits into from Apr 21, 2015
Merged

add simhash #24

merged 18 commits into from Apr 21, 2015

Conversation

kid1412z
Copy link
Contributor

@kid1412z kid1412z commented Apr 1, 2015

Hi Haifeng,
I add the LSH search based on signature of simhash, during the implementation, I found that there are a lot of repeated code and we cannot reuse some code already has been implemented. I think we may extract the Hash functions as devices so that all LSH search could share them. But it's not quite easy when I tried, because even the Euclidean hash functions used by LSH and MPLSH is slightly different from each other. It will take time to do this refactor.
BTW, do you have some recommended data sets to test this commit?

commit summary:

  • add simhash(simply use weights[1,1,1...1]) and the LSH search for signatures

@haifengl
Copy link
Owner

haifengl commented Apr 1, 2015

Thank you very much for the great work! In smile.sort (of SmileMath) package, we have a HeapSelect class. I guess that it does the same thing as your MaxHeap.

I am not familiar with SimLSH. What test data did they use in the original paper?

I totally agree that there are a lot of duplicated code between LSH and MPLSH. I was very lazy and simply copied LSH to MPLSH and then modify it. Please feel free to refactor it.

@kid1412z
Copy link
Contributor Author

kid1412z commented Apr 2, 2015

In LSH.java line:534

n2[i] = neighbors[i + 1];

Is it right that copy from index 1 not 0 ?

@haifengl
Copy link
Owner

haifengl commented Apr 2, 2015

That is the case of hit < k. Note that we add k fake neighbors with 0 distance (line 506 - 512). The purpose of that piece code to remove these fake neighbors. But why starting with 1? I don't remember. Probably should be a more dynamic determined number to filter out all fake neighbors. Maybe it is a bug. Thanks!

@kid1412z
Copy link
Contributor Author

kid1412z commented Apr 3, 2015

line 506 puts n fake neighbors with distance Double.MAX, if it's a max heap, the real neighbors are far from top.

Neighbor<double[], E> neighbor = new Neighbor<double[], E>(null, null, 0, Double.MAX_VALUE);

I pushed a test that indicate there's a bug exists in HeapSelect.

@haifengl
Copy link
Owner

haifengl commented Apr 3, 2015

Thank you very much for finding the bug! I am out of town without easy access to computer. I will check it asap.

@haifengl
Copy link
Owner

haifengl commented Apr 4, 2015

This is actually not a bug. Note that thee peek methods returns the k-th smallest value seen so far. k is 10 in your test case. The values you insert are 0.0, 0.1 0.2 0.3 0.4 MAX_VALUE MAX_VALUE ....

Of course, the peek method will return MAX_VALUE in this case.

@haifengl
Copy link
Owner

haifengl commented Apr 9, 2015

Any updates?

@kid1412z
Copy link
Contributor Author

kid1412z commented Apr 9, 2015

There's no specified data set in this paper, I decide to use MSRP for testing, and I am completing it.

@haifengl
Copy link
Owner

haifengl commented Apr 9, 2015

Thanks for updates! I will debug line 534 ASAP. So busy in these days.

@haifengl
Copy link
Owner

Line 534 is a bug. I fix it. Thanks!

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.

add nearest recall test
@kid1412z
Copy link
Contributor Author

Hi Haifeng,
I have completed the test, but had no time to do the refactor.

recall test result:

    SNLSH KNN recall is 0.6253140096618396
    SNLSH Nearest recall is 0.6492753623188405
    SNLSH range recall is 0.3713021310103753

@haifengl
Copy link
Owner

Thanks! Are you sure to keep MaxHeap? BTW, I imported SmileNLP, which includes stuffs such as tokenizer. Do you still want to use Google library?

@kid1412z
Copy link
Contributor Author

I will try to change to the existing lib, I used the murmur hash in Google library, do we need to implement it ourselves?

@haifengl
Copy link
Owner

If you believe that murmur is very important for this class, I will implement it. It doesn't look complicated. All other algorithms don't depend on third party libraries. Very few applications use LSH but they will have include this google library in their distribution due to this dependency. Not a necessary. Thanks!

@haifengl
Copy link
Owner

I import Apache Cassandra's murmur implementation to smile.hash.MurmurHash. It is Apache licensed and thus okay.

* @author Qiyang Zuo
* @since 15/4/9
*/
public class Tokenizer {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tokenizer in SmileNLP.

@kid1412z
Copy link
Contributor Author

module smile-nlp depends on smile-core, so I can't use nlp in core.

@haifengl
Copy link
Owner

Can you assume that the input data is already tokenized? This also give the user flexibility to use their own tokenizer (e.g. for some specific language).

@@ -0,0 +1,58 @@
/**
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be reused in other algorithms? If not, shall we move it into SNLSH?

@haifengl
Copy link
Owner

Thanks for refactors. If you can remove MaxHeap, I will pull in. Thanks!

@kid1412z
Copy link
Contributor Author

Hi Haifeng,
I completed the simhash lsh. Thanks.

@haifengl
Copy link
Owner

Thanks! The MaxHeap file is still there although it is not in use now. Shall we remove it?

@kid1412z
Copy link
Contributor Author

Sorry, I forgot to push the change.

haifengl added a commit that referenced this pull request Apr 21, 2015
@haifengl haifengl merged commit 1c7b2c2 into haifengl:master Apr 21, 2015
@haifengl
Copy link
Owner

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants