New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add simhash #24
add simhash #24
Conversation
Thank you very much for the great work! In smile.sort (of SmileMath) package, we have a HeapSelect class. I guess that it does the same thing as your MaxHeap. I am not familiar with SimLSH. What test data did they use in the original paper? I totally agree that there are a lot of duplicated code between LSH and MPLSH. I was very lazy and simply copied LSH to MPLSH and then modify it. Please feel free to refactor it. |
In LSH.java line:534
Is it right that copy from index 1 not 0 ? |
That is the case of hit < k. Note that we add k fake neighbors with 0 distance (line 506 - 512). The purpose of that piece code to remove these fake neighbors. But why starting with 1? I don't remember. Probably should be a more dynamic determined number to filter out all fake neighbors. Maybe it is a bug. Thanks! |
line 506 puts n fake neighbors with distance Double.MAX, if it's a max heap, the real neighbors are far from top.
I pushed a test that indicate there's a bug exists in HeapSelect. |
Thank you very much for finding the bug! I am out of town without easy access to computer. I will check it asap. |
This is actually not a bug. Note that thee peek methods returns the k-th smallest value seen so far. k is 10 in your test case. The values you insert are 0.0, 0.1 0.2 0.3 0.4 MAX_VALUE MAX_VALUE .... Of course, the peek method will return MAX_VALUE in this case. |
Any updates? |
There's no specified data set in this paper, I decide to use MSRP for testing, and I am completing it. |
Thanks for updates! I will debug line 534 ASAP. So busy in these days. |
Line 534 is a bug. I fix it. Thanks! |
# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit. add nearest recall test
Hi Haifeng, recall test result:
|
Thanks! Are you sure to keep MaxHeap? BTW, I imported SmileNLP, which includes stuffs such as tokenizer. Do you still want to use Google library? |
I will try to change to the existing lib, I used the murmur hash in Google library, do we need to implement it ourselves? |
If you believe that murmur is very important for this class, I will implement it. It doesn't look complicated. All other algorithms don't depend on third party libraries. Very few applications use LSH but they will have include this google library in their distribution due to this dependency. Not a necessary. Thanks! |
I import Apache Cassandra's murmur implementation to smile.hash.MurmurHash. It is Apache licensed and thus okay. |
* @author Qiyang Zuo | ||
* @since 15/4/9 | ||
*/ | ||
public class Tokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tokenizer in SmileNLP.
module smile-nlp depends on smile-core, so I can't use nlp in core. |
Can you assume that the input data is already tokenized? This also give the user flexibility to use their own tokenizer (e.g. for some specific language). |
@@ -0,0 +1,58 @@ | |||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be reused in other algorithms? If not, shall we move it into SNLSH?
Thanks for refactors. If you can remove MaxHeap, I will pull in. Thanks! |
Hi Haifeng, |
Thanks! The MaxHeap file is still there although it is not in use now. Shall we remove it? |
Sorry, I forgot to push the change. |
Thank you! |
Hi Haifeng,
I add the LSH search based on signature of simhash, during the implementation, I found that there are a lot of repeated code and we cannot reuse some code already has been implemented. I think we may extract the Hash functions as devices so that all LSH search could share them. But it's not quite easy when I tried, because even the Euclidean hash functions used by LSH and MPLSH is slightly different from each other. It will take time to do this refactor.
BTW, do you have some recommended data sets to test this commit?
commit summary: