add simhash #24

kid1412z · 2015-04-01T13:35:35Z

Hi Haifeng,
I add the LSH search based on signature of simhash, during the implementation, I found that there are a lot of repeated code and we cannot reuse some code already has been implemented. I think we may extract the Hash functions as devices so that all LSH search could share them. But it's not quite easy when I tried, because even the Euclidean hash functions used by LSH and MPLSH is slightly different from each other. It will take time to do this refactor.
BTW, do you have some recommended data sets to test this commit?

commit summary:

add simhash(simply use weights[1,1,1...1]) and the LSH search for signatures

haifengl · 2015-04-01T14:17:47Z

Thank you very much for the great work! In smile.sort (of SmileMath) package, we have a HeapSelect class. I guess that it does the same thing as your MaxHeap.

I am not familiar with SimLSH. What test data did they use in the original paper?

I totally agree that there are a lot of duplicated code between LSH and MPLSH. I was very lazy and simply copied LSH to MPLSH and then modify it. Please feel free to refactor it.

kid1412z · 2015-04-02T14:43:39Z

In LSH.java line:534

n2[i] = neighbors[i + 1];

Is it right that copy from index 1 not 0 ?

haifengl · 2015-04-02T15:46:35Z

That is the case of hit < k. Note that we add k fake neighbors with 0 distance (line 506 - 512). The purpose of that piece code to remove these fake neighbors. But why starting with 1? I don't remember. Probably should be a more dynamic determined number to filter out all fake neighbors. Maybe it is a bug. Thanks!

kid1412z · 2015-04-03T02:54:11Z

line 506 puts n fake neighbors with distance Double.MAX, if it's a max heap, the real neighbors are far from top.

Neighbor<double[], E> neighbor = new Neighbor<double[], E>(null, null, 0, Double.MAX_VALUE);

I pushed a test that indicate there's a bug exists in HeapSelect.

haifengl · 2015-04-03T12:27:26Z

Thank you very much for finding the bug! I am out of town without easy access to computer. I will check it asap.

haifengl · 2015-04-04T04:30:29Z

This is actually not a bug. Note that thee peek methods returns the k-th smallest value seen so far. k is 10 in your test case. The values you insert are 0.0, 0.1 0.2 0.3 0.4 MAX_VALUE MAX_VALUE ....

Of course, the peek method will return MAX_VALUE in this case.

haifengl · 2015-04-09T13:57:29Z

Any updates?

kid1412z · 2015-04-09T14:09:35Z

There's no specified data set in this paper, I decide to use MSRP for testing, and I am completing it.

haifengl · 2015-04-09T14:29:57Z

Thanks for updates! I will debug line 534 ASAP. So busy in these days.

haifengl · 2015-04-10T14:01:14Z

Line 534 is a bug. I fix it. Thanks!

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit. add nearest recall test

kid1412z · 2015-04-13T12:31:15Z

Hi Haifeng,
I have completed the test, but had no time to do the refactor.

recall test result：

    SNLSH KNN recall is 0.6253140096618396
    SNLSH Nearest recall is 0.6492753623188405
    SNLSH range recall is 0.3713021310103753

haifengl · 2015-04-13T13:41:33Z

Thanks! Are you sure to keep MaxHeap? BTW, I imported SmileNLP, which includes stuffs such as tokenizer. Do you still want to use Google library?

kid1412z · 2015-04-14T01:48:39Z

I will try to change to the existing lib, I used the murmur hash in Google library, do we need to implement it ourselves?

haifengl · 2015-04-14T02:17:29Z

If you believe that murmur is very important for this class, I will implement it. It doesn't look complicated. All other algorithms don't depend on third party libraries. Very few applications use LSH but they will have include this google library in their distribution due to this dependency. Not a necessary. Thanks!

haifengl · 2015-04-14T02:44:46Z

I import Apache Cassandra's murmur implementation to smile.hash.MurmurHash. It is Apache licensed and thus okay.

haifengl · 2015-04-14T02:46:26Z

Smile/src/main/java/smile/util/Tokenizer.java

+ * @author Qiyang Zuo
+ * @since 15/4/9
+ */
+public class Tokenizer {


We have tokenizer in SmileNLP.

kid1412z · 2015-04-15T02:50:03Z

module smile-nlp depends on smile-core, so I can't use nlp in core.

haifengl · 2015-04-15T13:29:29Z

Can you assume that the input data is already tokenized? This also give the user flexibility to use their own tokenizer (e.g. for some specific language).

haifengl · 2015-04-16T14:03:06Z

Smile/src/main/java/smile/hash/SimHash.java

@@ -0,0 +1,58 @@
+/**


Will this be reused in other algorithms? If not, shall we move it into SNLSH?

haifengl · 2015-04-20T18:39:33Z

Thanks for refactors. If you can remove MaxHeap, I will pull in. Thanks!

kid1412z · 2015-04-21T03:06:06Z

Hi Haifeng,
I completed the simhash lsh. Thanks.

haifengl · 2015-04-21T14:12:49Z

Thanks! The MaxHeap file is still there although it is not in use now. Shall we remove it?

kid1412z · 2015-04-21T14:14:46Z

Sorry, I forgot to push the change.

add simhash

haifengl · 2015-04-21T14:21:48Z

Thank you!

kid1412z added 2 commits April 1, 2015 20:39

add simhash

e4c0271

add range and test

cd3c7aa

merge from upstream

af9faf2

add a test of HeapSelect

f254602

remove test

17d01a1

kid1412z added 5 commits April 13, 2015 16:27

test knn recall -> 0.6

1d8a5fd

add nearest recall

923867a

add range recall

9a971ce

exclude identical

f9ec487

haifengl reviewed Apr 14, 2015
View reviewed changes

Merge branch 'master' of https://github.com/haifengl/smile

d76a501

remove google lib and tokenizer

8025978

haifengl reviewed Apr 16, 2015
View reviewed changes

kid1412z added 3 commits April 20, 2015 15:47

refactor add licence

d9b1480

recover .gitignore

ea28b40

refector

e00f30c

remove MaxHeap

70cb9f2

fix import

e8e899f

remove MaxHeap

e74b8ac

haifengl added a commit that referenced this pull request Apr 21, 2015

Merge pull request #24 from kid1412z/master

1c7b2c2

add simhash

haifengl merged commit 1c7b2c2 into haifengl:master Apr 21, 2015

haifengl added the new feature label Oct 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add simhash #24

add simhash #24

kid1412z commented Apr 1, 2015

haifengl commented Apr 1, 2015

kid1412z commented Apr 2, 2015

haifengl commented Apr 2, 2015

kid1412z commented Apr 3, 2015

haifengl commented Apr 3, 2015

haifengl commented Apr 4, 2015

haifengl commented Apr 9, 2015

kid1412z commented Apr 9, 2015

haifengl commented Apr 9, 2015

haifengl commented Apr 10, 2015

kid1412z commented Apr 13, 2015

haifengl commented Apr 13, 2015

kid1412z commented Apr 14, 2015

haifengl commented Apr 14, 2015

haifengl commented Apr 14, 2015

haifengl Apr 14, 2015

kid1412z commented Apr 15, 2015

haifengl commented Apr 15, 2015

haifengl Apr 16, 2015

haifengl commented Apr 20, 2015

kid1412z commented Apr 21, 2015

haifengl commented Apr 21, 2015

kid1412z commented Apr 21, 2015

haifengl commented Apr 21, 2015

add simhash #24

add simhash #24

Conversation

kid1412z commented Apr 1, 2015

commit summary:

haifengl commented Apr 1, 2015

kid1412z commented Apr 2, 2015

haifengl commented Apr 2, 2015

kid1412z commented Apr 3, 2015

haifengl commented Apr 3, 2015

haifengl commented Apr 4, 2015

haifengl commented Apr 9, 2015

kid1412z commented Apr 9, 2015

haifengl commented Apr 9, 2015

haifengl commented Apr 10, 2015

kid1412z commented Apr 13, 2015

haifengl commented Apr 13, 2015

kid1412z commented Apr 14, 2015

haifengl commented Apr 14, 2015

haifengl commented Apr 14, 2015

haifengl Apr 14, 2015

Choose a reason for hiding this comment

kid1412z commented Apr 15, 2015

haifengl commented Apr 15, 2015

haifengl Apr 16, 2015

Choose a reason for hiding this comment

haifengl commented Apr 20, 2015

kid1412z commented Apr 21, 2015

haifengl commented Apr 21, 2015

kid1412z commented Apr 21, 2015

haifengl commented Apr 21, 2015