How to compare Hashes ? #2

hafidmermouri · 2018-04-14T09:35:24Z

Hello
nice work ! thank for sharing.
can you add a method to compare a list of hashes ? I mean, i have a collection of 5000 documents, i used this code to calculate the simhash for each. and now, i want to detect duplicates.

thanks for your help

memosstilvi · 2018-04-14T11:48:26Z

@hafidmermouri thanks.

I will not add a method to compare the hashes in this repo because the repo is about the algorithmic implementation. The approach you want to follow in order to compare the hashes and decide/store the duplicate documents depends on the different use cases.

However, given that your corpus is small you can follow an approach like this:

# assuming that you have a dictionary with document id as the key and the document as the value: 
# documents = { doc_id: doc } you can do:

from simhash import simhash

def split_hash(str, num):
    return [ str[start:start+num] for start in range(0, len(str), num) ]

hashes = {}
for doc_id, doc in documents.items():
    hash = simhash(doc)
    
    # you can either use the whole hash for higher precision or split into chunks for higher recall
    hash_chunks = split_hash(hash, 4)
    
    for chunk in hash_chunks:
        if chunk not in hashes:
            hashes[chunk] = []
        hashes[chunk].append(doc_id)

# now you can print the duplicate documents:
for hash, doc_list in hashes:
    if doc_list > 1:
        print("Duplicates documents: ", doc_list)

hafidmermouri · 2018-04-14T11:56:35Z

@memosstilvi thanks for your quick response.

I was implementing a similar approach but, yes the precision is the main issue.

I'm looking for another approach which consist on chunking the hash into 4 (or more) parts (let say ABCD), and then do a permutation so i can get a list of possible duplicate hashes for a single document (for ex: ABCD (the current), ACDB, ADBC, BCAD, BDAC and CDAB). then I can just look into my db to get duplicates.

what do you think ?

memosstilvi · 2018-04-14T12:02:46Z

@hafidmermouri

I don't think this will give you something more than just comparing the different chunks of the hashes.

If your goal is high precision I would use the whole hash or split it into 2 chunks.

I suggest you play with a training set of documents and tailor the algorithm depending on your goal and the results you get on the training set.

Then do a test against another set of documents that you already know if they are duplicates or not, but they are not part of the training set. This will help you to avoid overfitting.

memosstilvi · 2018-04-14T16:38:34Z

I consider the issue closed since the original question has been addressed.

memosstilvi closed this as completed Apr 14, 2018

memosstilvi reopened this Apr 14, 2018

memosstilvi closed this as completed Apr 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compare Hashes ? #2

How to compare Hashes ? #2

hafidmermouri commented Apr 14, 2018

memosstilvi commented Apr 14, 2018

hafidmermouri commented Apr 14, 2018

memosstilvi commented Apr 14, 2018

memosstilvi commented Apr 14, 2018

How to compare Hashes ? #2

How to compare Hashes ? #2

Comments

hafidmermouri commented Apr 14, 2018

memosstilvi commented Apr 14, 2018

hafidmermouri commented Apr 14, 2018

memosstilvi commented Apr 14, 2018

memosstilvi commented Apr 14, 2018