Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compare Hashes ? #2

Closed
hafidmermouri opened this issue Apr 14, 2018 · 4 comments
Closed

How to compare Hashes ? #2

hafidmermouri opened this issue Apr 14, 2018 · 4 comments

Comments

@hafidmermouri
Copy link

Hello
nice work ! thank for sharing.
can you add a method to compare a list of hashes ? I mean, i have a collection of 5000 documents, i used this code to calculate the simhash for each. and now, i want to detect duplicates.

thanks for your help

@memosstilvi
Copy link
Owner

@hafidmermouri thanks.

I will not add a method to compare the hashes in this repo because the repo is about the algorithmic implementation. The approach you want to follow in order to compare the hashes and decide/store the duplicate documents depends on the different use cases.

However, given that your corpus is small you can follow an approach like this:

# assuming that you have a dictionary with document id as the key and the document as the value: 
# documents = { doc_id: doc } you can do:

from simhash import simhash

def split_hash(str, num):
    return [ str[start:start+num] for start in range(0, len(str), num) ]

hashes = {}
for doc_id, doc in documents.items():
    hash = simhash(doc)
    
    # you can either use the whole hash for higher precision or split into chunks for higher recall
    hash_chunks = split_hash(hash, 4)
    
    for chunk in hash_chunks:
        if chunk not in hashes:
            hashes[chunk] = []
        hashes[chunk].append(doc_id)

# now you can print the duplicate documents:
for hash, doc_list in hashes:
    if doc_list > 1:
        print("Duplicates documents: ", doc_list)

@hafidmermouri
Copy link
Author

@memosstilvi thanks for your quick response.

I was implementing a similar approach but, yes the precision is the main issue.

I'm looking for another approach which consist on chunking the hash into 4 (or more) parts (let say ABCD), and then do a permutation so i can get a list of possible duplicate hashes for a single document (for ex: ABCD (the current), ACDB, ADBC, BCAD, BDAC and CDAB). then I can just look into my db to get duplicates.

what do you think ?

@memosstilvi
Copy link
Owner

@hafidmermouri

I don't think this will give you something more than just comparing the different chunks of the hashes.

If your goal is high precision I would use the whole hash or split it into 2 chunks.

I suggest you play with a training set of documents and tailor the algorithm depending on your goal and the results you get on the training set.

Then do a test against another set of documents that you already know if they are duplicates or not, but they are not part of the training set. This will help you to avoid overfitting.

@memosstilvi
Copy link
Owner

I consider the issue closed since the original question has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants