New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to compare Hashes ? #2
Comments
@hafidmermouri thanks. I will not add a method to compare the hashes in this repo because the repo is about the algorithmic implementation. The approach you want to follow in order to compare the hashes and decide/store the duplicate documents depends on the different use cases. However, given that your corpus is small you can follow an approach like this: # assuming that you have a dictionary with document id as the key and the document as the value:
# documents = { doc_id: doc } you can do:
from simhash import simhash
def split_hash(str, num):
return [ str[start:start+num] for start in range(0, len(str), num) ]
hashes = {}
for doc_id, doc in documents.items():
hash = simhash(doc)
# you can either use the whole hash for higher precision or split into chunks for higher recall
hash_chunks = split_hash(hash, 4)
for chunk in hash_chunks:
if chunk not in hashes:
hashes[chunk] = []
hashes[chunk].append(doc_id)
# now you can print the duplicate documents:
for hash, doc_list in hashes:
if doc_list > 1:
print("Duplicates documents: ", doc_list) |
@memosstilvi thanks for your quick response. I was implementing a similar approach but, yes the precision is the main issue. I'm looking for another approach which consist on chunking the hash into 4 (or more) parts (let say ABCD), and then do a permutation so i can get a list of possible duplicate hashes for a single document (for ex: ABCD (the current), ACDB, ADBC, BCAD, BDAC and CDAB). then I can just look into my db to get duplicates. what do you think ? |
I don't think this will give you something more than just comparing the different chunks of the hashes. If your goal is high precision I would use the whole hash or split it into 2 chunks. I suggest you play with a training set of documents and tailor the algorithm depending on your goal and the results you get on the training set. Then do a test against another set of documents that you already know if they are duplicates or not, but they are not part of the training set. This will help you to avoid overfitting. |
I consider the issue closed since the original question has been addressed. |
Hello
nice work ! thank for sharing.
can you add a method to compare a list of hashes ? I mean, i have a collection of 5000 documents, i used this code to calculate the simhash for each. and now, i want to detect duplicates.
thanks for your help
The text was updated successfully, but these errors were encountered: