-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: content hash should not depend on chunks #1611
Conversation
@@ -45,6 +45,7 @@ def get_content_hash(doc: 'DocumentProto') -> str: | |||
doc_without_id = DocumentProto() | |||
doc_without_id.CopyFrom(doc) | |||
doc_without_id.id = "" | |||
del doc_without_id.chunks[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should be overriden for MultiModalDocument where the chunks contents are considered for id. it can be done later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the id should be only from the content itself not the chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue as a reminder. I'll also mark it as 'good first issue', maybe someone will pick it up before we get to it ;)
5e36b95
to
6ecda52
Compare
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
Seems some async test is failing, Doesn't seem to be related, does it? |
Codecov Report
@@ Coverage Diff @@
## master #1611 +/- ##
==========================================
+ Coverage 83.98% 84.92% +0.93%
==========================================
Files 127 127
Lines 6652 6713 +61
==========================================
+ Hits 5587 5701 +114
+ Misses 1065 1012 -53
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This content hash logic should be based on a mask or a set of fields to be added (whitelisted) more than a set of fields to be removed.
It will be otherwise unmantainable as the structure of the document grows.
Test added was failing before the change introduced in the hashing method