Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: content hash should not depend on chunks #1611

Merged
merged 1 commit into from
Jan 6, 2021

Conversation

cristianmtr
Copy link
Contributor

@cristianmtr cristianmtr commented Jan 6, 2021

Test added was failing before the change introduced in the hashing method

@cristianmtr cristianmtr requested a review from a team as a code owner January 6, 2021 15:33
@jina-bot jina-bot added size/S area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/type labels Jan 6, 2021
JoanFM
JoanFM previously requested changes Jan 6, 2021
jina/types/document/uid.py Show resolved Hide resolved
@@ -45,6 +45,7 @@ def get_content_hash(doc: 'DocumentProto') -> str:
doc_without_id = DocumentProto()
doc_without_id.CopyFrom(doc)
doc_without_id.id = ""
del doc_without_id.chunks[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method should be overriden for MultiModalDocument where the chunks contents are considered for id. it can be done later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the id should be only from the content itself not the chunks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue as a reminder. I'll also mark it as 'good first issue', maybe someone will pick it up before we get to it ;)

@github-actions
Copy link

github-actions bot commented Jan 6, 2021

Latency summary

Current PR yields:

  • 😶 index QPS at 1725, delta to last 3 avg.: +3%
  • 😶 query QPS at 32, delta to last 3 avg.: +0%

Breakdown

Version Index QPS Query QPS
current 1725 32
0.9.2 1665 32
0.9.1 1665 32
0.8.22 1649 32

Backed by latency-tracking. Further commits will update this comment.

@cristianmtr
Copy link
Contributor Author

Seems some async test is failing, tests/unit/flow/test_asyncflow.py::test_run_async_flow_other_task_concurrent. https://github.com/jina-ai/jina/pull/1611/checks?check_run_id=1657292065#step:6:2069

Doesn't seem to be related, does it?

@codecov
Copy link

codecov bot commented Jan 6, 2021

Codecov Report

Merging #1611 (6ecda52) into master (7200c65) will increase coverage by 0.93%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1611      +/-   ##
==========================================
+ Coverage   83.98%   84.92%   +0.93%     
==========================================
  Files         127      127              
  Lines        6652     6713      +61     
==========================================
+ Hits         5587     5701     +114     
+ Misses       1065     1012      -53     
Impacted Files Coverage Δ
jina/types/document/uid.py 81.81% <100.00%> (+0.56%) ⬆️
jina/logging/sse.py 91.42% <0.00%> (-0.76%) ⬇️
jina/logging/profile.py 69.84% <0.00%> (-0.56%) ⬇️
jina/executors/decorators.py 91.11% <0.00%> (-0.17%) ⬇️
jina/drivers/craft.py 100.00% <0.00%> (ø)
jina/types/ndarray/generic.py 100.00% <0.00%> (ø)
jina/drivers/encode.py 94.91% <0.00%> (+0.08%) ⬆️
jina/enums.py 96.59% <0.00%> (+0.09%) ⬆️
jina/jaml/__init__.py 95.93% <0.00%> (+0.09%) ⬆️
jina/flow/base.py 86.58% <0.00%> (+0.10%) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c68d2da...6ecda52. Read the comment docs.

@hanxiao hanxiao merged commit ccbd74b into master Jan 6, 2021
@hanxiao hanxiao deleted the fix-content-hash-not-dependent-on-chunks branch January 6, 2021 17:22
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This content hash logic should be based on a mask or a set of fields to be added (whitelisted) more than a set of fields to be removed.

It will be otherwise unmantainable as the structure of the document grows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/type size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants