test: sharding non existent #1719

florian-hoenicke · 2021-01-17T22:48:36Z

Needs to be fixed. We insert 201 documents but only get 101 back.

codecov · 2021-01-17T22:53:09Z

Codecov Report

Merging #1719 (5363d25) into master (75f9569) will increase coverage by 1.94%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1719      +/-   ##
==========================================
+ Coverage   83.32%   85.27%   +1.94%     
==========================================
  Files         134      134              
  Lines        6878     6884       +6     
==========================================
+ Hits         5731     5870     +139     
+ Misses       1147     1014     -133

Impacted Files	Coverage Δ
jina/drivers/convert.py	`100.00% <ø> (+3.70%)`	⬆️
jina/drivers/evaluate.py	`98.27% <ø> (+1.72%)`	⬆️
jina/drivers/predict.py	`89.65% <ø> (-0.18%)`	⬇️
jina/drivers/reduce.py	`100.00% <100.00%> (ø)`
jina/logging/profile.py	`70.40% <0.00%> (+3.20%)`	⬆️
jina/parsers/helloworld.py	`100.00% <0.00%> (+3.84%)`	⬆️
jina/helloworld/helper.py	`90.52% <0.00%> (+90.52%)`	⬆️
jina/helloworld/__init__.py	`100.00% <0.00%> (+100.00%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75f9569...8f790f4. Read the comment docs.

github-actions · 2021-01-17T22:58:17Z

Latency summary

Current PR yields:

🐢🐢 index QPS at 1199, delta to last 3 avg.: -12%
😶 query QPS at 27, delta to last 3 avg.: +2%

Breakdown

Version	Index QPS	Query QPS
current	1199	27
`0.9.19`	1186	26
`0.9.18`	1565	25

Backed by latency-tracking. Further commits will update this comment.

JoanFM · 2021-01-17T23:02:03Z

This is expected as the KVSearchDriver removes the non-found documents.

florian-hoenicke · 2021-01-18T10:55:33Z

This is expected as the KVSearchDriver removes the non-found documents.

In this case we index random_docs(0, 201) and query random_docs(0, 220) but just get 101 documents back. Expected would be to get 201 documents.

JoanFM · 2021-01-18T11:18:36Z

This is expected as the KVSearchDriver removes the non-found documents.

In this case we index random_docs(0, 201) and query random_docs(0, 220) but just get 101 documents back. Expected would be to get 201 documents.

The expected result is that u will get in output as many documents as they are found in the KVIndex. Do the 220 documents guarantee that 201 of them have id that had been previously indexed?

JoanFM · 2021-01-18T11:20:16Z

This is expected as the KVSearchDriver removes the non-found documents.

In this case we index random_docs(0, 201) and query random_docs(0, 220) but just get 101 documents back. Expected would be to get 201 documents.

And if it is the case, u need uses_after to merge all the documents at root level, so that the Request is rebuilt.

What is happening in this test is that, the first shard finishing returns its results. And that is why u just get a portion, what u would maybe see is 2 respones

florian-hoenicke · 2021-01-18T12:28:28Z

The expected result is that u will get in output as many documents as they are found in the KVIndex. Do the 220 documents guarantee that 201 of them have id that had been previously indexed?

Yes

And if it is the case, u need uses_after to merge all the documents at root level, so that the Request is rebuilt.

Good point. However, I tried it out with uses_after='_merge', as well and it did not work.

JoanFM · 2021-01-18T12:32:17Z

The expected result is that u will get in output as many documents as they are found in the KVIndex. Do the 220 documents guarantee that 201 of them have id that had been previously indexed?

Yes

And if it is the case, u need uses_after to merge all the documents at root level, so that the Request is rebuilt.

Good point. However, I tried it out with uses_after='_merge', as well and it did not work.

what traversal paths are used there?

florian-hoenicke · 2021-01-18T21:38:12Z

As discussed with @JoanFM supporting sharding for plain kv-indexers would be usefull but still requires some conceptual thinking and might take some time to implement.
In this example we use polling all to retrieve results from two shards.
The problem is that the ReduceDriver seems not to work on document level but only on matches.

JoanFM · 2021-01-18T21:39:15Z

As discussed with @JoanFM supporting sharding for plain kv-indexers would be usefull but still requires some conceptual thinking and might take some time to implement.
In this example we use polling all to retrieve results from two shards.
The problem is that the ReduceDriver seems not to work on document level but only on matches.

ReduceAllDriver ...

JoanFM · 2021-01-21T16:11:29Z

jina/drivers/__init__.py

+    def _apply_root(
+        self,
+        docs: 'DocumentSet',
+        field: str,


if we say is None, should we also hide field?

field contains "docs".
The context_doc is set to None

JoanFM · 2021-01-21T16:12:06Z

jina/drivers/reduce.py

+            docs.append(doc)
+        request = self.msg.request
+        request.body.ClearField(field)
+        request.docs.extend(docs)


so I guess you saw the problem I had?

@JoanFM The problem was that the docs were stored in the body.docs as well as in docs?

tests/integration/sharding/test_search_non_existent.py

cristianmtr

some quick clarifications

…arding-non-existent

florian-hoenicke · 2021-01-22T12:48:42Z

jina/drivers/__init__.py

+    @property
+    def docs(self):
+        if self.expect_parts > 1:
+            return (d for r in reversed(self.partial_reqs) for d in r.docs)


@JoanFM do we reverse the requests in order to wait for the last one to arrive?

it was like this, so to be honest not sure why it was there in the first place

florian-hoenicke · 2021-01-22T13:28:23Z

@hanxiao could you review this?
On this pr @JoanFM @maximilianwerk and me worked collaboratively together.
Means no one of us can approve it.

jina-bot added size/S area/testing This issue/PR affects testing labels Jan 17, 2021

jina-bot added the area/core This issue/PR affects the core codebase label Jan 18, 2021

jina-bot added size/M component/driver and removed size/S labels Jan 21, 2021

maximilianwerk marked this pull request as ready for review January 21, 2021 15:57

maximilianwerk requested a review from a team as a code owner January 21, 2021 15:57

maximilianwerk requested review from maximilianwerk and JoanFM January 21, 2021 15:57

JoanFM reviewed Jan 21, 2021

View reviewed changes

florian-hoenicke and others added 9 commits January 21, 2021 17:22

test: sharding non existent

df0d457

test: sharding merge

02549f0

fix: proof of concept on how to reduce at request level

b40b054

refactor: promote apply_root

4b14f1d

refactor: fixed circular import

e2b81bf

fix: type error

e627712

fix: another linting error fixed

5830024

feat: cleaner interface

794c091

fix: name change due to deprecation

3227b64

maximilianwerk force-pushed the test-sharding-non-existent branch from 1b3c851 to 3227b64 Compare January 21, 2021 16:32

JoanFM previously approved these changes Jan 21, 2021

View reviewed changes

cristianmtr reviewed Jan 21, 2021

View reviewed changes

tests/integration/sharding/test_search_non_existent.py Outdated Show resolved Hide resolved

cristianmtr reviewed Jan 21, 2021

View reviewed changes

tests/integration/sharding/test_search_non_existent.py Outdated Show resolved Hide resolved

cristianmtr suggested changes Jan 21, 2021

View reviewed changes

test: sharding move merge root

48c049b

florian-hoenicke dismissed JoanFM’s stale review via 48c049b January 22, 2021 09:12

jina-bot added the component/resource label Jan 22, 2021

Merge branch 'master' of https://github.com/jina-ai/jina into test-sh…

8f790f4

…arding-non-existent

florian-hoenicke commented Jan 22, 2021

View reviewed changes

hanxiao approved these changes Jan 22, 2021

View reviewed changes

hanxiao merged commit 86853bf into master Jan 22, 2021

hanxiao deleted the test-sharding-non-existent branch January 22, 2021 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: sharding non existent #1719

test: sharding non existent #1719

florian-hoenicke commented Jan 17, 2021

codecov bot commented Jan 17, 2021 •

edited

github-actions bot commented Jan 17, 2021 •

edited

JoanFM commented Jan 17, 2021

florian-hoenicke commented Jan 18, 2021

JoanFM commented Jan 18, 2021

JoanFM commented Jan 18, 2021

florian-hoenicke commented Jan 18, 2021

JoanFM commented Jan 18, 2021

florian-hoenicke commented Jan 18, 2021

JoanFM commented Jan 18, 2021

JoanFM Jan 21, 2021

florian-hoenicke Jan 22, 2021

JoanFM Jan 21, 2021

florian-hoenicke Jan 22, 2021

cristianmtr left a comment

florian-hoenicke Jan 22, 2021

JoanFM Jan 22, 2021

florian-hoenicke commented Jan 22, 2021

test: sharding non existent #1719

test: sharding non existent #1719

Conversation

florian-hoenicke commented Jan 17, 2021

codecov bot commented Jan 17, 2021 • edited

Codecov Report

github-actions bot commented Jan 17, 2021 • edited

Latency summary

Breakdown

JoanFM commented Jan 17, 2021

florian-hoenicke commented Jan 18, 2021

JoanFM commented Jan 18, 2021

JoanFM commented Jan 18, 2021

florian-hoenicke commented Jan 18, 2021

JoanFM commented Jan 18, 2021

florian-hoenicke commented Jan 18, 2021

JoanFM commented Jan 18, 2021

JoanFM Jan 21, 2021

Choose a reason for hiding this comment

florian-hoenicke Jan 22, 2021

Choose a reason for hiding this comment

JoanFM Jan 21, 2021

Choose a reason for hiding this comment

florian-hoenicke Jan 22, 2021

Choose a reason for hiding this comment

cristianmtr left a comment

Choose a reason for hiding this comment

florian-hoenicke Jan 22, 2021

Choose a reason for hiding this comment

JoanFM Jan 22, 2021

Choose a reason for hiding this comment

florian-hoenicke commented Jan 22, 2021

codecov bot commented Jan 17, 2021 •

edited

github-actions bot commented Jan 17, 2021 •

edited