Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(documentarray): heap based top k sorting in DocumentArray (#3467) #3570

Merged
merged 2 commits into from
Oct 6, 2021

Conversation

gmastrapas
Copy link
Member

@gmastrapas gmastrapas commented Oct 4, 2021

Issue #3467

Enable more efficient sorting in DocumentArray by exposing an optional top_k argument in method sort. Specifying a top_k integer guarantees that only the top_k elements will be sorted rather that sorting the entire document list. Uses heapq.heapify and heapq.heappop (and the maxheap counterparts) from stdlib.

@gmastrapas gmastrapas requested a review from a team as a code owner October 4, 2021 10:49
@gmastrapas gmastrapas changed the title feat(documentarray): heap based top k sorting in DocumentArray feat(documentarray): heap based top k sorting in DocumentArray (#3467) Oct 4, 2021
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gmastrapas ,

Thank you very much for the contribution, I have some minor comments

else:
self._pb_body.sort(*args, **kwargs)
if _key is not None:
self._pb_body = [(_key(element), element) for element in self._pb_body]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to change self._pb_body, I see that u do too to avoid having the if else at the other level. But all the magic done later seems to be more confusing. I would separate quite clearly what happens to the actual _pb_body and what are the values put in the heap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if some exception happens after here, the DocumentArray will stay in a very invalid state

@codecov
Copy link

codecov bot commented Oct 4, 2021

Codecov Report

Merging #3570 (d0e27f8) into master (4286a86) will increase coverage by 1.80%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3570      +/-   ##
==========================================
+ Coverage   86.90%   88.71%   +1.80%     
==========================================
  Files         152      154       +2     
  Lines       11305    11536     +231     
==========================================
+ Hits         9825    10234     +409     
+ Misses       1480     1302     -178     
Flag Coverage Δ
daemon 45.05% <14.70%> (-0.46%) ⬇️
jina 88.65% <100.00%> (+2.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/__init__.py 71.25% <100.00%> (ø)
jina/peapods/pods/k8s.py 60.92% <100.00%> (-13.44%) ⬇️
jina/types/arrays/document.py 85.88% <100.00%> (+0.67%) ⬆️
jina/types/arrays/memmap.py 95.09% <100.00%> (+0.06%) ⬆️
jina/peapods/pods/k8slib/kubernetes_tools.py 72.72% <0.00%> (-6.99%) ⬇️
jina/peapods/peas/__init__.py 85.89% <0.00%> (-4.49%) ⬇️
jina/peapods/runtimes/gateway/websocket/app.py 85.41% <0.00%> (-3.72%) ⬇️
jina/peapods/pods/compound.py 82.71% <0.00%> (-1.24%) ⬇️
jina/helper.py 82.76% <0.00%> (-0.36%) ⬇️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23a6426...d0e27f8. Read the comment docs.

assert da[0].id == '0'
assert da[0].scores['euclid'].value == 10

da.sort(top_k=3, key=lambda d: d.scores['euclid'].value, reverse=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should validate the up to top_k things are kept, and the rest are not kept in order.

Also it should validate that the set of Documents is still valid

@gmastrapas gmastrapas force-pushed the feat-documentarray-topk-sort-3467 branch from 9252f28 to 82e9306 Compare October 4, 2021 13:54
@@ -332,6 +332,29 @@ def test_da_sort_by_document_interface_in_proto():
assert da[0].embedding.shape == (1,)


def test_da_sort_topk():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the behavior when top-k is present but no key is given?

Copy link
Member Author

@gmastrapas gmastrapas Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both with and without top_k, if key=None an exception is raised

TypeError: '<' not supported between instances of 'DocumentProto' and 'DocumentProto'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my fear. I think then we should not consider key as an Optional argument

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can make it required. We 'll break the API but I think it raises the same Exception in master so we 're good, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are good because key would be accepted as a *kwargs so good

@@ -332,6 +332,29 @@ def test_da_sort_by_document_interface_in_proto():
assert da[0].embedding.shape == (1,)


def test_da_sort_topk():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my fear. I think then we should not consider key as an Optional argument

@gmastrapas
Copy link
Member Author

gmastrapas commented Oct 6, 2021

Hey @JoanFM can you lend a hand with the failed tests? I'm not getting much from the logs

test-flow-with-init-container@2805[E]:Flow is aborted due to ['gateway'] can not be started.

@JoanFM
Copy link
Member

JoanFM commented Oct 6, 2021

Hey @JoanFM can you lend a hand with the failed tests? I'm not getting much from the logs

test-flow-with-init-container@2805[E]:Flow is aborted due to ['gateway'] can not be started.

Hey @gmastrapas , do not worry, these tests are currently flaky and not because of this PR, will not be considered

@JoanFM JoanFM merged commit 31a6968 into jina-ai:master Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentArray sort should also get top_k to allow for more efficient sorting
2 participants