-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(documentarray): heap based top k sorting in DocumentArray (#3467) #3570
feat(documentarray): heap based top k sorting in DocumentArray (#3467) #3570
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @gmastrapas ,
Thank you very much for the contribution, I have some minor comments
jina/types/arrays/document.py
Outdated
else: | ||
self._pb_body.sort(*args, **kwargs) | ||
if _key is not None: | ||
self._pb_body = [(_key(element), element) for element in self._pb_body] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to change self._pb_body
, I see that u do too to avoid having the if
else at the other level. But all the magic done later seems to be more confusing. I would separate quite clearly what happens to the actual _pb_body
and what are the values put in the heap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if some exception happens after here, the DocumentArray will stay in a very invalid state
Codecov Report
@@ Coverage Diff @@
## master #3570 +/- ##
==========================================
+ Coverage 86.90% 88.71% +1.80%
==========================================
Files 152 154 +2
Lines 11305 11536 +231
==========================================
+ Hits 9825 10234 +409
+ Misses 1480 1302 -178
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
assert da[0].id == '0' | ||
assert da[0].scores['euclid'].value == 10 | ||
|
||
da.sort(top_k=3, key=lambda d: d.scores['euclid'].value, reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should validate the up to top_k things are kept, and the rest are not kept in order.
Also it should validate that the set of Documents is still valid
9252f28
to
82e9306
Compare
@@ -332,6 +332,29 @@ def test_da_sort_by_document_interface_in_proto(): | |||
assert da[0].embedding.shape == (1,) | |||
|
|||
|
|||
def test_da_sort_topk(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the behavior when top-k is present but no key is given?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both with and without top_k
, if key=None
an exception is raised
TypeError: '<' not supported between instances of 'DocumentProto' and 'DocumentProto'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my fear. I think then we should not consider key
as an Optional argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can make it required. We 'll break the API but I think it raises the same Exception in master
so we 're good, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we are good because key
would be accepted as a *kwargs
so good
@@ -332,6 +332,29 @@ def test_da_sort_by_document_interface_in_proto(): | |||
assert da[0].embedding.shape == (1,) | |||
|
|||
|
|||
def test_da_sort_topk(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my fear. I think then we should not consider key
as an Optional argument
Hey @JoanFM can you lend a hand with the failed tests? I'm not getting much from the logs
|
Hey @gmastrapas , do not worry, these tests are currently flaky and not because of this PR, will not be considered |
Issue #3467
Enable more efficient sorting in
DocumentArray
by exposing an optionaltop_k
argument in methodsort
. Specifying atop_k
integer guarantees that only thetop_k
elements will be sorted rather that sorting the entire document list. Usesheapq.heapify
andheapq.heappop
(and the maxheap counterparts) from stdlib.