-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: optimize scores avoiding array creation #2163
Conversation
…e-score-aggregate-matches-ranker
@@ -72,30 +72,35 @@ def _insert_query_matches( | |||
|
|||
@staticmethod | |||
def _group_by(match_idx, col_name): | |||
# sort by ``col | |||
""" | |||
Create an list of numpy arrays with the same ``doc_id`` in each position of the list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends of the col_name
right?
r.append((match_id, score)) | ||
return self._sort_doc_by_score(r) | ||
n_groups = len(_groups) | ||
res = np.empty((n_groups,), dtype=[('ids','U64'), ('scores', np.float64)] ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of 'U64'
use the Chunk2DocRanker variable to define the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'scores' is also a name found in Chunk2DocRanker namespace
|
||
for i,_g in enumerate(_groups): | ||
#res[i] = (match_id, score) | ||
res[i] = (_g['c0'][0], self.exec_fn(_g, query_chunk_meta, match_chunk_meta)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this 'c0'
?
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
Codecov Report
@@ Coverage Diff @@
## master #2163 +/- ##
==========================================
+ Coverage 89.00% 90.46% +1.46%
==========================================
Files 211 211
Lines 11269 11278 +9
==========================================
+ Hits 10030 10203 +173
+ Misses 1239 1075 -164
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
_, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True) | ||
# group by ``col`` | ||
return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1] | ||
list_numpy_arrays = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why do u think this is an optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
will do the sorting again, so at least O(NlogN)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool! Can we make sure which function contributes most in optimization?
_, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True) | ||
# group by ``col`` | ||
return np.split(_sorted_m, np.cumsum(_doc_counts))[:-1] | ||
list_numpy_arrays = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because _, _doc_counts = np.unique(_sorted_m[col_name], return_counts=True)
will do the sorting again, so at least O(NlogN)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of the optimization is from _group_by. Another function is not so obvious
…e-score-aggregate-matches-ranker
…e-score-aggregate-matches-ranker
list_numpy_arrays.append(_sorted_m[prev_val:i]) | ||
prev_val = i | ||
current = val | ||
if i == n_elements-1 and val == current: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be handled outside the for loop when u are sure u handled the last element?, what is the problem here?
This PR wants to avoid many array creations at ranker runtime.
The code can be locally benchmarked the python script
minimal_working_example_score.py
attached below, the script is self contained and generates dummy data to test performance. It can be copied and executed anywhere.The results of
python minimal_working_example_score.py
areScript
minimal_working_example_score.py
: