-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add batching support for rankers #1467
feat: add batching support for rankers #1467
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @deepampatel ,
Thank you very much for your contribution.
As a comment, I would just recommend do not try to fit every type of batching pattern in a single function since it can become really unreadable and unmantainable. So if you feel like you need a special function for some special executor or type of input you are welcome to add it!
Great work!
jina/executors/decorators.py
Outdated
|
||
class MultiModalExecutor: | ||
|
||
@batching(batch_size = 64, slice_on=[1,2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a comment, there is a special batching decorator for MultiModalExecutor
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @deepampatel ,
Thank you very much for your contribution.
As a comment, I would just recommend do not try to fit every type of batching pattern in a single function since it can become really unreadable and unmantainable. So if you feel like you need a special function for some special executor or type of input you are welcome to add it!
Great work!
@JoanFM Thanks for the quick feedback. I agree with you on the special function for special executor class. i will add a separate class for ranker input.
WDYT about having slice_on
parameter as a Union[int, List[int]]
instead of just int
value in @batching_multi_input
. Dont have any particular example in mind rn where this can be used but this thought came up when i was trying to merge both decorators.
cc4a142
to
d0fb777
Compare
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
…-support-for-rankers
Now I see the description, please unless it looks very very clean, avoid having batching and barching_multi_input inside a single decorator. it was split for better readability |
Previously i had tried to merge all batching decorators into one, after your comments above i have reverted the commit back and added a new |
jina/executors/decorators.py
Outdated
batch_size: Union[int, Callable] = None, | ||
num_batch: Optional[int] = None, | ||
split_over_axis: int = 0, | ||
merge_over_axis: int = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is only for ranker, this will not be different than 0, so let's remove this. Also the split_over_axis
(will it be different than 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same for slice_on
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice_on
i feel can be a configurable parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least slice_on
should be set to the default that the rankers
would use
jina/executors/decorators.py
Outdated
else: | ||
if isinstance(_slice_on, int): | ||
_slice_on = [_slice_on] | ||
_num_data = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_num_data is not used after here
Codecov Report
@@ Coverage Diff @@
## master #1467 +/- ##
==========================================
+ Coverage 84.37% 84.69% +0.32%
==========================================
Files 108 108
Lines 6311 6424 +113
==========================================
+ Hits 5325 5441 +116
+ Misses 986 983 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of code that is common (or can be made common) between batching_multi_input
and batching_input_ranker
. I am sure some helper functions can be made to unify a little bit of both
jina/executors/decorators.py
Outdated
_num_data = num_data | ||
if _num_data is not None: | ||
if isinstance(_slice_on, List): | ||
raise ValueError(f'When using num_data in @batching_ranker_input, an integer value ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then please fix the slice_on
type hint, since it suggests a List
is accepted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just adapt the type hint
and remove this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid this noisy part, consider num_data is never None (type hint suggests so)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking, slice_on
and num_data
when used together, allow you to batch consecutive parameters. There might be a case where you want to batch non - consecutive parameters, lets say [1,3], in those cases we can have slice_on
as list of indices. So whenever slice_on
is passed as a list, num_data
(type hint needs to be updated to optional) cannot be used. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is not yet a case where we require this, I'd prefer to keep it simple
jina/executors/decorators.py
Outdated
batch_size: Union[int, Callable] = None, | ||
num_batch: Optional[int] = None, | ||
slice_on: Union[int, List[int]] = 2, | ||
num_data: int = None) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_data default is 3 I think for this case no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of self, query_meta, old_match_scores, match_meta
we only want to batch old_match_scores
. other two are meta which dont need to batched.
jina/executors/decorators.py
Outdated
for idx,slice_idx in enumerate(_slice_on[1:]): | ||
batch_idx = next(data_iterators[idx+1]) | ||
if yield_dict[idx+1]: | ||
args[slice_on] = dict(batch) if yield_dict[0] else batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is yield_dict
needed? what is batch returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch returns a list of tuples of (key, value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I see. Maybe it is better to consider in batch_iterator
the case where a dict
is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels that until obtaining the iterators
the code with batching_multi_input
may be merged?
35d0a99
to
1918212
Compare
@@ -107,7 +107,10 @@ def batch_iterator(data: Iterable[Any], batch_size: int, axis: int = 0, | |||
data = iter(data) | |||
# as iterator, there is no way to know the length of it | |||
while True: | |||
chunk = tuple(islice(data, batch_size)) | |||
if yield_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add before elif(data, Iterable)
an elif(data, dict)
, like this it will work without the need of the yield_dict
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data
needs to be sliced before passing it to batch_iterator
because we have num_batch
that inturn sets the total_size
. There are two options i think :
- Pass
total_size
tobatch_iterator
and slice the data inside thebatch_iterator
- Slice it before passing it to
batch_iterator
and pass 'yield_dict`. In this case dict when sliced is converted to a iterable .
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, let's keep passing yield_dict
as argument so that we do not affect current behavior (the code is a little verbose already)
jina/executors/decorators.py
Outdated
total_size = _get_total_size(full_data_size, b_size, num_batch) | ||
final_result = [] | ||
yield_dict = [isinstance(args[slice_on + i], Dict) for i in range(0,num_data)] | ||
data_iterators = [batch_iterator(_get_slice(args[slice_on + i],total_size), b_size , yield_dict=yield_dict[i]) for i in range(0, num_data)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's adapt this part together with batching_multi_input
if possible. get_slice
can be used also there right?. Also separating comma before total_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since both batch_multi_input
and batch_ranker_input
are almost similar, we can completely remove batch_ranker_input
and update batching
and batch_multi_input
with get_slice
and yield_dict
to handle dictionary data. or keep batch_ranker_input
separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes lets try, but yield_dict can be avoided in regular batching. it can be set to default false in the batch_decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are removing batching_ranker_input
maybe we should add dictionary support to both batching
and batching_multi_input
no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in the future, but now I would prefer to keep it conservative and just touch where it may be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rankers we actually need only one parameter old_match_scores
to be batched. So we still need to use batch_multi_input
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not true, you may want to batch also match meta
recheckcla |
Jina CLA check ✅ All Contributors have signed the CLA. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @deepampatel ,
thanks for your great contribution, now we just need you to sign the CLA before merging the PR
I have read the CLA Document and I hereby sign the CLA |
Previously i had tried to merge all batching decorators into one, after your comments below i have reverted the commit back and added a new
@batching_ranker_input
decorator only for ranker executors. The@batching
and@batching_mutli_input
are untouched.