Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add batching support for rankers #1467

Merged

Conversation

deepampatel
Copy link
Contributor

@deepampatel deepampatel commented Dec 15, 2020

Previously i had tried to merge all batching decorators into one, after your comments below i have reverted the commit back and added a new @batching_ranker_input decorator only for ranker executors. The @batching and @batching_mutli_input are untouched.

@jina-bot jina-bot added size/M area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/executor executor/meta labels Dec 15, 2020
Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @deepampatel ,

Thank you very much for your contribution.

As a comment, I would just recommend do not try to fit every type of batching pattern in a single function since it can become really unreadable and unmantainable. So if you feel like you need a special function for some special executor or type of input you are welcome to add it!

Great work!


class MultiModalExecutor:

@batching(batch_size = 64, slice_on=[1,2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a comment, there is a special batching decorator for MultiModalExecutor itself.

Copy link
Contributor Author

@deepampatel deepampatel Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @deepampatel ,

Thank you very much for your contribution.

As a comment, I would just recommend do not try to fit every type of batching pattern in a single function since it can become really unreadable and unmantainable. So if you feel like you need a special function for some special executor or type of input you are welcome to add it!

Great work!

@JoanFM Thanks for the quick feedback. I agree with you on the special function for special executor class. i will add a separate class for ranker input.
WDYT about having slice_on parameter as a Union[int, List[int]] instead of just int value in @batching_multi_input . Dont have any particular example in mind rn where this can be used but this thought came up when i was trying to merge both decorators.

@deepampatel deepampatel force-pushed the feat-add-batching-support-for-rankers branch from cc4a142 to d0fb777 Compare December 27, 2020 13:05
@github-actions
Copy link

github-actions bot commented Dec 27, 2020

Latency summary

Current PR yields:

  • 😶 index QPS at 1756, delta to last 3 avg.: -3%
  • 😶 query QPS at 31, delta to last 3 avg.: -4%

Breakdown

Version Index QPS Query QPS
current 1756 31
0.8.16 1796 32
0.8.15 1812 32
0.8.14 1822 32

Backed by latency-tracking. Further commits will update this comment.

@JoanFM
Copy link
Member

JoanFM commented Dec 27, 2020

This commit is a feat but also a refactor commit .

  1. feat: add support for batching dictionary data, so it can be used to add batching support for rankers.
  2. wip: refactor @batching and @batching_multi_input into a single decorator
TODO : add support for label_on and ordinal_idx_arg

Now I see the description, please unless it looks very very clean, avoid having batching and barching_multi_input inside a single decorator. it was split for better readability

@deepampatel
Copy link
Contributor Author

This commit is a feat but also a refactor commit .

  1. feat: add support for batching dictionary data, so it can be used to add batching support for rankers.
  2. wip: refactor @batching and @batching_multi_input into a single decorator
TODO : add support for label_on and ordinal_idx_arg

Now I see the description, please unless it looks very very clean, avoid having batching and barching_multi_input inside a single decorator. it was split for better readability

Previously i had tried to merge all batching decorators into one, after your comments above i have reverted the commit back and added a new @batching_ranker_input decorator only for ranker executors. The @batchingand @batching_multi_input are untouched. @JoanFM

@deepampatel deepampatel marked this pull request as ready for review December 28, 2020 16:06
@deepampatel deepampatel requested a review from a team as a code owner December 28, 2020 16:06
batch_size: Union[int, Callable] = None,
num_batch: Optional[int] = None,
split_over_axis: int = 0,
merge_over_axis: int = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is only for ranker, this will not be different than 0, so let's remove this. Also the split_over_axis (will it be different than 0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same for slice_on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice_on i feel can be a configurable parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least slice_on should be set to the default that the rankers would use

else:
if isinstance(_slice_on, int):
_slice_on = [_slice_on]
_num_data = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_num_data is not used after here

@codecov
Copy link

codecov bot commented Dec 29, 2020

Codecov Report

Merging #1467 (9e28ecc) into master (3f5a588) will increase coverage by 0.32%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1467      +/-   ##
==========================================
+ Coverage   84.37%   84.69%   +0.32%     
==========================================
  Files         108      108              
  Lines        6311     6424     +113     
==========================================
+ Hits         5325     5441     +116     
+ Misses        986      983       -3     
Impacted Files Coverage Δ
jina/executors/decorators.py 92.22% <100.00%> (+1.31%) ⬆️
jina/logging/profile.py 68.33% <0.00%> (-0.58%) ⬇️
jina/drivers/craft.py 100.00% <0.00%> (ø)
jina/proto/jina_pb2.py 100.00% <0.00%> (ø)
jina/types/ndarray/generic.py 100.00% <0.00%> (ø)
jina/executors/evaluators/rank/recall.py 100.00% <0.00%> (ø)
jina/executors/evaluators/rank/precision.py 100.00% <0.00%> (ø)
jina/executors/encoders/helper.py
jina/executors/encoders/numeric/__init__.py 42.42% <0.00%> (ø)
jina/drivers/encode.py 94.91% <0.00%> (+0.08%) ⬆️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9a427b...9e28ecc. Read the comment docs.

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of code that is common (or can be made common) between batching_multi_input and batching_input_ranker. I am sure some helper functions can be made to unify a little bit of both

_num_data = num_data
if _num_data is not None:
if isinstance(_slice_on, List):
raise ValueError(f'When using num_data in @batching_ranker_input, an integer value '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then please fix the slice_on type hint, since it suggests a List is accepted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just adapt the type hint and remove this check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this noisy part, consider num_data is never None (type hint suggests so)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, slice_on and num_data when used together, allow you to batch consecutive parameters. There might be a case where you want to batch non - consecutive parameters, lets say [1,3], in those cases we can have slice_on as list of indices. So whenever slice_on is passed as a list, num_data (type hint needs to be updated to optional) cannot be used. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is not yet a case where we require this, I'd prefer to keep it simple

batch_size: Union[int, Callable] = None,
num_batch: Optional[int] = None,
slice_on: Union[int, List[int]] = 2,
num_data: int = None) -> Any:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_data default is 3 I think for this case no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of self, query_meta, old_match_scores, match_meta we only want to batch old_match_scores . other two are meta which dont need to batched.

for idx,slice_idx in enumerate(_slice_on[1:]):
batch_idx = next(data_iterators[idx+1])
if yield_dict[idx+1]:
args[slice_on] = dict(batch) if yield_dict[0] else batch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is yield_dict needed? what is batch returning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch returns a list of tuples of (key, value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I see. Maybe it is better to consider in batch_iterator the case where a dict is provided?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels that until obtaining the iterators the code with batching_multi_input may be merged?

@deepampatel deepampatel force-pushed the feat-add-batching-support-for-rankers branch from 35d0a99 to 1918212 Compare December 30, 2020 17:10
@jina-bot jina-bot added the area/helper This issue/PR affects the helper functionality label Dec 30, 2020
@@ -107,7 +107,10 @@ def batch_iterator(data: Iterable[Any], batch_size: int, axis: int = 0,
data = iter(data)
# as iterator, there is no way to know the length of it
while True:
chunk = tuple(islice(data, batch_size))
if yield_dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add before elif(data, Iterable) an elif(data, dict), like this it will work without the need of the yield_dict parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data needs to be sliced before passing it to batch_iterator because we have num_batch that inturn sets the total_size . There are two options i think :

  1. Pass total_size to batch_iterator and slice the data inside the batch_iterator
  2. Slice it before passing it to batch_iterator and pass 'yield_dict`. In this case dict when sliced is converted to a iterable .
    wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see, let's keep passing yield_dict as argument so that we do not affect current behavior (the code is a little verbose already)

total_size = _get_total_size(full_data_size, b_size, num_batch)
final_result = []
yield_dict = [isinstance(args[slice_on + i], Dict) for i in range(0,num_data)]
data_iterators = [batch_iterator(_get_slice(args[slice_on + i],total_size), b_size , yield_dict=yield_dict[i]) for i in range(0, num_data)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's adapt this part together with batching_multi_input if possible. get_slice can be used also there right?. Also separating comma before total_size

Copy link
Contributor Author

@deepampatel deepampatel Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both batch_multi_input and batch_ranker_input are almost similar, we can completely remove batch_ranker_input and update batching and batch_multi_input with get_slice and yield_dict to handle dictionary data. or keep batch_ranker_input separate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes lets try, but yield_dict can be avoided in regular batching. it can be set to default false in the batch_decorator

Copy link
Contributor Author

@deepampatel deepampatel Dec 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are removing batching_ranker_input maybe we should add dictionary support to both batching and batching_multi_input no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in the future, but now I would prefer to keep it conservative and just touch where it may be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rankers we actually need only one parameter old_match_scores to be batched. So we still need to use batch_multi_input ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not true, you may want to batch also match meta

@jina-bot jina-bot added size/S and removed size/M labels Dec 31, 2020
@deepampatel
Copy link
Contributor Author

#1382

@JoanFM
Copy link
Member

JoanFM commented Dec 31, 2020

recheckcla

@github-actions
Copy link

github-actions bot commented Dec 31, 2020

Jina CLA check ✅ All Contributors have signed the CLA.

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @deepampatel ,

thanks for your great contribution, now we just need you to sign the CLA before merging the PR

@deepampatel
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@JoanFM JoanFM merged commit 4a2c744 into jina-ai:master Dec 31, 2020
@jina-ai jina-ai locked and limited conversation to collaborators Dec 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/core This issue/PR affects the core codebase area/helper This issue/PR affects the helper functionality area/testing This issue/PR affects testing component/executor executor/meta size/S
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants