Adding `batch_size` support for (almost) all pipelines #13724

Narsil · 2021-09-24T07:33:06Z

What does this PR do?

When running pipeline on a dataset, with a small model (relative to the GPU). It can be good to be able to batch
the forward pass for performance.

This PR addresses this by adding batch_size argument.

This PR contains

Some facilities for batching and unbatching, not handled within each individual pipelines
Automated testing for ALL small models + pipelines of this functionality
Disabled for question-answering and zero-shot-classification. They are trickier because they already use batching with candidate labels and question features. The full solution would involve moving the iterator to real N [hypothesis, template] and batching there, and having another iterator on top that recreates the current zero-shot/question-answering results. Should we add that capabilities, at least for these 2 pipelines we would have a much better idea of alignement.
Ran all slow (pipelines) tests without issue
Refactor the batch/unbatch for better quality code
More doc, caveats about this argument and use cases, benchmarks and so on.
Need to think about TF which has currently no support (both streaming and batching)

The good example (https://gist.github.com/Narsil/4e1c36d7cf8477e5c1d580585860810e):

This code was executed on GTX 970 (and Titan RTX with similar conclusions), model is distilbert-base-uncased-finetuned-sst-2-english (250Mo bin file)

The old pipelines GPU method of iteration is excluded because it's an order of magnitude slower in all cases.

------------------------------
Streaming no batching
100%|██████████████████████████████████████████████████████████████████████| 5000/5000 [00:26<00:00, 187.52it/s]
------------------------------
Streaming batch_size=8
100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:04<00:00, 1205.95it/s]
------------------------------
Streaming batch_size=64
100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:02<00:00, 2478.24it/s]
------------------------------
Streaming batch_size=256
100%|█████████████████████████████████████████████████████████████████████| 5000/5000 [00:01<00:00, 2554.43it/s]
(diminishing returns)

This seems promising !

However, this has:

Perfect alignment (all inputs are exactly the same length)
Small model (lots of GPU RAM left for inputs and intermediary results)

Let's look at another example, which might (or not) be a bit more realistic:
Using varying size inputs (https://gist.github.com/Narsil/de88b2d7c242c29772a61af56a5c8270)

------------------------------
Streaming no batching
100%|█████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:30<00:00, 32.51it/s]
------------------------------
Streaming batch_size=8
100%|█████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:29<00:00, 33.62it/s]
------------------------------
Streaming batch_size=64
100%|█████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:29<00:00, 34.29it/s]
------------------------------
Streaming batch_size=256
  0%|                                                                                                                                          | 0/1000 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/nicolas/src/transformers/test.py", line 38, in <module>
    for out in tqdm.tqdm(pipe(dataset, batch_size=256), total=len(dataset)):
  File "/home/nicolas/src/transformers/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1133, in __iter__
    for obj in iterable:
....
    hidden_states = self.intermediate_act_fn(hidden_states)
  File "/home/nicolas/src/transformers/.venv/lib/python3.9/site-packages/torch/nn/functional.py", line 1555, in gelu
    return torch._C._nn.gelu(input)
RuntimeError: CUDA out of memory. Tried to allocate 472.00 MiB (GPU 0; 3.95 GiB total capacity; 2.13 GiB already allocated; 266.75 MiB free; 2.49 GiB reserved in total by PyTorch)

Here we can see, no speedup was achieved, and we actually crashed for large batch size.
This is entirely due to non alignment.

The problem can even be made worse, when you have large batch sizes, and RARE very long sentences (https://gist.github.com/Narsil/357519fd385d864bfec3caf5aa8df575).

------------------------------
Streaming no batching
100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:05<00:00, 183.69it/s]
------------------------------
Streaming batch_size=8
100%|█████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 265.74it/s]
------------------------------
Streaming batch_size=64
100%|██████████████████████████████████████████████████████████████████████| 1000/1000 [00:26<00:00, 37.80it/s]
------------------------------
Streaming batch_size=256
  0%|                                                                                 | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/nicolas/src/transformers/test.py", line 42, in <module>
    for out in tqdm.tqdm(pipe(dataset, batch_size=256), total=len(dataset)):
....
    q = q / math.sqrt(dim_per_head)  # (bs, n_heads, q_length, dim_per_head)
RuntimeError: CUDA out of memory. Tried to allocate 376.00 MiB (GPU 0; 3.95 GiB total capacity; 1.72 GiB already allocated; 354.88 MiB free; 2.46 GiB reserved in total by PyTorch)

Here we are actually 5x SLOWER on the batch_size=64 than on the non batched version. That is because the rare long sentence is so long, it actually forces the whole batch to be pad to its sequence length, and use much more memory and processing power (the padding tokens ARE processed by the GPU, they just don't influence the end result).

For users, a rule of thumb is:

Measure performance on your load, with your hardware. Measure, measure, and keep measuring. Real numbers are the only way to go.
If you are latency constrained (live product doing inference), don't batch
If you are using CPU, don't batch.
If you are using throughput (you want to run your model on a bunch of static data), on GPU, then:
- If you have no clue about the size of the sequence_length ("natural" data), by default don't batch, measure and try tentatively to add it, add OOM checks to recover when it will fail (and it will at some point if you don't control the sequence_length.)
- If your sequence_length is super regular, then batching is more likely to be VERY interesting, measure and push it until you get OOMs.
- The larger the GPU the more likely batching is going to be more interesting
As soon as you enable batching, make sure you can handle OOMs nicely.

There are no good (general) solutions for this problem, and your mileage may vary depending on your use cases. Which is why for now:

batch_size=1 by default (both for speed and OOM, issues we can't guess the correct parameters, at least with batch_size=1 we have the smallest chance possible to go OOM).
batch_size = 1 is somehow comparable in speed to batched data with irregular data sizes (which is an important use case, like live products where latency also matters).
Other batch_sizes are opt-in, because it might be valuable for users to use it (for instance when checking some metric on some dataset which has very regular input lengths, but then it's a user responsibility to check for OOM and slowness).
batch_size > 1 won't work for tokenizer/feature_processor that don't have a padding mecanism (if they require it).

It would be ideal if pipelines could start taking that responsibility on its shoulders and start batching dynamically for users but it's a hard problem right now:

It's hard to evaluate OOM, and OOM might happen late (so batch_size will always to have to be somehow dynamic during the streaming process)
It's even harder to evaluate the slowness factor due to padding, pipelines would have to count them, do some kind of batch exclusion mecanism.
Padding issue could be helped quite a bit with RaggedTensors, however, they also don't play that nicely with the GPU capabilities (which need as much aligned/regular data as possible.

Some other links/issues/discussions:

#11251
https://discuss.huggingface.co/t/how-to-change-the-batch-size-in-a-pipeline/8738
https://discuss.huggingface.co/t/how-to-make-pipeline-automatically-scale/7432
#13141
#12195
https://gist.github.com/Narsil/ee5c09875e74fa6f018dc6d014f6c06c

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@LysandreJik @sgugger

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

LysandreJik

This is a fantastic PR and write-up! Thanks for doing all of the work.

The code looks okay to me, but there are a lot of small changes across pipelines - would it be possible to add comments where those changes are unintuitive so that we may better understand the need addressed? I added comments where I think those would be helpful.

The test changes are clean. Thanks for adding this layer which should make testing simpler for new pipelines.

Finally, the write-up is great, it would be ideal to add it to the documentation. Can you add it to the pipeline RST document?

LysandreJik · 2021-10-12T01:21:16Z

src/transformers/pipelines/base.py

+                        k: element[self._unbatch_index].unsqueeze(0)
+                        if isinstance(element[self._unbatch_index], torch.Tensor)
+                        else np.expand_dims(element[self._unbatch_index], 0)
+                        if isinstance(element[self._unbatch_index], np.ndarray)
+                        else element[self._unbatch_index]
+                        for k, element in self._unbatch_data.items()
+                        if k != "past_key_values"


oof this is a tough one to understand, it would be nice to spread it over different lines.

Rewrote it, hopefully it's better now, can you confirm ?

LysandreJik · 2021-10-12T01:31:36Z

src/transformers/pipelines/base.py

+        raise ValueError("Pipeline without tokenizer or feature_extractor cannot do batching")
+    if tokenizer is not None:
+        if tokenizer.pad_token_id is None:
+            raise ValueError("Pipeline with tokenizer without pad_token cannot do batching")


Nice error raising! It would be nice to show how to attribute a padding token in that case.

Can we add a padding_token that simply ? Wouldn't it be erasing an existing (likely used) token ?

Not sure what you mean.

We generally show how to do this with the following:

model.config.pad_token_id = model.config.eos_token_id

I think this is particularly important for the pipeline as users don't necessarily understand what/how to change the underlying model's attributes, so printing an example of that in the console would be helpful

src/transformers/pipelines/base.py

src/transformers/pipelines/conversational.py

src/transformers/pipelines/object_detection.py

src/transformers/pipelines/question_answering.py

src/transformers/pipelines/zero_shot_classification.py

sgugger

Thanks a lot for all your work on this!

docs/source/main_classes/pipelines.rst

src/transformers/pipelines/base.py

sgugger · 2021-10-12T13:18:53Z

src/transformers/pipelines/base.py

+            self._unbatch_index = None
+            self._unbatch_data = None


For those other variables, I would prefer unpack_xxx to unbatch personally.

If I make the switch, I will make the switch for all variables so unpack_size as I consider that these are completely linked, so using similar name is important.

I am fine with the name, even if I feel we loose the connection to the batch concept.

Is that what you are implying ?

With respect to other comment, I updated everything to loader_batch_* which is better I think.

src/transformers/pipelines/base.py

src/transformers/pipelines/question_answering.py

src/transformers/pipelines/token_classification.py

src/transformers/pipelines/zero_shot_classification.py

docs/source/main_classes/pipelines.rst

Narsil · 2021-10-25T16:17:04Z

src/transformers/pipelines/conversational.py

        output_ids = self.model.generate(**model_inputs, **generate_kwargs)
        if self.model.config.is_encoder_decoder:
            start_position = 1
        else:
            start_position = n
-        return {"output_ids": output_ids[0, start_position:], "conversation": conversation}
+        return {"output_ids": output_ids[:, start_position:], "conversation": conversation}


We are changing the inference between forward and postprocess the have the batch in the tensors so batch/unbatch can happen.

Narsil · 2021-10-25T16:17:50Z

src/transformers/pipelines/token_classification.py

@@ -204,26 +204,29 @@ def _forward(self, model_inputs):
        offset_mapping = model_inputs.pop("offset_mapping", None)
        sentence = model_inputs.pop("sentence")
        if self.framework == "tf":
-            outputs = self.model(model_inputs.data)[0][0]
+            logits = self.model(model_inputs.data)[0]


We are changing the inference between forward and postprocess the have the batch in the tensors so batch/unbatch can happen.

Narsil · 2021-10-25T16:17:59Z

src/transformers/pipelines/token_classification.py

        sentence = model_outputs["sentence"]
        input_ids = model_outputs["input_ids"][0]
        offset_mapping = model_outputs["offset_mapping"][0] if model_outputs["offset_mapping"] is not None else None
        special_tokens_mask = model_outputs["special_tokens_mask"][0].numpy()

-        scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
+        maxes = np.max(logits, axis=-1, keepdims=True)


logits trick

Thanks for adding that

Narsil · 2021-10-25T16:18:27Z

tests/test_pipelines_audio_classification.py

-        filename = dataset[0]["file"]
-        output = audio_classifier(filename)
+        audio = dataset[0]["audio"]["array"]
+        output = audio_classifier(audio)


We're not relying on a filename anymore since the tests don't run ffmpeg anymore.

LysandreJik

This looks good to me, thank you @Narsil. If it works for you, I would like for this PR to be merged after the v4.12.0 release (tomorrow Thursday) so that it gets a bit of testing on master before being set in stone.

LysandreJik · 2021-10-27T23:12:08Z

src/transformers/pipelines/base.py

+        raise ValueError("Pipeline without tokenizer or feature_extractor cannot do batching")
+    if tokenizer is not None:
+        if tokenizer.pad_token_id is None:
+            raise ValueError("Pipeline with tokenizer without pad_token cannot do batching")


We generally show how to do this with the following:

model.config.pad_token_id = model.config.eos_token_id

I think this is particularly important for the pipeline as users don't necessarily understand what/how to change the underlying model's attributes, so printing an example of that in the console would be helpful

src/transformers/pipelines/base.py

sgugger · 2021-10-28T11:32:12Z

src/transformers/pipelines/token_classification.py

        sentence = model_outputs["sentence"]
        input_ids = model_outputs["input_ids"][0]
        offset_mapping = model_outputs["offset_mapping"][0] if model_outputs["offset_mapping"] is not None else None
        special_tokens_mask = model_outputs["special_tokens_mask"][0].numpy()

-        scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
+        maxes = np.max(logits, axis=-1, keepdims=True)


Thanks for adding that

- Not `zero-shot` (it's already passing stuff as batched so trickier) - Not `QA` (preprocess uses squad features, we need to switch to real tensors at this boundary.

and adressing comments.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

softmax trick.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Narsil · 2021-10-29T09:34:13Z

Release done, merging.

…3724) * Tentative enabling of `batch_size` for pipelines. * Add systematic test for pipeline batching. * Enabling batch_size on almost all pipelines - Not `zero-shot` (it's already passing stuff as batched so trickier) - Not `QA` (preprocess uses squad features, we need to switch to real tensors at this boundary. * Adding `min_length_for_response` for conversational. * Making CTC, speech mappings avaiable regardless of framework. * Attempt at fixing automatic tests (ffmpeg not enabled for fast tests) * Removing ffmpeg dependency in tests. * Small fixes. * Slight cleanup. * Adding docs and adressing comments. * Quality. * Update docs/source/main_classes/pipelines.rst Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/pipelines/question_answering.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/pipelines/zero_shot_classification.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Improving docs. * Update docs/source/main_classes/pipelines.rst Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com> * N -> oberved_batch_size softmax trick. * Follow `padding_side`. * Supporting image pipeline batching (and padding). * Rename `unbatch` -> `loader_batch`. * unbatch_size forgot. * Custom padding for offset mappings. * Attempt to remove librosa. * Adding require_audio. * torchaudio. * Back to using datasets librosa. * Adding help to set a pad_token on the tokenizer. * Update src/transformers/pipelines/base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/pipelines/base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/pipelines/base.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Quality. Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

Narsil changed the title ~~Adding batch_size support for (almost) all pipelines~~ [WIP] Adding batch_size support for (almost) all pipelines Sep 24, 2021

Narsil force-pushed the pipeline_batch_size_support branch from 354bb4c to ecedb2e Compare October 11, 2021 08:59

Narsil changed the title ~~[WIP] Adding batch_size support for (almost) all pipelines~~ Adding batch_size support for (almost) all pipelines Oct 11, 2021

LysandreJik approved these changes Oct 12, 2021

View reviewed changes

sgugger approved these changes Oct 12, 2021

View reviewed changes

philschmid reviewed Oct 12, 2021

View reviewed changes

docs/source/main_classes/pipelines.rst Outdated Show resolved Hide resolved

Narsil force-pushed the pipeline_batch_size_support branch 2 times, most recently from 376923a to e2d6a6a Compare October 18, 2021 12:11

Narsil mentioned this pull request Oct 25, 2021

Pipeline seems slower in 4.11+ #14125

Closed

4 tasks

Narsil commented Oct 25, 2021

View reviewed changes

LysandreJik approved these changes Oct 27, 2021

View reviewed changes

sgugger reviewed Oct 28, 2021

View reviewed changes

Narsil mentioned this pull request Oct 28, 2021

Adding support for truncation parameter on feature-extraction pipeline. #14193

Merged

5 tasks

Narsil and others added 14 commits October 29, 2021 11:01

Tentative enabling of batch_size for pipelines.

bfe2e50

Add systematic test for pipeline batching.

125bb7a

Enabling batch_size on almost all pipelines

94298d4

- Not `zero-shot` (it's already passing stuff as batched so trickier) - Not `QA` (preprocess uses squad features, we need to switch to real tensors at this boundary.

Adding min_length_for_response for conversational.

fe75e50

Making CTC, speech mappings avaiable regardless of framework.

2a53ecd

Attempt at fixing automatic tests (ffmpeg not enabled for fast tests)

9d224b9

Removing ffmpeg dependency in tests.

7925696

Small fixes.

ff55351

Slight cleanup.

3dc2e94

Adding docs

5f158a0

and adressing comments.

Quality.

0985328

Update docs/source/main_classes/pipelines.rst

7512777

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/question_answering.py

e788d1a

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/zero_shot_classification.py

19c11b1

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Narsil and others added 16 commits October 29, 2021 11:01

Improving docs.

aa9cbf3

Update docs/source/main_classes/pipelines.rst

28a366f

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

N -> oberved_batch_size

086ba6a

softmax trick.

Follow padding_side.

d674a1a

Supporting image pipeline batching (and padding).

9188dc1

Rename unbatch -> loader_batch.

ad857de

unbatch_size forgot.

67e3d15

Custom padding for offset mappings.

f9c5ef7

Attempt to remove librosa.

3d539be

Adding require_audio.

646dc54

torchaudio.

d66cf36

Back to using datasets librosa.

a36bfec

Adding help to set a pad_token on the tokenizer.

15e2011

Update src/transformers/pipelines/base.py

6877c7b

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/base.py

cf5d40c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/pipelines/base.py

5cd831c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Narsil force-pushed the pipeline_batch_size_support branch from 09a1db8 to 5cd831c Compare October 29, 2021 09:01

Quality.

a08ccfe

Narsil merged commit be23636 into huggingface:master Oct 29, 2021

Narsil deleted the pipeline_batch_size_support branch October 29, 2021 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding `batch_size` support for (almost) all pipelines #13724

Adding `batch_size` support for (almost) all pipelines #13724

Narsil commented Sep 24, 2021 •

edited

Loading

LysandreJik left a comment

LysandreJik Oct 12, 2021

Narsil Oct 12, 2021

LysandreJik Oct 12, 2021

Narsil Oct 12, 2021

LysandreJik Oct 27, 2021

sgugger left a comment

sgugger Oct 12, 2021

Narsil Oct 12, 2021

Narsil Oct 19, 2021

Narsil Oct 25, 2021 •

edited

Loading

Narsil Oct 25, 2021

Narsil Oct 25, 2021

sgugger Oct 28, 2021

Narsil Oct 25, 2021

LysandreJik left a comment

LysandreJik Oct 27, 2021

sgugger Oct 28, 2021

Narsil commented Oct 29, 2021

Adding batch_size support for (almost) all pipelines #13724

Adding batch_size support for (almost) all pipelines #13724

Conversation

Narsil commented Sep 24, 2021 • edited Loading

What does this PR do?

Before submitting

Who can review?

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil Oct 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil commented Oct 29, 2021

Adding `batch_size` support for (almost) all pipelines #13724

Adding `batch_size` support for (almost) all pipelines #13724

Narsil commented Sep 24, 2021 •

edited

Loading

Narsil Oct 25, 2021 •

edited

Loading