How to perform batch inference? #26061

ryanshrott · 2023-09-08T20:59:37Z

Feature request

I want to pass a list of tests to model.generate.

text = "hey there"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=184)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Motivation

I want to do batch inference.

Your contribution

Testing

NielsRogge · 2023-09-11T14:32:16Z

I opened a PR for this in #24432 to illustrate it for GPT-2, but it will be incorporated in a bigger PR.

cc @gante @stevhliu

NielsRogge · 2023-09-13T16:14:37Z

See #24575

github-actions · 2023-10-11T08:04:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

gante · 2023-10-19T14:41:05Z

Hey there @ryanshrott @NielsRogge 👋

I've added a short section in our basic LLM tutorial page on how to do batched generation in this PR.

Taken from the updated guide, here's an example:

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained(
...     "mistralai/Mistral-7B-v0.1", device_map="auto", load_in_4bit=True
... )
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1", padding_side="left")
>>> tokenizer.pad_token = tokenizer.eos_token  # Most LLMs don't have a pad token by default
>>> model_inputs = tokenizer(
...     ["A list of colors: red, blue", "Portugal is"], return_tensors="pt", padding=True
... ).to("cuda")
>>> generated_ids = model.generate(**model_inputs)
>>> tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
['A list of colors: red, blue, green, yellow, orange, purple, pink,',
'Portugal is a country in southwestern Europe, on the Iber']

ryanshrott · 2023-10-19T18:33:13Z

@gante Thanks. Is this faster than running them in a loop?

gante · 2023-10-23T16:03:41Z

@ryanshrott yes, much faster when measured in thorughput! The caveat is that it requires slightly more memory from your hardware, and it will have a slightly higher latency

gante mentioned this issue Oct 19, 2023

Generate: update basic llm tutorial #26937

Merged

gante closed this as completed in #26937 Oct 19, 2023

erikreed mentioned this issue Jan 15, 2024

Batch inference QwenLM/Qwen-VL#240

Open

huyiming2018 mentioned this issue Mar 6, 2024

请问推理支持batch跑吗？ Meituan-AutoML/MobileVLM#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to perform batch inference? #26061

How to perform batch inference? #26061

ryanshrott commented Sep 8, 2023

NielsRogge commented Sep 11, 2023

NielsRogge commented Sep 13, 2023

github-actions bot commented Oct 11, 2023

gante commented Oct 19, 2023 •

edited

ryanshrott commented Oct 19, 2023 •

edited

gante commented Oct 23, 2023 •

edited

How to perform batch inference? #26061

How to perform batch inference? #26061

Comments

ryanshrott commented Sep 8, 2023

Feature request

Motivation

Your contribution

NielsRogge commented Sep 11, 2023

NielsRogge commented Sep 13, 2023

github-actions bot commented Oct 11, 2023

gante commented Oct 19, 2023 • edited

ryanshrott commented Oct 19, 2023 • edited

gante commented Oct 23, 2023 • edited

gante commented Oct 19, 2023 •

edited

ryanshrott commented Oct 19, 2023 •

edited

gante commented Oct 23, 2023 •

edited