Batch generate? #1008

wj210 · 2023-09-11T15:48:14Z

System Info

Hi, i like to ask if it is possible to do batch generation?

client = Client("http://127.0.0.1:8081",timeout = 60) gen_t = client.generate(batch_text,max_new_tokens=64)

generate can only take strs, is there anyway I can do batch inference? or is tgi just meant for single queries?

Thanks!

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

shown above

Expected behavior

single inference only instead of batch

The text was updated successfully, but these errors were encountered:

LopezGG · 2023-09-22T02:47:05Z

I see a lot of code when I search for batch but cant find a way to do it. I thought if I send async calls, it would batch at the server automatically. But the runtime for 1200 requests did not reduce, so that obviously wasn't the way.
I also tried setting
CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --model-id {model_path} --quantize gptq --sharded true --max-input-length {max_input_length} --max-total-tokens {max_total_tokens} --max-batch-prefill-tokens {max_batch_prefill_tokens}
where
max_batch_prefill_tokens = max_total_tokens*10

with both AsyncClient and Client. No difference in generation time

LopezGG · 2023-09-22T04:51:25Z

A lot of folks talk about batch_size in the issues. I am not sure batching works. It might be good to have some documentation on it. In the meantime, based on #736 (comment) It looks like sending a list of prompts as a batch is not supported

wj210 · 2023-09-22T04:55:35Z

I did try async calls and it was faster based on my implementation.

def gen_text(text): return client.generate(text,max_new_tokens=128).generated_text

with concurrent.futures.ThreadPoolExecutor(max_workers=bs) as executor: out = list(executor.map(gen_text,batch_text))

Did you do something similar?

LopezGG · 2023-09-22T14:16:08Z

Thanks @wj210. This bought down one of my runtimes from 1hour 22 min to 30 min :)

Narsil · 2023-10-02T08:18:40Z

TGI runs continuous batching: https://github.com/huggingface/text-generation-inference/tree/main/router#simple-continuous-batching

Therefore sending multiple requests concurrently can leverage it. We do no plan to support sending lists of requests (it messes up return types quite a bit and doesn't unlock any super valuable speedups)

Narsil closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch generate? #1008

Batch generate? #1008

wj210 commented Sep 11, 2023

LopezGG commented Sep 22, 2023 •

edited

LopezGG commented Sep 22, 2023

wj210 commented Sep 22, 2023 •

edited

LopezGG commented Sep 22, 2023 •

edited

Narsil commented Oct 2, 2023

Batch generate? #1008

Batch generate? #1008

Comments

wj210 commented Sep 11, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

LopezGG commented Sep 22, 2023 • edited

LopezGG commented Sep 22, 2023

wj210 commented Sep 22, 2023 • edited

LopezGG commented Sep 22, 2023 • edited

Narsil commented Oct 2, 2023

LopezGG commented Sep 22, 2023 •

edited

wj210 commented Sep 22, 2023 •

edited

LopezGG commented Sep 22, 2023 •

edited