Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch generate? #1008

Closed
1 of 4 tasks
wj210 opened this issue Sep 11, 2023 · 5 comments
Closed
1 of 4 tasks

Batch generate? #1008

wj210 opened this issue Sep 11, 2023 · 5 comments

Comments

@wj210
Copy link

wj210 commented Sep 11, 2023

System Info

Hi, i like to ask if it is possible to do batch generation?

client = Client("http://127.0.0.1:8081",timeout = 60) gen_t = client.generate(batch_text,max_new_tokens=64)

generate can only take strs, is there anyway I can do batch inference? or is tgi just meant for single queries?

Thanks!

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

shown above

Expected behavior

single inference only instead of batch

@LopezGG
Copy link

LopezGG commented Sep 22, 2023

I see a lot of code when I search for batch but cant find a way to do it. I thought if I send async calls, it would batch at the server automatically. But the runtime for 1200 requests did not reduce, so that obviously wasn't the way.
I also tried setting
CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --model-id {model_path} --quantize gptq --sharded true --max-input-length {max_input_length} --max-total-tokens {max_total_tokens} --max-batch-prefill-tokens {max_batch_prefill_tokens}
where
max_batch_prefill_tokens = max_total_tokens*10

with both AsyncClient and Client. No difference in generation time

@LopezGG
Copy link

LopezGG commented Sep 22, 2023

A lot of folks talk about batch_size in the issues. I am not sure batching works. It might be good to have some documentation on it. In the meantime, based on #736 (comment) It looks like sending a list of prompts as a batch is not supported

@wj210
Copy link
Author

wj210 commented Sep 22, 2023

I did try async calls and it was faster based on my implementation.

def gen_text(text): return client.generate(text,max_new_tokens=128).generated_text

with concurrent.futures.ThreadPoolExecutor(max_workers=bs) as executor: out = list(executor.map(gen_text,batch_text))

Did you do something similar?

@LopezGG
Copy link

LopezGG commented Sep 22, 2023

Thanks @wj210. This bought down one of my runtimes from 1hour 22 min to 30 min :)

@Narsil
Copy link
Collaborator

Narsil commented Oct 2, 2023

TGI runs continuous batching: https://github.com/huggingface/text-generation-inference/tree/main/router#simple-continuous-batching

Therefore sending multiple requests concurrently can leverage it. We do no plan to support sending lists of requests (it messes up return types quite a bit and doesn't unlock any super valuable speedups)

@Narsil Narsil closed this as completed Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants