New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch generate? #1008
Comments
I see a lot of code when I search for batch but cant find a way to do it. I thought if I send async calls, it would batch at the server automatically. But the runtime for 1200 requests did not reduce, so that obviously wasn't the way. with both AsyncClient and Client. No difference in generation time |
A lot of folks talk about batch_size in the issues. I am not sure batching works. It might be good to have some documentation on it. In the meantime, based on #736 (comment) It looks like sending a list of prompts as a batch is not supported |
I did try async calls and it was faster based on my implementation.
Did you do something similar? |
Thanks @wj210. This bought down one of my runtimes from 1hour 22 min to 30 min :) |
TGI runs continuous batching: https://github.com/huggingface/text-generation-inference/tree/main/router#simple-continuous-batching Therefore sending multiple requests concurrently can leverage it. We do no plan to support sending lists of requests (it messes up return types quite a bit and doesn't unlock any super valuable speedups) |
System Info
Hi, i like to ask if it is possible to do batch generation?
client = Client("http://127.0.0.1:8081",timeout = 60) gen_t = client.generate(batch_text,max_new_tokens=64)
generate can only take strs, is there anyway I can do batch inference? or is tgi just meant for single queries?
Thanks!
Information
Tasks
Reproduction
shown above
Expected behavior
single inference only instead of batch
The text was updated successfully, but these errors were encountered: