-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for CTranslate2 acceleration #496
Comments
The batch size 2 seems very small. |
Feel free to try your own example. if __name__ == "__main__":
import timeit
for framework in Framework:
start = timeit.default_timer()
model = CodeCompletionModel(framework=framework)
end = timeit.default_timer()
model.predict_batch(["warmup"])
inp = [("def " * i) for i in range(1,16)] # batch size 16 with increasing number of tokens
t = timeit.timeit("model.predict_batch(inp)", globals=locals(), number=10) / 10
print("framework: ", framework, " mean time per batch: ", t, "s, model load time was ", end -start, "s") e.g. batch size 64, with increasing size framework: Framework.vllm mean time per batch: 0.9124439172912389 s, model load time was 6.033201194833964 s
framework: Framework.ctranslate2 mean time per batch: 0.5082830318249763 s, model load time was 7.9577884785830975 s |
@michaelfeil Hi, I was working on ctranslate2 version of wizardcoder and the latency with max_new_token=128 is a lot higher (about 4s) compared to yours, may I ask what machine are you using? or did you modify anything to make it faster? In my experience, the mean time of your example should grow approximate 4X to 2s if max_new_token=128, which still 2X faster than my results |
Yung-Kai The above uses santacoder (1.1B params), with 32 tokens. Whats the relative speed difference across all frameworks? |
@michaelfeil Thanks for your reply, I got around 16s on HF, and 5s for 4 bit quantization with autoGPTQ. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Feature request
The feature would be to support accelerated inference with the CTranslate2 framework
https://github.com/OpenNMT/CTranslate2
Motivation
Reasons to CTranslate2
faster float16 generation
In my case, outperforms VLLM #478 by a factor of 180% and Transformers by 190%.
This is measured for gpt_bigcode_starcoder, with input length of 10, batch_size of 2, max_new_tokens=32
even faster int8 quantization
Wide range of model support
This could be a new "best effort" option - https://opennmt.net/CTranslate2/guides/transformers.html
Stability
Compared to other frameworks, cross-platform unit tests.
It is the backend of LibreTranslate https://github.com/LibreTranslate/LibreTranslate
Streaming Support
https://opennmt.net/CTranslate2/generation.html?highlight=stream
Your contribution
Here is an example script, comparing the three frameworks with
ray[serve]
(replacement of the dynamic batching in TGI).With the code below, my timings are:
In:
input: "def hello_world"
-> 32 new tokensMedian time of 10:
The text was updated successfully, but these errors were encountered: