New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimum-NVIDIA #1316
Comments
Yes, +1 to this. As I understand, supporting Llama models would be straightforward @SunMarc @younesbelkada ? |
+1 |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
unstale. We'll see if we can leverage it in our default regular transformers branch, but it won't work with flash attention, nor paged attention, leading to suboptimal performance in the server overall. In general, TGI already provides much better performance than what the blog is comparing against. We haven't run benchmarks against each other, because well they serve different purposes so there's no real point. We're still working on enabling everything TRT-llm provides (mostly FP8 I think) to get the exact same performance, WITH all the features they do not provide (speculative, prom metrics, simple docker interface, no preprocessing of the models).
We already have that. Mileage may vary on actual setups, but TGI was 2x faster (latency) than TRT-llm on some 7b models I was testing the other day. There are probably setups where they are faster (FP8 for sure), and we'll catch up on those. If you have some specific use case where library X is doing better than TGI on hardware Y and you can provide anything to replicate your numbers, that will go a long way in making it easy for us to prioritize a given combo. Also please always specify if we're talking about LATENCY or THROUGHPUT in what is measured, since both are entirely orthogonal, it's easy to get confused about what "faster" means. |
Hey @Narsil thanks for the reply. As far as throughput goes though, on the huggingface blog, they are claiming to reach speeds of 1200 tokens/second on 7-billion parameter models. I am running zephyr and other mistral/llama2-based models on my H100s using TGI, and the highest throughput i've seen for a single request is 130 tokens/second. Is this your experience as well? |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Feature request
This just released, would be awesome to see it here
https://huggingface.co/blog/optimum-nvidia
Motivation
Faster latencies and higher throughput
Your contribution
/
The text was updated successfully, but these errors were encountered: