Optimum-NVIDIA #1316

SinanAkkoyun · 2023-12-06T22:07:30Z

Feature request

This just released, would be awesome to see it here

https://huggingface.co/blog/optimum-nvidia

Motivation

Faster latencies and higher throughput

Your contribution

/

RonanKMcGovern · 2023-12-09T19:20:50Z

Yes, +1 to this.

As I understand, supporting Llama models would be straightforward @SunMarc @younesbelkada ?

tim-a-davis · 2023-12-10T23:37:53Z

+1
Adding this in as an option seems like a no brainer. The optimum-nvidia team has said they plan to support more models soon.

github-actions · 2024-01-10T01:47:03Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Narsil · 2024-01-10T11:17:26Z

unstale.

We'll see if we can leverage it in our default regular transformers branch, but it won't work with flash attention, nor paged attention, leading to suboptimal performance in the server overall.

In general, TGI already provides much better performance than what the blog is comparing against. We haven't run benchmarks against each other, because well they serve different purposes so there's no real point.

We're still working on enabling everything TRT-llm provides (mostly FP8 I think) to get the exact same performance, WITH all the features they do not provide (speculative, prom metrics, simple docker interface, no preprocessing of the models).

Faster latencies and higher throughput

We already have that. Mileage may vary on actual setups, but TGI was 2x faster (latency) than TRT-llm on some 7b models I was testing the other day. There are probably setups where they are faster (FP8 for sure), and we'll catch up on those.

If you have some specific use case where library X is doing better than TGI on hardware Y and you can provide anything to replicate your numbers, that will go a long way in making it easy for us to prioritize a given combo. Also please always specify if we're talking about LATENCY or THROUGHPUT in what is measured, since both are entirely orthogonal, it's easy to get confused about what "faster" means.

tim-a-davis · 2024-01-19T04:12:54Z

Hey @Narsil thanks for the reply. As far as throughput goes though, on the huggingface blog, they are claiming to reach speeds of 1200 tokens/second on 7-billion parameter models. I am running zephyr and other mistral/llama2-based models on my H100s using TGI, and the highest throughput i've seen for a single request is 130 tokens/second. Is this your experience as well?

github-actions · 2024-02-25T01:45:44Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2024-03-31T01:46:27Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2024-05-01T01:47:24Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jan 10, 2024

github-actions bot removed the Stale label Jan 11, 2024

github-actions bot added the Stale label Feb 25, 2024

SinanAkkoyun closed this as completed Feb 29, 2024

SinanAkkoyun reopened this Feb 29, 2024

github-actions bot removed the Stale label Mar 1, 2024

github-actions bot added Stale and removed Stale labels Mar 31, 2024

github-actions bot added the Stale label May 1, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimum-NVIDIA #1316

Optimum-NVIDIA #1316

SinanAkkoyun commented Dec 6, 2023

RonanKMcGovern commented Dec 9, 2023

tim-a-davis commented Dec 10, 2023

github-actions bot commented Jan 10, 2024

Narsil commented Jan 10, 2024

tim-a-davis commented Jan 19, 2024 •

edited

github-actions bot commented Feb 25, 2024

github-actions bot commented Mar 31, 2024

github-actions bot commented May 1, 2024

Optimum-NVIDIA #1316

Optimum-NVIDIA #1316

Comments

SinanAkkoyun commented Dec 6, 2023

Feature request

Motivation

Your contribution

RonanKMcGovern commented Dec 9, 2023

tim-a-davis commented Dec 10, 2023

github-actions bot commented Jan 10, 2024

Narsil commented Jan 10, 2024

tim-a-davis commented Jan 19, 2024 • edited

github-actions bot commented Feb 25, 2024

github-actions bot commented Mar 31, 2024

github-actions bot commented May 1, 2024

tim-a-davis commented Jan 19, 2024 •

edited