Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimum-NVIDIA #1316

Closed
SinanAkkoyun opened this issue Dec 6, 2023 · 8 comments
Closed

Optimum-NVIDIA #1316

SinanAkkoyun opened this issue Dec 6, 2023 · 8 comments
Labels

Comments

@SinanAkkoyun
Copy link

Feature request

This just released, would be awesome to see it here

https://huggingface.co/blog/optimum-nvidia

Motivation

Faster latencies and higher throughput

Your contribution

/

@RonanKMcGovern
Copy link

Yes, +1 to this.

As I understand, supporting Llama models would be straightforward @SunMarc @younesbelkada ?

@tim-a-davis
Copy link

+1
Adding this in as an option seems like a no brainer. The optimum-nvidia team has said they plan to support more models soon.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jan 10, 2024
@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2024

unstale.

We'll see if we can leverage it in our default regular transformers branch, but it won't work with flash attention, nor paged attention, leading to suboptimal performance in the server overall.

In general, TGI already provides much better performance than what the blog is comparing against. We haven't run benchmarks against each other, because well they serve different purposes so there's no real point.

We're still working on enabling everything TRT-llm provides (mostly FP8 I think) to get the exact same performance, WITH all the features they do not provide (speculative, prom metrics, simple docker interface, no preprocessing of the models).

Faster latencies and higher throughput

We already have that. Mileage may vary on actual setups, but TGI was 2x faster (latency) than TRT-llm on some 7b models I was testing the other day. There are probably setups where they are faster (FP8 for sure), and we'll catch up on those.

If you have some specific use case where library X is doing better than TGI on hardware Y and you can provide anything to replicate your numbers, that will go a long way in making it easy for us to prioritize a given combo. Also please always specify if we're talking about LATENCY or THROUGHPUT in what is measured, since both are entirely orthogonal, it's easy to get confused about what "faster" means.

@github-actions github-actions bot removed the Stale label Jan 11, 2024
@tim-a-davis
Copy link

tim-a-davis commented Jan 19, 2024

Hey @Narsil thanks for the reply. As far as throughput goes though, on the huggingface blog, they are claiming to reach speeds of 1200 tokens/second on 7-billion parameter models. I am running zephyr and other mistral/llama2-based models on my H100s using TGI, and the highest throughput i've seen for a single request is 130 tokens/second. Is this your experience as well?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added Stale and removed Stale labels Mar 31, 2024
Copy link

github-actions bot commented May 1, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 1, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants