The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

rxsalad · 2024-05-04T19:03:25Z

System Info

Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 21.1 GB, setting max batch total tokens to 38928
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 22.5 GB
The VRAM usage after the batched inference, 24 x ( Prompt 512, Decode 512 ): 23.8 GB
So it can support 24 x ( Prompt 512, Decode 512 ).

Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 19.2 GB, setting max batch total tokens to 104384
The VRAM usage after the batched inference, 8 x ( Prompt 512, Decode 512 ): 23.2 GB
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 24.5 GB
So it can only support 8 x ( Prompt 512, Decode 512 ).

The questions are:

1)Why TGI reserves a significant amount of VRAM and what it is used for?
2)Why does the VRAM usage keep growing after TGI reserves a large amount of VRAM during the inference?

It's hard to believe that the quantized model saves almost 10 GB in size compared to the standard Llama 3 8B, yet can handle much less batch requests on the same hardware.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

model=casperhansen/llama-3-8b-instruct-awq
text-generation-launcher --quantize awq --model-id $model

model=meta-llama/Meta-Llama-3-8B-Instruct
text-generation-launcher --model-id $model

text-generation-benchmark --tokenizer-name $model --batch-size 1 --batch-size 2 --batch-size 4 --batch-size 8 --batch-size 16 --batch-size 24 --sequence-length 512 --decode-length 512

Expected behavior

The quantized model should be able to support larger batch requests compared to the standard Llama 3 8B, given its smaller size ( 6GB vs. 16 GB).

github-actions · 2024-06-04T01:47:55Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jun 4, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

rxsalad commented May 4, 2024 •

edited

github-actions bot commented Jun 4, 2024

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM. #1856

Comments

rxsalad commented May 4, 2024 • edited

System Info

Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM

The questions are:

Information

Tasks

Reproduction

Expected behavior

github-actions bot commented Jun 4, 2024

rxsalad commented May 4, 2024 •

edited