You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM.
#1856
Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM
Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 21.1 GB, setting max batch total tokens to 38928
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 22.5 GB
The VRAM usage after the batched inference, 24 x ( Prompt 512, Decode 512 ): 23.8 GB
So it can support 24 x ( Prompt 512, Decode 512 ).
Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM
Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 19.2 GB, setting max batch total tokens to 104384
The VRAM usage after the batched inference, 8 x ( Prompt 512, Decode 512 ): 23.2 GB
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 24.5 GB
So it can only support 8 x ( Prompt 512, Decode 512 ).
The questions are:
1)Why TGI reserves a significant amount of VRAM and what it is used for?
2)Why does the VRAM usage keep growing after TGI reserves a large amount of VRAM during the inference?
It's hard to believe that the quantized model saves almost 10 GB in size compared to the standard Llama 3 8B, yet can handle much less batch requests on the same hardware.
The quantized model should be able to support larger batch requests compared to the standard Llama 3 8B, given its smaller size ( 6GB vs. 16 GB).
The text was updated successfully, but these errors were encountered:
rxsalad
changed the title
The quantized llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on the same RTX 3090 with 24GB VRAM.
The quantized llama-3-8b-instruct-awq with TGI 1.4 can handle fewer batch requests than the standard llama-3-8b-instruct with TGI 1.4 on the same RTX 3090 with 24GB VRAM.
May 4, 2024
System Info
Test with llama-3-8b-instruct (around 16 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM
Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 21.1 GB, setting max batch total tokens to 38928
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 22.5 GB
The VRAM usage after the batched inference, 24 x ( Prompt 512, Decode 512 ): 23.8 GB
So it can support 24 x ( Prompt 512, Decode 512 ).
Test with llama-3-8b-instruct-awq (around 6 GB in size) with TGI 1.4 on RTX 3090 with 24 GB VRAM
Initial VRAM usage: 0.7 GB
The VRAM usage after the model is loaded: 19.2 GB, setting max batch total tokens to 104384
The VRAM usage after the batched inference, 8 x ( Prompt 512, Decode 512 ): 23.2 GB
The VRAM usage after the batched inference, 16 x ( Prompt 512, Decode 512 ): 24.5 GB
So it can only support 8 x ( Prompt 512, Decode 512 ).
The questions are:
1)Why TGI reserves a significant amount of VRAM and what it is used for?
2)Why does the VRAM usage keep growing after TGI reserves a large amount of VRAM during the inference?
It's hard to believe that the quantized model saves almost 10 GB in size compared to the standard Llama 3 8B, yet can handle much less batch requests on the same hardware.
Information
Tasks
Reproduction
model=casperhansen/llama-3-8b-instruct-awq
text-generation-launcher --quantize awq --model-id $model
model=meta-llama/Meta-Llama-3-8B-Instruct
text-generation-launcher --model-id $model
text-generation-benchmark --tokenizer-name $model --batch-size 1 --batch-size 2 --batch-size 4 --batch-size 8 --batch-size 16 --batch-size 24 --sequence-length 512 --decode-length 512
Expected behavior
The quantized model should be able to support larger batch requests compared to the standard Llama 3 8B, given its smaller size ( 6GB vs. 16 GB).
The text was updated successfully, but these errors were encountered: