You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have reasons to believe that this #1729 is causing a 2-3x performance regression on decoding stage when running EETQ quantized models on multiple shards with Cuda graphs enabled. Find below supporting experiments.
Note: I understand TGI built-in benchmarker is the preferred way to provide such results, I can follow up with that in case.
Hardware used:
NVIDIA-SMI 535.129.03
Driver Version: 535.129.03
CUDA Version: 12.2
[NVIDIA A100-SXM4-40GB | 400W | 40960MiB] x 8
Information
Docker
The CLI directly
Tasks
An officially supported command
My own modifications
Reproduction
Experiment 1
TGI image:sha-c2fd35d (from #1716 before Upgrade EETQ) Args: --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support Hardware:2xA100 @40GB memory 50th Percentile of per-token decode latency:~8ms Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.
Experiment 2
TGI image:sha-6c2c44b (Upgrade EETQ #1729) Args: --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support Hardware:2xA100 @40GB memory 50th Percentile of per-token decode latency:~25ms Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.
Experiment 3
TGI image:2.0.0 Args: --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --disable-grammar-support Hardware:4xA100 @40GB memory 50th Percentile of per-token decode latency:~10ms Load: sending 1 request at a time at /generate with inputs 128|256|512 tokens and max output 32 tokens.
Expected behavior
Exp. 2 shows a ~3x regression in per-token decode latency wrt Exp. 1 which has the same configuration but a TGI image pre-EETQ upgrade. Exp 3. shows that if the model is not quantized the performance for per-token decode latency are 2.5x better.
Performance should be consistent when sharding an EETQ quantized model.
The text was updated successfully, but these errors were encountered:
System Info
I have reasons to believe that this #1729 is causing a 2-3x performance regression on decoding stage when running EETQ quantized models on multiple shards with Cuda graphs enabled. Find below supporting experiments.
Note: I understand TGI built-in benchmarker is the preferred way to provide such results, I can follow up with that in case.
Hardware used:
Information
Tasks
Reproduction
Experiment 1
TGI image:
sha-c2fd35d
(from #1716 before Upgrade EETQ)Args:
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support
Hardware:
2xA100 @40GB memory
50th Percentile of per-token decode latency:
~8ms
Load: sending 1 request at a time at
/generate
with inputs128|256|512
tokens and max output32
tokens.Experiment 2
TGI image:
sha-6c2c44b
(Upgrade EETQ #1729)Args:
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --quantize eetq --sharded true --num-shard 2 --disable-grammar-support
Hardware:
2xA100 @40GB memory
50th Percentile of per-token decode latency:
~25ms
Load: sending 1 request at a time at
/generate
with inputs128|256|512
tokens and max output32
tokens.Experiment 3
TGI image:
2.0.0
Args:
--model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --sharded true --num-shard 4 --disable-grammar-support
Hardware:
4xA100 @40GB memory
50th Percentile of per-token decode latency:
~10ms
Load: sending 1 request at a time at
/generate
with inputs128|256|512
tokens and max output32
tokens.Expected behavior
Exp. 2 shows a
~3x
regression in per-token decode latency wrt Exp. 1 which has the same configuration but a TGI image pre-EETQ upgrade. Exp 3. shows that if the model is not quantized the performance for per-token decode latency are2.5x
better.Performance should be consistent when sharding an EETQ quantized model.
The text was updated successfully, but these errors were encountered: