Describe the bug
I am trying to run some performance benchmarks of an Int4 quantized model against the original float32 model. While trying to run inference comparisons, I observed that the Int4 model is running slower than the Float32 or BFloat16 models.
I did observe a significant decrease in RAM usage and a slight reduction in GPU/CPU power usage for Int4 models. Still, I am trying to understand why the overall inference is slower. Please clarify if this is an expected behavior.
To Reproduce
Any standard LLM. I simplified my experiment to run only on the encoder.
Include code snippet
# ...performed model warmup prior to this
for prompt in data:
prompt = tokenizer.encode(prompt)
mx.eval(prompt)
tic = perf_counter_ns()
output = model.encode(prompt)
output_tokens = output.tolist() # -> forces mx.eval()
toc = perf_counter_ns()
elapsed = toc - tic
Expected behavior
I was expecting the average and 90th percentile inference time of the Int4 model to be faster than the Float32 model. But the behavior is the opposite.
Desktop (please complete the following information):
- OS Version: MacOS 14.3.1
- Version 0.7.0
Additional context
N/A
Describe the bug
I am trying to run some performance benchmarks of an Int4 quantized model against the original float32 model. While trying to run inference comparisons, I observed that the Int4 model is running slower than the Float32 or BFloat16 models.
I did observe a significant decrease in RAM usage and a slight reduction in GPU/CPU power usage for Int4 models. Still, I am trying to understand why the overall inference is slower. Please clarify if this is an expected behavior.
To Reproduce
Any standard LLM. I simplified my experiment to run only on the encoder.
Include code snippet
Expected behavior
I was expecting the average and 90th percentile inference time of the Int4 model to be faster than the Float32 model. But the behavior is the opposite.
Desktop (please complete the following information):
Additional context
N/A