Skip to content

[BUG] Int4 Quantized model inference is slower than Float32 model #846

@r4ghu

Description

@r4ghu

Describe the bug
I am trying to run some performance benchmarks of an Int4 quantized model against the original float32 model. While trying to run inference comparisons, I observed that the Int4 model is running slower than the Float32 or BFloat16 models.

I did observe a significant decrease in RAM usage and a slight reduction in GPU/CPU power usage for Int4 models. Still, I am trying to understand why the overall inference is slower. Please clarify if this is an expected behavior.

To Reproduce
Any standard LLM. I simplified my experiment to run only on the encoder.

Include code snippet

# ...performed model warmup prior to this
for prompt in data:
    prompt = tokenizer.encode(prompt)
    mx.eval(prompt)
    tic = perf_counter_ns()
    output = model.encode(prompt)
    output_tokens = output.tolist() # -> forces mx.eval()
    toc = perf_counter_ns()
    elapsed = toc - tic

Expected behavior
I was expecting the average and 90th percentile inference time of the Int4 model to be faster than the Float32 model. But the behavior is the opposite.

Desktop (please complete the following information):

  • OS Version: MacOS 14.3.1
  • Version 0.7.0

Additional context
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions