[BUG] Int4 Quantized model inference is slower than Float32 model

**Describe the bug**
I am trying to run some performance benchmarks of an Int4 quantized model against the original float32 model. While trying to run inference comparisons, I observed that the Int4 model is running slower than the Float32 or BFloat16 models. 

I did observe a significant decrease in RAM usage and a slight reduction in GPU/CPU power usage for Int4 models. Still, I am trying to understand why the overall inference is slower. Please clarify if this is an expected behavior. 

**To Reproduce**
Any standard LLM. I simplified my experiment to run only on the encoder.

Include code snippet
```python
# ...performed model warmup prior to this
for prompt in data:
    prompt = tokenizer.encode(prompt)
    mx.eval(prompt)
    tic = perf_counter_ns()
    output = model.encode(prompt)
    output_tokens = output.tolist() # -> forces mx.eval()
    toc = perf_counter_ns()
    elapsed = toc - tic
```

**Expected behavior**
I was expecting the average and 90th percentile inference time of the Int4 model to be faster than the Float32 model. But the behavior is the opposite. 

**Desktop (please complete the following information):**
 - OS Version: MacOS 14.3.1
 - Version 0.7.0

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Int4 Quantized model inference is slower than Float32 model #846

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Int4 Quantized model inference is slower than Float32 model #846

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions