Skip to content

[Quantization] Add metal quantization for MPS devices!#43934

Merged
SunMarc merged 23 commits intomainfrom
add-mlx-quantization
Feb 27, 2026
Merged

[Quantization] Add metal quantization for MPS devices!#43934
SunMarc merged 23 commits intomainfrom
add-mlx-quantization

Conversation

@MekkCyber
Copy link
Contributor

What does this PR do?

Adds mlx quantization for mps devices leveraging the kernels library for pre-built kernels !!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice, missing some tests tho!!

@SunMarc SunMarc self-requested a review February 17, 2026 12:37
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Maybe we can change the name to Metal insteal of Mlx as it can create confusion ? In the future, we might have mlx if we add compatibility with mlx models. Please add some e2e tests + add tests to check that we have the right dtype after quantization and dequantization

Comment on lines +294 to +298
orig_dtype = value.dtype # e.g. bfloat16 for Llama
return {
target_key: w_packed,
scale_key: scales.to(orig_dtype),
bias_key: biases.to(orig_dtype),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine but I think _affine_quantize_tensor should return them in the right dtype already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this, since when we quantize we keep the scales dtype the same as the weight dtype before quantization which is float32

@MekkCyber MekkCyber force-pushed the add-mlx-quantization branch from 212b192 to 6ac192b Compare February 26, 2026 15:31
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot ! Can you just update the overview docs to add this quantization method ?

@SunMarc SunMarc changed the title [Quantization] Add mlx quantization for MPS devices! [Quantization] Add metal quantization for MPS devices! Feb 27, 2026
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: metal

@SunMarc SunMarc merged commit 9dd9076 into main Feb 27, 2026
26 checks passed
@SunMarc SunMarc deleted the add-mlx-quantization branch February 27, 2026 13:28
zvik pushed a commit to zvik/transformers that referenced this pull request Mar 1, 2026
…3934)

* first commit

* style

* fix

* fix

* mlx -> metal

* other fixes

* add tests

* fixes

* weight -> qweight

* fix

* tests

* fix style

* fix

* toctree

* some docs

* qweight -> weight

* fix dtype

* rm print

* overview

---------

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants