fix: propagate quantization mode in QuantizedAllToShardedLinear / QuantizedShardedToAllLinear by vskiwi · Pull Request #3133 · ml-explore/mlx

vskiwi · 2026-02-15T17:23:01Z

Summary

QuantizedAllToShardedLinear and QuantizedShardedToAllLinear in mlx/nn/layers/distributed.py do not accept, store, or pass the mode parameter to mx.quantized_matmul. When an MXFP8-quantized QuantizedLinear is converted via shard_linear(), the mode is silently lost. The resulting sharded layer calls quantized_matmul without mode=, which defaults to "affine" — interpreting FP8 packed weights as affine int8, producing garbage output with no error.

Additionally, MXFP8 does not use biases, but both classes unconditionally accessed self["biases"], which would raise ValueError once the mode fix is applied.

Changes

Add mode parameter (default "affine") to both __init__ methods
Store self.mode and pass it to mx.quantize and mx.quantized_matmul
Use *biases unpacking to handle modes that don't produce biases (mxfp8, mxfp4)
Use self.get("biases") instead of self["biases"] for safe access (consistent with QuantizedLinear)
Propagate mode from source layer in from_quantized_linear
Include mode in _extra_repr output
Add distributed test for mxfp8 quantized shard_linear

Impact

This unblocks tensor parallel inference for all MXFP8-quantized models (and likely mxfp4). Confirmed working: GLM-5 754B (mlx-community/GLM-5-8bit-MXFP8, mode=mxfp8, group_size=32, bits=8) on 2× M3 Ultra 512GB at ~14 tok/s with tensor parallel.

No changes to the affine (default) code path — full backward compatibility.

Test plan

Existing test_shard_linear test for affine quantization is unchanged and should still pass
New mxfp8 test in test_shard_linear verifies mode propagation, biases=None, and output correctness
Formatted with black

Made with Cursor

QuantizedAllToShardedLinear and QuantizedShardedToAllLinear did not accept, store, or forward the `mode` parameter to `mx.quantized_matmul`. When a non-affine QuantizedLinear (e.g. mode="mxfp8") was converted via `shard_linear()`, the mode was silently lost and `quantized_matmul` defaulted to "affine", producing garbage output with no error. Additionally, MXFP8 does not use biases, but both classes unconditionally accessed `self["biases"]` which would fail once the mode fix was applied because `mx.quantize` does not return biases for mxfp8. Changes: - Add `mode` parameter (default "affine") to both __init__ methods - Store `self.mode` and pass it to `mx.quantize` and `mx.quantized_matmul` - Use `*biases` unpacking to handle modes that don't produce biases - Use `self.get("biases")` instead of `self["biases"]` for safe access - Propagate mode from source layer in `from_quantized_linear` - Include mode in `_extra_repr` output - Add distributed test for mxfp8 quantized shard_linear Fixes ml-explore#3132 Co-authored-by: Cursor <cursoragent@cursor.com>

angeloskath

Thank you that looks great!

I'll merge after the tests pass.

angeloskath approved these changes Feb 16, 2026

View reviewed changes

angeloskath merged commit e226af7 into ml-explore:main Feb 16, 2026
16 checks passed

vskiwi deleted the fix-quantized-sharded-mode-propagation branch February 16, 2026 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: propagate quantization mode in QuantizedAllToShardedLinear / QuantizedShardedToAllLinear#3133

fix: propagate quantization mode in QuantizedAllToShardedLinear / QuantizedShardedToAllLinear#3133
angeloskath merged 1 commit intoml-explore:mainfrom
vskiwi:fix-quantized-sharded-mode-propagation

vskiwi commented Feb 15, 2026

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vskiwi commented Feb 15, 2026

Summary

Changes

Impact

Test plan

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants