Skip to content

Conversation

@jambayk
Copy link
Contributor

@jambayk jambayk commented Nov 3, 2025

Describe your changes

  • New QuantEmbedding module added to do input embedding quantization
    • export to GatherBlockQuantized with torch script and dynamo mode supported. model builder doesn't support it since it requires change in the builder script
    • If a quantization pass is responsible for quantizing both the embeds and lm_head of a model with original tied weights, it can keep them tied in the pytorch model. During export, torch script duplicates the shared qweight because of reshape and dynamo keeps the reshape on the MatMulNBits (need to test if it's more efficient for the reshape to be on the GatherBlockQuantized)
  • New Rtn pass that can be composed on top of other quantization passes. For example, gptq on the transformer layers and then rtn on the embedding and lm head to take advantage of weight tieing
  • New quantized model checkpoint format. Moved to the same format used by the MatMulNBits and GatherBlockQuantized. Now there is no overhead for unpacking and repacking the weights during export, so it is very fast.
    • Enforce the same restrictions on block size and weight shapes required by the contrib ops for compatibility.
    • We also enforce that that the quantization dim is divisible by the block size. This makes the packing logic easier as we don't have to worry about paddings and gives compatibility between 3D qweight for Linears and 2D qweight for Embeddings.
  • Updated the autogptq and autoawq checkpoint export to make the quantization parameters 2D as per the latest specs for the contrib operators.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

@jambayk jambayk requested a review from xiaoyu-work November 3, 2025 19:55
@jambayk jambayk changed the title Qauntization: Embeddings quantization, new packing format, Rtn quantizer Quantization: Embeddings quantization, new packing format, Rtn quantizer Nov 3, 2025
@xiaoyu-work
Copy link
Collaborator

We are trying to improve our test coverage. Can you please add unit test for new files created in this PR?

@jambayk jambayk merged commit d645057 into main Nov 3, 2025
11 checks passed
@jambayk jambayk deleted the jambayk/embeds-rtn branch November 3, 2025 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants