Skip to content

Conversation

@jambayk
Copy link
Contributor

@jambayk jambayk commented Nov 4, 2025

Describe your changes

  • New surgery TieWordEmbeddings that ties the weights between the input embeddings and lm head
  • Two cases are supported:
    • Both are unquantized: lm head MatMul is replaced with a Gemm with transB set to True. input activation is reshaped to 2D and the output is reshaped back to original rank. For CPU and CUDA atleast, reshape on activation is just a metadata change so there is no overhead. This option was chosen over adding a transpose node on the shared weight since ORT constant folds the transpose and duplicates the initializer during session init.
    • Both are quantized with MatMulNBits and GatherBlockQuantized. A reshape node is added on the MatMulNBits qweight. This only saves disk space but not device memory because ORT constant folds the reshape and duplicates the initializer. I plan to follow up with the contrib op developers if we could relax the 2d shape requirement on the GatherBlockQuantized op. Although I don't know if sharing the initializers in memory would be useful for the CPU EP case since the MatMulNBits op does kernel prepacking anyways.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

(Optional) Issue link

@jambayk jambayk requested a review from xiaoyu-work November 4, 2025 03:50
doc

reskip ort fusion test

skip dynamo quant export
@jambayk jambayk force-pushed the jambayk/embeds-tie branch from 40c5155 to 328b1a3 Compare November 4, 2025 16:56
@jambayk jambayk enabled auto-merge (squash) November 4, 2025 19:10
@jambayk jambayk merged commit ff4050c into main Nov 4, 2025
11 checks passed
@jambayk jambayk deleted the jambayk/embeds-tie branch November 4, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants