Skip to content

Conversation

@githubnemo
Copy link
Collaborator

While we already implemented forward compatibility with the way transformers>=5 handles weight tying, there was an issue with weight tying of trainable tokens wrappers.

Before, we simply got fixed strings of which modules are tied to the embeddings, e.g. "lm_head" - this never changed since it was just a static property of the respective PretrainedModel class. However, with the new way get_tied_weights_keys is implemented, the names of the tied-to-embeddings modules change if they are moved around. So if we wrap the lm_head once in a trainable tokens wrapper, it'll become lm_head.token_adapter.base_layer instead of lm_head. That means that the check to see if we already wrapped the tied layer needs to look at the grand-parent instead of the target layer.

This obviously assumes that we always have a nesting level of two which is true for TrainableTokensWrapper.

While we already implemented forward compatibility with the way transformers>=5
handles weight tying, there was an issue with weight tying of trainable tokens wrappers.

Before, we simply got fixed strings of which modules are tied to the embeddings,
e.g. `"lm_head"` - this never changed since it was just a static property of the
respective PretrainedModel class. However, with the new way `get_tied_weights_keys`
is implemented, the names of the tied-to-embeddings modules change if they are
moved around. So if we wrap the `lm_head` once in a trainable tokens wrapper, it'll
become `lm_head.token_adapter.base_layer` instead of `lm_head`. That means that
the check to see if we already wrapped the tied layer needs to look at the
grand-parent instead of the target layer.

This obviously assumes that we always have a nesting level of two which is true
for TrainableTokensWrapper.
@githubnemo
Copy link
Collaborator Author

I think ModulesToSaveWrapper is not relevant here since it only handles weight tying explicitly using PeftConfig.modules_to_tie.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@githubnemo githubnemo merged commit 708d69d into huggingface:main Nov 20, 2025
9 of 10 checks passed
Conzel pushed a commit to Conzel/peft that referenced this pull request Nov 25, 2025
While we already implemented forward compatibility with the way transformers>=5
handles weight tying, there was an issue with weight tying of trainable tokens wrappers.

Before, we simply got fixed strings of which modules are tied to the embeddings,
e.g. `"lm_head"` - this never changed since it was just a static property of the
respective PretrainedModel class. However, with the new way `get_tied_weights_keys`
is implemented, the names of the tied-to-embeddings modules change if they are
moved around. So if we wrap the `lm_head` once in a trainable tokens wrapper, it'll
become `lm_head.token_adapter.base_layer` instead of `lm_head`. That means that
the check to see if we already wrapped the tied layer needs to look at the
grand-parent instead of the target layer.

This obviously assumes that we always have a nesting level of two which is true
for TrainableTokensWrapper.
Arlaz pushed a commit to obvious-research/peft that referenced this pull request Jan 9, 2026
While we already implemented forward compatibility with the way transformers>=5
handles weight tying, there was an issue with weight tying of trainable tokens wrappers.

Before, we simply got fixed strings of which modules are tied to the embeddings,
e.g. `"lm_head"` - this never changed since it was just a static property of the
respective PretrainedModel class. However, with the new way `get_tied_weights_keys`
is implemented, the names of the tied-to-embeddings modules change if they are
moved around. So if we wrap the `lm_head` once in a trainable tokens wrapper, it'll
become `lm_head.token_adapter.base_layer` instead of `lm_head`. That means that
the check to see if we already wrapped the tied layer needs to look at the
grand-parent instead of the target layer.

This obviously assumes that we always have a nesting level of two which is true
for TrainableTokensWrapper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants