Deal with weight tying in transformers >=5 #2922
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While we already implemented forward compatibility with the way transformers>=5 handles weight tying, there was an issue with weight tying of trainable tokens wrappers.
Before, we simply got fixed strings of which modules are tied to the embeddings, e.g.
"lm_head"- this never changed since it was just a static property of the respective PretrainedModel class. However, with the new wayget_tied_weights_keysis implemented, the names of the tied-to-embeddings modules change if they are moved around. So if we wrap thelm_headonce in a trainable tokens wrapper, it'll becomelm_head.token_adapter.base_layerinstead oflm_head. That means that the check to see if we already wrapped the tied layer needs to look at the grand-parent instead of the target layer.This obviously assumes that we always have a nesting level of two which is true for TrainableTokensWrapper.