-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix embeddings resizing in TF models #8657
Conversation
if new_num_tokens is not None: | ||
self.predictions.bias = self.add_weight( | ||
shape=(new_num_tokens,), initializer="zeros", trainable=True, name="bias" | ||
) | ||
self.predictions.decoder_bias = self.add_weight( | ||
shape=(new_num_tokens,), initializer="zeros", trainable=True, name="bias" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that this does not erase the current value of the bias, instead replacing it by all zeros?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I does erase yes. I thought that the resizing were happening only with a fresh new model just before the training to adapt the model the vocab size it has to be trained on. Something like:
from transformers import TFBertMaskedLM, BertConfig
config = BertConfig()
model = TFBertMaskedLM(config)
model.resize_token_embeddings(new_size)
But I can make not erasing the current bias values, it is not a problem :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the case! Resizing is important also in the case of fine-tuning. In that case, you would want to keep the existing token embeddings, but add randomly initialized columns for the added tokens.
@@ -486,16 +486,17 @@ def _get_resized_embeddings(self, old_embeddings, new_num_tokens=None) -> tf.Var | |||
# todo: initializer range is not always passed in config. | |||
init_range = getattr(self.config, "initializer_range", 0.02) | |||
new_embeddings = self.add_weight( | |||
"weight", | |||
name=word_embeddings.name.split(":")[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix a naming issue. When the resized is created, the name is changed as well and raises a naming issue when the updated model saved and then loaded.
from transformers import BertConfig, TFBertForMaskedLM
model = TFBertForMaskedLM(BertConfig())
model(model.dummy_inputs)
model.resize_token_embeddings(28996)
model.save_pretrained("here")
model = TFBertForMaskedLM.from_pretrained("here")
Gives
Some layers from the model checkpoint at here were not used when initializing TFBertForMaskedLM: ['']
- This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
.....
Weights and bias are not properly loaded. After naming fix:
All model checkpoint layers were used when initializing TFBertForMaskedLM.
All the layers of TFBertForMaskedLM were initialized from the model checkpoint at here.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy, that part of the lib is really clumsy (compared to the PyTorch side), thanks for tackling this resizing problem!
After discussing with @LysandreJik we would like to suggest a different API (mainly to avoid redefining resize_bias
and override resize_token_embeddings
at each model.
First thing first, get_output_embeddings
is not used anywhere inside the tf utils, so I'd suggest removing it. (It is used on the PyTorch side to tie the weights, but there is no such thing in TF.) The only thing it would break from a search in the repo is the check in generation_tf_utils
(L187) but that check is super clumsy too (it should use one of the auto mapping instead of using that attribute to determine if a model has a LM head).
Then, we can replace this get_output_embeddings
by get_output_bias
that would return None
by default. The resize_token_embeddings
method should then check for the result of this method, and if it's not None, resize the bias. This way you would avoid the multiple copies of resize_bias
:-)
Let me know if this makes sense.
super().build(input_shape) | ||
|
||
def resize_bias(self, new_num_tokens): | ||
if new_num_tokens is not None: | ||
num_tokens_to_copy = min(self.bias.shape[0], new_num_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.bias
is never used, so we can forget about resizing it. We should also delete it from the model, then we can add it to the variable _keys_to_ignore_on_load_unexected
to avoid the warning.
init_weights = self.decoder.value()[:num_tokens_to_copy] | ||
self.decoder = self.add_weight( | ||
shape=(self.config.vocab_size, self.config.embedding_size), | ||
initializer="zeros", | ||
trainable=True, | ||
name=self.decoder.name.split(":")[0], | ||
) | ||
self.decoder.assign(init_weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is about the weights, so it shouldn't be there.
Thanks @sgugger for your useful comments. I was thinking the same about I like very much the solution you proposed and I'm totally fine with it! |
3d3b129
to
3705fe0
Compare
@sgugger I have reworked the resizing for the bias and applied it on BERT at first for testing. Are you agree with this new way to do? If yes, I will do the same for the other models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks perfect to me. Just one nit on the naming: since you're returning the layer that has the bias instead of the actual bias, I think get_bias
should be named get_output_layer_with_bias
(or something better if you're more inspired).
@@ -184,10 +197,10 @@ def generate( | |||
""" | |||
|
|||
# We cannot generate if the model does not have a LM head | |||
if self.get_output_embeddings() is None: | |||
is_lm, list_models = self.is_lm_model() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Way cleaner, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree here -> I prefer to leave the self.get_output_embeddings()
check here. E.g. TFRag
won't be in any of the TF_MODEL_FOR_CAUSAL_LM_MAPPING, TF_MODEL_FOR_MASKED_LM_MAPPING, TF_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING
. And since TFRag
will have two models of that can both generate, but won't both be able to be in the same MAPPING class we will already run into problems here. Also this is inconsistent with PyTorch.
Logically, I like this check as it is right now. If the model has output_embeddings
then it can generate because all one needs for generate is a logit output vector. I don't want to tie this functionality to the LM Mappings in the modeling_auto.py
. It creates an unnecessary dependency IMO, makes in unnecessarily more inflexible IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to be sure, self.get_output_embeddings()
is nowhere used except for a simple check in the generation_tf_utils.py
, so what I deduce from that is that it is a kind of useless method. Unless they are used somewhere else?
For sure we can create a smarter way to detect if a model has an LM layer, if the role of self.get_output_embeddings()
is just to know that. What could be the best compromise?
""" | ||
Returns the model's output embeddings. | ||
Get the layer that handles a bias attribute in case the model has an LM head. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get the layer that handles a bias attribute in case the model has an LM head. | |
Get the layer that handles a bias attribute in case the model has an LM head with weights tied to the embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the point made below, the changes look good.
def is_lm_model(self): | ||
from .models.auto import ( | ||
TF_MODEL_FOR_CAUSAL_LM_MAPPING, | ||
TF_MODEL_FOR_MASKED_LM_MAPPING, | ||
TF_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING, | ||
) | ||
|
||
list_models = list(TF_MODEL_FOR_SEQ_TO_SEQ_CAUSAL_LM_MAPPING.values()) | ||
list_models.extend(list(TF_MODEL_FOR_CAUSAL_LM_MAPPING.values())) | ||
list_models.extend(list(TF_MODEL_FOR_MASKED_LM_MAPPING.values())) | ||
|
||
return (type(self) in list_models, [model.__name__ for model in list_models]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I find the API for this method a bit weird. I wouldn't expect an
is_lm_model
method which implies a boolean return to return a tuple(bool, List[str])
. -
I also wouldn't expect that to be a method of
TFGenerationMixin
given its description:
A class containing all of the functions supporting generation, to be used as a mixin in
:class:`~transformers.TFPreTrainedModel`.
- I find it weird to have an
is_lm_model()
method on models. If we have this, why notis_sequence_classification_model
,is_question_answering_model
, etc.
Given how it's implemented, and why it's used for, I'd rather have an has_tied_weights()
method which returns a boolean value, and not a tuple. If you want to have the list of lm models, I would put that in another method, which I wouldn't place on TFPreTrainedModel
(or any of its mixins), as it's a simple operation over three mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm fine to divide and rename this method into two separate and distinct role 👍
@@ -931,8 +955,28 @@ def __init__(self, config, *inputs, **kwargs): | |||
self.albert = TFAlbertMainLayer(config, name="albert") | |||
self.predictions = TFAlbertMLMHead(config, self.albert.embeddings, name="predictions") | |||
|
|||
def get_output_embeddings(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to keep this API - completely removing the function name is very breaking in my opinion. This is not really a private method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does-it breaks more precisely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I would think that people made use of this method for example if they wrap their language model in a custom class or if they built their own generation method
@sgugger @patrickvonplaten @LysandreJik This PR takes care of resizing all the bias, and if we start to change how the embeddings are resized + modify the generation, I think it would be a bit too much and out of the scope of this PR. Then, what I propose is to keep how it was at the beginning in |
It would be awesome if we can keep the A couple of reasons why I would like to keep
|
Thanks a lot @patrickvonplaten for sharing this! I think we should move this talk to a more suited place, and meanwhile I will revert that part of the changes. |
I disagree with you on this @patrickvonplaten
The weight tying cannot be done the same way in TF (and honestly the resizing on the PyTorch side is a bit hacky and very hard to understand it kind of goes against our principle of no magic code), so this alone is not an argument for keeping the
The problem is that this function is always implemented to return the input embeddings, so the function as it is does not do anything more than
The PyTorch side has no assert, so in that case, the consistent thing is to remove the assert entirely. I could be convinced to leave the |
Ok we debriefed a bit with @patrickvonplaten to avoid spamming the PR. I had missed that some models are already using an output embeddings that is different from the input embeddings (most models are tied), like T5 or mT5. So those, like In the end, we both agree on keeping that method, add the Does that all make sense? |
Ok, I'm totally fine with this 👍 ! Nevertheless, there are still few things I don't get.
What do you mean by the resizing does not work? Which one? Do you have a more specific example?
I don't understand this sentence, do you have an example? What do we have to check if the input/output embeddings are different if we get them with two separate methods (namely get_input_embeddings and get_output_embeddings). |
The new T5 and mT5 models have an output embedding layer that is sometimes tied to the input embeddings (so same weights like BERT) and sometimes different. When it's different, it is not resized.
The output embeddings are, very often, the same as the input embeddings (BERT situation) so in most instances |
Crystal clear!!! Thanks a lot for the details! I will proceed to the changes once the sprint is finished 👍 |
@patrickvonplaten I have put back the |
@@ -818,6 +819,32 @@ def __init__(self, config, *inputs, **kwargs): | |||
def get_output_embeddings(self): | |||
return self.albert.embeddings | |||
|
|||
def resize_token_embeddings(self, new_num_tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PT, albert can just use the "normal" resize_token_embeddings
function, see: https://github.com/huggingface/transformers/pull/8880/files#r534143307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a special case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't do that, the two bias are not resized, so yes it is mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment on why it's necessary would help (I remember asking you too why there were two biases in ALBERT ;-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look, there is already a comment ;)
# ALBERT is a special case where there are two bias to update
# even though self.bias is not used anywhere and is here
# just to make the loading weights from a PT model happy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I was responding to fast here and didn't look closely. That's great then :-)
@@ -1051,6 +1051,24 @@ def __init__(self, config, *inputs, **kwargs): | |||
name="/final_logits_bias", shape=[1, config.vocab_size], initializer="zeros", trainable=False | |||
) | |||
|
|||
def resize_token_embeddings(self, new_num_tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! This is the same in PT!
@@ -612,14 +612,24 @@ def set_input_embeddings(self, value): | |||
else: | |||
raise NotImplementedError | |||
|
|||
def get_output_embeddings(self) -> tf.keras.layers.Layer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave the default get_output_embeddings
function for now as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method restored!
LGTM! @LysandreJik just missing your approval. The Flax tests do not pass and I don't know why :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for working on this and improving the API @jplu!
What does this PR do?
Currenlty when the embeddings are resized the biases are not resized in same time. In TF there is no explicit link between the decoder weights and biases in a dense layer contrarily than in PT. This PR fixes this issue by resizing in same time the biases, even thought I don't know if this is the best solution. @LysandreJik @sgugger what do you think?