Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional layers #8961

Merged
merged 44 commits into from
Dec 8, 2020
Merged

Optional layers #8961

merged 44 commits into from
Dec 8, 2020

Conversation

jplu
Copy link
Contributor

@jplu jplu commented Dec 7, 2020

What does this PR do?

This PR adds the possibility to have optional layers in the models thanks to the new input/output process. Here the pooling layer is created or not for the BERT/ALBERT/Longformer/MobileBERT/Roberta models. The keys to ignore when loading for these layers has been updated in same time.

@LysandreJik
Copy link
Member

LysandreJik commented Dec 7, 2020

Thanks for working on this @jplu. I think we should take the opportunity to think about this issue: #8793.

The problem with the add_pooling_layer option and how it's currently done in PyTorch models is that when doing a training initialized from a model checkpoints that contains the pooling layer, like bert-base-cased:

model = BertForMaskedLM.from_pretrained("bert-base-cased")
# Fine-tune the model on an MLM task

we're losing the pooling layer doing so. It's not a big deal here as we're doing an MLM task, however, if we want to use that model for a downstream task:

model.save_pretrained("bert-base-cased-finetuned-mlm")
classifier_model = BertForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mlm")

we're now having a classifier model that has a randomly initialized pooling layer, whereas the weights that were stored in the bert-base-cased original checkpoint would have been better than a randomly initialized layer.

The issue is that right now, we have no way of specifying if we want to keep the pooling layer or not in such a setup. I would argue that controlling it from the configuration would really be useful here, rather than setting it to add_pooling_layer=False in architectures that do not need it.

cc @jplu @sgugger @patrickvonplaten

@jplu
Copy link
Contributor Author

jplu commented Dec 7, 2020

Indeed, it starts to be more complicated than we thought at the beginning, but the case you are raising is a very good one!!

I think that controlling this from the config to have the same behavior would be more flexible, I +1 this proposal!

_keys_to_ignore_on_load_unexpected = [
r"model.encoder.embed_tokens.weight",
r"model.decoder.embed_tokens.weight",
]
_keys_to_ignore_on_load_missing = [
r"final_logits_bias",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logits bias can be fine-tuned, so not sure that we want to have those in _keys_to_ignore_on_load_missing. Or are they missing from all bart models (and thus set to 0?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, I don't know enough BART to answer to the question, so I will remove that one 👍

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much like this PR!

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Dec 7, 2020

Thanks for working on this @jplu. I think we should take the opportunity to think about this issue: #8793.

The problem with the add_pooling_layer option and how it's currently done in PyTorch models is that when doing a training initialized from a model checkpoints that contains the pooling layer, like bert-base-cased:

model = BertForMaskedLM.from_pretrained("bert-base-cased")
# Fine-tune the model on an MLM task

we're losing the pooling layer doing so. It's not a big deal here as we're doing an MLM task, however, if we want to use that model for a downstream task:

model.save_pretrained("bert-base-cased-finetuned-mlm")
classifier_model = BertForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mlm")

we're now having a classifier model that has a randomly initialized pooling layer, whereas the weights that were stored in the bert-base-cased original checkpoint would have been better than a randomly initialized layer.

The issue is that right now, we have no way of specifying if we want to keep the pooling layer or not in such a setup. I would argue that controlling it from the configuration would really be useful here, rather than setting it to add_pooling_layer=False in architectures that do not need it.

cc @jplu @sgugger @patrickvonplaten

I remember that we were thinking about adding a config param for add_pooling_layer for PT: #7272 and decided not to. I still think the cleaner solution is to not add a config param because it's a very weird use-case IMO. Why wouldn't the user just use a BertForPreTraining model for his use case? But I'm also fine with adding a config param instead. It's not a big deal to me...but in this case I'd definitely prefer to not add it to the general PretrainedConfig, but to each model's config.

@LysandreJik
Copy link
Member

Good point regarding the BertForPreTraining. I think this is a use-case (you want to keep a layer from another architecture) where you would want to build your own architectures for that, to have complete control over the layers.

I think we might be missing some documentation on how to do that, and on how creating an architecture that inherits from PreTrainedModel works, but this is a discussion for another time.

Ok to keep it this way.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks good to me, thanks for working on this!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @jplu!

@@ -876,6 +876,8 @@ def call(self, input_ids, use_cache=False):
)
@keras_serializable
class TFBartModel(TFPretrainedBartModel):
base_model_prefix = "model"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this previously missing? It is in the TFPretrainedBartModel:

class TFPretrainedBartModel(TFPreTrainedModel):
    config_class = BartConfig
    base_model_prefix = "model"

@jplu
Copy link
Contributor Author

jplu commented Dec 8, 2020

LGTM for me!

@LysandreJik LysandreJik merged commit bf7f79c into huggingface:master Dec 8, 2020
@jplu jplu deleted the optional-layers branch December 8, 2020 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants