Optional layers #8961

jplu · 2020-12-07T15:11:10Z

What does this PR do?

This PR adds the possibility to have optional layers in the models thanks to the new input/output process. Here the pooling layer is created or not for the BERT/ALBERT/Longformer/MobileBERT/Roberta models. The keys to ignore when loading for these layers has been updated in same time.

LysandreJik · 2020-12-07T16:07:29Z

Thanks for working on this @jplu. I think we should take the opportunity to think about this issue: #8793.

The problem with the add_pooling_layer option and how it's currently done in PyTorch models is that when doing a training initialized from a model checkpoints that contains the pooling layer, like bert-base-cased:

model = BertForMaskedLM.from_pretrained("bert-base-cased")
# Fine-tune the model on an MLM task

we're losing the pooling layer doing so. It's not a big deal here as we're doing an MLM task, however, if we want to use that model for a downstream task:

model.save_pretrained("bert-base-cased-finetuned-mlm")
classifier_model = BertForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mlm")

we're now having a classifier model that has a randomly initialized pooling layer, whereas the weights that were stored in the bert-base-cased original checkpoint would have been better than a randomly initialized layer.

The issue is that right now, we have no way of specifying if we want to keep the pooling layer or not in such a setup. I would argue that controlling it from the configuration would really be useful here, rather than setting it to add_pooling_layer=False in architectures that do not need it.

cc @jplu @sgugger @patrickvonplaten

jplu · 2020-12-07T16:13:42Z

Indeed, it starts to be more complicated than we thought at the beginning, but the case you are raising is a very good one!!

I think that controlling this from the config to have the same behavior would be more flexible, I +1 this proposal!

patrickvonplaten · 2020-12-07T18:25:48Z

src/transformers/models/bart/modeling_tf_bart.py

    _keys_to_ignore_on_load_unexpected = [
        r"model.encoder.embed_tokens.weight",
        r"model.decoder.embed_tokens.weight",
    ]
+    _keys_to_ignore_on_load_missing = [
+        r"final_logits_bias",


The logits bias can be fine-tuned, so not sure that we want to have those in _keys_to_ignore_on_load_missing. Or are they missing from all bart models (and thus set to 0?)

That's a good question, I don't know enough BART to answer to the question, so I will remove that one 👍

patrickvonplaten

Very much like this PR!

patrickvonplaten · 2020-12-07T18:43:57Z

Thanks for working on this @jplu. I think we should take the opportunity to think about this issue: #8793.

The problem with the add_pooling_layer option and how it's currently done in PyTorch models is that when doing a training initialized from a model checkpoints that contains the pooling layer, like bert-base-cased:
model = BertForMaskedLM.from_pretrained("bert-base-cased")
# Fine-tune the model on an MLM task
we're losing the pooling layer doing so. It's not a big deal here as we're doing an MLM task, however, if we want to use that model for a downstream task:
model.save_pretrained("bert-base-cased-finetuned-mlm")
classifier_model = BertForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mlm")
we're now having a classifier model that has a randomly initialized pooling layer, whereas the weights that were stored in the bert-base-cased original checkpoint would have been better than a randomly initialized layer.

The issue is that right now, we have no way of specifying if we want to keep the pooling layer or not in such a setup. I would argue that controlling it from the configuration would really be useful here, rather than setting it to add_pooling_layer=False in architectures that do not need it.

cc @jplu @sgugger @patrickvonplaten

I remember that we were thinking about adding a config param for add_pooling_layer for PT: #7272 and decided not to. I still think the cleaner solution is to not add a config param because it's a very weird use-case IMO. Why wouldn't the user just use a BertForPreTraining model for his use case? But I'm also fine with adding a config param instead. It's not a big deal to me...but in this case I'd definitely prefer to not add it to the general PretrainedConfig, but to each model's config.

LysandreJik · 2020-12-07T19:08:54Z

Good point regarding the BertForPreTraining. I think this is a use-case (you want to keep a layer from another architecture) where you would want to build your own architectures for that, to have complete control over the layers.

I think we might be missing some documentation on how to do that, and on how creating an architecture that inherits from PreTrainedModel works, but this is a discussion for another time.

Ok to keep it this way.

sgugger

It all looks good to me, thanks for working on this!

LysandreJik

LGTM, great work @jplu!

LysandreJik · 2020-12-07T23:28:12Z

src/transformers/models/bart/modeling_tf_bart.py

@@ -876,6 +876,8 @@ def call(self, input_ids, use_cache=False):
 )
 @keras_serializable
 class TFBartModel(TFPretrainedBartModel):
+    base_model_prefix = "model"


Was this previously missing? It is in the TFPretrainedBartModel:

class TFPretrainedBartModel(TFPreTrainedModel): config_class = BartConfig base_model_prefix = "model"

jplu · 2020-12-08T12:14:29Z

LGTM for me!

patrickvonplaten reviewed Dec 7, 2020

View reviewed changes

patrickvonplaten approved these changes Dec 7, 2020

View reviewed changes

sgugger approved these changes Dec 7, 2020

View reviewed changes

LysandreJik approved these changes Dec 7, 2020

View reviewed changes

jplu added 22 commits December 8, 2020 12:35

Apply on BERT and ALBERT

7f86209

Update TF Bart

653b1e8

Add input processing to TF BART

3279ab2

Add input processing for TF CTRL

d4a373d

Add input processing to TF Distilbert

0f7019d

Add input processing to TF DPR

e3b0151

Add input processing to TF Electra

fd60959

Add deprecated arguments

6ce34d3

Add input processing to TF XLM

446e63b

remove unused imports

a2f5a97

Add input processing to TF Funnel

b8674a7

Add input processing to TF GPT2

5349123

Add input processing to TF Longformer

7ab8b33

Add input processing to TF Lxmert

107325d

Apply style

49a06ef

Add input processing to TF Mobilebert

21ae34b

Add input processing to TF GPT

2cbe48f

Add input processing to TF Roberta

a6d4302

Add input processing to TF T5

6b51082

Add input processing to TF TransfoXL

6c601a3

Apply style

f6e8b80

Rebase on master

346c4c3

jplu added 21 commits December 8, 2020 12:35

Fix wrong model name

1c821c2

Fix BART

ce48d86

Apply style

40bbc85

Put the deprecated warnings in the input processing function

c0cbfee

Remove the unused imports

5e1c86c

Raise an error when len(kwargs)>0

2bef54a

test ModelOutput instead of TFBaseModelOutput

b920d13

Address Patrick's comments

1ab91fd

Address Patrick's comments

e7097a4

Add boolean processing for the inputs

c67f764

Take into account the optional layers

4e656b0

Add missing/unexpected weights in the other models

e871f61

Apply style

6c9f230

rename parameters

87bc78f

Apply style

cd1c9fc

Remove useless

6ee9aa3

Remove useless

22ea586

Remove useless

69301d8

Update num parameters

98458a3

Fix tests

0700ed2

Address Patrick's comment

6d21389

jplu force-pushed the optional-layers branch from 0d0f85a to 6d21389 Compare December 8, 2020 11:36

Remove useless attribute

bb585db

LysandreJik merged commit bf7f79c into huggingface:master Dec 8, 2020

jplu deleted the optional-layers branch December 8, 2020 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional layers #8961

Optional layers #8961

jplu commented Dec 7, 2020

LysandreJik commented Dec 7, 2020 •

edited

jplu commented Dec 7, 2020

patrickvonplaten Dec 7, 2020

jplu Dec 7, 2020

patrickvonplaten left a comment

patrickvonplaten commented Dec 7, 2020 •

edited

LysandreJik commented Dec 7, 2020

sgugger left a comment

LysandreJik left a comment

LysandreJik Dec 7, 2020

jplu commented Dec 8, 2020

Optional layers #8961

Optional layers #8961

Conversation

jplu commented Dec 7, 2020

What does this PR do?

LysandreJik commented Dec 7, 2020 • edited

jplu commented Dec 7, 2020

patrickvonplaten Dec 7, 2020

Choose a reason for hiding this comment

jplu Dec 7, 2020

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Dec 7, 2020 • edited

LysandreJik commented Dec 7, 2020

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Dec 7, 2020

Choose a reason for hiding this comment

jplu commented Dec 8, 2020

LysandreJik commented Dec 7, 2020 •

edited

patrickvonplaten commented Dec 7, 2020 •

edited