[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395

patrickvonplaten · 2020-06-30T10:44:39Z

This PR aligns TF code more with PT code and adds full training support to all CLM and MLM models applying @jplu's loss design to the remaining models. In more detail the following things are included in the PR:

Add TFMaskedLanguageModelingLoss and TFCausalLanguageModelingLoss to all CLM and MLM TF models. Only Transfo-XL and XLM are not included since they use adaptive softmax (TF Transfo-XL currently has no Adaptive Softmax implemented cc @TevenLeScao for notification)
Change value to mask CE loss from -1 to -100 to align with PyTorch cc - tf_ner script is updated accordingly @jplu. Using -1 is deprecated here and should be removed in a future version.
Split Bert into BertForCLM and BertForMLM as was done in PyTorch (small break in backward compatibility here)
Split TFAutoModelWithLMHead into TFAutoModelForCLM, ...ForMLM, ForSeq2Seq as was done in PyTorch to make TF ready for encoder-decoder wrapper.
Add various tests for modeling_tf_auto.py e.g. that the mappings are correctly ordered
Fix inconsistent naming in TF T5 and fix TF T5 keras compilation bug @sshleifer - encoder decoder tf related tests are fixed so should concern tf bart as well

TODO:

add labels to all tests where it applies
add CLM loss to all other models
add MLM loss to all other models
Clean TF T5

Future Pr:

Test that TF Trainer works well with all new CLM / MLM models - we should definitely start adding tests for TF Trainer as well @jplu @julien-c @LysandreJik
TF Benchmark can now be done on training as welll -> update the benchmark scripts

patrickvonplaten · 2020-06-30T10:45:17Z

src/transformers/modeling_tf_bert.py

@@ -843,6 +847,80 @@ def call(self, inputs, **kwargs):
        return outputs  # prediction_scores, (hidden_states), (attentions)


+class TFBertLMHeadModel(TFBertForPreTraining, TFCausalLanguageModelingLoss):


@LysandreJik @thomwolf split Bert into two as was done for PyTorch -> small break in backward compatibility here.

patrickvonplaten · 2020-06-30T10:46:05Z

src/transformers/modeling_tf_utils.py

+class TFCausalLanguageModelingLoss:
+    def compute_loss(self, labels, logits):
+        loss_fn = tf.keras.losses.CategoricalCrossentropy(
+            from_logits=True, reduction=tf.keras.losses.Reduction.NONE


I set the reduction to Reduction.NONE as was done for the other losses -> is that correct @LysandreJik @jplu ?

Yes, beause NONE is for making the loss computation compliant with the custom training. And the reduction is let to the trainer.

patrickvonplaten · 2020-06-30T10:47:05Z

src/transformers/modeling_tf_bert.py

+
+        outputs = (logits,) + outputs[2:]  # Add hidden states and attention if they are here
+        if labels is not None:
+            logits = logits[: :-1]


shift logits the same it's done in PyTorch

patrickvonplaten · 2020-06-30T10:50:45Z

@LysandreJik @julien-c @jplu @thomwolf @sgugger - can you take a look at this example if the CLM loss is correctly added? If yes, I will add this loss to all other CLM models and add tests.

jplu · 2020-06-30T10:56:23Z

Looks good to me!!

codecov · 2020-06-30T11:04:38Z

Codecov Report

Merging #5395 into master will decrease coverage by 0.03%.
The diff coverage is 77.17%.

@@            Coverage Diff             @@
##           master    #5395      +/-   ##
==========================================
- Coverage   76.39%   76.35%   -0.04%     
==========================================
  Files         141      141              
  Lines       24617    24868     +251     
==========================================
+ Hits        18807    18989     +182     
- Misses       5810     5879      +69

Impacted Files	Coverage Δ
src/transformers/__init__.py	`99.22% <ø> (ø)`
src/transformers/modeling_auto.py	`74.41% <ø> (ø)`
src/transformers/modeling_xlnet.py	`78.86% <ø> (ø)`
src/transformers/modeling_tf_electra.py	`26.02% <10.00%> (-0.91%)`	⬇️
src/transformers/modeling_tf_mobilebert.py	`23.38% <14.28%> (-0.24%)`	⬇️
src/transformers/modeling_tf_auto.py	`63.03% <40.42%> (-9.47%)`	⬇️
src/transformers/modeling_tf_bert.py	`96.97% <80.00%> (-1.40%)`	⬇️
src/transformers/modeling_tf_t5.py	`90.90% <83.65%> (-0.53%)`	⬇️
src/transformers/modeling_tf_utils.py	`88.88% <84.61%> (-0.23%)`	⬇️
src/transformers/modeling_tf_albert.py	`76.47% <100.00%> (+0.48%)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21cd8c4...c25aa53. Read the comment docs.

LysandreJik

LGTM. Great addition!

LysandreJik · 2020-06-30T13:24:06Z

src/transformers/modeling_tf_bert.py

+        r"""
+    Return:
+        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:


The additional argument labels should be added here

patrickvonplaten · 2020-06-30T13:57:37Z

Ok will add this for all TF CLM models then :-) and add tests.

patrickvonplaten · 2020-06-30T18:07:12Z

src/transformers/modeling_tf_utils.py

+        )
+        # make sure only labels that are not equal to -100
+        # are taken into account as loss
+        active_loss = tf.reshape(labels, (-1,)) != -100


@jplu @sgugger @LysandreJik - In PyTorch we use -100 and not -1 to mask tokens for the loss. Should we do the same here?
Would slightly break backward compatibility since -1 was already used for token classification - but not sure how many people already trained on token classification.

I think -100 would be the most rigorous, right?

Maybe we can release 3.0.1 immediately so that nearly no users are affected

would be nice to have a consistent value there. Before it was only used for TokenClassification and we don't have any notebooks/ examples on TF token classification training so not too many people should be affected. I think it's worth it to align the values here

+1 for consistency

Ok to replace for consistency! but don't forget to update run_tf_ner.py and TFTokenClassificationLoss accordingly as well.

Actually it might be better to deprecate -1 here and still allow its usage for backward compatibility no? @sgugger @LysandreJik @jplu

Yep! Good idea.

patrickvonplaten · 2020-06-30T18:07:27Z

src/transformers/modeling_tf_utils.py

@@ -122,7 +135,9 @@ def compute_loss(self, labels, logits):
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True, reduction=tf.keras.losses.Reduction.NONE
        )
-        active_loss = tf.reshape(labels, (-1,)) != -1


@jplu @sgugger @LysandreJik - In PyTorch we use -100 and not -1 to mask tokens for the loss. Should we do the same here?
Would slightly break backward compatibility since -1 was already used for token classification - but not sure how many people already trained on token classification.

patrickvonplaten · 2020-07-02T00:01:14Z

src/transformers/modeling_tf_distilbert.py


        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)
        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)
        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)
        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)
-        prediction_logits = self.vocab_projector(prediction_logits)
+        prediction_logits = self.vocab_projector(prediction_logits, training=training)


@LysandreJik @VictorSanh - think this training=training was missing here when comparing to tf_bert. Not 100p sure though.

I don't think that's necessary, as self.vocab_projector is a linear layer. I believe the training parameter is only useful for dropout?

True! training=training is only relevant if there is a dropout or batchnorm keras layer: tensorflow/tensorflow#36936

patrickvonplaten · 2020-07-02T00:03:58Z

tests/test_modeling_tf_common.py

@@ -94,6 +95,8 @@ def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
                inputs_dict["labels"] = tf.zeros(self.model_tester.batch_size)
            elif model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.values():
                inputs_dict["labels"] = tf.zeros((self.model_tester.batch_size, self.model_tester.seq_length))
+            elif model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.values():


Love these tests whoever made them @jplu

patrickvonplaten · 2020-07-03T09:49:12Z

src/transformers/modeling_tf_bert.py

+        return outputs  # (loss), prediction_scores, (hidden_states), (attentions)
+
+
+class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):


split Bert the same way it's done in PyTorch

patrickvonplaten · 2020-07-03T09:49:54Z

src/transformers/modeling_tf_distilbert.py


        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)
        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)
        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)
-        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)
+        prediction_logits = self.vocab_layer_norm(prediction_logits, training=training)  # (bs, seq_length, dim)


I think layernorm weights should be conditioned on the training parameter in TF Keras

It can't hurt but I don't see it used in the source code.

patrickvonplaten · 2020-07-03T09:50:51Z

src/transformers/modeling_xlnet.py

        output_attentions=None,
        output_hidden_states=None,
+        labels=None,


just move labels to its correct position

patrickvonplaten · 2020-07-03T09:51:12Z

tests/test_modeling_distilbert.py

-    #     for model_name in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]:
-    #         model = DistilBertModel.from_pretrained(model_name)
-    #         self.assertIsNotNone(model)
+    @slow


enable slow test

patrickvonplaten · 2020-07-03T09:51:41Z

tests/test_modeling_tf_auto.py

@@ -25,6 +25,8 @@
    from transformers import (
        AutoConfig,
        BertConfig,
+        GPT2Config,


Add missing tests from its pytorch version

patrickvonplaten · 2020-07-03T09:52:13Z

tests/test_modeling_tf_common.py

@@ -292,7 +301,7 @@ def test_compile_tf_model(self):
                    "decoder_input_ids": tf.keras.Input(
                        batch_shape=(2, 2000), name="decoder_input_ids", dtype="int32"
                    ),
-                    "inputs": tf.keras.Input(batch_shape=(2, 2000), name="inputs", dtype="int32"),
+                    "input_ids": tf.keras.Input(batch_shape=(2, 2000), name="input_ids", dtype="int32"),


fix T5 inputs vs input_ids @sshleifer

patrickvonplaten · 2020-07-03T09:52:45Z

tests/test_modeling_tf_common.py

+                if model.__class__ in TF_MODEL_FOR_CAUSAL_LM_MAPPING.values():
+                    # if loss is causal lm loss, labels are shift, so that one label per batch
+                    # is cut
+                    loss_size = loss_size - self.model_tester.batch_size


This is a bit hacky, but I don't see another way around it at the moment @sgugger @jplu - CLM loss shifts the tokens and thus cuts off one token

For me it is ok to do like this, it doesn't seems too odd.

patrickvonplaten · 2020-07-03T09:54:20Z

src/transformers/modeling_tf_t5.py

-
-        if isinstance(inputs, dict):
-            kwargs.update(inputs)
+        if isinstance(inputs, (tuple, list)):


There were was one inconsistency in TF T5 before in that the variable input_ids was wrongly called inputs @sshleifer .
Also TF T5 is made completely keras compilation compatible here which was not the case before.

patrickvonplaten · 2020-07-03T10:04:50Z

src/transformers/modeling_tf_auto.py

@@ -140,126 +142,158 @@

 TF_MODEL_MAPPING = OrderedDict(
    [
+        (T5Config, TFT5Model),


Reordered all the mappings here the same way it's done in PyTorch and added a test to check it's correct (also cc @Pierrci - think you reordered Roberta here recently).

…/transformers into add_tf_clm_loss

jplu · 2020-07-03T12:35:27Z

src/transformers/modeling_tf_t5.py

        )

        # insert decoder past at right place
        # to speed up decoding
-        if use_cache is True:
+        if cast_bool_to_primitive(use_cache) is True:


I suggest cast_bool_to_primitive(use_cache, True) is True

cast_bool_to_primitive(use_cache, self.use_cache) is True

as you did in your other PR is actually much cleaner :-)

sgugger

This looks great! Just a few suggestions for docstrings (missing TF or tf.Tensor).

src/transformers/modeling_tf_albert.py

src/transformers/modeling_tf_auto.py

src/transformers/modeling_tf_bert.py

src/transformers/modeling_tf_distilbert.py

sgugger · 2020-07-07T12:00:01Z

src/transformers/modeling_tf_distilbert.py


        hidden_states = distilbert_output[0]  # (bs, seq_length, dim)
        prediction_logits = self.vocab_transform(hidden_states)  # (bs, seq_length, dim)
        prediction_logits = self.act(prediction_logits)  # (bs, seq_length, dim)
-        prediction_logits = self.vocab_layer_norm(prediction_logits)  # (bs, seq_length, dim)
+        prediction_logits = self.vocab_layer_norm(prediction_logits, training=training)  # (bs, seq_length, dim)


It can't hurt but I don't see it used in the source code.

src/transformers/modeling_tf_electra.py

src/transformers/modeling_tf_mobilebert.py

src/transformers/modeling_tf_roberta.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

add first version of clm tf

8cf4476

patrickvonplaten commented Jun 30, 2020

View reviewed changes

make style

59216aa

patrickvonplaten requested review from LysandreJik, julien-c and thomwolf June 30, 2020 10:49

patrickvonplaten linked an issue Jun 30, 2020 that may be closed by this pull request

Add "labels" functionality for all TF Causal LM and Masked LM models #5334

Closed

LysandreJik approved these changes Jun 30, 2020

View reviewed changes

patrickvonplaten requested a review from sgugger June 30, 2020 13:58

patrickvonplaten added 3 commits June 30, 2020 18:15

add more tests for bert

29570d7

update tf clm loss

2ea1100

fix tests

df7a948

patrickvonplaten commented Jun 30, 2020

View reviewed changes

patrickvonplaten changed the title ~~[TF all CLM models] provide labels to forward for tf~~ [Don't merge yet] [TF all CLM models] provide labels to forward for tf Jun 30, 2020

patrickvonplaten changed the title ~~[Don't merge yet] [TF all CLM models] provide labels to forward for tf~~ [Don't merge yet] [TF all CLM / MLM models] provide labels to forward for tf Jun 30, 2020

patrickvonplaten added 2 commits July 2, 2020 00:28

correct tf ner script

cceed5d

add mlm loss

66c0ede

patrickvonplaten commented Jul 2, 2020

View reviewed changes

delete bogus file

2960946

patrickvonplaten commented Jul 2, 2020

View reviewed changes

patrickvonplaten added 3 commits July 2, 2020 14:45

clean tf auto model + add tests

b49767e

finish adding clm loss everywhere

06a3b62

fix training in distilbert

7056d93

patrickvonplaten commented Jul 3, 2020

View reviewed changes

patrickvonplaten changed the title ~~[Don't merge yet] [TF all CLM / MLM models] provide labels to forward for tf~~ TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile Jul 3, 2020

patrickvonplaten changed the title ~~TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile~~ [Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile Jul 3, 2020

Merge branch 'master' into add_tf_clm_loss

794697a

patrickvonplaten commented Jul 3, 2020

View reviewed changes

patrickvonplaten added 4 commits July 3, 2020 12:20

fix tf gpt2

ac4e2b1

Merge branch 'add_tf_clm_loss' of https://github.com/patrickvonplaten…

4bc8413

…/transformers into add_tf_clm_loss

fix new test utils import

c75b906

fix flake8

8164122

jplu reviewed Jul 3, 2020

View reviewed changes

patrickvonplaten mentioned this pull request Jul 3, 2020

Fix saved model creation #5468

Merged

keep backward compatibility

05bf021

sgugger approved these changes Jul 7, 2020

View reviewed changes

patrickvonplaten and others added 9 commits July 7, 2020 17:42

Update src/transformers/modeling_tf_albert.py

24728dd

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_auto.py

5c0b471

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_electra.py

ba59316

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_roberta.py

e9abed9

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_mobilebert.py

c80a727

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_auto.py

d3ddcf2

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_bert.py

ae4dcbd

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_tf_distilbert.py

c5cbae1

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

apply sylvains suggestions

c25aa53

patrickvonplaten merged commit 4dc6559 into huggingface:master Jul 7, 2020

		@@ -843,6 +847,80 @@ def call(self, inputs, **kwargs):
		return outputs # prediction_scores, (hidden_states), (attentions)


		class TFBertLMHeadModel(TFBertForPreTraining, TFCausalLanguageModelingLoss):

		return outputs # (loss), prediction_scores, (hidden_states), (attentions)


		class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss):

[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395

[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395

Conversation

patrickvonplaten commented Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Jun 30, 2020 • edited Loading

jplu commented Jun 30, 2020

codecov bot commented Jun 30, 2020 • edited Loading

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Jun 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Jul 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Jun 30, 2020 •

edited

Loading

jplu Jun 30, 2020 •

edited

Loading

patrickvonplaten commented Jun 30, 2020 •

edited

Loading

codecov bot commented Jun 30, 2020 •

edited

Loading

patrickvonplaten Jul 2, 2020 •

edited

Loading