-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395
[Almost all TF models] TF clean up: add missing CLM / MLM loss; fix T5 naming and keras compile #5395
Conversation
src/transformers/modeling_tf_bert.py
Outdated
@@ -843,6 +847,80 @@ def call(self, inputs, **kwargs): | |||
return outputs # prediction_scores, (hidden_states), (attentions) | |||
|
|||
|
|||
class TFBertLMHeadModel(TFBertForPreTraining, TFCausalLanguageModelingLoss): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LysandreJik @thomwolf split Bert into two as was done for PyTorch -> small break in backward compatibility here.
class TFCausalLanguageModelingLoss: | ||
def compute_loss(self, labels, logits): | ||
loss_fn = tf.keras.losses.CategoricalCrossentropy( | ||
from_logits=True, reduction=tf.keras.losses.Reduction.NONE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the reduction to Reduction.NONE
as was done for the other losses -> is that correct @LysandreJik @jplu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, beause NONE is for making the loss computation compliant with the custom training. And the reduction is let to the trainer.
src/transformers/modeling_tf_bert.py
Outdated
|
||
outputs = (logits,) + outputs[2:] # Add hidden states and attention if they are here | ||
if labels is not None: | ||
logits = logits[: :-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shift logits the same it's done in PyTorch
@LysandreJik @julien-c @jplu @thomwolf @sgugger - can you take a look at this example if the CLM loss is correctly added? If yes, I will add this loss to all other CLM models and add tests. |
Looks good to me!! |
Codecov Report
@@ Coverage Diff @@
## master #5395 +/- ##
==========================================
- Coverage 76.39% 76.35% -0.04%
==========================================
Files 141 141
Lines 24617 24868 +251
==========================================
+ Hits 18807 18989 +182
- Misses 5810 5879 +69
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great addition!
src/transformers/modeling_tf_bert.py
Outdated
r""" | ||
Return: | ||
:obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The additional argument labels
should be added here
Ok will add this for all TF CLM models then :-) and add tests. |
) | ||
# make sure only labels that are not equal to -100 | ||
# are taken into account as loss | ||
active_loss = tf.reshape(labels, (-1,)) != -100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jplu @sgugger @LysandreJik - In PyTorch we use -100 and not -1 to mask tokens for the loss. Should we do the same here?
Would slightly break backward compatibility since -1 was already used for token classification - but not sure how many people already trained on token classification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think -100 would be the most rigorous, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can release 3.0.1 immediately so that nearly no users are affected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have a consistent value there. Before it was only used for TokenClassification and we don't have any notebooks/ examples on TF token classification training so not too many people should be affected. I think it's worth it to align the values here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok to replace for consistency! but don't forget to update run_tf_ner.py
and TFTokenClassificationLoss
accordingly as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it might be better to deprecate -1
here and still allow its usage for backward compatibility no? @sgugger @LysandreJik @jplu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Good idea.
@@ -122,7 +135,9 @@ def compute_loss(self, labels, logits): | |||
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy( | |||
from_logits=True, reduction=tf.keras.losses.Reduction.NONE | |||
) | |||
active_loss = tf.reshape(labels, (-1,)) != -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jplu @sgugger @LysandreJik - In PyTorch we use -100 and not -1 to mask tokens for the loss. Should we do the same here?
Would slightly break backward compatibility since -1 was already used for token classification - but not sure how many people already trained on token classification.
labels
to forward for tflabels
to forward for tf
labels
to forward for tflabels
to forward for tf
|
||
hidden_states = distilbert_output[0] # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_transform(hidden_states) # (bs, seq_length, dim) | ||
prediction_logits = self.act(prediction_logits) # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_projector(prediction_logits) | ||
prediction_logits = self.vocab_projector(prediction_logits, training=training) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LysandreJik @VictorSanh - think this training=training
was missing here when comparing to tf_bert. Not 100p sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary, as self.vocab_projector
is a linear layer. I believe the training
parameter is only useful for dropout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True! training=training
is only relevant if there is a dropout or batchnorm keras layer: tensorflow/tensorflow#36936
tests/test_modeling_tf_common.py
Outdated
@@ -94,6 +95,8 @@ def _prepare_for_class(self, inputs_dict, model_class, return_labels=False): | |||
inputs_dict["labels"] = tf.zeros(self.model_tester.batch_size) | |||
elif model_class in TF_MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.values(): | |||
inputs_dict["labels"] = tf.zeros((self.model_tester.batch_size, self.model_tester.seq_length)) | |||
elif model_class in TF_MODEL_WITH_LM_HEAD_MAPPING.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love these tests whoever made them @jplu
return outputs # (loss), prediction_scores, (hidden_states), (attentions) | ||
|
||
|
||
class TFBertLMHeadModel(TFBertPreTrainedModel, TFCausalLanguageModelingLoss): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split Bert the same way it's done in PyTorch
|
||
hidden_states = distilbert_output[0] # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_transform(hidden_states) # (bs, seq_length, dim) | ||
prediction_logits = self.act(prediction_logits) # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_layer_norm(prediction_logits, training=training) # (bs, seq_length, dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think layernorm weights should be conditioned on the training parameter in TF Keras
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't hurt but I don't see it used in the source code.
output_attentions=None, | ||
output_hidden_states=None, | ||
labels=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just move labels
to its correct position
# for model_name in DISTILBERT_PRETRAINED_MODEL_ARCHIVE_LIST[:1]: | ||
# model = DistilBertModel.from_pretrained(model_name) | ||
# self.assertIsNotNone(model) | ||
@slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enable slow test
@@ -25,6 +25,8 @@ | |||
from transformers import ( | |||
AutoConfig, | |||
BertConfig, | |||
GPT2Config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing tests from its pytorch version
@@ -292,7 +301,7 @@ def test_compile_tf_model(self): | |||
"decoder_input_ids": tf.keras.Input( | |||
batch_shape=(2, 2000), name="decoder_input_ids", dtype="int32" | |||
), | |||
"inputs": tf.keras.Input(batch_shape=(2, 2000), name="inputs", dtype="int32"), | |||
"input_ids": tf.keras.Input(batch_shape=(2, 2000), name="input_ids", dtype="int32"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix T5 inputs vs input_ids @sshleifer
if model.__class__ in TF_MODEL_FOR_CAUSAL_LM_MAPPING.values(): | ||
# if loss is causal lm loss, labels are shift, so that one label per batch | ||
# is cut | ||
loss_size = loss_size - self.model_tester.batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it is ok to do like this, it doesn't seems too odd.
|
||
if isinstance(inputs, dict): | ||
kwargs.update(inputs) | ||
if isinstance(inputs, (tuple, list)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were was one inconsistency in TF T5 before in that the variable input_ids
was wrongly called inputs
@sshleifer .
Also TF T5 is made completely keras compilation compatible here which was not the case before.
labels
to forward for tf@@ -140,126 +142,158 @@ | |||
|
|||
TF_MODEL_MAPPING = OrderedDict( | |||
[ | |||
(T5Config, TFT5Model), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordered all the mappings here the same way it's done in PyTorch and added a test to check it's correct (also cc @Pierrci - think you reordered Roberta here recently).
src/transformers/modeling_tf_t5.py
Outdated
) | ||
|
||
# insert decoder past at right place | ||
# to speed up decoding | ||
if use_cache is True: | ||
if cast_bool_to_primitive(use_cache) is True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest cast_bool_to_primitive(use_cache, True) is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast_bool_to_primitive(use_cache, self.use_cache) is True
as you did in your other PR is actually much cleaner :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Just a few suggestions for docstrings (missing TF or tf.Tensor).
|
||
hidden_states = distilbert_output[0] # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_transform(hidden_states) # (bs, seq_length, dim) | ||
prediction_logits = self.act(prediction_logits) # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_layer_norm(prediction_logits) # (bs, seq_length, dim) | ||
prediction_logits = self.vocab_layer_norm(prediction_logits, training=training) # (bs, seq_length, dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can't hurt but I don't see it used in the source code.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
This PR aligns TF code more with PT code and adds full training support to all CLM and MLM models applying @jplu's loss design to the remaining models. In more detail the following things are included in the PR:
TFMaskedLanguageModelingLoss
andTFCausalLanguageModelingLoss
to all CLM and MLM TF models. Only Transfo-XL and XLM are not included since they use adaptive softmax (TF Transfo-XL currently has no Adaptive Softmax implemented cc @TevenLeScao for notification)modeling_tf_auto
.py e.g. that the mappings are correctly orderedTODO:
Future Pr: