-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654
prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6654 +/- ##
==========================================
- Coverage 79.74% 79.70% -0.05%
==========================================
Files 157 157
Lines 28479 28477 -2
==========================================
- Hits 22712 22697 -15
- Misses 5767 5780 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the variable renaming/readability changes.
Can the BART tokenization tests leverage the TokenizerTesterMixin
?
examples/seq2seq/distillation.py
Outdated
@@ -246,7 +260,7 @@ def add_distill_args(parser): | |||
|
|||
class BartTranslationDistiller(BartSummarizationDistiller): | |||
mode = "translation" | |||
loss_names = ["loss"] | |||
loss_names = ["loss", "ce_loss", "mlm_loss", "enc_mse_loss", "hid_loss_enc", "hid_loss_dec"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QOL improvement unrelated to the batching change
return layers_to_copy[n_to_get] | ||
else: | ||
return all_layers[:n_to_get] # TODO: better version on theseus-bart branch | ||
LAYERS_TO_COPY = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QOL improvement unrelated to the batching change
self.dataset_class = TranslationDataset | ||
else: | ||
self.dataset_class = Seq2SeqDataset | ||
self.dataset_class = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If every seq2seq model is going to have the prepare_seq2seq_batch
then we might not need LegacyDataset
. Would it be a good idea to make make prepare_seq2seq_batch
batch mandatory for all seq2seq tokenizers and not maintain two datasets ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe eventually. @patrickvonplaten is experimenting with composing GLUE models into seq2seq models with EncoderDecoderModel
, (e.g. Roberta2Roberta
), so I think that we keep the Legacy dataset until we can guarantee that most tokenizers have prepare_seq2seq_batch
methods, which might be never.
examples/seq2seq/finetune.py
Outdated
lm_labels = target_ids | ||
lm_labels = batch["labels"] | ||
decoder_input_ids = self.model._shift_right(lm_labels) | ||
decoder_attention_mask = decoder_input_ids.ne(pad_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is new
examples/seq2seq/finetune.py
Outdated
decoder_attention_mask = decoder_input_ids.ne(pad_token_id) | ||
elif "labels" in batch: | ||
lm_labels = batch["labels"] | ||
decoder_input_ids = shift_tokens_right(lm_labels, pad_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key change
Added common tokenizer tests @LysandreJik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, cool tests!
…ggingface#6654) * broken test * batch parity * tests pass * boom boom * boom boom * split out bart tokenizer tests * fix tests * boom boom * Fixed dataset bug * Fix marian * Undo extra * Get marian working * Fix t5 tok tests * Test passing * Cleanup * better assert msg * require torch * Fix mbart tests * undo extra decoder_attn_mask change * Fix import * pegasus tokenizer can ignore src_lang kwargs * unused kwarg test cov * boom boom * add todo for pegasus issue * cover one word translation edge case * Cleanup * doc
…ggingface#6654) * broken test * batch parity * tests pass * boom boom * boom boom * split out bart tokenizer tests * fix tests * boom boom * Fixed dataset bug * Fix marian * Undo extra * Get marian working * Fix t5 tok tests * Test passing * Cleanup * better assert msg * require torch * Fix mbart tests * undo extra decoder_attn_mask change * Fix import * pegasus tokenizer can ignore src_lang kwargs * unused kwarg test cov * boom boom * add todo for pegasus issue * cover one word translation edge case * Cleanup * doc
…ter. (huggingface#6654)" This reverts commit 4fd670a.
src/
changes:tgt_texts
is suppliedprepare_seq_to_seq_batch
calls the tensor that used to be calleddecoder_input_ids
,labels
.This branch was originally called "Fairseq batch equivalence", because it makes batches that look identical to fairseq's for mbart (and bart).
examples/seq2seq
changes:examples/seq2seq/finetune.py
(and eventually Seq2SeqTrainer) makes decoder_input_ids by shifting tokens right.--label_smoothing
option to seq2seq/distillation.pySeq2SeqDataset
->LegacySeq2SeqDataset
andTranslationDataset
->Seq2SeqDataset
. The newSeq2SeqDataset
callsprepare_seq2seq_batch
. The choice of which dataset to use is determined based on whether the tokenizer has aprepare_seq2seq_batch
method.Problem:
Previously on master, if the target language sequence was
"Șeful ONU declară că nu există soluții militare în Siria", and the tokenizer was Marian, lm_labels would become "ONU declară că nu există soluții militare în Siria", and the model would learn to skip the first token (or not generate bos).
Generations would then start very strangely, for example:
", fostul şef al personalului prezidenţial din Brazilia, va participa la un proces"
now:
"Fostul şef al personalului prezidenţial al Braziliei va fi judecat".
(same thing is happening for pegasus #6711)
Metrics
mbart en-> ro: no change
marian: master: 23 BLEU, this branch: 25
(en ro distillation/no teacher/3 dec layers)
distilbart-cnn-12-3: no change (within 0.01 ROUGE 2)
master + label smoothing:
{'rouge1': 43.2764, 'rouge2': 20.4969, 'rougeL': 29.9210}
this branch + label smoothing:
{"rouge1": 43.1997, "rouge2": 20.4879, "rougeL": 30.1607}
TODO:
If you want to test whether this branch makes truncation go away, the easiest way is to pull the mirror branch with
cc @patil-suraj