Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare_seq2seq_batch makes labels/ decoder_input_ids made later. #6654

Merged
merged 33 commits into from
Aug 28, 2020

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Aug 21, 2020

src/ changes:

  • when tgt_texts is supplied prepare_seq_to_seq_batch calls the tensor that used to be called decoder_input_ids, labels.
  • This change helps metrics for models whose tokenizers do not add bos to the beginning of target sequences, like Marian and Pegasus, without affecting metrics for other models (bart).
    This branch was originally called "Fairseq batch equivalence", because it makes batches that look identical to fairseq's for mbart (and bart).
  • tokenization testing file for bart.
  • lots of cleanup and testing.

examples/seq2seq changes:

  • examples/seq2seq/finetune.py (and eventually Seq2SeqTrainer) makes decoder_input_ids by shifting tokens right.
  • this enables Marian finetuning and distillation, with a few extra changes.
  • add --label_smoothing option to seq2seq/distillation.py
  • rename Seq2SeqDataset -> LegacySeq2SeqDataset and TranslationDataset-> Seq2SeqDataset. The new Seq2SeqDataset calls prepare_seq2seq_batch. The choice of which dataset to use is determined based on whether the tokenizer has a prepare_seq2seq_batch method.

Problem:

Previously on master, if the target language sequence was
"Șeful ONU declară că nu există soluții militare în Siria", and the tokenizer was Marian, lm_labels would become "ONU declară că nu există soluții militare în Siria", and the model would learn to skip the first token (or not generate bos).

Generations would then start very strangely, for example:
", fostul şef al personalului prezidenţial din Brazilia, va participa la un proces"

now: "Fostul şef al personalului prezidenţial al Braziliei va fi judecat".
(same thing is happening for pegasus #6711)

Metrics

mbart en-> ro: no change
marian: master: 23 BLEU, this branch: 25
(en ro distillation/no teacher/3 dec layers)
distilbart-cnn-12-3: no change (within 0.01 ROUGE 2)
master + label smoothing: {'rouge1': 43.2764, 'rouge2': 20.4969, 'rougeL': 29.9210}
this branch + label smoothing: {"rouge1": 43.1997, "rouge2": 20.4879, "rougeL": 30.1607}

TODO:

  • check t5-base
  • check pegasus

If you want to test whether this branch makes truncation go away, the easiest way is to pull the mirror branch with

git fetch
git checkout batch-parity-cleaner

cc @patil-suraj

@sshleifer sshleifer changed the title [wip] prepare_seq2seq_batch lets model make decoder_input_ids prepare_seq2seq_batch lets model make decoder_input_ids Aug 27, 2020
@codecov
Copy link

codecov bot commented Aug 27, 2020

Codecov Report

Merging #6654 into master will decrease coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6654      +/-   ##
==========================================
- Coverage   79.74%   79.70%   -0.05%     
==========================================
  Files         157      157              
  Lines       28479    28477       -2     
==========================================
- Hits        22712    22697      -15     
- Misses       5767     5780      +13     
Impacted Files Coverage Δ
src/transformers/modeling_bart.py 95.57% <100.00%> (+0.01%) ⬆️
src/transformers/modeling_t5.py 83.83% <100.00%> (ø)
src/transformers/tokenization_bart.py 100.00% <100.00%> (ø)
src/transformers/tokenization_marian.py 99.15% <100.00%> (+32.48%) ⬆️
src/transformers/tokenization_mbart.py 96.82% <100.00%> (+1.51%) ⬆️
src/transformers/tokenization_pegasus.py 95.23% <100.00%> (+49.92%) ⬆️
src/transformers/tokenization_t5.py 95.32% <100.00%> (ø)
src/transformers/tokenization_roberta.py 87.67% <0.00%> (-10.96%) ⬇️
src/transformers/tokenization_utils_base.py 86.58% <0.00%> (-7.19%) ⬇️
src/transformers/tokenization_transfo_xl.py 38.73% <0.00%> (-3.76%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bd7be9...08ddfd4. Read the comment docs.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the variable renaming/readability changes.

Can the BART tokenization tests leverage the TokenizerTesterMixin?

@@ -246,7 +260,7 @@ def add_distill_args(parser):

class BartTranslationDistiller(BartSummarizationDistiller):
mode = "translation"
loss_names = ["loss"]
loss_names = ["loss", "ce_loss", "mlm_loss", "enc_mse_loss", "hid_loss_enc", "hid_loss_dec"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QOL improvement unrelated to the batching change

return layers_to_copy[n_to_get]
else:
return all_layers[:n_to_get] # TODO: better version on theseus-bart branch
LAYERS_TO_COPY = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QOL improvement unrelated to the batching change

self.dataset_class = TranslationDataset
else:
self.dataset_class = Seq2SeqDataset
self.dataset_class = (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important change

Copy link
Contributor

@patil-suraj patil-suraj Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If every seq2seq model is going to have the prepare_seq2seq_batch then we might not need LegacyDataset. Would it be a good idea to make make prepare_seq2seq_batch batch mandatory for all seq2seq tokenizers and not maintain two datasets ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe eventually. @patrickvonplaten is experimenting with composing GLUE models into seq2seq models with EncoderDecoderModel, (e.g. Roberta2Roberta), so I think that we keep the Legacy dataset until we can guarantee that most tokenizers have prepare_seq2seq_batch methods, which might be never.

lm_labels = target_ids
lm_labels = batch["labels"]
decoder_input_ids = self.model._shift_right(lm_labels)
decoder_attention_mask = decoder_input_ids.ne(pad_token_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is new

decoder_attention_mask = decoder_input_ids.ne(pad_token_id)
elif "labels" in batch:
lm_labels = batch["labels"]
decoder_input_ids = shift_tokens_right(lm_labels, pad_token_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change

@patil-suraj patil-suraj mentioned this pull request Aug 27, 2020
@sshleifer
Copy link
Contributor Author

Added common tokenizer tests @LysandreJik

@sshleifer sshleifer mentioned this pull request Aug 28, 2020
3 tasks
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, cool tests!

@LysandreJik LysandreJik merged commit 9336086 into huggingface:master Aug 28, 2020
Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
…ggingface#6654)

* broken test

* batch parity

* tests pass

* boom boom

* boom boom

* split out bart tokenizer tests

* fix tests

* boom boom

* Fixed dataset bug

* Fix marian

* Undo extra

* Get marian working

* Fix t5 tok tests

* Test passing

* Cleanup

* better assert msg

* require torch

* Fix mbart tests

* undo extra decoder_attn_mask change

* Fix import

* pegasus tokenizer can ignore src_lang kwargs

* unused kwarg test cov

* boom boom

* add todo for pegasus issue

* cover one word translation edge case

* Cleanup

* doc
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
…ggingface#6654)

* broken test

* batch parity

* tests pass

* boom boom

* boom boom

* split out bart tokenizer tests

* fix tests

* boom boom

* Fixed dataset bug

* Fix marian

* Undo extra

* Get marian working

* Fix t5 tok tests

* Test passing

* Cleanup

* better assert msg

* require torch

* Fix mbart tests

* undo extra decoder_attn_mask change

* Fix import

* pegasus tokenizer can ignore src_lang kwargs

* unused kwarg test cov

* boom boom

* add todo for pegasus issue

* cover one word translation edge case

* Cleanup

* doc
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pegasus finetuning: OOM Bart: make decoder_input_ids correctly if labels specified.
3 participants