[bart] rename self-attention -> attention #6708

sshleifer · 2020-08-25T04:29:49Z

Makes more sense since also used from cross-attention.
Does not change the state dict, or break slow tests/backwards compatibility.

codecov · 2020-08-25T04:38:26Z

Codecov Report

Merging #6708 into master will increase coverage by 0.12%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #6708      +/-   ##
==========================================
+ Coverage   80.02%   80.15%   +0.12%     
==========================================
  Files         157      157              
  Lines       28586    28586              
==========================================
+ Hits        22876    22912      +36     
+ Misses       5710     5674      -36

Impacted Files	Coverage Δ
src/transformers/modeling_bart.py	`95.06% <100.00%> (ø)`
src/transformers/tokenization_xlnet.py	`66.66% <0.00%> (-23.43%)`	⬇️
src/transformers/tokenization_dpr.py	`53.15% <0.00%> (-4.51%)`	⬇️
src/transformers/generation_tf_utils.py	`83.95% <0.00%> (-2.26%)`	⬇️
src/transformers/tokenization_utils.py	`89.84% <0.00%> (+0.39%)`	⬆️
src/transformers/tokenization_bert.py	`91.51% <0.00%> (+0.44%)`	⬆️
src/transformers/tokenization_openai.py	`84.09% <0.00%> (+1.51%)`	⬆️
src/transformers/tokenization_utils_fast.py	`94.28% <0.00%> (+2.14%)`	⬆️
src/transformers/tokenization_utils_base.py	`93.49% <0.00%> (+7.18%)`	⬆️
src/transformers/tokenization_roberta.py	`98.63% <0.00%> (+21.91%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f58903...4db250c. Read the comment docs.

sgugger

As long as it doesn't break anything, this is fine by me.

* Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * Fix style (#6803) * t5 model should make decoder_attention_mask (#6800) * [s2s] Test hub configs in self-scheduled CI (#6809) * [s2s] round runtime in run_eval (#6798) * Pegasus finetune script: add --adafactor (#6811) * [bart] rename self-attention -> attention (#6708) * [tests] fix typos in inputs (#6818) * Fixed open in colab link (#6825) * Add model card for singbert lite. Update widget for singbert and singbert-large. (#6827) * BR_BERTo model card (#6793) * clearly indicate shuffle=False (#6312) * Clarify shuffle * clarify shuffle Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> * [s2s README] Add more dataset download instructions (#6737) * Style * Patch logging issue * Set default logging level to `WARNING` instead of `INFO` * TF Flaubert w/ pre-norm (#6841) * Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> * Fix in Adafactor docstrings (#6845) * Fix resuming training for Windows (#6847) * Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Thomas Ashish Cherian <6967017+PandaWhoCodes@users.noreply.github.com> Co-authored-by: Zane Lim <zyuanlim@gmail.com> Co-authored-by: Rodolfo De Nadai <rdenadai@gmail.com> Co-authored-by: xujiaze13 <37360975+xujiaze13@users.noreply.github.com> Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Huang Lianzhe <hlz@pku.edu.cn> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * Fix style (huggingface#6803) * t5 model should make decoder_attention_mask (huggingface#6800) * [s2s] Test hub configs in self-scheduled CI (huggingface#6809) * [s2s] round runtime in run_eval (huggingface#6798) * Pegasus finetune script: add --adafactor (huggingface#6811) * [bart] rename self-attention -> attention (huggingface#6708) * [tests] fix typos in inputs (huggingface#6818) * Fixed open in colab link (huggingface#6825) * Add model card for singbert lite. Update widget for singbert and singbert-large. (huggingface#6827) * BR_BERTo model card (huggingface#6793) * clearly indicate shuffle=False (huggingface#6312) * Clarify shuffle * clarify shuffle Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> * [s2s README] Add more dataset download instructions (huggingface#6737) * Style * Patch logging issue * Set default logging level to `WARNING` instead of `INFO` * TF Flaubert w/ pre-norm (huggingface#6841) * Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task (huggingface#6644) * add datacollator and dataset for next sentence prediction task * bug fix (numbers of special tokens & truncate sequences) * bug fix (+ dict inputs support for data collator) * add padding for nsp data collator; renamed cached files to avoid conflict. * add test for nsp data collator * Style Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> * Fix in Adafactor docstrings (huggingface#6845) * Fix resuming training for Windows (huggingface#6847) * Only access loss tensor every logging_steps * tensor.item() was being called every step. This must not be done for XLA:TPU tensors as it's terrible for performance causing TPU<>CPU communication at each step. On RoBERTa MLM for example, it reduces step time by 30%, should be larger for smaller step time models/tasks. * Train batch size was not correct in case a user uses the `per_gpu_train_batch_size` flag * Avg reduce loss accross eval shards * comments Co-authored-by: Sam Shleifer <sshleifer@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Thomas Ashish Cherian <6967017+PandaWhoCodes@users.noreply.github.com> Co-authored-by: Zane Lim <zyuanlim@gmail.com> Co-authored-by: Rodolfo De Nadai <rdenadai@gmail.com> Co-authored-by: xujiaze13 <37360975+xujiaze13@users.noreply.github.com> Co-authored-by: Kevin Canwen Xu <canwenxu@126.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr> Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Huang Lianzhe <hlz@pku.edu.cn> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

This reverts commit adcc753.

sshleifer added 2 commits August 25, 2020 00:29

rename self-attention -> attention

951ebdd

style

03aef91

sshleifer requested review from LysandreJik and sgugger August 25, 2020 04:31

sshleifer changed the title ~~rename self-attention -> attention~~ [bart] rename self-attention -> attention Aug 25, 2020

sgugger approved these changes Aug 25, 2020

View reviewed changes

style

4db250c

sshleifer merged commit 22933e6 into huggingface:master Aug 29, 2020

sshleifer deleted the bart-rename-attn branch August 29, 2020 22:03

stas00 pushed a commit to stas00/transformers that referenced this pull request Aug 30, 2020

[bart] rename self-attention -> attention (huggingface#6708)

d6c6805

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020

[bart] rename self-attention -> attention (huggingface#6708)

45acf9c

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

[bart] rename self-attention -> attention (huggingface#6708)

adcc753

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "[bart] rename self-attention -> attention (huggingface#6708)"

22b23ba

This reverts commit adcc753.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bart] rename self-attention -> attention #6708

[bart] rename self-attention -> attention #6708

sshleifer commented Aug 25, 2020 •

edited

Loading

codecov bot commented Aug 25, 2020 •

edited

Loading

sgugger left a comment

[bart] rename self-attention -> attention #6708

[bart] rename self-attention -> attention #6708

Conversation

sshleifer commented Aug 25, 2020 • edited Loading

codecov bot commented Aug 25, 2020 • edited Loading

Codecov Report

sgugger left a comment

Choose a reason for hiding this comment

sshleifer commented Aug 25, 2020 •

edited

Loading

codecov bot commented Aug 25, 2020 •

edited

Loading