T5Tokenizer adds EOS token if not already added #5866

sshleifer · 2020-07-18T11:29:52Z

T5 Tokenizer should add </s> to the end of sequences. Since some users are doing this on their own, this PR only adds </s> if it has already been added.

On my machine, this makes zero shot validation BLEU go from 27.87 -> 27.65. Since this change is needed for finetuning, and the empirical difference is small and doesn't happen on Stas' machine, I would recommend merging this.

If others want to test, the command takes about 3 mins to run on brutasse.

Zero Shot BLEU Scores

For english -> romanian
I grabbed the WMT english-romanian dataset:

wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz

Then ran evaluation (without finetuning) on the validation split:

export DATA_DIR=wmt_en_ro
python run_eval.py t5-base \
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path t5_enro_bleu_eos.json \
    --task translation_en_to_ro \
    --device cuda \
    --fp16 \
    --bs 32

(this branch) (with EOS):27.65
master (no EOS): 27.87

sacrebleu==1.4.3
torch==1.5.1

Will merge and fix tests if others have positive results.

codecov · 2020-07-23T16:37:24Z

Codecov Report

Merging #5866 into master will decrease coverage by 0.67%.
The diff coverage is 89.47%.

@@            Coverage Diff             @@
##           master    #5866      +/-   ##
==========================================
- Coverage   80.10%   79.42%   -0.68%     
==========================================
  Files         156      156              
  Lines       28411    28426      +15     
==========================================
- Hits        22758    22578     -180     
- Misses       5653     5848     +195

Impacted Files	Coverage Δ
src/transformers/tokenization_t5.py	`95.32% <89.47%> (-1.42%)`	⬇️
src/transformers/modeling_tf_openai.py	`22.58% <0.00%> (-72.26%)`	⬇️
src/transformers/tokenization_roberta.py	`87.67% <0.00%> (-10.96%)`	⬇️
src/transformers/tokenization_utils_base.py	`86.58% <0.00%> (-7.19%)`	⬇️
src/transformers/tokenization_transfo_xl.py	`38.73% <0.00%> (-3.76%)`	⬇️
src/transformers/tokenization_openai.py	`82.57% <0.00%> (-1.52%)`	⬇️
src/transformers/tokenization_utils_fast.py	`92.85% <0.00%> (-1.43%)`	⬇️
src/transformers/modeling_openai.py	`80.96% <0.00%> (-1.30%)`	⬇️
src/transformers/generation_tf_utils.py	`86.21% <0.00%> (-0.51%)`	⬇️
src/transformers/tokenization_bert.py	`91.07% <0.00%> (-0.45%)`	⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e6397a...d977bff. Read the comment docs.

stas00 · 2020-08-15T22:27:44Z

FWIW, I get identical results of 27.84 with this branch and the master.

sgugger

LGTM thanks! Just a few nits in the docs.

src/transformers/tokenization_t5.py

patrickvonplaten

Great! Very clean implementation. Not sure what to say about the evaluation. I think the functionality should be added anyways and the user can optionally set add_special_tokens=False if he/she wants

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik

Great, nice addition. Maybe not the cleanest to check if there's already an existing token as we don't do that in other implementations, but cool that this brings backwards-compatibility.

I believe we should align this with other tokenizers' prepare inputs methods in a future version, by always adding the EOS token even if there's already one in the input.

LysandreJik · 2020-08-24T07:32:32Z

src/transformers/tokenization_t5.py

+    def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
+        """Do not add eos again if user already added it."""
+        if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
+            return token_ids


Could we raise a warning/info telling the user that this is handled by the method, and that adding it manually + using the function would result in two tokens being added in a future version?

sshleifer · 2020-08-25T04:34:55Z

Happy to eventually remove the check to see if it's already there.

LysandreJik · 2020-08-25T08:54:18Z

I think we can keep it like this right now with a warning for future versions. It would create a breaking change to users, and I feel it would be especially hard to debug an unknown drop in performance due to an additional token being added, right?

ahoho · 2020-10-17T21:03:37Z

Will this behavior cause problems for the unsupervised setting? Per the docs, </s> is not added during denoising training:

input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
model(input_ids=input_ids, labels=labels)

Not sure if this will cause problems. (Also, Aa a somewhat related question, should the sentinel tokens in the labels be excluded from the loss in this setting, as I believe is the case with [MASK] in BERT?).

sshleifer · 2020-10-19T14:20:57Z

I'm not sure about either question:
made an issue verifying the docs: #7904
Feel free to make an issue about the sentinel tokens question. I'd tag thomwolf/patrickvonplaten.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…5866)" This reverts commit 13044d1.

sshleifer added 5 commits July 16, 2020 23:58

t5tok

9e6d67b

comment

5ffbe58

Merge branch 'master' into t5tok

08f704b

No bos

04f18d4

paste failures, maybe truncated

bc35cf7

This was linked to issues Jul 18, 2020

t5-base translation_en_to_de BLEU lower than the paper #5543

Closed

Truncated Outputs by t5 fine-tuned models #5656

Closed

This was referenced Jul 18, 2020

t5-base translation_en_to_de BLEU lower than the paper #5543

Closed

Truncated Outputs by t5 fine-tuned models #5656

Closed

some style

b782f1b

sshleifer changed the title ~~T5Tokenizer adds EOS token~~ [WIP] T5Tokenizer adds EOS token Jul 18, 2020

sshleifer added 2 commits July 18, 2020 14:01

Merge branch 'master' into t5tok

a4b9569

with eos, translation truncated

02e9a79

patil-suraj mentioned this pull request Jul 23, 2020

[examples (seq2seq)] fix preparing decoder_input_ids for T5 #5994

Merged

style

a12efd7

sshleifer added 4 commits July 23, 2020 16:51

Merge branch 'master' into t5tok

f56d78c

broken

47a5462

boom boom

6c6a6a0

Merge branch 'master' into t5tok

72b1637

add eos iff not already added

bff6dc4

sshleifer requested review from patrickvonplaten, sgugger and LysandreJik and removed request for patrickvonplaten and sgugger August 16, 2020 17:30

sgugger approved these changes Aug 17, 2020

View reviewed changes

src/transformers/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/tokenization_t5.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes Aug 20, 2020

View reviewed changes

sshleifer and others added 2 commits August 20, 2020 14:58

merged master

804de80

Update src/transformers/tokenization_t5.py

df18e3f

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik approved these changes Aug 24, 2020

View reviewed changes

sshleifer added 3 commits August 25, 2020 14:28

Merge branch 'master' into t5tok

7efd750

Add warning for user supplied eos token

fdbe849

Merge branch 't5tok' of github.com:huggingface/transformers into t5tok

d977bff

sshleifer changed the title ~~[WIP] T5Tokenizer adds EOS token~~ T5Tokenizer adds EOS token Aug 25, 2020

sshleifer changed the title ~~T5Tokenizer adds EOS token~~ T5Tokenizer adds EOS token if not already added Aug 25, 2020

sshleifer merged commit 6244957 into master Aug 25, 2020

sshleifer deleted the t5tok branch August 25, 2020 18:56

ahoho mentioned this pull request Oct 19, 2020

T5 Docs training example has shifted labels #7904

Closed

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020

T5Tokenizer adds EOS token if not already added (huggingface#5866)

1887b5d

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

T5Tokenizer adds EOS token if not already added (huggingface#5866)

13044d1

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "T5Tokenizer adds EOS token if not already added (huggingface#…

b443814

…5866)" This reverts commit 13044d1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5Tokenizer adds EOS token if not already added #5866

T5Tokenizer adds EOS token if not already added #5866

sshleifer commented Jul 18, 2020 •

edited

Loading

codecov bot commented Jul 23, 2020 •

edited

Loading

stas00 commented Aug 15, 2020

sgugger left a comment

patrickvonplaten left a comment •

edited

Loading

LysandreJik left a comment

LysandreJik Aug 24, 2020

sshleifer Aug 25, 2020

sshleifer commented Aug 25, 2020 •

edited

Loading

LysandreJik commented Aug 25, 2020

ahoho commented Oct 17, 2020

sshleifer commented Oct 19, 2020

T5Tokenizer adds EOS token if not already added #5866

T5Tokenizer adds EOS token if not already added #5866

Conversation

sshleifer commented Jul 18, 2020 • edited Loading

Zero Shot BLEU Scores

codecov bot commented Jul 23, 2020 • edited Loading

Codecov Report

stas00 commented Aug 15, 2020

sgugger left a comment

Choose a reason for hiding this comment

patrickvonplaten left a comment • edited Loading

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Aug 24, 2020

Choose a reason for hiding this comment

sshleifer Aug 25, 2020

Choose a reason for hiding this comment

sshleifer commented Aug 25, 2020 • edited Loading

LysandreJik commented Aug 25, 2020

ahoho commented Oct 17, 2020

sshleifer commented Oct 19, 2020

sshleifer commented Jul 18, 2020 •

edited

Loading

codecov bot commented Jul 23, 2020 •

edited

Loading

patrickvonplaten left a comment •

edited

Loading

sshleifer commented Aug 25, 2020 •

edited

Loading