Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5Tokenizer adds EOS token if not already added #5866

Merged
merged 19 commits into from
Aug 25, 2020
Merged

T5Tokenizer adds EOS token if not already added #5866

merged 19 commits into from
Aug 25, 2020

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Jul 18, 2020

T5 Tokenizer should add </s> to the end of sequences. Since some users are doing this on their own, this PR only adds </s> if it has already been added.

On my machine, this makes zero shot validation BLEU go from 27.87 -> 27.65. Since this change is needed for finetuning, and the empirical difference is small and doesn't happen on Stas' machine, I would recommend merging this.

If others want to test, the command takes about 3 mins to run on brutasse.

Zero Shot BLEU Scores

For english -> romanian
I grabbed the WMT english-romanian dataset:

wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz

Then ran evaluation (without finetuning) on the validation split:

export DATA_DIR=wmt_en_ro
python run_eval.py t5-base \
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path t5_enro_bleu_eos.json \
    --task translation_en_to_ro \
    --device cuda \
    --fp16 \
    --bs 32

(this branch) (with EOS):27.65
master (no EOS): 27.87

sacrebleu==1.4.3
torch==1.5.1

Will merge and fix tests if others have positive results.

@sshleifer sshleifer changed the title T5Tokenizer adds EOS token [WIP] T5Tokenizer adds EOS token Jul 18, 2020
@codecov
Copy link

codecov bot commented Jul 23, 2020

Codecov Report

Merging #5866 into master will decrease coverage by 0.67%.
The diff coverage is 89.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5866      +/-   ##
==========================================
- Coverage   80.10%   79.42%   -0.68%     
==========================================
  Files         156      156              
  Lines       28411    28426      +15     
==========================================
- Hits        22758    22578     -180     
- Misses       5653     5848     +195     
Impacted Files Coverage Δ
src/transformers/tokenization_t5.py 95.32% <89.47%> (-1.42%) ⬇️
src/transformers/modeling_tf_openai.py 22.58% <0.00%> (-72.26%) ⬇️
src/transformers/tokenization_roberta.py 87.67% <0.00%> (-10.96%) ⬇️
src/transformers/tokenization_utils_base.py 86.58% <0.00%> (-7.19%) ⬇️
src/transformers/tokenization_transfo_xl.py 38.73% <0.00%> (-3.76%) ⬇️
src/transformers/tokenization_openai.py 82.57% <0.00%> (-1.52%) ⬇️
src/transformers/tokenization_utils_fast.py 92.85% <0.00%> (-1.43%) ⬇️
src/transformers/modeling_openai.py 80.96% <0.00%> (-1.30%) ⬇️
src/transformers/generation_tf_utils.py 86.21% <0.00%> (-0.51%) ⬇️
src/transformers/tokenization_bert.py 91.07% <0.00%> (-0.45%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e6397a...d977bff. Read the comment docs.

@stas00
Copy link
Contributor

stas00 commented Aug 15, 2020

FWIW, I get identical results of 27.84 with this branch and the master.

@sshleifer sshleifer requested review from patrickvonplaten, sgugger and LysandreJik and removed request for patrickvonplaten and sgugger August 16, 2020 17:30
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! Just a few nits in the docs.

src/transformers/tokenization_t5.py Outdated Show resolved Hide resolved
src/transformers/tokenization_t5.py Outdated Show resolved Hide resolved
Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Very clean implementation. Not sure what to say about the evaluation. I think the functionality should be added anyways and the user can optionally set add_special_tokens=False if he/she wants

sshleifer and others added 2 commits August 20, 2020 14:58
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, nice addition. Maybe not the cleanest to check if there's already an existing token as we don't do that in other implementations, but cool that this brings backwards-compatibility.

I believe we should align this with other tokenizers' prepare inputs methods in a future version, by always adding the EOS token even if there's already one in the input.

def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]:
"""Do not add eos again if user already added it."""
if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id:
return token_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we raise a warning/info telling the user that this is handled by the method, and that adding it manually + using the function would result in two tokens being added in a future version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@sshleifer
Copy link
Contributor Author

sshleifer commented Aug 25, 2020

Happy to eventually remove the check to see if it's already there.

@LysandreJik
Copy link
Member

I think we can keep it like this right now with a warning for future versions. It would create a breaking change to users, and I feel it would be especially hard to debug an unknown drop in performance due to an additional token being added, right?

@sshleifer sshleifer changed the title [WIP] T5Tokenizer adds EOS token T5Tokenizer adds EOS token Aug 25, 2020
@sshleifer sshleifer changed the title T5Tokenizer adds EOS token T5Tokenizer adds EOS token if not already added Aug 25, 2020
@sshleifer sshleifer merged commit 6244957 into master Aug 25, 2020
@sshleifer sshleifer deleted the t5tok branch August 25, 2020 18:56
@ahoho
Copy link

ahoho commented Oct 17, 2020

Will this behavior cause problems for the unsupervised setting? Per the docs, </s> is not added during denoising training:

input_ids = tokenizer.encode('The <extra_id_0> walks in <extra_id_1> park', return_tensors='pt')
labels = tokenizer.encode('<extra_id_0> cute dog <extra_id_1> the <extra_id_2> </s>', return_tensors='pt')
model(input_ids=input_ids, labels=labels)

Not sure if this will cause problems. (Also, Aa a somewhat related question, should the sentinel tokens in the labels be excluded from the loss in this setting, as I believe is the case with [MASK] in BERT?).

@sshleifer
Copy link
Contributor Author

I'm not sure about either question:
made an issue verifying the docs: #7904
Feel free to make an issue about the sentinel tokens question. I'd tag thomwolf/patrickvonplaten.

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Truncated Outputs by t5 fine-tuned models t5-base translation_en_to_de BLEU lower than the paper
6 participants