-
Notifications
You must be signed in to change notification settings - Fork 25.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T5Tokenizer adds EOS token if not already added #5866
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5866 +/- ##
==========================================
- Coverage 80.10% 79.42% -0.68%
==========================================
Files 156 156
Lines 28411 28426 +15
==========================================
- Hits 22758 22578 -180
- Misses 5653 5848 +195
Continue to review full report at Codecov.
|
FWIW, I get identical results of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks! Just a few nits in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Very clean implementation. Not sure what to say about the evaluation. I think the functionality should be added anyways and the user can optionally set add_special_tokens=False
if he/she wants
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, nice addition. Maybe not the cleanest to check if there's already an existing token as we don't do that in other implementations, but cool that this brings backwards-compatibility.
I believe we should align this with other tokenizers' prepare inputs methods in a future version, by always adding the EOS token even if there's already one in the input.
def _add_eos_if_not_present(self, token_ids: List[int]) -> List[int]: | ||
"""Do not add eos again if user already added it.""" | ||
if len(token_ids) > 0 and token_ids[-1] == self.eos_token_id: | ||
return token_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we raise a warning/info telling the user that this is handled by the method, and that adding it manually + using the function would result in two tokens being added in a future version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
Happy to eventually remove the check to see if it's already there. |
I think we can keep it like this right now with a warning for future versions. It would create a breaking change to users, and I feel it would be especially hard to debug an unknown drop in performance due to an additional token being added, right? |
Will this behavior cause problems for the unsupervised setting? Per the docs,
Not sure if this will cause problems. (Also, Aa a somewhat related question, should the sentinel tokens in the |
I'm not sure about either question: |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
T5 Tokenizer should add
</s>
to the end of sequences. Since some users are doing this on their own, this PR only adds</s>
if it has already been added.On my machine, this makes zero shot validation BLEU go from 27.87 -> 27.65. Since this change is needed for finetuning, and the empirical difference is small and doesn't happen on Stas' machine, I would recommend merging this.
If others want to test, the command takes about 3 mins to run on brutasse.
Zero Shot BLEU Scores
For english -> romanian
I grabbed the WMT english-romanian dataset:
Then ran evaluation (without finetuning) on the validation split:
(this branch) (with EOS):27.65
master (no EOS): 27.87
Will merge and fix tests if others have positive results.