-
Notifications
You must be signed in to change notification settings - Fork 301
Allow BPE to treat special tokens as one token #939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow BPE to treat special tokens as one token #939
Conversation
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool! left some initial comments
) | ||
self.assertAllEqual(call_output, expected) | ||
|
||
def test_tokenize_with_special_tokens(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested detokenization? That is probably worth checking out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, it's actually not affected by this PR.
Here the special token has to be present inside the vocab, such as <|endoftext|>
.
if special_tokens: | ||
|
||
def build_alt(i): | ||
return " " + "I" * i + "IGuessIamAValidWord" + "d" * i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh this feels a little too janky :)
What if we...
- Move this function out (don't nest it) for readability.
- Take in the original special_token.
- Strip all splittable
\s\p{L}\p{N}
from the original special token. - Add a unique prefix or suffix.
f"Ĵ{i}"
or something like that.
That is nice in that it won't expand the length of these characters much, and if someone is debugging intermediate state, it will look a little more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha done.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/gcbrun |
There are some cases that BPE needs to treat special tokens as one token, e.g.,
"<|endoftext|>"
in GPT2.