-
Notifications
You must be signed in to change notification settings - Fork 301
Small fixes for special_tokens arg in BPE #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Small fixes for special_tokens arg in BPE #969
Conversation
sequence_length=None, | ||
add_prefix_space=False, | ||
special_tokens=None, | ||
special_tokens_lst=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a rename would make sense, but maybe we just name this unsplittable_tokens
or something like that. That would be more literal as to what this is doing. Docstring...
unsplittable_tokens: A list of strings that will never be split during the word-level splitting applied before the byte-pair encoding. This can be used to ensure string special tokens map to a unique
index in the vocabulary, even if these special tokens contain splittable characters such as punctation.
/gcbrun |
pad_token = "<pad>" | ||
end_token = "</s>" | ||
|
||
if "unsplittable_tokens" in kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is probably better UX, but just in terms of our own code maintenance, I think we can ditch this. If users pass this argument they will get an error message abut duplicate kwargs, and can check the code here. It is fairly self explanatory.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 2 minor comments.
# In the constructor, we pass the list of special tokens to the | ||
# `unsplittable_tokens` arg of the superclass' constructor. Hence, we | ||
# delete it from the config here. | ||
del config["unsplittable_tokens"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't remove it from the config, will that just get written again in the from_config
time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't remove if from the config, if will get passed to the BartTokenizer
constructor. I was advising the del
as we don't actually want to support the user providing their own unsplittables at the Bart modeling level. BartTokenizer
takes over unsplittable_tokens
as an internal option.
and will tokenize a word with a leading space differently. Adding | ||
a prefix space to the first word will cause it to be tokenized | ||
equivalently to all subsequent words in the sequence. | ||
unsplittable_tokens: list, defaults to None. A list of strings that will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth calling out these "tokens" must exist in the vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg! I can just add that as I merge
GCP testing looks good! The only failure was #960 |
special_tokens_lst
(since subclass WhisperTokenizer also has the samespecial_tokens
arg). Open to suggestions.special_tokens_lst
(open to a better explanation).__init__
args toget_config()
.