Skip to content

Conversation

abheesht17
Copy link
Collaborator

  • Renamed arg to special_tokens_lst (since subclass WhisperTokenizer also has the same special_tokens arg). Open to suggestions.
  • Added doc-string for special_tokens_lst (open to a better explanation).
  • Added missing __init__ args to get_config().

@abheesht17 abheesht17 requested a review from mattdangerw April 6, 2023 19:36
sequence_length=None,
add_prefix_space=False,
special_tokens=None,
special_tokens_lst=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a rename would make sense, but maybe we just name this unsplittable_tokens or something like that. That would be more literal as to what this is doing. Docstring...

unsplittable_tokens: A list of strings that will never be split during the word-level splitting applied before the byte-pair encoding. This can be used to ensure string special tokens map to a unique
index in the vocabulary, even if these special tokens contain splittable characters such as punctation.

@mattdangerw
Copy link
Member

/gcbrun

pad_token = "<pad>"
end_token = "</s>"

if "unsplittable_tokens" in kwargs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is probably better UX, but just in terms of our own code maintenance, I think we can ditch this. If users pass this argument they will get an error message abut duplicate kwargs, and can check the code here. It is fairly self explanatory.

@mattdangerw
Copy link
Member

/gcbrun

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 2 minor comments.

# In the constructor, we pass the list of special tokens to the
# `unsplittable_tokens` arg of the superclass' constructor. Hence, we
# delete it from the config here.
del config["unsplittable_tokens"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't remove it from the config, will that just get written again in the from_config time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't remove if from the config, if will get passed to the BartTokenizer constructor. I was advising the del as we don't actually want to support the user providing their own unsplittables at the Bart modeling level. BartTokenizer takes over unsplittable_tokens as an internal option.

and will tokenize a word with a leading space differently. Adding
a prefix space to the first word will cause it to be tokenized
equivalently to all subsequent words in the sequence.
unsplittable_tokens: list, defaults to None. A list of strings that will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth calling out these "tokens" must exist in the vocab.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg! I can just add that as I merge

@mattdangerw
Copy link
Member

GCP testing looks good! The only failure was #960

@mattdangerw mattdangerw merged commit 21f9bcc into keras-team:master Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants