-
Notifications
You must be signed in to change notification settings - Fork 301
Closed
Labels
good first issueGood for newcomersGood for newcomersstat:contributions welcomeAdd this label to feature request issues so they are separated out from bug reporting issuesAdd this label to feature request issues so they are separated out from bug reporting issues
Description
We should add a vocabulary_size
argument to the UnicodeCharacterTokenizer
layer that clamps the output code point values to be in the range [0, vocabulary_size)
.
We should also add a vocabulary_size()
method as with other tokenizers that returns this value if it has been set, and return None
otherwise.
Potential docstring:
vocabulary_size: Set the vocabulary `vocabulary_size`,
by clamping all codepoints to the range [0, vocabulary_size).
Effectively this will make the `vocabulary_size - 1` id the
the OOV token.
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomersstat:contributions welcomeAdd this label to feature request issues so they are separated out from bug reporting issuesAdd this label to feature request issues so they are separated out from bug reporting issues