Skip to content

Add a vocabulary_size argument to UnicodeCharacterTokenizer #155

@mattdangerw

Description

@mattdangerw

We should add a vocabulary_size argument to the UnicodeCharacterTokenizer layer that clamps the output code point values to be in the range [0, vocabulary_size).

We should also add a vocabulary_size() method as with other tokenizers that returns this value if it has been set, and return None otherwise.

Potential docstring:

        vocabulary_size: Set the vocabulary `vocabulary_size`,
            by clamping all codepoints to the range [0, vocabulary_size).
            Effectively this will make the `vocabulary_size - 1` id the
            the OOV token.

Metadata

Metadata

Assignees

Labels

good first issueGood for newcomersstat:contributions welcomeAdd this label to feature request issues so they are separated out from bug reporting issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions