Add a vocabulary_size argument to UnicodeCharacterTokenizer

We should add a `vocabulary_size` argument to the `UnicodeCharacterTokenizer` layer that clamps the output code point values to be in the range `[0, vocabulary_size)`.

We should also add a `vocabulary_size()` method as with other tokenizers that returns this value if it has been set, and return `None` otherwise.

Potential docstring:

```
        vocabulary_size: Set the vocabulary `vocabulary_size`,
            by clamping all codepoints to the range [0, vocabulary_size).
            Effectively this will make the `vocabulary_size - 1` id the
            the OOV token.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a vocabulary_size argument to UnicodeCharacterTokenizer #155

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add a vocabulary_size argument to UnicodeCharacterTokenizer #155

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions