Skip to content

Add add_bos, add_eos parameters to SentencePieceTokenizer. #1710

@briango28

Description

@briango28

Is your feature request related to a problem? Please describe.
SentencePieceTokenizer, which is a wrapper for tf_text.SentencepieceTokenizer, currently does not expose all the internal parameters that may be specified in tf_text.SentencepieceTokenizer.__init__().
The parameters I am particularly interested in are add_bos and add_eos; in the current state, users must explicitly add the token ids for '<s>' and '</s>' (which default to 1 and 2) to the result of SentencePieceTokenizer.tokenize().

Describe the solution you'd like
Add add_bos=False, add_eos=False to SentencePieceTokenizer.__init__(), save them in self.add_bos and self.add_eos, and use them in set_proto() when initializing tf_text.SentencepieceTokenizer.

Describe alternatives you've considered
It's always possible to write a custom Tokenizer by subclassing, but the change seemed trivial enough.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions