Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow usage of tokenizedDocument in BERT tokenization #20

Open
bwdGitHub opened this issue Dec 10, 2021 · 0 comments
Open

Allow usage of tokenizedDocument in BERT tokenization #20

bwdGitHub opened this issue Dec 10, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@bwdGitHub
Copy link
Collaborator

We would like to use these issues to gauge user interest.

The BERT tokenizer is intended as an identical reimplementation of the original BERT tokenization. However it is possible to replace the bert.tokenizer.internal.BasicTokenizer with a tokenizer using tokenizedDocument.

The belief is this should not affect the model too much as the wordpiece encoding is still the same, and it is these wordpiece encoded sub-tokens that are the input to the model.

Advantages of this are that tokenizedDocument is considerably faster than BasicTokenizer and may offer better integration with Text Analytics Toolbox functionality.

@bwdGitHub bwdGitHub added the enhancement New feature or request label Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant