Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve additionnal vocabulary management #309

Merged
merged 9 commits into from Jun 19, 2020
Merged

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Jun 18, 2020

Fix #302

  • Extracted Additional vocabulary from Tokenizer
  • Added normalized option, to handle cases where the tokens should be extracted from the normalized version of the text. For example, when adding a token like yesterday, we expect it to match in the input text Yesterday I saw a cat if we have a Lowercase Normalizer.
  • Added missing tests

This brings a breaking change, where AddedToken now requires a second argument is_special_token: bool, that is used to set the default normalized behavior. Special tokens shouldn't be normalized, and just extracted as given, while normal tokens should be normalized by default. The normalized behavior can be manually set on any token to override the default.

@n1t0 n1t0 merged commit 58e29fd into master Jun 19, 2020
@n1t0 n1t0 deleted the improve-added-tokens branch June 19, 2020 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some rough edges with add_tokens
1 participant