Improve additionnal vocabulary management #309

n1t0 · 2020-06-18T15:04:38Z

Extracted Additional vocabulary from Tokenizer
Added normalized option, to handle cases where the tokens should be extracted from the normalized version of the text. For example, when adding a token like yesterday, we expect it to match in the input text Yesterday I saw a cat if we have a Lowercase Normalizer.
Added missing tests

This brings a breaking change, where AddedToken now requires a second argument is_special_token: bool, that is used to set the default normalized behavior. Special tokens shouldn't be normalized, and just extracted as given, while normal tokens should be normalized by default. The normalized behavior can be manually set on any token to override the default.

n1t0 added 7 commits June 16, 2020 14:42

Rust - Extract AddedVocabulary management from Tokenizer

66be62b

Rust - Add slice and slice_bytes to NormalizedString

7dff86b

Rust - Add AddedVocabulary + normalized option on AddedToken

397cc53

Rust - Fix/Tweak AddedVocabulary + Fix python tests

c6f633e

AddedVocabulary - Add tests, update bindings + various tweaks

fc63d56

Rust - Fix byte-level decoding for added tokens

b92d739

Update CHANGELOGs

4c7a0ff

n1t0 force-pushed the improve-added-tokens branch from 7cedb13 to 4c7a0ff Compare June 18, 2020 18:50

n1t0 added 2 commits June 19, 2020 10:18

Python - Update AddedToken repr

63edb95

Python - Make AddedToken pickable

898a4a8

n1t0 merged commit 58e29fd into master Jun 19, 2020

n1t0 deleted the improve-added-tokens branch June 19, 2020 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve additionnal vocabulary management #309

Improve additionnal vocabulary management #309

n1t0 commented Jun 18, 2020 •

edited

Improve additionnal vocabulary management #309

Improve additionnal vocabulary management #309

Conversation

n1t0 commented Jun 18, 2020 • edited

n1t0 commented Jun 18, 2020 •

edited