You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when
processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps
while applying labels to each word.
Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later
load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
With the serialization comes the compatibility with Pickle! The Tokenizer, all of its components,
Encodings, everything can be pickled!
Training a tokenizer is now even faster (up to 5-10x) than before!
Compatibility with multiprocessing, even when using the fork start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with multiprocessing. This version now allows to
disable the parallelism, and will warn you if this is necessary.
And a lot of other improvements, and fixes.
Fixed
[#286]: Fix various crash when training a BPE model
[#309]: Fixed a few bugs related to additional vocabulary/tokens
Added
[#272]: Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...).
This adds some methods to easily save/load an entire tokenizer (from_str, from_file).
[#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure
activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with enable_padding(pad_to_multiple_of=8) for example.
[#298]: Ability to get the currently set truncation/padding params
[#311]: Ability to enable/disable the parallelism using the TOKENIZERS_PARALLELISM environment
variable. This is especially usefull when using multiprocessing capabilities, with the fork
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)
Changed
Improved errors generated during truncation: When the provided max length is too low are
now handled properly.
[#249] encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument is_pretokenized=True must be specified.
[#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the
processing of each file
[#280]: Use onig for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2
[#309]: Improved the management of the additional vocabulary. This introduces an option normalized, controlling whether a token should be extracted from the normalized version of the
input text.