Releases: huggingface/tokenizers
Releases 路 huggingface/tokenizers
Python v0.4.0
Python v0.3.0
Changes:
- BPETokenizer has been renamed to CharBPETokenizer for clarity.
- Added
CharDelimiterSplit
: a newPreTokenizer
that allows splitting sequences on the given delimiter (Works like.split(delimiter)
) - Added
WordLevel
: a new model that simply mapstokens
to theirids
. - Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing
Encoding
that are ready to be processed by a language model, just as the mainEncoding
. - Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
Bug fixes:
- Fix a bug with IndexableString
- Fix a bug with truncation
Python v0.2.1
- Fix a bug with the IDs associated with added tokens.
- Fix a bug that was causing crashes in Python 3.5
Python v0.2.0
In this release, we fixed some inconsistencies between the BPETokenizer
and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer
back to Whitespace
(instead of WhitespaceSplit
).
Python v0.1.1
- Fix a bug where special tokens get split while encoding
Python v0.1.0
Bump python version for release