Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
Rust v0.10.1
Fixed
- [#226]: Fix the word indexes when there are special tokens
Rust v0.10.0
Rust v0.9.0
Changed
- Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long. - [#190]: Improved BPE and WordPiece builders
- [#193]:
encode
andencode_batch
now take a new argument, specifying whether we should add the
special tokens - [#197]: The
NormalizedString
has been removed from theEncoding
. It is now possible to
retrieve it by callingnormalize
on theTokenizer
. This brings a reduction of 70% of the memory
footprint - [#197]: The
NormalizedString
API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsets - [#197]: The offsets provided on
Encoding
are now relative to the original string, and not the
normalized one anymore AddedToken
are now used for bothadd_special_tokens
andadd_tokens
. Also, these AddedToken
have more options to allow various behaviors.
Added
- [#188]:
impl PostProcessor for ByteLevel
: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual token - More alignment mappings on the
Encoding
. post_process
can be called on theTokenizer
Fixed
- [#193]: Fix some issues with the offsets being wrong with the
ByteLevel
BPE:- when
add_prefix_space
is activated - [#156]: when a Unicode character gets split-up in multiple byte-level characters
- when
- Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
- [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
advised, but that's not the question)
How to migrate
- Add the
ByteLevel
PostProcessor
to your byte-level BPE tokenizers if relevant.
Rust v0.8.0
Python v0.6.0
Python v0.5.2
Fixes:
- We introduced a bug related to the saving of the WordPiece model in 0.5.2: The
vocab.txt
file was named
vocab.json
. This is now fixed. - The
WordLevel
model was also saving its vocabulary in the wrong format.
Python v0.5.1
Changes:
name
argument is now optional when saving aModel
's vocabulary. When the name is not specified,
the files get a more generic naming, likevocab.json
ormerges.txt
.
Python v0.5.0
Changes:
BertWordPieceTokenizer
now cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizer
now hasdropout
(thanks @colinclement with #149)- Added a new
Strip
normalizer do_lowercase
has been changed tolowercase
for consistency between the different tokenizers. (EspeciallyByteLevelBPETokenizer
andCharBPETokenizer
)- Expose
__len__
onEncoding
(cf #139) - Improved padding performances.
Fixes:
Python v0.4.2
Fixes:
- Fix a bug in the class
WordPieceTrainer
that preventedBertWordPieceTokenizer
from being trained. (cf #137)
Python v0.4.1
Fixes:
- Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)