Skip to content

Releases: huggingface/tokenizers

Rust v0.10.1

09 Apr 15:37
Compare
Choose a tag to compare

Fixed

  • [#226]: Fix the word indexes when there are special tokens

Rust v0.10.0

08 Apr 18:23
Compare
Choose a tag to compare

Changed

  • [#222]: All Tokenizer's subparts must now be Send + Sync

Added

  • [#208]: Ability to retrieve the vocabulary from the Tokenizer & Model

Fixed

  • [#205]: Trim the decoded string in BPEDecoder
  • [b770f36]: Fix a bug with added tokens generated IDs

Rust v0.9.0

26 Mar 21:23
Compare
Choose a tag to compare

Changed

  • Only one progress bar while reading files during training. This is better for use-cases with
    a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
    size of each file before starting to actually read these files, as this process could take really
    long.
  • [#190]: Improved BPE and WordPiece builders
  • [#193]: encode and encode_batch now take a new argument, specifying whether we should add the
    special tokens
  • [#197]: The NormalizedString has been removed from the Encoding. It is now possible to
    retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
    footprint
  • [#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both
    strings using both "normalized" or "original" offsets
  • [#197]: The offsets provided on Encoding are now relative to the original string, and not the
    normalized one anymore
  • AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
    have more options to allow various behaviors.

Added

  • [#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
    the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
    part of the actual token
  • More alignment mappings on the Encoding.
  • post_process can be called on the Tokenizer

Fixed

  • [#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
    • when add_prefix_space is activated
    • [#156]: when a Unicode character gets split-up in multiple byte-level characters
  • Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
  • [#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
    advised, but that's not the question)

How to migrate

  • Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.

Rust v0.8.0

02 Mar 19:53
Compare
Choose a tag to compare

Changes:

  • Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

  • Do not open all files directly while training (#163)
  • There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
    in multiple bytes. (cf #156)
  • The LongestFirst truncation strategy had a bug (#174)

Python v0.6.0

02 Mar 20:03
Compare
Choose a tag to compare

Changes:

  • Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

  • Some default tokens were missing from BertWordPieceTokenizer (cf #160)
  • There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
    in multiple bytes. (cf #156)
  • The longest_first truncation strategy had a bug (#174)

Python v0.5.2

24 Feb 21:10
Compare
Choose a tag to compare

Fixes:

  • We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named
    vocab.json. This is now fixed.
  • The WordLevel model was also saving its vocabulary in the wrong format.

Python v0.5.1

24 Feb 15:16
Compare
Choose a tag to compare

Changes:

  • name argument is now optional when saving a Model's vocabulary. When the name is not specified,
    the files get a more generic naming, like vocab.json or merges.txt.

Python v0.5.0

18 Feb 23:59
Compare
Choose a tag to compare

Changes:

  • BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
  • ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
  • Added a new Strip normalizer
  • do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
  • Expose __len__ on Encoding (cf #139)
  • Improved padding performances.

Fixes:

  • #145: Decoding was buggy on BertWordPieceTokenizer.
  • #152: Some documentation and examples were still using the old BPETokenizer

Python v0.4.2

11 Feb 13:24
Compare
Choose a tag to compare

Fixes:

  • Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Python v0.4.1

11 Feb 04:34
Compare
Choose a tag to compare

Fixes:

  • Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)