Skip to content

Navigation Menu

Explore
By size
By industry
By use case
Topics
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

huggingface / tokenizers Public

Notifications You must be signed in to change notification settings
Fork 757
Star 8.7k

Code
Issues 31
Pull requests 9
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: huggingface/tokenizers

Releases · huggingface/tokenizers

Rust v0.10.1

09 Apr 15:37

n1t0

Compare

Choose a tag to compare

Loading

Rust v0.10.1

Fixed

[#226]: Fix the word indexes when there are special tokens

Assets 2

Loading

All reactions

Rust v0.10.0

08 Apr 18:23

n1t0

Compare

Choose a tag to compare

Loading

Rust v0.10.0

Changed

[#222]: All Tokenizer's subparts must now be Send + Sync

Added

[#208]: Ability to retrieve the vocabulary from the Tokenizer & Model

Fixed

[#205]: Trim the decoded string in BPEDecoder
[b770f36]: Fix a bug with added tokens generated IDs

Assets 2

Loading

All reactions

Rust v0.9.0

26 Mar 21:23

n1t0

Compare

Choose a tag to compare

Loading

Rust v0.9.0

Changed

Only one progress bar while reading files during training. This is better for use-cases with
a high number of files as it avoids having too many progress bars on screen. Also avoids reading the
size of each file before starting to actually read these files, as this process could take really
long.
[#190]: Improved BPE and WordPiece builders
[#193]: encode and encode_batch now take a new argument, specifying whether we should add the
special tokens
[#197]: The NormalizedString has been removed from the Encoding. It is now possible to
retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
footprint
[#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsets
[#197]: The offsets provided on Encoding are now relative to the original string, and not the
normalized one anymore
AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
have more options to allow various behaviors.

Added

[#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual token
More alignment mappings on the Encoding.
post_process can be called on the Tokenizer

Fixed

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space is activated
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not
advised, but that's not the question)

How to migrate

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.

Assets 2

Loading

All reactions

Rust v0.8.0

02 Mar 19:53

n1t0

Compare

Choose a tag to compare

Loading

Rust v0.8.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Do not open all files directly while training (#163)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
in multiple bytes. (cf #156)
The LongestFirst truncation strategy had a bug (#174)

Assets 2

Loading

All reactions

Python v0.6.0

02 Mar 20:03

n1t0

Compare

Choose a tag to compare

Loading

Python v0.6.0

Changes:

Big improvements in speed for BPE (Both training and tokenization) (#165)

Fixes:

Some default tokens were missing from BertWordPieceTokenizer (cf #160)
There was a bug in ByteLevel PreTokenizer that caused offsets to be wrong if a char got split up
in multiple bytes. (cf #156)
The longest_first truncation strategy had a bug (#174)

Assets 2

Loading

All reactions

Python v0.5.2

24 Feb 21:10

n1t0

Compare

Choose a tag to compare

Loading

Python v0.5.2

Fixes:

We introduced a bug related to the saving of the WordPiece model in 0.5.2: The vocab.txt file was named
vocab.json. This is now fixed.
The WordLevel model was also saving its vocabulary in the wrong format.

Assets 2

Loading

All reactions

Python v0.5.1

24 Feb 15:16

n1t0

Compare

Choose a tag to compare

Loading

Python v0.5.1

Changes:

name argument is now optional when saving a Model's vocabulary. When the name is not specified,
the files get a more generic naming, like vocab.json or merges.txt.

Assets 2

Loading

All reactions

Python v0.5.0

18 Feb 23:59

n1t0

Compare

Choose a tag to compare

Loading

Python v0.5.0

Changes:

BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
Added a new Strip normalizer
do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
Expose __len__ on Encoding (cf #139)
Improved padding performances.

Fixes:

#145: Decoding was buggy on BertWordPieceTokenizer.
#152: Some documentation and examples were still using the old BPETokenizer

Assets 2

Loading

All reactions

Python v0.4.2

11 Feb 13:24

n1t0

Compare

Choose a tag to compare

Loading

Python v0.4.2

Fixes:

Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Assets 2

Loading

All reactions

Python v0.4.1

11 Feb 04:34

n1t0

Compare

Choose a tag to compare

Loading

Python v0.4.1

Fixes:

Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

Assets 2

Loading

All reactions

Previous 1 2 … 5 6 7 8 9 Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.