Skip to content

Releases: huggingface/tokenizers

Python v0.4.0

10 Feb 21:12
Compare
Choose a tag to compare

Changes:

  • Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
  • Improved typings

Python v0.3.0

05 Feb 19:03
Compare
Choose a tag to compare

Changes:

  • BPETokenizer has been renamed to CharBPETokenizer for clarity.
  • Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
  • Added WordLevel: a new model that simply maps tokens to their ids.
  • Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
  • Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
  • Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

  • Fix a bug with IndexableString
  • Fix a bug with truncation

Python v0.2.1

22 Jan 21:13
Compare
Choose a tag to compare
  • Fix a bug with the IDs associated with added tokens.
  • Fix a bug that was causing crashes in Python 3.5

Python v0.2.0

20 Jan 14:24
Compare
Choose a tag to compare

In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).

Python v0.1.1

12 Jan 07:37
Compare
Choose a tag to compare
  • Fix a bug where special tokens get split while encoding

Python v0.1.0

10 Jan 18:49
Compare
Choose a tag to compare
Bump python version for release

v0.0.13

08 Jan 18:43
Compare
Choose a tag to compare
Hotfix Python bindings for 32-bit systems

v0.0.12

07 Jan 02:05
Compare
Choose a tag to compare
Bump version for release

v0.0.11

27 Dec 15:44
Compare
Choose a tag to compare

Fixes the sdist build for Python

v0.0.10

26 Dec 19:56
Compare
Choose a tag to compare
Bump for release