Releases · huggingface/tokenizers

10 Feb 21:12

n1t0

python-v0.4.0

3c0164e

Python v0.4.0

Changes:

Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
Improved typings

Assets 2

05 Feb 19:03

n1t0

python-v0.3.0

9745786

Python v0.3.0

Changes:

BPETokenizer has been renamed to CharBPETokenizer for clarity.
Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
Added WordLevel: a new model that simply maps tokens to their ids.
Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
Provide mapping to the original string offsets using:

output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Exposed the vocabulary size on all tokenizers: #99 by @kdexd

Bug fixes:

Fix a bug with IndexableString
Fix a bug with truncation

Assets 2

22 Jan 21:13

n1t0

python-v0.2.1

0105021

Python v0.2.1

Fix a bug with the IDs associated with added tokens.
Fix a bug that was causing crashes in Python 3.5

Assets 2

20 Jan 14:24

n1t0

python-v0.2.0

da7e629

Python v0.2.0

In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).

Assets 2

12 Jan 07:37

n1t0

v0.1.1

fc9e81d

Python v0.1.1

Fix a bug where special tokens get split while encoding

Assets 2

10 Jan 18:49

n1t0

v0.1.0

dd56902

Python v0.1.0

Bump python version for release

Assets 2

08 Jan 18:43

n1t0

v0.0.13

988159a

v0.0.13

Hotfix Python bindings for 32-bit systems

Assets 2

07 Jan 02:05

n1t0

v0.0.12

b06681c

v0.0.12

Bump version for release

Assets 2

27 Dec 15:44

n1t0

v0.0.11

839239d

v0.0.11

Fixes the sdist build for Python

Assets 2

26 Dec 19:56

n1t0

v0.0.10

ffd28ba

v0.0.10

Bump for release

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes:

Changes:

Bug fixes:

Releases: huggingface/tokenizers

Python v0.4.0

Changes:

Python v0.3.0

Changes:

Bug fixes:

Python v0.2.1

Python v0.2.0

Python v0.1.1

Python v0.1.0

v0.0.13

v0.0.12

v0.0.11

v0.0.10