Skip to content

Python v0.5.0

Compare
Choose a tag to compare
@n1t0 n1t0 released this 18 Feb 23:59

Changes:

  • BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
  • ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
  • Added a new Strip normalizer
  • do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
  • Expose __len__ on Encoding (cf #139)
  • Improved padding performances.

Fixes:

  • #145: Decoding was buggy on BertWordPieceTokenizer.
  • #152: Some documentation and examples were still using the old BPETokenizer