Boundary problems of BertWordPieceTokenizer (rust fatal error, max length) #174

wxupjack · 2020-02-27T06:40:20Z

The simple code is here for reproducing:

tokenizers version: 0.5.2

with open('./vocab.txt', 'w') as f:
    f.write('\n'.join( ['[SEP]', '[UNK]', '[CLS]', '[MASK]', '[PAD]', ',', '.']))

from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer('./vocab.txt')
tokenizer.enable_truncation(max_length=128)

text1, text2 = ',' * 125, '.' * 130
encodings = tokenizer.encode_batch([(text1, text2)])

print(len(encodings[0].ids))

Here, max_length=128 (125=max_length-3 special tokens)

When len(text2)<125, no problems happens whatever the length of text1. The normal encoding should be like [2, (part of) token indexes of text1 ..., 0, (part of) token indexes of text2..., 0] (0 for [SEP], 2 for [CLS]).
When len(text2)>=125,

len(text1)<125, every case also work well.
len(text1)=125, the error of Rust happens.

thread '' panicked at 'assertion failed: stride < max_len', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/encoding.rs:109:9
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Aborted (core dumped)

len(text1)>125, the length of encodings are out of 'max_length', maybe some asserts required here.

It maybe meaningless for the last case, so I ignored it at first. But when I found the fatal runtime error, I think this should be fixed just in case.

The text was updated successfully, but these errors were encountered:

n1t0 · 2020-02-29T21:38:28Z

Hi @wxupjack and thank you for your report. There is indeed a bug in the LongestFirst strategy (the default truncation strategy) that makes it truncate only the longest, not both. So when it is expected to truncate one of them entirely, it crashes.

Fixed with this commit: f8f0702. Will be part of the next release!

n1t0 closed this as completed Feb 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boundary problems of BertWordPieceTokenizer (rust fatal error, max length) #174

Boundary problems of BertWordPieceTokenizer (rust fatal error, max length) #174

wxupjack commented Feb 27, 2020 •

edited

Loading

n1t0 commented Feb 29, 2020

Boundary problems of BertWordPieceTokenizer (rust fatal error, max length) #174

Boundary problems of BertWordPieceTokenizer (rust fatal error, max length) #174

Comments

wxupjack commented Feb 27, 2020 • edited Loading

n1t0 commented Feb 29, 2020

wxupjack commented Feb 27, 2020 •

edited

Loading