Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boundary problems of BertWordPieceTokenizer (rust fatal error, max length) #174

Closed
wxupjack opened this issue Feb 27, 2020 · 1 comment
Closed

Comments

@wxupjack
Copy link

wxupjack commented Feb 27, 2020

The simple code is here for reproducing:

tokenizers version: 0.5.2

with open('./vocab.txt', 'w') as f:
    f.write('\n'.join( ['[SEP]', '[UNK]', '[CLS]', '[MASK]', '[PAD]', ',', '.']))

from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer('./vocab.txt')
tokenizer.enable_truncation(max_length=128)

text1, text2 = ',' * 125, '.' * 130
encodings = tokenizer.encode_batch([(text1, text2)])

print(len(encodings[0].ids))

Here, max_length=128 (125=max_length-3 special tokens)

  • When len(text2)<125, no problems happens whatever the length of text1. The normal encoding should be like [2, (part of) token indexes of text1 ..., 0, (part of) token indexes of text2..., 0] (0 for [SEP], 2 for [CLS]).
  • When len(text2)>=125,
  1. len(text1)<125, every case also work well.
  2. len(text1)=125, the error of Rust happens.

thread '' panicked at 'assertion failed: stride < max_len', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/encoding.rs:109:9
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
Aborted (core dumped)

  1. len(text1)>125, the length of encodings are out of 'max_length', maybe some asserts required here.

It maybe meaningless for the last case, so I ignored it at first. But when I found the fatal runtime error, I think this should be fixed just in case.

@n1t0
Copy link
Member

n1t0 commented Feb 29, 2020

Hi @wxupjack and thank you for your report. There is indeed a bug in the LongestFirst strategy (the default truncation strategy) that makes it truncate only the longest, not both. So when it is expected to truncate one of them entirely, it crashes.

Fixed with this commit: f8f0702. Will be part of the next release!

@n1t0 n1t0 closed this as completed Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants