You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I used ByteLevelBPETokenizer for tokenize training and set add_prefix_space to True during the training process. Later I found that it is reasonable to add prefix_space for English, but there is actually no need to add prefix_space for Chinese, Japanese and Korean. So, I use tokenizer.normalizer = normalizers.Replace(pattern=tokenizers.Regex(r"^(?=\p{Latin})"), content=' ') and set add_prefix_space=False to achieve the above function. But during the training process, an error was reported:
Traceback (most recent call last):
File "trainer_bbpe_kwai.py", line 119, in <module>
tokenizer.train(
File "/share/miniconda3/envs/hf_tokenizers/lib/python3.8/site-packages/tokenizers/implementations/byte_level_bpe.py", line 98, in train
self._tokenizer.train(files, trainer=trainer)
pyo3_runtime.PanicException: index out of bounds: the len is 39 but the index is 39
thread '<unnamed>' panicked at 'index out of bounds: the len is 35 but the index is 35', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 115 but the index is 115', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 16 but the index is 16', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bouThanks for your attention to this matter.nds: the len is 4 but the index is 4', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 4 but the index is 4', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 32 but the index is 32', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 13 but the index is 13', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 7 but the index is 7', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 2 but the index is 2', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
How can we solve this problem? training code
Thanks for your attention to this issue.
The text was updated successfully, but these errors were encountered:
Sorry I did not have a look but the normalizer if of course in cause here. Not sure I'll have the time to debug this, @Narsil if anything comes to your mind!
Recently I used
ByteLevelBPETokenizer
for tokenize training and setadd_prefix_space
toTrue
during the training process. Later I found that it is reasonable to addprefix_space
for English, but there is actually no need to addprefix_space
for Chinese, Japanese and Korean. So, I usetokenizer.normalizer = normalizers.Replace(pattern=tokenizers.Regex(r"^(?=\p{Latin})"), content=' ')
and setadd_prefix_space=False
to achieve the above function. But during the training process, an error was reported:How can we solve this problem?
training code
Thanks for your attention to this issue.
The text was updated successfully, but these errors were encountered: