Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentencepiece\sentencepiece\src\sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())] #20011

Closed
2 of 4 tasks
showpiecep opened this issue Nov 1, 2022 · 5 comments

Comments

@showpiecep
Copy link

System Info

  • transformers version: 4.24.0
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.9.2
  • Huggingface_hub version: 0.10.1
  • PyTorch version (GPU?): 1.13.0+cpu (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@patrickvonplaten
@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import T5Tokenizer
tokenizer = T5Tokenizer(vocab_file='vocab_ruturk.spm')

Traceback (most recent call last):
File "app.py", line 3, in
tokenizer = T5Tokenizer(vocab_file='vocab.ruturk.spm')
File "env\lib\site-packages\transformers\models\t5\tokenization_t5.py", line 157, in init
self.sp_model.Load(vocab_file)
File "env\lib\site-packages\sentencepiece_init_.py", line 910, in Load
return self.LoadFromFile(model_file)
File "env\lib\site-packages\sentencepiece_init_.py", line 311, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: a\sentencepiece\sentencepiece\src\sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Expected behavior

No errors

@sgugger
Copy link
Collaborator

sgugger commented Nov 2, 2022

It looks like you are using the tokenizer with a broken sentencepiece vocab. In any case, we would need a reproducer with a file we have access to to be able to investigate.

@cliangyu
Copy link

cliangyu commented Nov 5, 2022

Ran into the same issue. How did you solve it?

@showpiecep
Copy link
Author

showpiecep commented Nov 6, 2022

Ran into the same issue. How did you solve it?

The whole problem was the vocab. I just took a different one.

@AlexWortega
Copy link

Whats wrong with vocab? how to change it correct?

@jiangzhuolin
Copy link

Whats wrong with vocab? how to change it correct?

make sure your vocab files(*.bin files) have been downloaded fully.in my case, I didn't install git-lfs. git clone repo from huggingface is failed for these files. download these files or use git-lfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants