Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 #31789

Closed
2 of 4 tasks
murthyrudra opened this issue Jul 4, 2024 · 5 comments

Comments

@murthyrudra
Copy link

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.13
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 133, in __init__
    super().__init__(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3


Expected behavior

There should be no error loading tokenizer

@LysandreJik
Copy link
Member

Hello! Is it possible you have an outdated version of tokenizers? Do you mind upgrading to the latest one?

pip install -U tokenizers

@murthyrudra
Copy link
Author

Thanks, Updating the tokenizer library helped :)

mawandm added a commit to ametnes/nesis that referenced this issue Jul 5, 2024
This PR
- makes new document columns nullable
- upgrades rag transformers and tokenizers to fix the error
huggingface/transformers#31789

Part of #108
squidarth added a commit to basetenlabs/truss-examples that referenced this issue Jul 8, 2024
Fix Mistral truss-examples, see
[issue](huggingface/transformers#31789) for
context. Something changed w tokenizers library that we need to update
these.

This is the exception that we're seeing:

```
Exception while loading model

Traceback (most recent call last):
  File "/app/model_wrapper.py", line 118, in load
    self.try_load()
  File "/app/model_wrapper.py", line 179, in try_load
    retry(
  File "/app/common/retry.py", line 20, in retry
    raise exc
  File "/app/common/retry.py", line 15, in retry
    fn()
  File "/app/model/model.py", line 34, in load
    self.tokenizer = AutoTokenizer.from_pretrained(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/auto/tokenization_auto.py", line 751, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 2045, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/llama/tokenization_llama_fast.py", line 122, in __init__
    super().__init__(
  File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3
```
squidarth added a commit to basetenlabs/truss-examples that referenced this issue Jul 9, 2024
# Context

There was a recent change to the Mistral repo on huggingface where they
started using a newer transformer feature: This resulted in this:
huggingface/transformers#31789.

Bumping the transformers to fix this across all of our mistral models.
As a follow-up, we should start pinning the HF repository for all of our
examples to prevent this from happening.

# Testing

I have  tested a couple of the TRT examples, but not _everything_
@anki-code
Copy link

anki-code commented Jul 10, 2024

I have this issue and partial update is not working:

!pip install -U 'tokenizers<0.15'
# Successfully installed tokenizers-0.14.1

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

!pip install -U 'tokenizers'
# Successfully installed tokenizers-0.19.1
# RESTART NOTEBOOK

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# ImportError: tokenizers>=0.14,<0.15 is required for a normal functioning of this module, but found tokenizers==0.19.1.

Full update solves issue to me:

!pip install -U transformers
#Successfully installed huggingface-hub-0.23.4 transformers-4.42.3

And it also works in case of run from xonsh shell.

@littlerookie
Copy link

I have the same issue, and the package version is latest. transformers==4.43.3 tokenizers==0.19.1

@ArthurZucker
Copy link
Collaborator

@littlerookie sorry but can't reproduce:
image
Could you share a bit more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants