Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 #31789

murthyrudra · 2024-07-04T06:41:34Z

System Info

transformers version: 4.39.0.dev0
Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 133, in __init__
    super().__init__(
  File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

Expected behavior

There should be no error loading tokenizer

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-07-04T13:06:33Z

Hello! Is it possible you have an outdated version of tokenizers? Do you mind upgrading to the latest one?

pip install -U tokenizers

murthyrudra · 2024-07-04T13:59:40Z

Thanks, Updating the tokenizer library helped :)

This PR - makes new document columns nullable - upgrades rag transformers and tokenizers to fix the error huggingface/transformers#31789 Part of #108

Fix Mistral truss-examples, see [issue](huggingface/transformers#31789) for context. Something changed w tokenizers library that we need to update these. This is the exception that we're seeing: ``` Exception while loading model Traceback (most recent call last): File "/app/model_wrapper.py", line 118, in load self.try_load() File "/app/model_wrapper.py", line 179, in try_load retry( File "/app/common/retry.py", line 20, in retry raise exc File "/app/common/retry.py", line 15, in retry fn() File "/app/model/model.py", line 34, in load self.tokenizer = AutoTokenizer.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/models/auto/tokenization_auto.py", line 751, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 2045, in from_pretrained return cls._from_pretrained( ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/transformers/models/llama/tokenization_llama_fast.py", line 122, in __init__ super().__init__( File "/usr/local/lib/python3.11/dist-packages/transformers/tokenization_utils_fast.py", line 111, in __init__ fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 ```

# Context There was a recent change to the Mistral repo on huggingface where they started using a newer transformer feature: This resulted in this: huggingface/transformers#31789. Bumping the transformers to fix this across all of our mistral models. As a follow-up, we should start pinning the HF repository for all of our examples to prevent this from happening. # Testing I have tested a couple of the TRT examples, but not _everything_

anki-code · 2024-07-10T14:18:13Z

I have this issue and partial update is not working:

!pip install -U 'tokenizers<0.15'
# Successfully installed tokenizers-0.14.1

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

!pip install -U 'tokenizers'
# Successfully installed tokenizers-0.19.1
# RESTART NOTEBOOK

from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1" 
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
# ImportError: tokenizers>=0.14,<0.15 is required for a normal functioning of this module, but found tokenizers==0.19.1.

Full update solves issue to me:

!pip install -U transformers
#Successfully installed huggingface-hub-0.23.4 transformers-4.42.3

And it also works in case of run from xonsh shell.

littlerookie · 2024-08-02T06:52:12Z

I have the same issue, and the package version is latest. transformers==4.43.3 tokenizers==0.19.1

ArthurZucker · 2024-08-05T05:56:55Z

@littlerookie sorry but can't reproduce:

Could you share a bit more?

murthyrudra closed this as completed Jul 4, 2024

mawandm mentioned this issue Jul 5, 2024

fix(api): make migration columns nullable ametnes/nesis#139

Merged

mawandm added a commit to ametnes/nesis that referenced this issue Jul 5, 2024

fix(api): make migration columns nullable (#139)

2817f0e

This PR - makes new document columns nullable - upgrades rag transformers and tokenizers to fix the error huggingface/transformers#31789 Part of #108

squidarth mentioned this issue Jul 8, 2024

Update transformers on mistral to fix mistral examples basetenlabs/truss-examples#320

Merged

jmuntaner-smd mentioned this issue Jul 8, 2024

Mixtral Instruct tokenizer from Colab notebook doesn't work. dvmazur/mixtral-offloading#38

Open

squidarth mentioned this issue Jul 8, 2024

Bump transformers version for mistrals. basetenlabs/truss-examples#321

Merged

bluusun mentioned this issue Sep 23, 2024

Support for Flux brycedrennan/imaginAIry#489

Closed

zzn-nzz mentioned this issue Oct 7, 2024

RuntimeError: duplicate registrations for aten._local_scalar_dense.default Walter0807/RepBelief#2

Open

youkaichao mentioned this issue Oct 24, 2024

[RFC]: Model Deprecation Policy vllm-project/vllm#9669

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 #31789

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 #31789

murthyrudra commented Jul 4, 2024

LysandreJik commented Jul 4, 2024

murthyrudra commented Jul 4, 2024

anki-code commented Jul 10, 2024 •

edited

Loading

littlerookie commented Aug 2, 2024

ArthurZucker commented Aug 5, 2024

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 #31789

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3 #31789

Comments

murthyrudra commented Jul 4, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Jul 4, 2024

murthyrudra commented Jul 4, 2024

anki-code commented Jul 10, 2024 • edited Loading

littlerookie commented Aug 2, 2024

ArthurZucker commented Aug 5, 2024

anki-code commented Jul 10, 2024 •

edited

Loading