PreTrainedTokenizerBase issue produced by PR #19073 #19488

gugarosa · 2022-10-11T12:33:50Z

System Info

transformers version: 4.23.1
Platform: Linux-5.4.0-125-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.13
Huggingface_hub version: 0.10.0
PyTorch version (GPU?): 1.12.1+cu116 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@Saull

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Have a local tokenizer.json file (different than Hub's file and in the same folder as invoked code) and invoke the following code:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
print(tokenizer)

Error

Depends on how the local tokenizer.json is filled with. It could be an error stating that both tokenizers are distinct in the number of tokens, etc. Nevertheless, here is the trace of what could be the issue:

transformers.tokenization_utils_base (Line 1726)

for file_id, file_path in vocab_files.items():
  if file_path is None:
    resolved_vocab_files[file_id] = None
  elif os.path.isfile(file_path):
    resolved_vocab_files[file_id] = file_path
...

If we print the vocab_files dictionary, most of the time its output will be as expected:

{'vocab_file': 'vocab.json', 'merges_file': 'merges.txt', 'tokenizer_file': 'tokenizer.json', 'added_tokens_file': 'added_tokens.json', 'special_tokens_map_file': 'special_tokens_map.json', 'tokenizer_config_file': 'tokenizer_config.json'}

With the added lines in PR #19073, there will be a time when the code will check if tokenizer.json is a file that exists in the system, and if it does, it will mark it as the file_path for the resolved_vocab_files dictionary. Unfortunately, this is not expected, because we need the file_path to come from the Hub's download (since we are loading a pre-trained tokenizer from a identifier that is found on Hub) and not from a local file.

If we print the resolved_vocab_files dictionary with the added lines from PR #19073, this is its output:

{... 'tokenizer_file': 'tokenizer.json' ...}

Without the added lines:

{... 'tokenizer_file': '/home/gderosa/.cache/huggingface/hub/models--Salesforce--codegen-350M-mono/snapshots/40b7a3b6e99e73bdb497a14b740e7167b3413c74/tokenizer.json' ...}

My assumption is that this very same behavior should occur if users have any local files defined by the vocab_files dictionary in the same folder as they are running their scripts.

Solutions

Maybe the cached_file loading should become prior to the added lines? And if the cached version could not be found, it resorts to local files?

Expected behavior

Expected behavior is to use the tokenizer_file from the pretrained_model_name_or_path instead of the local file.

The text was updated successfully, but these errors were encountered:

gugarosa · 2022-10-11T12:34:24Z

Hi everyone! Hope everything is going well with you.

Please let me know if I could be clear enough in describing the issue.

Thanks for your attention and best regards,
Gustavo.

LysandreJik · 2022-10-13T15:18:44Z

Thank you for the issue @gugarosa!

Pinging @sgugger

sgugger · 2022-10-14T14:41:24Z

Thanks for the report. I understand the bug and your analysis seems correct for its cause. Will work on a fix as soon I have some free time (might be early next week only)!

sgugger · 2022-10-14T16:39:45Z

Got time today actually, this should be fixed by the PR linked above!

gugarosa · 2022-10-14T18:14:57Z

Thanks so much for the prompt response @sgugger!

sgugger self-assigned this Oct 14, 2022

sgugger mentioned this issue Oct 14, 2022

Tokenizer from_pretrained should not use local files named like token… #19626

Merged

sgugger closed this as completed in #19626 Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreTrainedTokenizerBase issue produced by PR #19073 #19488

PreTrainedTokenizerBase issue produced by PR #19073 #19488

gugarosa commented Oct 11, 2022 •

edited

Loading

gugarosa commented Oct 11, 2022

LysandreJik commented Oct 13, 2022

sgugger commented Oct 14, 2022

sgugger commented Oct 14, 2022

gugarosa commented Oct 14, 2022

PreTrainedTokenizerBase issue produced by PR #19073 #19488

PreTrainedTokenizerBase issue produced by PR #19073 #19488

Comments

gugarosa commented Oct 11, 2022 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Reproduction

Error

transformers.tokenization_utils_base (Line 1726)

Solutions

Expected behavior

gugarosa commented Oct 11, 2022

LysandreJik commented Oct 13, 2022

sgugger commented Oct 14, 2022

sgugger commented Oct 14, 2022

gugarosa commented Oct 14, 2022

gugarosa commented Oct 11, 2022 •

edited

Loading