-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PreTrainedTokenizerBase issue produced by PR #19073 #19488
Comments
Hi everyone! Hope everything is going well with you. Please let me know if I could be clear enough in describing the issue. Thanks for your attention and best regards, |
Thanks for the report. I understand the bug and your analysis seems correct for its cause. Will work on a fix as soon I have some free time (might be early next week only)! |
Got time today actually, this should be fixed by the PR linked above! |
Thanks so much for the prompt response @sgugger! |
System Info
transformers
version: 4.23.1Who can help?
@Saull
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Reproduction
Have a local
tokenizer.json
file (different than Hub's file and in the same folder as invoked code) and invoke the following code:Error
Depends on how the local
tokenizer.json
is filled with. It could be an error stating that both tokenizers are distinct in the number of tokens, etc. Nevertheless, here is the trace of what could be the issue:transformers.tokenization_utils_base (Line 1726)
If we print the
vocab_files
dictionary, most of the time its output will be as expected:With the added lines in PR #19073, there will be a time when the code will check if
tokenizer.json
is a file that exists in the system, and if it does, it will mark it as thefile_path
for theresolved_vocab_files
dictionary. Unfortunately, this is not expected, because we need thefile_path
to come from the Hub's download (since we are loading a pre-trained tokenizer from a identifier that is found on Hub) and not from a local file.If we print the
resolved_vocab_files
dictionary with the added lines from PR #19073, this is its output:Without the added lines:
My assumption is that this very same behavior should occur if users have any local files defined by the
vocab_files
dictionary in the same folder as they are running their scripts.Solutions
Maybe the
cached_file
loading should become prior to the added lines? And if the cached version could not be found, it resorts to local files?Expected behavior
Expected behavior is to use the
tokenizer_file
from thepretrained_model_name_or_path
instead of the local file.The text was updated successfully, but these errors were encountered: