Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

Closed
erikchwang opened this issue Jun 15, 2020 · 8 comments 路 Fixed by #5043
Assignees
Labels

Comments

@erikchwang
Copy link

馃悰 Bug

Information

I want to save MarianConfig, MarianTokenizer, and MarianMTModel to a local directory ("my_dir") and then load them:

import transformers

transformers.AutoConfig.from_pretrained("Helsinki-NLP/opus-mt-en-de").save_pretrained("my_dir")
transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de").save_pretrained("my_dir")
transformers.AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-de").save_pretrained("my_dir")

config = transformers.AutoConfig.from_pretrained("my_dir")
tokenizer = transformers.AutoTokenizer.from_pretrained("my_dir")
model = transformers.AutoModelWithLMHead.from_pretrained("my_dir")

But the above code failed when loading the saved MarianTokenizer from "my_dir":

Traceback (most recent call last):
File "", line 8, in
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_auto.py", line 206, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 911, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1062, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_marian.py", line 83, in init
self.spm_source = load_spm(source_spm)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_marian.py", line 236, in load_spm
spm.Load(path)
File "/Users/anaconda/lib/python3.6/site-packages/sentencepiece.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/Users/anaconda/lib/python3.6/site-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

@erikchwang
Copy link
Author

I noticed that after saving the pretrained MarianTokenizer to "my_dir", the "source.spm" file and "target.spm" file are actually named as:

1bec78f268e25152d11e6efa41998f2ebebe3ce5452c952c90fc7264c8c45a5b.23f506277c63e64e484c4de9d754a6625e5ba734bb6153470be9b7ffdb7c4ac5

and

5f95a1efcd8b6093955eb77d42cf97bde71563395863991bd96ad0832776f409.52488b746595fe55ab4afaebb1c23e29994354ddfebd6eddb77815395dc1d604

When I changed the file names back to "source.spm" and "target.spm", the error disappears.

@sshleifer
Copy link
Contributor

sshleifer commented Jun 16, 2020

I figured it out! The spm files are coming from the cache.
So their names are not human readable! Fixed by tomorrow.

@erikchwang
Copy link
Author

Thanks a lot... Will this fix be included in the next release?

@sshleifer sshleifer linked a pull request Jun 16, 2020 that will close this issue
@sshleifer
Copy link
Contributor

Yes!

@mittalsuraj18
Copy link

Same issue exists for albert models also

@sshleifer
Copy link
Contributor

Please make a new issue with instructions to reproduce. Thanks!

@danielbellhv
Copy link

Did you ever solve this for Albert models? @mittalsuraj18

@lunaryan
Copy link

If you also see the warning: "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will
not be properly handled. We recommend you to read the related pull request available at #24565 ", make sure to set tokenizer type to be None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants