"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

erikchwang · 2020-06-15T22:51:09Z

🐛 Bug

Information

I want to save MarianConfig, MarianTokenizer, and MarianMTModel to a local directory ("my_dir") and then load them:

import transformers

transformers.AutoConfig.from_pretrained("Helsinki-NLP/opus-mt-en-de").save_pretrained("my_dir")
transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de").save_pretrained("my_dir")
transformers.AutoModelWithLMHead.from_pretrained("Helsinki-NLP/opus-mt-en-de").save_pretrained("my_dir")

config = transformers.AutoConfig.from_pretrained("my_dir")
tokenizer = transformers.AutoTokenizer.from_pretrained("my_dir")
model = transformers.AutoModelWithLMHead.from_pretrained("my_dir")

But the above code failed when loading the saved MarianTokenizer from "my_dir":

Traceback (most recent call last):
File "", line 8, in
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_auto.py", line 206, in from_pretrained
return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 911, in from_pretrained
return cls._from_pretrained(*inputs, **kwargs)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1062, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_marian.py", line 83, in init
self.spm_source = load_spm(source_spm)
File "/Users/anaconda/lib/python3.6/site-packages/transformers/tokenization_marian.py", line 236, in load_spm
spm.Load(path)
File "/Users/anaconda/lib/python3.6/site-packages/sentencepiece.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/Users/anaconda/lib/python3.6/site-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

erikchwang · 2020-06-16T00:03:39Z

I noticed that after saving the pretrained MarianTokenizer to "my_dir", the "source.spm" file and "target.spm" file are actually named as:

1bec78f268e25152d11e6efa41998f2ebebe3ce5452c952c90fc7264c8c45a5b.23f506277c63e64e484c4de9d754a6625e5ba734bb6153470be9b7ffdb7c4ac5

and

5f95a1efcd8b6093955eb77d42cf97bde71563395863991bd96ad0832776f409.52488b746595fe55ab4afaebb1c23e29994354ddfebd6eddb77815395dc1d604

When I changed the file names back to "source.spm" and "target.spm", the error disappears.

sshleifer · 2020-06-16T00:14:32Z

I figured it out! The spm files are coming from the cache.
So their names are not human readable! Fixed by tomorrow.

erikchwang · 2020-06-16T00:16:28Z

Thanks a lot... Will this fix be included in the next release?

sshleifer · 2020-06-16T00:19:00Z

Yes!

mittalsuraj18 · 2020-06-24T11:37:09Z

Same issue exists for albert models also

sshleifer · 2020-06-25T18:42:36Z

Please make a new issue with instructions to reproduce. Thanks!

danielbellhv · 2022-01-14T12:20:26Z

Did you ever solve this for Albert models? @mittalsuraj18

lunaryan · 2024-04-20T03:06:15Z

If you also see the warning: "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will
not be properly handled. We recommend you to read the related pull request available at #24565 ", make sure to set tokenizer type to be None.

sshleifer self-assigned this Jun 15, 2020

sshleifer linked a pull request Jun 16, 2020 that will close this issue

Fix marian tokenizer save pretrained #5043

Merged

sshleifer added the marian label Jun 16, 2020

sshleifer closed this as completed in #5043 Jun 16, 2020

whoisjones mentioned this issue Jun 24, 2020

albert-base-v2 tokenization broken flairNLP/flair#1712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

erikchwang commented Jun 15, 2020

erikchwang commented Jun 16, 2020

sshleifer commented Jun 16, 2020 •

edited

erikchwang commented Jun 16, 2020

sshleifer commented Jun 16, 2020

mittalsuraj18 commented Jun 24, 2020

sshleifer commented Jun 25, 2020

danielbellhv commented Jan 14, 2022

lunaryan commented Apr 20, 2024

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

"AutoTokenizer.from_pretrained" does not work when loading a pretrained MarianTokenizer from a local directory #5040

Comments

erikchwang commented Jun 15, 2020

🐛 Bug

Information

erikchwang commented Jun 16, 2020

sshleifer commented Jun 16, 2020 • edited

erikchwang commented Jun 16, 2020

sshleifer commented Jun 16, 2020

mittalsuraj18 commented Jun 24, 2020

sshleifer commented Jun 25, 2020

danielbellhv commented Jan 14, 2022

lunaryan commented Apr 20, 2024

sshleifer commented Jun 16, 2020 •

edited