AutoTokenizer not able to load saved Roberta Tokenizer #4197

noncuro · 2020-05-07T13:09:16Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Roberta

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

I'm trying to run run_language_modelling.py with my own tokenizer

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

!pip install transformers --upgrade

from transformers import AutoTokenizer
my_tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
!mkdir my_tokenizer
my_tokenizer.save_pretrained("my_tokenizer")
my_tokenizer2 = AutoTokenizer.from_pretrained("./my_tokenizer")

Stack Trace:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
    246                 resume_download=resume_download,
--> 247                 local_files_only=local_files_only,
    248             )

4 frames
/usr/local/lib/python3.6/dist-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
    260         # File, but it doesn't exist.
--> 261         raise EnvironmentError("file {} not found".format(url_or_filename))
    262     else:

OSError: file ./my_tokenizer/config.json not found

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-16-b699c4abfe8e> in <module>()
      3 get_ipython().system('mkdir my_tokenizer')
      4 my_tokenizer.save_pretrained("my_tokenizer")
----> 5 my_tokenizer2 = AutoTokenizer.from_pretrained("./my_tokenizer")

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    184         config = kwargs.pop("config", None)
    185         if not isinstance(config, PretrainedConfig):
--> 186             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    187 
    188         if "bert-base-japanese" in pretrained_model_name_or_path:

/usr/local/lib/python3.6/dist-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    185         """
    186         config_dict, _ = PretrainedConfig.get_config_dict(
--> 187             pretrained_model_name_or_path, pretrained_config_archive_map=ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, **kwargs
    188         )
    189 

/usr/local/lib/python3.6/dist-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
    268                     )
    269                 )
--> 270             raise EnvironmentError(msg)
    271 
    272         except json.JSONDecodeError:

OSError: Can't load './my_tokenizer'. Make sure that:

- './my_tokenizer' is a correct model identifier listed on 'https://huggingface.co/models'

- or './my_tokenizer' is the correct path to a directory containing a 'config.json' file

Expected behavior

AutoTokenizer should be able to load the tokenizer from the file.

Possible duplicate of #1063 #3838

Environment info

transformers version:
Platform: Google Colab
Python version: 3.6.9
PyTorch version (GPU?): N/A
Tensorflow version (GPU?): N/A
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

jaymody · 2020-05-07T21:03:30Z

It looks like when you load a tokenizer from a dir it's also looking for files to load it's related model config via AutoConfig.from_pretrained. It does this because it's using the information from the config to to determine which model class the tokenizer belongs to (BERT, XLNet, etc ...) since there is no way of knowing that with the saved tokenizer files themselves. This is deceptive, as intuitively it would seem you should be able to load a tokenizer without needing it's model config. Plus in the documentation of AutoTokenizer under examples it states:

# If vocabulary files are in a directory (e.g. tokenizer was saved using `save_pretrained('./test/saved_model/')`)
tokenizer = AutoTokenizer.from_pretrained('./test/bert_saved_model/')

Either:

The documentation needs an update to clarify that config must be a kwarg of from_pretrained or the dir must also have the files for AutoConfig (if loading from a dir)
AutoTokenizer should be changed to be independent from AutoConfig (maybe the model base class name should be stored along with "tokenizer_config.json")

atowey01 · 2020-05-13T16:40:50Z

I had the same issue loading a saved tokenizer. Separately loading and saving the config file to the tokenizer directory worked.

config_file = AutoConfig.from_pretrained('distilgpt2')
config_file.save_pretrained(saved_tokenizer_directory)

yoavz · 2020-05-17T22:23:09Z

+1 to @jaymody. It seems as though that AutoTokenizer is coupled to the model that uses it, and I don't see why it should be impossible to load a tokenizer without an explicit model configuration.

For those that happen to be loading a tokenizer and model using the Auto classes, I was able to use the following workaround where my tokenizer relies on the configuration in the model directory:

tokenizer = AutoTokenizer.from_pretrained(<tokenizer-dir>, config=AutoConfig.from_pretrained(<model-dir>))

stale · 2020-07-17T14:42:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

MarcosFP97 mentioned this issue Jul 6, 2020

Solved error when realoding models labteral/ernie#15

Merged

stale bot added the wontfix label Jul 17, 2020

stale bot closed this as completed Jul 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer not able to load saved Roberta Tokenizer #4197

AutoTokenizer not able to load saved Roberta Tokenizer #4197

noncuro commented May 7, 2020

jaymody commented May 7, 2020 •

edited

Loading

atowey01 commented May 13, 2020 •

edited

Loading

yoavz commented May 17, 2020

stale bot commented Jul 17, 2020

AutoTokenizer not able to load saved Roberta Tokenizer #4197

AutoTokenizer not able to load saved Roberta Tokenizer #4197

Comments

noncuro commented May 7, 2020

🐛 Bug

Information

To reproduce

Stack Trace:

Expected behavior

Environment info

jaymody commented May 7, 2020 • edited Loading

atowey01 commented May 13, 2020 • edited Loading

yoavz commented May 17, 2020

stale bot commented Jul 17, 2020

jaymody commented May 7, 2020 •

edited

Loading

atowey01 commented May 13, 2020 •

edited

Loading