Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model config is logged twice on startup #11104

Open
stas00 opened this issue Apr 7, 2021 · 1 comment
Open

Model config is logged twice on startup #11104

stas00 opened this issue Apr 7, 2021 · 1 comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@stas00
Copy link
Contributor

stas00 commented Apr 7, 2021

Currently, the model config is logged twice during startup:

  1. via AutoConfig.from_pretrained
  2. via AutoTokenizer.from_pretrained -> AutoConfig.from_pretrained

Should there be a state variable that prevents the logging of the same config twice?

This happens for example with all example scripts:

Example log when running run_clm.py:

  File "examples/language-modeling/run_clm.py", line 444, in <module>
    main()
  File "examples/language-modeling/run_clm.py", line 275, in main
    config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/auto/configuration_auto.py", line 401, in from_pretrained
    return config_class.from_dict(config_dict, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/configuration_utils.py", line 526, in from_dict
    traceback.print_stack()
[INFO|configuration_utils.py:527] 2021-04-06 21:16:04,999 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|configuration_utils.py:490] 2021-04-06 21:16:05,277 >> loading configuration file https://huggingface.co/distilgpt2/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/f985248d2791fcff97732e4ee263617adec1edb5429a2b8421734c6d14e39bee.422318838d1ec4e061efb4ea29671cb2a044e244dc69229682bebd7cacc81631
  File "examples/language-modeling/run_clm.py", line 444, in <module>
    main()
  File "examples/language-modeling/run_clm.py", line 289, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/auto/tokenization_auto.py", line 390, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/models/auto/configuration_auto.py", line 401, in from_pretrained
    return config_class.from_dict(config_dict, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-gpt-neo-nan/src/transformers/configuration_utils.py", line 526, in from_dict
    traceback.print_stack()
[INFO|configuration_utils.py:527] 2021-04-06 21:16:05,279 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

To get the traceback I just added:

        import traceback
        traceback.print_stack()

@LysandreJik, @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Apr 7, 2021

Yes, this is known and on the roadmap for the tokenizers: the goal is for them to use their specific config instead of the model config. Hoping to get to that in the next month or so!

@huggingface huggingface deleted a comment from github-actions bot May 7, 2021
@stas00 stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

2 participants