Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Trainer: 2 little fixes #7461

Merged
merged 5 commits into from Oct 1, 2020

Conversation

sshleifer
Copy link
Contributor

@sshleifer sshleifer commented Sep 29, 2020

  1. fix DDP access to model.config. We could also set self.config = model.config earlier in __init__
  2. switch torch.Tensor -> torch.tensor. The latter "infers the dtype automatically"
    After which the command in Seq2SeqTrainer Distributed: AttributeError and the RuntimeError #7460 works.

CC @patil-suraj , @TevenLeScao

@sshleifer sshleifer changed the title reset model.config Trainer: reset model.config after calling DDP Sep 29, 2020
@sshleifer sshleifer linked an issue Sep 29, 2020 that may be closed by this pull request
@sgugger
Copy link
Collaborator

sgugger commented Sep 29, 2020

Can we see when the config is accessed (in your error message)? model.config should be accessed as sparsely as possible in Trainer to work with any kind of model and I'll probably remove the requirement entirely soon.

@sshleifer sshleifer changed the title Trainer: reset model.config after calling DDP Distributed Trainer: 2 little fixes Sep 29, 2020
@@ -675,12 +675,14 @@ def train(self, model_path: Optional[str] = None, trial: Union["optuna.Trial", D

# Distributed training (should be after apex fp16 initialization)
if self.args.local_rank != -1:
config = model.config
Copy link
Collaborator

@sgugger sgugger Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't assume model has a config without proper test, having Trainer work with models that are not PreTrainedModels is a feature that has been asked. If there is an access to config that makes the code fail, we should fix that place.

Copy link
Contributor Author

@sshleifer sshleifer Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already assumed that model.config exists. The base trainer.py accesses model.config 23 times, including in the statement below this one

https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L682

@sshleifer
Copy link
Contributor Author

Seq2SeqTrainer uses model.config 8 times. Mostly pad_token_id to avoid counting padding in the loss func.

@sgugger
Copy link
Collaborator

sgugger commented Sep 30, 2020

It should add an assert the model is a PreTrainedModel at init just to be clean, then for your specific problem, it should use the function self._actual_model() to grab the config to avoid your error (e.g., self.model.config -> self._actual_model().config).

Trainer is on its way to fully handle models without config, see #7464.

@sshleifer
Copy link
Contributor Author

OK. I reduced scope of this PR to just the Tensor -> tensor.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me :-)

@sshleifer sshleifer merged commit 097049b into huggingface:master Oct 1, 2020
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
* reset model.config

* Update src/transformers/trainer.py

* use lower case tensor

* Just tensor change
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Seq2SeqTrainer Distributed: AttributeError and the RuntimeError
2 participants