Distributed Trainer: 2 little fixes #7461

sshleifer · 2020-09-29T21:17:22Z

fix DDP access to model.config. We could also set self.config = model.config earlier in __init__
switch torch.Tensor -> torch.tensor. The latter "infers the dtype automatically"
After which the command in Seq2SeqTrainer Distributed: AttributeError and the RuntimeError #7460 works.

src/transformers/trainer.py

sgugger · 2020-09-29T21:33:22Z

Can we see when the config is accessed (in your error message)? model.config should be accessed as sparsely as possible in Trainer to work with any kind of model and I'll probably remove the requirement entirely soon.

…rs_fork into distributed-bug-fox

sgugger · 2020-09-29T22:25:25Z

src/transformers/trainer.py

@@ -675,12 +675,14 @@ def train(self, model_path: Optional[str] = None, trial: Union["optuna.Trial", D

        # Distributed training (should be after apex fp16 initialization)
        if self.args.local_rank != -1:
+            config = model.config


We shouldn't assume model has a config without proper test, having Trainer work with models that are not PreTrainedModels is a feature that has been asked. If there is an access to config that makes the code fail, we should fix that place.

It's already assumed that model.config exists. The base trainer.py accesses model.config 23 times, including in the statement below this one

https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L682

sshleifer · 2020-09-30T03:45:45Z

Seq2SeqTrainer uses model.config 8 times. Mostly pad_token_id to avoid counting padding in the loss func.

sgugger · 2020-09-30T11:27:59Z

It should add an assert the model is a PreTrainedModel at init just to be clean, then for your specific problem, it should use the function self._actual_model() to grab the config to avoid your error (e.g., self.model.config -> self._actual_model().config).

Trainer is on its way to fully handle models without config, see #7464.

sshleifer · 2020-09-30T17:26:52Z

OK. I reduced scope of this PR to just the Tensor -> tensor.

sgugger

That works for me :-)

* reset model.config * Update src/transformers/trainer.py * use lower case tensor * Just tensor change

This reverts commit 3f93ae7.

reset model.config

e60cb7e

sshleifer requested a review from sgugger September 29, 2020 21:18

sshleifer changed the title ~~reset model.config~~ Trainer: reset model.config after calling DDP Sep 29, 2020

sshleifer commented Sep 29, 2020

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

Update src/transformers/trainer.py

ea96f0d

sshleifer linked an issue Sep 29, 2020 that may be closed by this pull request

Seq2SeqTrainer Distributed: AttributeError and the RuntimeError #7460

Closed

sshleifer added 2 commits September 29, 2020 17:52

use lower case tensor

d43fa53

Merge branch 'distributed-bug-fox' of github.com:sshleifer/transforme…

e71b451

…rs_fork into distributed-bug-fox

sshleifer changed the title ~~Trainer: reset model.config after calling DDP~~ Distributed Trainer: 2 little fixes Sep 29, 2020

sgugger reviewed Sep 29, 2020

View reviewed changes

Just tensor change

6021714

sgugger approved these changes Sep 30, 2020

View reviewed changes

sshleifer merged commit 097049b into huggingface:master Oct 1, 2020

patil-suraj mentioned this pull request Oct 15, 2020

[Seq2Seq] Allow EncoderDecoderModels to be trained with Seq2Seq #7809

Merged

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Distributed Trainer: 2 little fixes (huggingface#7461)

3f93ae7

* reset model.config * Update src/transformers/trainer.py * use lower case tensor * Just tensor change

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Distributed Trainer: 2 little fixes (huggingface#7461)"

41bfc41

This reverts commit 3f93ae7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Trainer: 2 little fixes #7461

Distributed Trainer: 2 little fixes #7461

sshleifer commented Sep 29, 2020 •

edited

sgugger commented Sep 29, 2020 •

edited

sgugger Sep 29, 2020 •

edited by sshleifer

sshleifer Sep 30, 2020 •

edited

sshleifer commented Sep 30, 2020

sgugger commented Sep 30, 2020

sshleifer commented Sep 30, 2020

sgugger left a comment

Distributed Trainer: 2 little fixes #7461

Distributed Trainer: 2 little fixes #7461

Conversation

sshleifer commented Sep 29, 2020 • edited

sgugger commented Sep 29, 2020 • edited

sgugger Sep 29, 2020 • edited by sshleifer

Choose a reason for hiding this comment

sshleifer Sep 30, 2020 • edited

Choose a reason for hiding this comment

sshleifer commented Sep 30, 2020

sgugger commented Sep 30, 2020

sshleifer commented Sep 30, 2020

sgugger left a comment

Choose a reason for hiding this comment

sshleifer commented Sep 29, 2020 •

edited

sgugger commented Sep 29, 2020 •

edited

sgugger Sep 29, 2020 •

edited by sshleifer

sshleifer Sep 30, 2020 •

edited