[trainer] make generate work with multigpu #8716

stas00 · 2020-11-22T20:41:24Z

This PR:

fixes torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'generate' under DataParallel
enables test_finetune_bert2bert under multigpu - the test now works with any number of GPUs.

Chances are that this would be the same problem with any other model.foo calls as this is not the first time this is happening. i.e. the base model class most likely needs to made aware of DataParallel and transparently get the model at the calling point.

@sgugger, @LysandreJik, @patrickvonplaten

Fixes: #8713

examples/seq2seq/seq2seq_trainer.py

stas00 · 2020-11-23T02:04:55Z

Normally the model attribute of the Trainer is always a reference to the real model (without the module from DataParallel and the likes), so using self.model here should prevent this error.

It did - thank you!

This is a very ambiguous situation for a user who wants to use HF trainer in their code. When to use model the argument and when self.model.

What happens here is model = torch.nn.DataParallel(self.model) in the previous frame (src/transformers/trainer.py:prediction_loop), so model no longer has its normal methods accessible.

Here are some possible solutions to resolve this ambiguity:

monkeypatch torch.nn.DataParallel to expand its API to support all the methods of the original model transparently by installing a catch all __getattr__ and remap all the failed method look ups to delegate to self.module.
not to call the function argument model anymore, since it isn't under multi gpu, but is something else.
remove the model argument completely + document to always use self.model - currently in seq2seq_trainer.py once we switch to self.model, prediction_step() no longer needs model as an argument (but is it always the case?)

sgugger · 2020-11-23T12:45:33Z

We can certainly improve the documentation and the debugging experience. I think I prefer the solution 2 since 1. is too magic (so will probably make things harder to debug) and 3 is not compatible with the regular Trainer (that needs the unwrapped model though I'd need to check to be sure).

Doing model -> wrapped_model should be enough to clarify things? Wdyt

stas00 · 2020-11-23T17:42:48Z

[...] 3 is not compatible with the regular Trainer (that needs the unwrapped model though I'd need to check to be sure).

Did you mean to say "needs the wrapped model"?

Unless I'm misreading what you wrote 3rd solution is the right one, since the Trainer doesn't do anything with the wrapped model. I don't know though whether this is so everywhere.

The 4th solution is passing self.model as the model arg, and making the wrapped model available via self.wrapped_model if the user needs it.

Doing model -> wrapped_model should be enough to clarify things? Wdyt

Except it won't be wrapped per se most of the time - very confusing to the user. Currently it should be called may_be_wrapped_model_use_self_model_instead variable ;)

sgugger · 2020-11-23T18:49:00Z

I meant the wrapped model, sorry.

JamesDeAntonis · 2021-09-22T00:18:31Z

I'm getting this issue too using a T5 Model on multiple gpus

AttributeError: 'DataParallel' object has no attribute 'generate'

Is this supposed to be resolved? I've never seen this before. I've tried with 4.10.0 as well as current master branch

stas00 · 2021-09-22T02:47:21Z

@JamesDeAntonis

Is it possible you somehow have a really old transformers in your sys.path?

If not, as always we need a way to reproduce the problem as the first step. And ideally in a new issue so that it can be tracked.

But you can also see the fix in this PR and try to trace it to where the generate call is made. Clearly it's not calling it on the correct object.

Thank you.

make generate work with multigpu

784159a

stas00 mentioned this pull request Nov 22, 2020

eval of seq2seq/finetune_trainer does not work on multiple gpus #8713

Closed

stas00 requested a review from sgugger November 22, 2020 20:43

stas00 mentioned this pull request Nov 22, 2020

[s2s] multigpu skip #8613

Merged

sgugger reviewed Nov 23, 2020

View reviewed changes

examples/seq2seq/seq2seq_trainer.py Outdated Show resolved Hide resolved

better fix - thanks @sgugger

91abcdd

sgugger approved these changes Nov 23, 2020

View reviewed changes

stas00 merged commit 1e45bef into huggingface:master Nov 23, 2020

stas00 deleted the finetrain branch November 23, 2020 18:58

stas00 mentioned this pull request Nov 23, 2020

[trainer] model argument is not the same depending on n_gpus #8736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] make generate work with multigpu #8716

[trainer] make generate work with multigpu #8716

stas00 commented Nov 22, 2020 •

edited

stas00 commented Nov 23, 2020 •

edited

sgugger commented Nov 23, 2020

stas00 commented Nov 23, 2020 •

edited

sgugger commented Nov 23, 2020

JamesDeAntonis commented Sep 22, 2021 •

edited

stas00 commented Sep 22, 2021

[trainer] make generate work with multigpu #8716

[trainer] make generate work with multigpu #8716

Conversation

stas00 commented Nov 22, 2020 • edited

stas00 commented Nov 23, 2020 • edited

sgugger commented Nov 23, 2020

stas00 commented Nov 23, 2020 • edited

sgugger commented Nov 23, 2020

JamesDeAntonis commented Sep 22, 2021 • edited

stas00 commented Sep 22, 2021

stas00 commented Nov 22, 2020 •

edited

stas00 commented Nov 23, 2020 •

edited

stas00 commented Nov 23, 2020 •

edited

JamesDeAntonis commented Sep 22, 2021 •

edited