Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[trainer] make generate work with multigpu #8716

Merged
merged 2 commits into from
Nov 23, 2020
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Nov 22, 2020

This PR:

  • fixes torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'generate' under DataParallel
  • enables test_finetune_bert2bert under multigpu - the test now works with any number of GPUs.

Chances are that this would be the same problem with any other model.foo calls as this is not the first time this is happening. i.e. the base model class most likely needs to made aware of DataParallel and transparently get the model at the calling point.

@sgugger, @LysandreJik, @patrickvonplaten

Fixes: #8713

@stas00
Copy link
Contributor Author

stas00 commented Nov 23, 2020

Normally the model attribute of the Trainer is always a reference to the real model (without the module from DataParallel and the likes), so using self.model here should prevent this error.

It did - thank you!

This is a very ambiguous situation for a user who wants to use HF trainer in their code. When to use model the argument and when self.model.

What happens here is model = torch.nn.DataParallel(self.model) in the previous frame (src/transformers/trainer.py:prediction_loop), so model no longer has its normal methods accessible.

Here are some possible solutions to resolve this ambiguity:

  1. monkeypatch torch.nn.DataParallel to expand its API to support all the methods of the original model transparently by installing a catch all __getattr__ and remap all the failed method look ups to delegate to self.module.

  2. not to call the function argument model anymore, since it isn't under multi gpu, but is something else.

  3. remove the model argument completely + document to always use self.model - currently in seq2seq_trainer.py once we switch to self.model, prediction_step() no longer needs model as an argument (but is it always the case?)

@sgugger
Copy link
Collaborator

sgugger commented Nov 23, 2020

We can certainly improve the documentation and the debugging experience. I think I prefer the solution 2 since 1. is too magic (so will probably make things harder to debug) and 3 is not compatible with the regular Trainer (that needs the unwrapped model though I'd need to check to be sure).

Doing model -> wrapped_model should be enough to clarify things? Wdyt

@stas00
Copy link
Contributor Author

stas00 commented Nov 23, 2020

[...] 3 is not compatible with the regular Trainer (that needs the unwrapped model though I'd need to check to be sure).

Did you mean to say "needs the wrapped model"?

Unless I'm misreading what you wrote 3rd solution is the right one, since the Trainer doesn't do anything with the wrapped model. I don't know though whether this is so everywhere.

The 4th solution is passing self.model as the model arg, and making the wrapped model available via self.wrapped_model if the user needs it.

Doing model -> wrapped_model should be enough to clarify things? Wdyt

Except it won't be wrapped per se most of the time - very confusing to the user. Currently it should be called may_be_wrapped_model_use_self_model_instead variable ;)

@sgugger
Copy link
Collaborator

sgugger commented Nov 23, 2020

I meant the wrapped model, sorry.

@stas00 stas00 merged commit 1e45bef into huggingface:master Nov 23, 2020
@stas00 stas00 deleted the finetrain branch November 23, 2020 18:58
@JamesDeAntonis
Copy link
Contributor

JamesDeAntonis commented Sep 22, 2021

I'm getting this issue too using a T5 Model on multiple gpus

AttributeError: 'DataParallel' object has no attribute 'generate'

Is this supposed to be resolved? I've never seen this before. I've tried with 4.10.0 as well as current master branch

@stas00
Copy link
Contributor Author

stas00 commented Sep 22, 2021

@JamesDeAntonis

Is it possible you somehow have a really old transformers in your sys.path?

If not, as always we need a way to reproduce the problem as the first step. And ideally in a new issue so that it can be tracked.

But you can also see the fix in this PR and try to trace it to where the generate call is made. Clearly it's not calling it on the correct object.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eval of seq2seq/finetune_trainer does not work on multiple gpus
3 participants