Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use model.from_pretrained for DataParallel also #8795

Merged
merged 3 commits into from
Nov 30, 2020

Conversation

shaie
Copy link
Contributor

@shaie shaie commented Nov 26, 2020

When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end.

This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel.

If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden?

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end.

This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel.

If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden?
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the problem but I disagree with the solution you picked: the model attribute of the Learner is always a reference to the original (unwrapped) model. I suggested a change to use that for the fix, would you mind applying it?
Thanks!

Comment on lines 817 to 819
if is_data_parallel:
# re-wrap with DataParallel
self.model = torch.nn.DataParallel(self.model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model attribute of the Learner is never wrapped in DataParallel or the like, it stays a reference to the original model so these lines are not necessary.
With the same reasoning, I think the only change to make is on the line above this comment (L816) and replace

self.model = model.from_pretrained(self.state.best_model_checkpoint)

by

self.model = self.model.from_pretrained(self.state.best_model_checkpoint)

Thanks for the feedback. I made the change that you proposed, but I also think we should update L811 to check if `self.mode` is an instance of `PreTrained`, otherwise we would still not get into that `if` section, right?
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks a lot!

@sgugger
Copy link
Collaborator

sgugger commented Nov 30, 2020

Oh, looks like there is a last code-style issue to fix. Could you run make style on your branch? Then we can merge this.

@shaie
Copy link
Contributor Author

shaie commented Nov 30, 2020

I don't have make installed 😄 , what is the style issue? Wonder what style issue can go wrong in such a simple patch. The only thing we added is self. in those 2 lines

@shaie
Copy link
Contributor Author

shaie commented Nov 30, 2020

check_code_quality complains about finetune.py, but it's not modified by this patch

@sgugger
Copy link
Collaborator

sgugger commented Nov 30, 2020

Weird indeed. Will merge and fix if the issue persists.

@sgugger sgugger merged commit 7738494 into huggingface:master Nov 30, 2020
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
* Use model.from_pretrained for DataParallel also

When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end.

This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel.

If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden?

* Fix silly bug

* Address review comments

Thanks for the feedback. I made the change that you proposed, but I also think we should update L811 to check if `self.mode` is an instance of `PreTrained`, otherwise we would still not get into that `if` section, right?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants