Use model.from_pretrained for DataParallel also #8795

shaie · 2020-11-26T08:36:39Z

When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end.

This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel.

If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden?

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end. This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel. If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden?

sgugger

I understand the problem but I disagree with the solution you picked: the model attribute of the Learner is always a reference to the original (unwrapped) model. I suggested a change to use that for the fix, would you mind applying it?
Thanks!

sgugger · 2020-11-30T12:49:52Z

src/transformers/trainer.py

+                if is_data_parallel:
+                    # re-wrap with DataParallel
+                    self.model = torch.nn.DataParallel(self.model)


The model attribute of the Learner is never wrapped in DataParallel or the like, it stays a reference to the original model so these lines are not necessary.
With the same reasoning, I think the only change to make is on the line above this comment (L816) and replace

self.model = model.from_pretrained(self.state.best_model_checkpoint)

by

self.model = self.model.from_pretrained(self.state.best_model_checkpoint)

Thanks for the feedback. I made the change that you proposed, but I also think we should update L811 to check if `self.mode` is an instance of `PreTrained`, otherwise we would still not get into that `if` section, right?

sgugger

Perfect, thanks a lot!

sgugger · 2020-11-30T14:57:29Z

Oh, looks like there is a last code-style issue to fix. Could you run make style on your branch? Then we can merge this.

shaie · 2020-11-30T15:33:39Z

I don't have make installed 😄 , what is the style issue? Wonder what style issue can go wrong in such a simple patch. The only thing we added is self. in those 2 lines

shaie · 2020-11-30T15:35:06Z

check_code_quality complains about finetune.py, but it's not modified by this patch

sgugger · 2020-11-30T16:11:06Z

Weird indeed. Will merge and fix if the issue persists.

* Use model.from_pretrained for DataParallel also When training on multiple GPUs, the code wraps a model with torch.nn.DataParallel. However if the model has custom from_pretrained logic, it does not get applied during load_best_model_at_end. This commit uses the underlying model during load_best_model_at_end, and re-wraps the loaded model with DataParallel. If you choose to reject this change, then could you please move the this logic to a function, e.g. def load_best_model_checkpoint(best_model_checkpoint) or something, so that it can be overridden? * Fix silly bug * Address review comments Thanks for the feedback. I made the change that you proposed, but I also think we should update L811 to check if `self.mode` is an instance of `PreTrained`, otherwise we would still not get into that `if` section, right?

shaie added 2 commits November 26, 2020 10:35

Fix silly bug

e1d99db

LysandreJik requested a review from sgugger November 27, 2020 17:34

sgugger reviewed Nov 30, 2020

View reviewed changes

Address review comments

b94712d

Thanks for the feedback. I made the change that you proposed, but I also think we should update L811 to check if `self.mode` is an instance of `PreTrained`, otherwise we would still not get into that `if` section, right?

sgugger approved these changes Nov 30, 2020

View reviewed changes

sgugger merged commit 7738494 into huggingface:master Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use model.from_pretrained for DataParallel also #8795

Use model.from_pretrained for DataParallel also #8795

shaie commented Nov 26, 2020

sgugger left a comment

sgugger Nov 30, 2020

sgugger left a comment

sgugger commented Nov 30, 2020

shaie commented Nov 30, 2020

shaie commented Nov 30, 2020

sgugger commented Nov 30, 2020

Use model.from_pretrained for DataParallel also #8795

Use model.from_pretrained for DataParallel also #8795

Conversation

shaie commented Nov 26, 2020

What does this PR do?

Before submitting

Who can review?

sgugger left a comment

Choose a reason for hiding this comment

sgugger Nov 30, 2020

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

sgugger commented Nov 30, 2020

shaie commented Nov 30, 2020

shaie commented Nov 30, 2020

sgugger commented Nov 30, 2020