-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[s2s trainer] fix DP mode #8823
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean! Thanks
moving the discussion out of the review commentary as it disappears as soon as it's resolved, so it's best to discuss it in the normal comments as this is what this PR is trying to solve. Oh, I see - thank you for catching that. So I didn't solve the actual problem, but had a luck of hiding it under the carpet. The problem is that the
or re-coded to handle dp too? I don't know the initial intention - should it support we need to know whether to:
And somewhat unrelated to the actual bug, I'd like to repeat the request at #8822 - let's have a simple flag so that the downstream code knows which mode it is under and not via checking ranks and n_gpus which is very confusing and error-prone. |
Here is where the problem happens with dp: transformers/examples/seq2seq/utils.py Lines 361 to 368 in 9995a34
So |
In |
Great, so then should we change the signature to make it clear ddp is wanted and not any distributed:
and adjust the invocations accordingly?
Great. Should we create a feature request for that? |
I think there is a misunderstanding on the terminology:
We can do that, yes. |
If you stick to the specific implementation, yes, dpp is the only distributed mode. But logically it doesn't make sense. DP is just as distributed as DPP, just isn't using the As an example if you look at this function usage pattern it's mostly |
I disagree, in the sense that code use PyTorch should stick with the PyTorch naming conventions. They chose to have a not distributed |
That is a reasonable choice to follow. I'm only flagging how this leads to coding errors when a developer assumes that n_gpu> 1 == ddp. So perhaps some extra support is needed there. |
Let's see how it goes once we add the "distributed_env" to |
@sgugger, please kindly review at your convenience - I addressed all the issues you have raised - all should be good - CI failures are unrelated. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks a lot for humoring me and my annoying comments :-)
On the contrary, your comments were excellent and to the point. I was just slow on getting your point of view since in my mind if we solve a problem on multiple gpus it's distributed across multiple-gpus, regardless of the way it's implemented. But here distributed means distributed across multiple processes. Different semantics. |
So this is probably wrong too:
But that's code base on PL. @patil-suraj, may be you could have a look when you start working at this one? I suspect that it should do a different check for distributed and not check the number of gpus. Let me know if you prefer that I open a separate issue. |
Dunno how PL works. |
Added a feature request: #8858 |
Thank you HuggingFace Team and @stas00 , I cannot express how much I appreciate your efforts. |
* fix DP case on multi-gpu * make executable * test all 3 modes * use the correct check for distributed * dp doesn't need a special case * restore original name * cleanup
This PR:
finetune_trainer.py
executable/runnable@patrickvonplaten, @sgugger