[s2s trainer] fix DP mode #8823

stas00 · 2020-11-27T20:03:24Z

This PR:

fixes [s2s finetune_trainer] a mess around distributed #8822 which currently crashes under multigpu and w/o an explicit ddp mode
adds tests
makes finetune_trainer.py executable/runnable

@patrickvonplaten, @sgugger

patrickvonplaten

Very clean! Thanks

examples/seq2seq/seq2seq_trainer.py

examples/seq2seq/test_finetune_trainer.py

stas00 · 2020-11-30T18:09:30Z

moving the discussion out of the review commentary as it disappears as soon as it's resolved, so it's best to discuss it in the normal comments as this is what this PR is trying to solve.

Oh, I see - thank you for catching that. So I didn't solve the actual problem, but had a luck of hiding it under the carpet.

The problem is that the distributed=... is wrong here - it is currently coded to expect ddp when distributed==True and not dp. dp doesn't have get_world_size()/etc and so it fails, so should that arg be called dpp instead of distributed? But in any case the correct solution is then:

                self.train_dataset.make_sortish_sampler(
                    self.args.per_device_train_batch_size, distributed=self.args.local_rank != -1)

or re-coded to handle dp too? I don't know the initial intention - should it support sortish_sampler under dp or not?

we need to know whether to:

recode make_sortish_sampler to support dp (can't use get_world_size()/etc)
recode make_sortish_sampler to change its distributed arg to dpp, so that it only does the special case for dpp.

And somewhat unrelated to the actual bug, I'd like to repeat the request at #8822 - let's have a simple flag so that the downstream code knows which mode it is under and not via checking ranks and n_gpus which is very confusing and error-prone.

stas00 · 2020-11-30T18:22:03Z

Here is where the problem happens with dp:

transformers/examples/seq2seq/utils.py

Lines 361 to 368 in 9995a34

    
           class DistributedSortishSampler(Sampler): 
        
               """Copied from torch DistributedSampler""" 
        
               def __init__(self, dataset, batch_size, num_replicas=None, rank=None, add_extra_examples=True, shuffle=True): 
        
                   if num_replicas is None: 
        
                       if not dist.is_available(): 
        
                           raise RuntimeError("Requires distributed package to be available") 
        
                       num_replicas = dist.get_world_size()

So dist.is_available() returns True under dp, but dist.get_world_size() fails, since it only works under dpp and requires torch.distributed.init_process_group() which doesn't get called under dp.

sgugger · 2020-11-30T18:28:32Z

In DataParallel mode, you don't need to do anything to your datalaoder (only in DistributedDataParallel where you need to split the batches across the various processes somehow) so you should make a regular datalaoder in that case.
In general, the only proper way to detect if you are in distributed data parallel is to look at the test local_rank != -1 as torch.distributed can give you false information there. I agree it would all be much easier if the training arguments contained something that directly gives the distributed environment.

stas00 · 2020-11-30T18:42:48Z

In DataParallel mode, you don't need to do anything to your datalaoder (only in DistributedDataParallel where you need to split the batches across the various processes somehow) so you should make a regular datalaoder in that case.

Great, so then should we change the signature to make it clear ddp is wanted and not any distributed:

- def make_sortish_sampler(self, batch_size, distributed=False, shuffle=True, **kwargs):
+ def make_sortish_sampler(self, batch_size, ddp=False, shuffle=True, **kwargs):

and adjust the invocations accordingly?

In general, the only proper way to detect if you are in distributed data parallel is to look at the test local_rank != -1 as torch.distributed can give you false information there. I agree it would all be much easier if the training arguments contained something that directly gives the distributed environment.

Great. Should we create a feature request for that?

sgugger · 2020-11-30T18:49:23Z

I think there is a misunderstanding on the terminology: DataParallel is not distributed: distributed means launching several processes with the same script. The package torch.distributed does not return anything useful for DataParallel and ddp stands for distributed data parallel, so leaving that argument as distributed seems better to me.

Great. Should we create a feature request for that?

We can do that, yes.

stas00 · 2020-11-30T18:55:59Z

If you stick to the specific implementation, yes, dpp is the only distributed mode. But logically it doesn't make sense. DP is just as distributed as DPP, just isn't using the torch.distributed, so it's not a very clear distinction and will lead to such confusions all over.

As an example if you look at this function usage pattern it's mostly dataset.make_sortish_sampler(batch_size, distributed=self.hparams.gpus > 1) which clearly implies for any multi gpu mode (and erroneously so).

sgugger · 2020-11-30T19:00:26Z

I disagree, in the sense that code use PyTorch should stick with the PyTorch naming conventions. They chose to have a not distributed DataParallel, so we should honor that in our naming as well. In Distributed data parallel, you have to use a DistributedSampler (but not in DataParallel) etc. Those are all parallel modes (as you're training with multiple GPUs) but only one is distributed.

stas00 · 2020-11-30T19:05:08Z

That is a reasonable choice to follow. I'm only flagging how this leads to coding errors when a developer assumes that n_gpu> 1 == ddp. So perhaps some extra support is needed there.

sgugger · 2020-11-30T19:59:36Z

Let's see how it goes once we add the "distributed_env" to TrainingArguments!

stas00 · 2020-11-30T20:46:46Z

@sgugger, please kindly review at your convenience - I addressed all the issues you have raised - all should be good - CI failures are unrelated. Thank you!

sgugger

Perfect, thanks a lot for humoring me and my annoying comments :-)

stas00 · 2020-11-30T20:55:45Z

Perfect, thanks a lot for humoring me and my annoying comments :-)

On the contrary, your comments were excellent and to the point.

I was just slow on getting your point of view since in my mind if we solve a problem on multiple gpus it's distributed across multiple-gpus, regardless of the way it's implemented. But here distributed means distributed across multiple processes. Different semantics.

stas00 · 2020-11-30T20:57:07Z

So this is probably wrong too:

# examples/seq2seq/finetune.py:  
sampler = dataset.make_sortish_sampler(batch_size, distributed=self.hparams.gpus > 1)

But that's code base on PL.

@patil-suraj, may be you could have a look when you start working at this one? I suspect that it should do a different check for distributed and not check the number of gpus. Let me know if you prefer that I open a separate issue.

sgugger · 2020-11-30T21:08:58Z

Dunno how PL works.

stas00 · 2020-11-30T21:14:59Z

Let's see how it goes once we add the "distributed_env" to TrainingArguments!

Added a feature request: #8858

rabeehk · 2020-12-01T22:40:56Z

Thank you HuggingFace Team and @stas00 , I cannot express how much I appreciate your efforts.

* fix DP case on multi-gpu * make executable * test all 3 modes * use the correct check for distributed * dp doesn't need a special case * restore original name * cleanup

stas00 added 4 commits November 27, 2020 11:32

fix DP case on multi-gpu

fa42177

make executable

2912313

test all 3 modes

809117c

Merge remote-tracking branch 'origin/master' into s2s-trainer-dp

a37b86c

stas00 requested review from sgugger and patrickvonplaten November 28, 2020 01:47

stas00 mentioned this pull request Nov 29, 2020

[s2s finetune_trainer] a mess around distributed #8822

Closed

patrickvonplaten approved these changes Nov 30, 2020

View reviewed changes

sgugger reviewed Nov 30, 2020

View reviewed changes

examples/seq2seq/seq2seq_trainer.py Outdated Show resolved Hide resolved

examples/seq2seq/test_finetune_trainer.py Outdated Show resolved Hide resolved

stas00 added 4 commits November 30, 2020 12:33

use the correct check for distributed

84cfdc4

dp doesn't need a special case

0c6d4b0

restore original name

3aa4120

cleanup

b41bde3

sgugger approved these changes Nov 30, 2020

View reviewed changes

stas00 merged commit 7f34d75 into huggingface:master Nov 30, 2020

stas00 mentioned this pull request Nov 30, 2020

[trainer] add distributed_env to TrainingArguments #8858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[s2s trainer] fix DP mode #8823

[s2s trainer] fix DP mode #8823

stas00 commented Nov 27, 2020 •

edited

Loading

patrickvonplaten left a comment

stas00 commented Nov 30, 2020 •

edited

Loading

stas00 commented Nov 30, 2020 •

edited

Loading

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020 •

edited

Loading

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020 •

edited

Loading

sgugger left a comment

stas00 commented Nov 30, 2020

stas00 commented Nov 30, 2020 •

edited

Loading

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020

rabeehk commented Dec 1, 2020

[s2s trainer] fix DP mode #8823

[s2s trainer] fix DP mode #8823

Conversation

stas00 commented Nov 27, 2020 • edited Loading

patrickvonplaten left a comment

Choose a reason for hiding this comment

stas00 commented Nov 30, 2020 • edited Loading

stas00 commented Nov 30, 2020 • edited Loading

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020 • edited Loading

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

stas00 commented Nov 30, 2020

stas00 commented Nov 30, 2020 • edited Loading

sgugger commented Nov 30, 2020

stas00 commented Nov 30, 2020

rabeehk commented Dec 1, 2020

stas00 commented Nov 27, 2020 •

edited

Loading

stas00 commented Nov 30, 2020 •

edited

Loading

stas00 commented Nov 30, 2020 •

edited

Loading

stas00 commented Nov 30, 2020 •

edited

Loading

stas00 commented Nov 30, 2020 •

edited

Loading

stas00 commented Nov 30, 2020 •

edited

Loading