Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parallel_mode property to TrainingArguments #8877

Merged
merged 3 commits into from
Dec 1, 2020
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Dec 1, 2020

What does this PR do?

This PR adds a distributed_env property to the TrainingArugments making it clear if we are in:

  • a single process (CPU or one GPU)
  • a parallel setting (one process but several GPUs)
  • a distributed parallel setting (several processes, one per GPU)
  • a TPU setting

Fixes #8858

Comment on lines 500 to 504
class DistributedEnvironment(Enum):
SINGLE = "single"
PARALLEL = "parallel"
DISTRIBUTED_PARALLEL = "distributed_parallel"
TPU = "tpu"
Copy link
Contributor

@stas00 stas00 Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps sticking to the commonly used abbreviations: dp/dpp?

  • dp = DataParallel
  • ddp = DistributedDataParallel
  • tpu = TPU

and "single" is too ambiguous - single gpu or single process?

Possible ideas:

  • dnp = DataNotParallel (goes well with dp/dpp, but not mp/pp)
  • np = NotParallel
  • ds = DataSingle or DataSimple (possible confusion with DeepSpeed)
  • sp = SimpleProcess (ambiguous processing unit-wise)
  • ss = SinglegpuorcpuSingleprocess
  • 11 = 1 cpu/gpu 1 process (variation of ss)
  • pu = ProcessingUnit (G or C) (ambiguous process-wise)
  • bu = Basic Unit
  • one = one of cpu/gpu and of process (not an abbreviation, like the rest)
  • basic = Well, basic (not an abbreviation, like the rest)
  • simple = Same as basic (not an abbreviation, like the rest)

I think I like "np" the most - as it works well with dp/ddp/mp/pp

Copy link
Contributor

@stas00 stas00 Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need a mode for MP (ModelParallel) and PP (PipeParallel) too. But of course these can be added when needed. Just mentioning these so that it will be easier to build a nicely mapped enum set.

  • mp = ModelParallel
  • pp = PipeParallel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A three letter abbreviation won't be clear for users who are not super familiar with PyTorch and we generally try to avoid them in the Transformers codebase.

Copy link
Contributor

@stas00 stas00 Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but there are existing conventions - why reinvent names when we have them already. Can use the full DistributedDataParallel, etc. if abbreviations are not fitting.

@stas00
Copy link
Contributor

stas00 commented Dec 1, 2020

Given our discussion yesterday, I'm not sure distributed_env is fitting. As you convinced me that DP is not distributed when it comes to pytorch conventions, if self.distributed_env == "dp" is back to being confusing.

Given that with the exception of tpu, all dp/ddp/mp/pp are SomethingParallel, should it be called parallel_mode?

I don't know anything about tpu, so it's hard for me to know where it fits. But it's probably not distributed either. And not parallel either.

So perhaps we call it compute_env

Comment on lines 509 to 513
class ParallelMode(Enum):
NO = "no"
NOT_DISTRIBUTED = "not_distributed"
DISTRIBUTED = "distributed"
TPU = "tpu"
Copy link
Contributor

@stas00 stas00 Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, This is a different way. Let's try to apply these:

So if self.parallel_mode == ParallelMode.DISTRIBUTED - works

But if self.parallel_mode == ParallelMode.NO is weird and can be confused with ParallelMode.NOT_DISTRIBUTED. I'd rename NO => NOT_PARALLEL, so if self.parallel_mode == ParallelMode.NOT_PARALLEL is more readable, no?

does this one make sense if self.parallel_mode == TPU (again knowing that you know TPU).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.parallel_mode == ParallelMode.TPU works fine, ok to change NO to NOT_PARALLEL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I assume we will add ParallelMode.PIPE and ParallelMode.MODEL for PP and MP, right?

@stas00 stas00 changed the title Add a distributed_env property to TrainingArguments Add a parallel_mode property to TrainingArguments Dec 1, 2020
@stas00
Copy link
Contributor

stas00 commented Dec 1, 2020

LGTM, @sgugger!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool names :)

@sgugger sgugger merged commit b08843c into master Dec 1, 2020
@sgugger sgugger deleted the distributed_env branch December 1, 2020 18:46
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
* Add a `distributed_env` property to TrainingArguments

* Change name

* Address comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[trainer] add distributed_env to TrainingArguments
3 participants