[RFC][Discussion] Automatic Parallelization #335

soodoshll · 2023-07-28T20:27:11Z

rendered rfc

yaoyaoding · 2023-07-28T22:13:37Z

Hi @soodoshll, thanks for the draft!

It looks good as a first verion rfc draft!

I have several suggestions:

The 0001 and 0002 rfc slot has been used. Might consider use 0003.
Might be better to give a concrete example of distributed_config and optimization_config in the guide-level explanation.
Reference-level explanation is a good place to show what configs are available for distributed_config and optimization_config, and what each config specify. It's okay only to put the ones that are known now and update the draft and add more configs during implementation and refactor in the future.
I prefer putting the out_dir as a seperate function parameter instead of an attribute in config.
Add a reference to Alpa and use something like "Alpa[1]" in the main text.
This part "For example, for a 4x4 multi-machine-multi-GPU cluster, the possible sharding specifications are (4x4, 1), (4(machine), 4(gpus)), (4(gpus), 4(machines)), (1, 4x4). We do not consider (2, 8) or (8, 2). Therefore, using R or Si is sufficient since the number of shards is determined by the number of devices. " is a little vague. We can consider adding some example to illustrate what does a specific sharding specification mean (e.g., (4 gpus, 4 machine)), and explain the meaning of "R" and "Si".
Consider using mesh_axes_per_dim in TensorShardSpec.
The math formula in "Operator Sharding Specification" has some typesetting flaws.

yaoyaoding · 2023-07-28T23:52:29Z

Could add a section to describe the ILP formulation.

The design looks good to me. Hi @soodoshll and @xinli-git, could you also discuss how to seperate the whole feature into relative small steps to implement? We can use this issue to track the PRs related to this RFC, something like apache/tvm#15319. Thanks!

soodoshll · 2023-07-29T06:02:07Z

Hi @yaoyaoding, thanks for your suggestions. I've fixed the draft.

The whole features can be decomposed into the following steps:

Design and implement the data structure for tensor and op sharding specifications
connect function, which relies on (1)
Sharding rule generation, which relies on (1)
weight sharding and comm op injection, which relies on (2)
auto-parallelization algorithm, which relies on (2) and (3)
Run end-to-end tests

I'm working on 1 after it is done, we can start 2 and 3. I have a prototype of 3, which I will integrate later.

Hi @xinli-git, let's work in the auto-parallel branch.

soodoshll · 2023-07-31T05:38:18Z

I found that resharding (tensor conversion between ops with different specifications) sometimes requires the collective communication primitive all-to-all. For example, it happens when a MxN matrix is sharded along axis M and we want to convert it to be sharded along axis N.

Though nccl does not directly supports all-to-all, it can be implemented by send and recv. Without all-to-all, a workaround is to use all-gather and then do slicing for the same purpose, though suffering from suboptimal performance.

I'd suggest treat it as a low-prioritized TODO item and see if it will really cause performance issue. We can fix it after finishing the backbone of the whole pipeline.

xinli-git · 2023-07-31T06:11:24Z

Thanks! @soodoshll. The RFC is very detailed.

For modelling computation, it seems that Alpa assumes that all tensor contraction OPs (MM, Conv) must be fully sharded so all such ops that same computation cost under different sharding strategies. They also observe that other OPs have negligible runtime cost for computation. (I verified this as well). As a result, they think there was no need to model computation.

Since this feature probably requires a month of work for multiple people (currently me and Qidong) I was thinking maybe we can leverage github Projects (https://github.com/hidet-org/hidet/projects?query=is%3Aopen)

@yaoyaoding if you think that's a good idea I will take a lead on this

yaoyaoding · 2023-07-31T06:15:47Z

Hi @xinli-git, sounds good to me. I have not used the github project feature before, but you can have a try and let's see whether it helps the orgnization and planning.

soodoshll added the enhancement New feature or request label Jul 28, 2023

soodoshll added the rfc Discussion of potential rfc label Jul 29, 2023

soodoshll mentioned this issue Jul 31, 2023

[Distributed][auto-parallel] Sharding Specification and rule discovery #336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][Discussion] Automatic Parallelization #335

[RFC][Discussion] Automatic Parallelization #335

soodoshll commented Jul 28, 2023 •

edited by yaoyaoding

Loading

yaoyaoding commented Jul 28, 2023

yaoyaoding commented Jul 28, 2023

soodoshll commented Jul 29, 2023

soodoshll commented Jul 31, 2023

xinli-git commented Jul 31, 2023

yaoyaoding commented Jul 31, 2023

[RFC][Discussion] Automatic Parallelization #335

[RFC][Discussion] Automatic Parallelization #335

Comments

soodoshll commented Jul 28, 2023 • edited by yaoyaoding Loading

yaoyaoding commented Jul 28, 2023

yaoyaoding commented Jul 28, 2023

soodoshll commented Jul 29, 2023

soodoshll commented Jul 31, 2023

xinli-git commented Jul 31, 2023

yaoyaoding commented Jul 31, 2023

soodoshll commented Jul 28, 2023 •

edited by yaoyaoding

Loading