Add `num_replicas` and `num_partitions` to ModuleOp #650

sdasgup3 · 2022-11-29T19:42:36Z

Request description

The statically know values num_replicas and num_partitions, being provided by HLOModuleConfig, helps to find the size of each process group to be employed in parallel execution. Currently these values are not exposed in StableHLO.

Having these values in StableHLO will enable

type inference of CollectiveOps. For example, using num_replicas and num_partitions we can determine shard_count using GetSubgroupSize. With that we can say: dim(result, all_gather_dim) = shard_count * dim(operand, all_gather_dim).
Populate empty replica_groups ref

Originated in #503 (comment) and
#503 (comment)

The text was updated successfully, but these errors were encountered:

fixes #462 Address the followings: 1. Adds verification checks for AllGather w.r.t. #498 2. fixes #491 A few points - Type Inference is marked `infeasible` as the return type of the op depends upon the [shard_count](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L452) which [depends on the result type](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L426). Note that the `shard_count` is a parameter in HLO spec, where as in MHLO it is [derived](https://github.com/tensorflow/tensorflow/blob/a1acd6a6466f58ed4197b7beaa2f3e0b6fcfc32a/tensorflow/compiler/xla/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc#L766) using result type before exporting to HLO. - With (1) and (2), the Verifier is a `yes`. Note that we still do not have the [check](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L436) for `shard_count == subgroup_size`. The `subgroup_size` depends on [module configuration](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L95) which manages the settings and values which affect the compiled executable outside of the HLO code itself and I am not sure if that info is available in StableHLO IR. upd: - With #650, the type inference should be made feasible. Marked it `revisit` until that is fixed. - The verifier should be `revisit` based on #652

fixes openxla#462 Address the followings: 1. Adds verification checks for AllGather w.r.t. openxla#498 2. fixes openxla#491 A few points - Type Inference is marked `infeasible` as the return type of the op depends upon the [shard_count](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L452) which [depends on the result type](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L426). Note that the `shard_count` is a parameter in HLO spec, where as in MHLO it is [derived](https://github.com/tensorflow/tensorflow/blob/a1acd6a6466f58ed4197b7beaa2f3e0b6fcfc32a/tensorflow/compiler/xla/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc#L766) using result type before exporting to HLO. - With (1) and (2), the Verifier is a `yes`. Note that we still do not have the [check](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L436) for `shard_count == subgroup_size`. The `subgroup_size` depends on [module configuration](https://github.com/tensorflow/tensorflow/blob/20c6943d3cd7e07da162f7778a0af5d3776274b4/tensorflow/compiler/xla/service/hlo_verifier.cc#L95) which manages the settings and values which affect the compiled executable outside of the HLO code itself and I am not sure if that info is available in StableHLO IR. upd: - With openxla#650, the type inference should be made feasible. Marked it `revisit` until that is fixed. - The verifier should be `revisit` based on openxla#652

burmako · 2023-04-13T15:47:07Z

JAX has just added mhlo.num_replicas and mhlo.num_partitions to their lowering: jax-ml/jax#15586. It's great to know that this information is available during lowering 🎉

We have the following constraints in the spec: ``` (I1) `operand`: tensor. (I2) `source_target_pairs`: 2-dimensional tensor constant of type `si64`. (I3) `channel_id`: constant of type `si64`. (C1) `dim(source_target_pairs, 1) = 2`. (C2) `is_unique(source_target_pairs[:, 0])`. (C3) `is_unique(source_target_pairs[:, 1])`. (C4) `0 <= source_target_pairs < N`, where N is defined as: * `num_replicas` if `cross_replica` is used. * `num_partitions` if `cross_partition` is used. (C5) `type(result) = type(operand)`. ``` These constraints will be comprehensively covered by the following tests: ``` I1: a) `operand` is not a tensor. (Covered by ODS). I2: a) `source_target_pairs` is not a 2-dimensional tensor constant of type `si64`. I3: a) `channel_id` is not a constant of type `si64`. (Covered by ODS). C1: a) `dim(source_target_pairs, 1) != 2`. C2: a) `is_unique(source_target_pairs[:, 0]) = false`. C3: a) `is_unique(source_target_pairs[:, 1]) = false`. C4: a) `source_target_pairs < 0`. C4: b) `source_target_pairs >= N`, where `N` is defined as: * `num_replicas` if `cross_replica` is used. * `num_partitions` if `cross_partition` is used. C5: a) `type(result) != type(operand)`. ``` If we drop the "Covered by ODS" pieces, this will leave us with the following test cases: ``` I2a: `source_target_pairs` is not a 2-dimensional tensor constant of type `si64`. C1a: `dim(source_target_pairs, 1) != 2`. C2a: `is_unique(source_target_pairs[:, 0]) = false`. C3a: `is_unique(source_target_pairs[:, 1]) = false`. C4a: `source_target_pairs < 0`. C4b: `source_target_pairs >= N`, where `N` is defined as: * `num_replicas` if `cross_replica` is used. * `num_partitions` if `cross_partition` is used. C5a: `type(result) != type(operand)`. ``` Notes: * C4b verification is infeasible since `num_replicas` and `num_partitions` are not known statically at the moment (see #650). closes #1124

sdasgup3 added the Spec label Nov 29, 2022

sdasgup3 mentioned this issue Nov 29, 2022

Add spec for AllGatherOp #503

Merged

burmako changed the title ~~Expose num_replicas and num_partitions in StableHLO.~~ Add num_replicas and num_partitions to ModuleOp Nov 29, 2022

burmako added the Type inference label Nov 29, 2022

This was referenced Dec 2, 2022

Add spec for ReduceScatterOp #564

Merged

Missing verification of replica_groups for AllGather/AllReduce/AllToAll/CollectivePermute/ReduceScatter #498

Open

burmako added this to Frontend contract Feb 6, 2023

burmako assigned atondwal Feb 6, 2023

julianwa added this to (Deprecated) IREE Apr 11, 2023

github-project-automation bot moved this to Inbox in (Deprecated) IREE Apr 11, 2023

julianwa moved this from Inbox to Needs Scheduling in (Deprecated) IREE Apr 11, 2023

julianwa moved this from Needs Scheduling to Not Started in (Deprecated) IREE Apr 11, 2023

okkwon mentioned this issue Apr 13, 2023

Lower StableHLO ops with multiple replica groups into flow ops iree-org/iree#12946

Closed

burmako moved this to Todo in Frontend contract Apr 23, 2023

allieculp moved this from Not Started to Backlog in (Deprecated) IREE May 16, 2023

burmako assigned sdasgup3 and unassigned atondwal Aug 2, 2023

burmako mentioned this issue Aug 6, 2023

Add interpreter for CollectivePermuteOp #1715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `num_replicas` and `num_partitions` to ModuleOp #650

Add `num_replicas` and `num_partitions` to ModuleOp #650

sdasgup3 commented Nov 29, 2022 •

edited

Loading

burmako commented Apr 13, 2023

Add num_replicas and num_partitions to ModuleOp #650

Add num_replicas and num_partitions to ModuleOp #650

Comments

sdasgup3 commented Nov 29, 2022 • edited Loading

Request description

burmako commented Apr 13, 2023

Add `num_replicas` and `num_partitions` to ModuleOp #650

Add `num_replicas` and `num_partitions` to ModuleOp #650

sdasgup3 commented Nov 29, 2022 •

edited

Loading