Reduction latency variation with coll/tuned

As part of working on https://github.com/open-mpi/ompi/issues/10347 I found that the performance of coll/tuned varies significantly for small reductions, depending on the number of nodes I run on. All experiments are done on Hawk, the dual-socket AMD EPYC Rome system using ConnectX-6 installed at HLRS. I'm using the `main` branch of Open MPI.

coll/han does seem to provide better performance if the ranking is done by-node (no surprise since it reduces cross-node traffic) and provides fairly consistent performance with by-core ranking. coll/tuned varies between higher and lower latency than coll/han.

I have collected data for 4B reductions using the OSU benchmarks, with the `-f` flag to gather average, min, and max latency observed on any process. It is interesting to note that it is not just the average that varies but (to a lesser degree) also the maximum latency. That suggests that it's not an artifact of the benchmark's timing.

In order to not waste too many node hours, I allocated nodes in multiples of 8 and ran on N-7..N nodes on N nodes. I tested partitions 1-8, 9-16, ... and 5-12, 13-20, 21-28, ... to make sure that the effects I'm seeing aren't an artifact of this partitioning. The results are the same. In all cases, I ran with 64 processes per node. I have data with 48 processes per node where the effects are the same, albeit with slightly shifted switching points.

Average latency:
![reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-2](https://user-images.githubusercontent.com/10974502/196483289-a0cf4aa9-1da1-4035-bfdd-edf1e71a6b6b.png)

Maximum latency:
![reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-6](https://user-images.githubusercontent.com/10974502/196483284-2b3ab210-d256-4252-bc8a-f90004a14073.png)

Minimum latency:
![reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-4](https://user-images.githubusercontent.com/10974502/196483287-e709046f-473d-4bc2-96e0-b92030544776.png)

Here is how I run the benchmarks:

```
mpirun --mca coll ^hcoll --rank-by ${rankby} -N $npn -n $((npn*nodes)) --bind-to core --mca coll_tuned_priority 100 --mca btl ^uct mpi/collective/osu_reduce -f -m 4:4
```
`rankby` is either `node` or `core`. `npn` is 64 in the data above. I had to disable uct because I get segfaults otherwise (different story, not sure why).

As far as I can tell from the decision function, all runs use the binomial reduction tree (https://github.com/open-mpi/ompi/blob/main/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L662). 

I dumped the binomial trees used in coll/tuned for both the by-core and by-node ranking and it looks just like what you would expect: high number of intra-node communication with by-core ranking and high inter-node communication with by-node ranking. This is reflected in the performance above. I didn't find any change in the tree between slow and fast runs of coll/tuned.

In the graphs below, same node colors represent the compute node (same color, same compute node). Dashed lines represent shared memory communication and solid lines represent inter-node communication.

By-core ranking:
![bmtree_leafs_4x64_bycore dot](https://user-images.githubusercontent.com/10974502/196485766-d515c801-c6f3-412c-b718-411d3c22db4f.png)

By-node ranking:
![bmtree_leafs_4x64_bynode dot](https://user-images.githubusercontent.com/10974502/196485761-1e519cf4-6c2b-4a24-b086-a21b7ca6fe15.png)

No surprise here, but it confirms that we're doing the right thing in coll/tuned with a linear distribution across nodes (at least with binomial trees). 

@janjust could you please run a similar set of benchmarks on your machines to make sure I'm not chasing a machine artifact?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduction latency variation with coll/tuned #10947

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduction latency variation with coll/tuned #10947

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions