Skip to content

Reduction latency variation with coll/tuned #10947

@devreal

Description

@devreal

As part of working on #10347 I found that the performance of coll/tuned varies significantly for small reductions, depending on the number of nodes I run on. All experiments are done on Hawk, the dual-socket AMD EPYC Rome system using ConnectX-6 installed at HLRS. I'm using the main branch of Open MPI.

coll/han does seem to provide better performance if the ranking is done by-node (no surprise since it reduces cross-node traffic) and provides fairly consistent performance with by-core ranking. coll/tuned varies between higher and lower latency than coll/han.

I have collected data for 4B reductions using the OSU benchmarks, with the -f flag to gather average, min, and max latency observed on any process. It is interesting to note that it is not just the average that varies but (to a lesser degree) also the maximum latency. That suggests that it's not an artifact of the benchmark's timing.

In order to not waste too many node hours, I allocated nodes in multiples of 8 and ran on N-7..N nodes on N nodes. I tested partitions 1-8, 9-16, ... and 5-12, 13-20, 21-28, ... to make sure that the effects I'm seeing aren't an artifact of this partitioning. The results are the same. In all cases, I ran with 64 processes per node. I have data with 48 processes per node where the effects are the same, albeit with slightly shifted switching points.

Average latency:
reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-2

Maximum latency:
reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-6

Minimum latency:
reduce_64_osu_tuned_52_64_1800703 hawk-pbs5-4

Here is how I run the benchmarks:

mpirun --mca coll ^hcoll --rank-by ${rankby} -N $npn -n $((npn*nodes)) --bind-to core --mca coll_tuned_priority 100 --mca btl ^uct mpi/collective/osu_reduce -f -m 4:4

rankby is either node or core. npn is 64 in the data above. I had to disable uct because I get segfaults otherwise (different story, not sure why).

As far as I can tell from the decision function, all runs use the binomial reduction tree (https://github.com/open-mpi/ompi/blob/main/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c#L662).

I dumped the binomial trees used in coll/tuned for both the by-core and by-node ranking and it looks just like what you would expect: high number of intra-node communication with by-core ranking and high inter-node communication with by-node ranking. This is reflected in the performance above. I didn't find any change in the tree between slow and fast runs of coll/tuned.

In the graphs below, same node colors represent the compute node (same color, same compute node). Dashed lines represent shared memory communication and solid lines represent inter-node communication.

By-core ranking:
bmtree_leafs_4x64_bycore dot

By-node ranking:
bmtree_leafs_4x64_bynode dot

No surprise here, but it confirms that we're doing the right thing in coll/tuned with a linear distribution across nodes (at least with binomial trees).

@janjust could you please run a similar set of benchmarks on your machines to make sure I'm not chasing a machine artifact?

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions