Skip to content

TCP and OpenIB BTLs cause performance degradation in unrelated frameworks  #2839

@jladd-mlnx

Description

@jladd-mlnx

We have observed this both in OSHMEM UCX testing and now in HCOLL shared memory testing on POWER8 Garrison platform. In the 2.x series, when the BTLs are silently loaded to cover OSC we see the following when running HCOLL shared memory collectives.

Running with 160 threads
mpirun -np 160 -mca coll_hcoll_enable 1 ./IBM-MPI1 Allreduce

8-byte allreduce latency:
21.9

mpirun -np 160 -mca btl ^tcp,openib -mca coll_hcoll_enable 1 ./IBM-MPI1 Allreduce

8-byte allreduce latency:
11.58

@hjelmn and I discussed this at the meeting, and the root cause appears to be the asynchronous progress thread in these two BTLs.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions