Allow use of OFI BTL with RXM

Running Open MPI over Libfabric (RXM/verbs) MTL works fine, but there is no shared memory support in this configuration. Open MPI supports an OFI BTL, but RXM is explicitly excluded due to a lack of real `FI_DELIVERY_COMPLETE` support. Ref: 41acfee2bbfc5495aeeeae4b72f385ca8d1d8cee 

By commenting that commit out, it "works" and I can observe shared memory being used for intra-node communication, leading to a massive performance benefit. Of course, I may have just opened myself up to subtle race conditions and correctness issues...

Can we re-evaluate whether this needs to still be excluded? This may be a libfabric question, so I will follow up there as well.

Out of curiosity, is there a reason why Open MPI needs the "delivery complete" semantic as opposed to 'transmit complete"? If so, is there a test case or example of how this more relaxed guarantee could result in a correctness issue?

I could be wrong, but I assume that libfabric is basically using the underlying verbs RDMA write completion, which indeed doesn't guarantee that the data has actually landed in remote memory yet (only that the remote NIC has ACKd it). However, it is guaranteed that any subsequent read will reflect all previously written data from prior WQEs, so I am curious how any side effect of this weaker completion guarantee could actually be observed anyway.

I am also curious how this same thing is handled with UCX - as in, are they doing a full software ACK?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow use of OFI BTL with RXM #13383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow use of OFI BTL with RXM #13383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions