Skip to content

Allow use of OFI BTL with RXM #13383

@jakemoroni

Description

@jakemoroni

Running Open MPI over Libfabric (RXM/verbs) MTL works fine, but there is no shared memory support in this configuration. Open MPI supports an OFI BTL, but RXM is explicitly excluded due to a lack of real FI_DELIVERY_COMPLETE support. Ref: 41acfee

By commenting that commit out, it "works" and I can observe shared memory being used for intra-node communication, leading to a massive performance benefit. Of course, I may have just opened myself up to subtle race conditions and correctness issues...

Can we re-evaluate whether this needs to still be excluded? This may be a libfabric question, so I will follow up there as well.

Out of curiosity, is there a reason why Open MPI needs the "delivery complete" semantic as opposed to 'transmit complete"? If so, is there a test case or example of how this more relaxed guarantee could result in a correctness issue?

I could be wrong, but I assume that libfabric is basically using the underlying verbs RDMA write completion, which indeed doesn't guarantee that the data has actually landed in remote memory yet (only that the remote NIC has ACKd it). However, it is guaranteed that any subsequent read will reflect all previously written data from prior WQEs, so I am curious how any side effect of this weaker completion guarantee could actually be observed anyway.

I am also curious how this same thing is handled with UCX - as in, are they doing a full software ACK?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions