You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the weekly Webex today, it became evident that there are differing views on when the UCX PML is to be used by default. NOTE: All of the discussion below is about the default case -- any component / transport can be selected via MCA params / CLI options / etc., of course.
My understanding of the UCX PML was that it was only to be used when IB or RoCE devices were discovered on the system.
@Akshay-Venkatesh was surprised by this, and thought that the UCX PML should be used whenever it could be used. This includes in TCP and/or shared memory systems.
After looking into this a little bit today, here's what I found:
The UCX PML has a priority of 51.
The UCX PML will report that it can run if any of its transports can be used.
The UCM PML does not exclude any of its transports, so it will report that it can be used even if there are TCP and/or shared memory endpoints available.
The CM PML sets a priority based on whether any MTL can run.
The OFI MTL, for example, sets a priority of 25.
The CM PML will report that it can run if any of its transports can run.
The OB1 PML will always report that it can run, probably on the assumption that everywhere has TCP and/or shared memory.
(all of these priorities are MCA params and can be overridden, of course)
Effectively, it looks like this setup means that the UCX PML is used for all cases, because just about every system has TCP.
For example, on a usNIC-based system, if the UCX PML is available, then by default, the UCX PML is selected (because it sees TCP available), and therefore excludes OB1 and usNIC is not used.
That being said, for reasons I don't quite understand, on an EFA-based system, the UCX PML errors/fails to open the EFA device and therefore excludes itself from selection. CM+OFI MTL then take over and run, as expected.
I do not know what happens on machines with other networks not supported by UCX. However:
This is unacceptable for usNIC.
I am also greatly surprised to discover that UCX has, by default, effectively taken over TCP and shared memory handling in non-IB/non-RoCE networks.
Am I clueless to not have realized that this is happening? ☹️ I'm curious to know if others are aware of this UCX PML behavior.
The text was updated successfully, but these errors were encountered:
Say what?? Absolutely not - TCP was supposed to default to the TCP BTL, as it has done for many years. Users would certainly be surprised to find it wasn't.
To be clear, the only time UCX was supposed to be the default is when we are on Mellanox hardware. Otherwise, we were supposed to default to (a) the vendor's BTL/MTL (e.g., usnic) and then (b) to the BTLs. UCX was never supposed to be the default everywhere, just like OFI isn't the default (even though it too supports TCP).
During the weekly Webex today, it became evident that there are differing views on when the UCX PML is to be used by default. NOTE: All of the discussion below is about the default case -- any component / transport can be selected via MCA params / CLI options / etc., of course.
After looking into this a little bit today, here's what I found:
(all of these priorities are MCA params and can be overridden, of course)
Effectively, it looks like this setup means that the UCX PML is used for all cases, because just about every system has TCP.
For example, on a usNIC-based system, if the UCX PML is available, then by default, the UCX PML is selected (because it sees TCP available), and therefore excludes OB1 and usNIC is not used.
That being said, for reasons I don't quite understand, on an EFA-based system, the UCX PML errors/fails to open the EFA device and therefore excludes itself from selection. CM+OFI MTL then take over and run, as expected.
I do not know what happens on machines with other networks not supported by UCX. However:
Am I clueless to not have realized that this is happening?☹️ I'm curious to know if others are aware of this UCX PML behavior.
The text was updated successfully, but these errors were encountered: