Peer Shared Receive Queues and Peer Shared Completion Queus #8118

amirshehataornl · 2022-10-14T23:12:29Z

amirshehataornl
Oct 14, 2022

I have a few comments/questions, which might or might not have been addressed before

PEER_SRX Capability

Would it make sense to have Peer SRX be a capability? This way an application can query providers which support Peer SRX, and opt to use them.

With LINKx it would be useful to get the capability in the fi_getinfo(). LINKx can then make a decision whether to use SRX or not. Currently SRX imposes a restriction on LINKx; all providers must support it in order to be able to link them. However, there exists a use case where you can link multiple providers without SRX. Receiving from ADDR_UNSPEC will not be supported, but everything else will.

Currently the only way to find out if SRX is supported is by calling specific APIs with FI_PEER_... and if that returns failure then SRX is not supported.

This means that if LINKx is linking multiple providers, and the first one supports SRX, but the second one doesn't, then it's more complicated to rewind the setup in order to "turn off" SRX on the first provider.

Shared Receive Queues vs Shared Completion Queues Workflow

I was initially under the impression that setting up SRQ and SCQ will be symmetric

Call the fi_<><>() API with the FI_PEER<> flag

However SRQ setup is broken up into multiple steps:

Create the domain
"bind" srx to the domain via fi_srx_context()
bind srx to the endpoint

I'm not sure of the benefit in that workflow? Do you see a scenario where you can have multiple endpoints, some with SRX enabled and others not?

If that's not the case, then why are we not binding the SRX to the domain via fi_srx_context() and then when endpoints are created they inherit the SRX from the domain?

If you want multiple different endpoints some with SRX and others not, then it would make sense to have different domains in that case. no?

Can you please explain the benefit of binding the SRX to the endpoint explicitly vs the endpoint inheriting the SRX from the domain. The inheritance method, I believe, does have precedent.

SRX Binding to Endpoint

In the current implementation binding an SRX to an end point is done under the FI_CLASS_SRX_CTX. Why is it not done under FI_CLASS_PEER_SRX? Isn't that the purpose of the latter? And shouldn't the peer srx operations be under its own class?

Although, I don't think we should bind at all as mentioned in my point above.

shefty · 2022-10-15T01:00:05Z

shefty
Oct 15, 2022
Maintainer

I believe that we are moving to a highly composable peer model. Within a year, I expect we'll be combining 3-4 providers together to act as one. The peer model got more complex with the addition of offload providers, and I expect it will get worse.

Realistically, any provider that is going to pair with another one will implement everything that it needs. And we will know up front what peer combinations we'll want. Trying to handle this through a dynamic query is a recipe for unneeded complexity. We have a bounded set of peers we're working with.

I want to replace all of the FI_PEER_ bits with a single FI_PEER bit, not add more. We're consuming too many capabilities bits. Applications should not use FI_PEER bits, nor try to control libfabric internal architecture through the public API.

An srx is not 'bound' to a domain like it is an endpoint. It's allocated, and that allocation is restricted to a domain. This is the same as for CQs. There can be multiple srx's/CQs allocated per domain. An endpoint is attached to an srx/CQ. This is the same flow used by applications through the main API. The peer allocates the srx, cq, ep, etc. the same as an app, and binds them the same as an app. We can argue whether a domain should only allow the creation of a single srx and cq, but that's an API 2.0 discussion.

The srx that is allocated is a normal srx (i.e. FI_CLASS_SRX_CTX). It's attached to the ep's the same as other srx's are attached. The FI_CLASS_PEER_SRX is the object that's imported by the peer as the void *context parameter, it's not what gets attached to the ep's.

4 replies

amirshehataornl Oct 15, 2022
Author

Thanks for the explanation.

Regarding your point of "highly composable peer model". I'm still trying to get the terminology straight. When you use the utility "fi_info", and you get a list of entries. What's the term you use to refer to each one of these entries?

shefty Oct 16, 2022
Maintainer

I call them fi_info entries. Original, no? :)

Originally, the term provider referred to an implementation of the API targeting a specific lower-level hardware or software interface. I think that term still mostly holds; it's just that the lower-level software interface being targeted may itself be libfabric. So, we try to group providers based on what they're trying to do: utility, core, hooking. I think of the link provider as distinct from any of these. An offload provider will be different yet, in that it will only implement a subset of the API and only be used as a peer.

My goal is to allow for independently maintained and developed providers, within reason. I don't want to skip architecture and design for the sake of perceived short-term convenience. But I also don't want an abstraction to increase maintenance costs just for the purpose of having an abstraction.

Here's just one example of what I envision for a 'highly composable peer model'. The link provider will join shared memory with rxm. Rxm sits over the verbs provider. Rxm may also link in a software collective module (currently the rxm util collective code). Rxm may also link in an offload collective module that communicates with NIC-based collectives (e.g. SHARP). So, from one view, we're dealing with at least 5 providers: link, shm, rxm, verbs, and sharp. A key here is that the functionality implemented by each separate provider is required. Trying to combine the functionality into a single, monolithic provider is just packaging. It won't simplify anything or remove functionality. It only results in duplicating the same functionality throughout other monolithic providers and makes tuning each component that much more difficult.

amirshehataornl Oct 16, 2022
Author

Is it fair to use Object Oriented Language to explain the relationship between providers and fi_info entries?

An fi_info entry could be referred to as an instance of the provider class. For example if you have a net provider and fi_info returns multiple entries for each ethernet interface, then each of these entries is an instance of the net provider class.

I'm trying to clarify this in my mind within the scope of the LINKx provider.

While we can say LINKx links multiple providers, it would be more accurate to say that it links instances of the providers. Yes?
As per your definition of provider "an implementation of the API targeting a specific lower-level hardware or software interface". You don't link the APIs together, rather you link the interfaces together.
I might be using the term "link" differently from you. You might've meant link the different provider functionality together, while I'm using "link" in the context of linking interfaces together.

If we have 3 interfaces of a specific HW (ex: ethernet) LINKx can potentially group all 3 instances of the provider into one linkx-group and we can utilize these interfaces in some multi-rail method.

It is also possible to link (or compose) different providers together. So the hierarchy would be something like

LINKx
- provider-A (ex net)
  - provider instance 1 (ex eth0)
  - provider instance N (ex ethN)
- provider-B (ex shm)
  - provider instance 1 (ex: shm which supports HMEM)

Looking at OpenMPI as an example application, it calls fi_getinfo() and then proceeds to select only one fi_info. This means it can only use one particular interface at a time.

That was my original purpose of having the LINKx provider, where instead of using only one interface, we can abstract the use of multiple interfaces from the application. The practical use case at the moment is to use SHM provider for intra node and some other interface for inter node.

However, my next step is to allow for using multiple interfaces. I'll probably start with homogeneous interfaces, but using heterogeneous interfaces is possible as well, although probably less of a priority at the moment.

For example if you have a node with ib0,ib1 and eth0,eth1 and you want to be able to use all of them for your application. It's possible to create an IB link-group which contains ib0,ib1 and an eth link-group which contains eth0,eth1. The application can then establish connections to peers connected over the IB network as well as the ones connected over the eth network. This can be abstracted under the LINKx provider. The application will only see the LINKx address which is a composition of all the different addresses which a specific node can be reached on. When the address is inserted, LINKx decomposes the address into its constituents and updates the AV table for each provider instance. Now, if you have multiple provider instances of the same provider, it would make sense to have all these instances share the same AV table, if all the interfaces are on the same network.

Of course some interfaces can be on different subnets, in which case you'd need to create a different linkx-group per subnet. So the structure ends up something like

Provider-A (ex net)
- LINKx-Group (subnet-1)
  - provider instance 1 (ex eth0)
  - provider instance 2 (ex eth1)
- LINKx-Group (subnet-2)
  - provider instance 3 (ex eth2)
Provider-B (ex IB)
- LINKx-Group (ib-subnet-1)
  - provider instance 1 (ex ib0)
  - provider instance 2 (ex ib1)
- LINKx-Group (ib-subnet-2)
  - provider instance 3 (ex ib2)

Provider A and B refer to a provider class and each instance is a specific interface.

thoughts?

shefty Oct 17, 2022
Maintainer

I think of a provider being an implementation, not a class or instance. However, one could consider struct fi_provider as a class, and an instance of a provider as the loaded provider library. That is, there's only 1 instance of each provider present.

OFI classes are fabrics, domains, ep, etc. fi_endpoint() creates an instance of an ep. Simply stated, fi_info describes the attributes for creating an instance of an ep. Those attributes include information about which implementation will be used. But an fi_info isn't an instance of any class itself.

The peer architecture defines how difference instances of the various classes work together, when the instances come from different implementations. For example, how an ep instance from the tcp provider works with a cq instance from the shm provider.

If a provider is only using the public libfabric API to provide additional functionality, it's acting as what we've been calling a utility provider. While useful, it doesn't allow for optimizations that we can achieve using the peer model.

In your bulleted lists above, link isn't creating provider instances. It's creating ep instances. Those ep's may belong to different provider instances, and even in cases where the ep's are within the same provider instance, they may belong to different domain (NIC) instances. The starting point for the peer model was to define how we could create a single cq instance, on any of the domain or provider instances, and share the cq across all the different ep instances. We just recently expanded that to allow sharing eq, av, and av sets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peer Shared Receive Queues and Peer Shared Completion Queus #8118

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Peer Shared Receive Queues and Peer Shared Completion Queus #8118

amirshehataornl Oct 14, 2022

PEER_SRX Capability

Shared Receive Queues vs Shared Completion Queues Workflow

SRX Binding to Endpoint

Replies: 1 comment · 4 replies

shefty Oct 15, 2022 Maintainer

amirshehataornl Oct 15, 2022 Author

shefty Oct 16, 2022 Maintainer

amirshehataornl Oct 16, 2022 Author

shefty Oct 17, 2022 Maintainer

amirshehataornl
Oct 14, 2022

Replies: 1 comment 4 replies

shefty
Oct 15, 2022
Maintainer

amirshehataornl Oct 15, 2022
Author

shefty Oct 16, 2022
Maintainer

amirshehataornl Oct 16, 2022
Author

shefty Oct 17, 2022
Maintainer