Ofi 2.0 changes #10009

j-xiong · 2024-04-29T06:51:16Z

Here are the proposed changes for OFI 2.0, most of which are cherry-picked from #9384 and modified to fit current tip of main branch. Since the sockets and usnic provider have not been removed as suggested by #9384, extra work is needed to keep them in sync with the changes.

More commits are going to be added to this PR. The merging of this PR will embark the forking of v1.x-main.

Several defines and values should not have been exposed in the public header files. Remove or move the definitions into internal headers. This removes the chance of possible conflicts with application definitions and API breakage. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Remove support for asynchronous insertions and AV_MAP. The format of the fi_addr_t value will either be indexed based in the standard case or provider defined in more advanced use cases, based on the AV configuration (such as using auth_keys). Signed-off-by: Sean Hefty <sean.hefty@intel.com>

Remove FI_BUFFERED_RECV as an exported API option. Since it's currently used internally between mrail and rxm, make it an internal only option. It has a limited use case for multirail over rxm over connected endpoints where shared receive queues are not available. With shared receive queues, the feature wouldn't be needed, as mrail could own the buffers outright. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Remove overly complicated threading models and focus on specific models to allow better alignment between application designs and provider implementation. Use FI_THREAD_DOMAIN as the preferred lockless threading model for standard endpoints. Signed-off-by: Sean Hefty <sean.hefty@intel.com>

Combine data and control progress into one progress option. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Completions are always unordered. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Field was deprecated and only serves as a placeholder for compatility. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Support for wait sets adds significant complexity to the provider implementation and is basically an abstraction around the OS constructs for poll/epoll (on Linux). Remove the feature from the API, but keep the internal implementation for now. This allows providers to move away from wait set support. Note that blocking support and support for native wait objects (e.g. epoll fd's) are still supported by the API. Only the wait set abstraction is removed, which allows providers control over creating wait objects. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

shijin-aws · 2024-04-29T16:38:05Z

man/fi_domain.3.md

-  Setting control=FI_PROGRESS_CONTROL_UNIFIED, data=FI_PROGRESS_MANUAL, and
-  threading=FI_THREAD_DOMAIN/FI_THREAD_COMPLETION allows Libfabric to remove all
-  locking in the critical data progress path.
+: This progress model indicates that the user will synchronize progressing


If we remove control progress, aren't we assuming data and ctrl progress are unified?

That is a question I would like to discuss. It is possible that both use the same model (auto or manual) but progress independently. But does it make practical sense to keep them separated? Or are there technical difficulties for all providers to support unified progress?

My intent was to remove the arbitrary separation between data and control progress. There's only progress, which can be manual or auto. Heck, even trying to determine if an async operation that generates a completion meets the requirement for being manual versus auto is non-trivial.

The only "non-data" asynchronous operation that is left is connection establishment. Such operations are reported to an EQ, which already provides separation from data transfer completions. Connections require user intervention in order to complete. I don't know what auto progress really means there.

shijin-aws · 2024-04-30T16:07:12Z

AWS CI failed in ubertest as wait_obj is removed

--------------------------------- Captured Out ---------------------------------

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.38.120 'timeout 1000 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10009-debug/install/fabtests/bin/fi_ubertest -x'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.38.120 'timeout 1000 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10009-debug/install/fabtests/bin/fi_ubertest -u /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10009-debug/install/fabtests/share/fabtests/test_configs/sockets/quick.test 172.31.38.120'"'"''
client_stdout:
[error] fabtests:ubertest/config.c:316: Unknown wait_obj
[error] fabtests:ubertest/config.c:568: Error parsing config!
[error] fabtests:ubertest/config.c:629: Unable to parse file

client returncode: 1
server_stdout:

server returncode: 124

j-xiong · 2024-04-30T16:42:48Z

@shijin-aws Thanks. Similarly, the failure in Intel CI is related to trying to run removed test (fi_poll).

shefty · 2024-05-06T23:07:04Z

include/rdma/fabric.h

 #define FI_VARIABLE_MSG		0ULL
 #define FI_NOTIFY_FLAGS_ONLY	0ULL
 #define FI_RESTRICTED_COMP	0ULL
 #define FI_EP_SOCK_STREAM	FI_EP_UNSPEC
+#define FI_ORDER_NONE		0ULL
+#define FI_MR_BASIC		(FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY)
+#define FI_MR_SCALABLE		0


Defining FI_MR_BASIC and FI_MR_SCALABLE allow the app to compile, but the behavior isn't the same. FI_MR_BASIC, specifically will result is a broken app. IMO, it's better to fail the compilation.

Keep these temporarily for current CI testing. Will remove before merge.

shefty · 2024-05-06T23:08:52Z

Makefile.am

@@ -294,7 +293,7 @@ real_man_pages = \
        man/man7/fi_arch.7 \
        man/man7/fi_direct.7 \
        man/man7/fi_guide.7 \
-	man/man7/fi_hook.7 \
+	      man/man7/fi_hook.7 \


tab got inserted here

belynam

The changes to the OPX provider look good to me.

shefty · 2024-05-14T15:00:22Z

include/rdma/fabric.h

+enum {
+	FI_TAG_BITS,
+	FI_TAG_HPC,
+	FI_TAG_AI,


Are we wanting HPC and AI or MPI and CCL? The latter are closer aligned to what the definitions correspond to.

MPI and CCL. Forgot to change that.

shefty · 2024-05-14T15:02:03Z

man/fi_endpoint.3.md

+
+  The AI tag format does not require searching for matching receive
+  buffers, only directing the message to the correct virtual message queue
+  based on to the payload identifier.


extra 'to' in the text that isn't needed

shefty · 2024-05-14T15:04:06Z

man/fi_atomic.3.md

+: 16-bit half precision floating point value (IEEE 754-2008).
+
+*FI_BFLOAT16*
+: 16-bit brain floating point value (IEEE 754-2008).


brain floating point - I love IEEE

Poll sets are a simple iterator around progressing multiple objects. Remove from the API to reduce provider complexity. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

It is only implemented in the sockets provider as a proof of concept. The sockets provider itself is deprecated. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Remove FI_MR_BASIC/SCALABLE/UNSPEC. These were deprecated in version 1.5. Remove FI_LOCAL_MR, which was an earlier version of FI_MR_LOCAL and also deprecated. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Feature is not implemented natively by providers. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Remove FI_ORDER_NONE (flag that's 0) and FI_ORDER_STRICT (which isn't a flag, and covers only a portion of the valid flags). Remove FI_ORDER_DATA, which will not be supported by version 2 in order to allow for greater provider optimization. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

They are still used by MPICH which is needed by current CI runs. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Remove the option of directing transmit and receive completions to separate CQs for the same endpoint. The option adds complexity at the provider and application levels. This is largely the result of needing SW based protocols for certain operations, such as tag matching. This either makes it necessary for the app to drive progress across multiple CQs, or the provider emulates the application's CQs in SW. This change updates the man page only. Provider developers are left to update their code bases separately. Signed-off-by: Sean Hefty <sean.hefty@intel.com>

Disallow users hand-crafting or hand-copying their own fi_info structs. This allows the library to allocate hidden fields for internal use. Plus, there's no need to apps to do this, given that the API call is way easier to use. Signed-off-by: Sean Hefty <sean.hefty@intel.com>

Add a new call that takes fi_info as input, which provides consistency with the domain and endpoint open calls. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Subsystem filtering isn't useful (or likely ever used). Remove the enum fi_log_subsys. Instead convert the API to accept int flags. This maintains API compatibility. Update all current FI_LOG_xxx subsys values to 0. This avoids needing to update the providers to the new API, forcing them to pass in 0 for the flags. No actual flag values are defined. Those become a placeholder for future options. The logging checks are simplified by this change. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Introduce the concept of peer groups. A peer group is a set of peers that are communicating together for some specific set of tasks. Peer groups provide a lower-level mapping of HPC and AI communicators. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Allow specifying precise tag formatting options. The mem_tag_format takes as input a set of bit fields. In practice, this ends up being unusable to implement, resulting in the entire tag simply being masked with ignore bits. When the mem_tag_format value only has the lower bits set (< 256), interpret the format as specific options. Two new options are defined, one aligned with MPI and the other with CCLs. This information can be used by providers to optimize for the separate use cases. Signed-off-by: Sean Hefty <sean.hefty@intel.com>

Add 16-bit and 8-bit floating point datatypes that are often used for machine learning and AI applications. Add atomic diff operation. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

Today, `ep_attr->max_msg_size` and `tx_attr->inject_size` define the transport's size limitation for a single data transfer operation. It lacks the flexibility to allow providers having different limits on different categories of operations (msg/tagged/rma/atomic). At the same time, applications can't express usage models that have very different size limits (e.g, only need small send/recv, but want large RMA). Here a set of options are added to the fi_getopt()/fi_setopt() interface. These options allow the maximum message size and inject size be retrieved or set independently for msg/tagged/rma/atomci operations. The existing definition of `ep_attr->max_msg_size` and `tx_attr->inject_size` remain unchanged -- they are the upper bound of the limits for all the data transfer functions. A typical application will call fi_getopt() to retrieve the default values first, and call fi_setopt() if lower values are desired. If -FI_ENOPROTOOPT is returned, `ep_attr->max_msg_size` and `tx_attr->inject_size` should be used. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

miharulidze · 2024-05-16T17:09:41Z

Dear all,

Are there any published schedule for the 2.0 API/library release and/or any a specific public mailing list or meeting where the ideas/release plans are discussed?

Thank you very much !

shijin-aws · 2024-05-28T21:30:07Z

For 8f17f05, I think it doesn't have to be ofi 2.0 only, can we make a separate PR that can be merged to main and we can release it in a future 1.x release

j-xiong · 2024-05-28T21:51:42Z

Dear all,

Are there any published schedule for the 2.0 API/library release and/or any a specific public mailing list or meeting where the ideas/release plans are discussed?

Thank you very much !

There is a roadmap here:
#8049

Discussions are usually held on the bi-weekly ofiwg meeting on Tuesdays mornings: https://www.openfabrics.org/my-calendar/

j-xiong · 2024-05-28T21:52:41Z

@shijin-aws Yes, we can do that.

j-xiong · 2024-08-06T19:56:01Z

Replaced with #10279 and #10250 .

shefty added 8 commits April 28, 2024 15:54

core: Simplify progress definition

b629dfb

Combine data and control progress into one progress option. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

core: Remove comp_order attribute

44295a5

Completions are always unordered. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

core: Remove total_buffered_recv

5acaa43

Field was deprecated and only serves as a placeholder for compatility. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

j-xiong added the work in progress label Apr 29, 2024

shijin-aws reviewed Apr 29, 2024

View reviewed changes

j-xiong force-pushed the ofi-2.0-changes branch from 03514ce to a77882b Compare April 29, 2024 16:46

j-xiong force-pushed the ofi-2.0-changes branch from a77882b to 91f8a34 Compare May 5, 2024 01:05

shefty reviewed May 6, 2024

View reviewed changes

belynam reviewed May 14, 2024

View reviewed changes

shefty reviewed May 14, 2024

View reviewed changes

shefty and others added 10 commits May 14, 2024 17:14

core: Remove fid_poll from the public API

fe1551b

Poll sets are a simple iterator around progressing multiple objects. Remove from the API to reduce provider complexity. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

core: Remove FI_WAIT_MUTEX_COND support from API

b3fc385

It is only implemented in the sockets provider as a proof of concept. The sockets provider itself is deprecated. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

core: Remove support for async memory registration

168e05c

Feature is not implemented natively by providers. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

core: Add dummy definitions for some removed caps/modes

70cf42f

They are still used by MPICH which is needed by current CI runs. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

core: Add fi_fabric2() API

14b5745

Add a new call that takes fi_info as input, which provides consistency with the domain and endpoint open calls. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

shefty and others added 4 commits May 14, 2024 17:14

core: Introduce new atomic datatypes and operation

a4cf6ff

Add 16-bit and 8-bit floating point datatypes that are often used for machine learning and AI applications. Add atomic diff operation. Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>

j-xiong force-pushed the ofi-2.0-changes branch from c6174dd to 8f17f05 Compare May 15, 2024 02:36

This was referenced Jul 31, 2024

2.0 deprecations #10250

Merged

2.0 new features #10279

Merged

libfabric version2: Initial set of proposed changes #9384

Closed

j-xiong closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ofi 2.0 changes #10009

Ofi 2.0 changes #10009

j-xiong commented Apr 29, 2024

shijin-aws Apr 29, 2024

j-xiong Apr 29, 2024

shijin-aws Apr 29, 2024

shefty Apr 29, 2024

shijin-aws commented Apr 30, 2024 •

edited

Loading

j-xiong commented Apr 30, 2024

shefty May 6, 2024

j-xiong May 14, 2024

shefty May 6, 2024

belynam left a comment

shefty May 14, 2024

j-xiong May 14, 2024

shefty May 14, 2024

shefty May 14, 2024

miharulidze commented May 16, 2024

shijin-aws commented May 28, 2024 •

edited

Loading

j-xiong commented May 28, 2024

j-xiong commented May 28, 2024

j-xiong commented Aug 6, 2024

Ofi 2.0 changes #10009

Ofi 2.0 changes #10009

Conversation

j-xiong commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws commented Apr 30, 2024 • edited Loading

j-xiong commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

belynam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miharulidze commented May 16, 2024

shijin-aws commented May 28, 2024 • edited Loading

j-xiong commented May 28, 2024

j-xiong commented May 28, 2024

j-xiong commented Aug 6, 2024

shijin-aws commented Apr 30, 2024 •

edited

Loading

shijin-aws commented May 28, 2024 •

edited

Loading