Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ofi 2.0 changes #10009

Closed
wants to merge 22 commits into from
Closed

Ofi 2.0 changes #10009

wants to merge 22 commits into from

Conversation

j-xiong
Copy link
Contributor

@j-xiong j-xiong commented Apr 29, 2024

Here are the proposed changes for OFI 2.0, most of which are cherry-picked from #9384 and modified to fit current tip of main branch. Since the sockets and usnic provider have not been removed as suggested by #9384, extra work is needed to keep them in sync with the changes.

More commits are going to be added to this PR. The merging of this PR will embark the forking of v1.x-main.

Several defines and values should not have been exposed in the
public header files.  Remove or move the definitions into internal
headers.  This removes the chance of possible conflicts with
application definitions and API breakage.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Remove support for asynchronous insertions and AV_MAP.  The
format of the fi_addr_t value will either be indexed based in
the standard case or provider defined in more advanced use cases,
based on the AV configuration (such as using auth_keys).

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Remove FI_BUFFERED_RECV as an exported API option.  Since it's
currently used internally between mrail and rxm, make it an
internal only option.  It has a limited use case for multirail
over rxm over connected endpoints where shared receive queues are
not available.  With shared receive queues, the feature wouldn't
be needed, as mrail could own the buffers outright.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Remove overly complicated threading models and focus on specific
models to allow better alignment between application designs and
provider implementation.

Use FI_THREAD_DOMAIN as the preferred lockless threading model
for standard endpoints.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Combine data and control progress into one progress option.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Completions are always unordered.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Field was deprecated and only serves as a placeholder for
compatility.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Support for wait sets adds significant complexity to the provider
implementation and is basically an abstraction around the OS
constructs for poll/epoll (on Linux).  Remove the feature from the
API, but keep the internal implementation for now.  This allows
providers to move away from wait set support.

Note that blocking support and support for native wait objects
(e.g. epoll fd's) are still supported by the API.  Only the wait
set abstraction is removed, which allows providers control over
creating wait objects.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Setting control=FI_PROGRESS_CONTROL_UNIFIED, data=FI_PROGRESS_MANUAL, and
threading=FI_THREAD_DOMAIN/FI_THREAD_COMPLETION allows Libfabric to remove all
locking in the critical data progress path.
: This progress model indicates that the user will synchronize progressing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove control progress, aren't we assuming data and ctrl progress are unified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a question I would like to discuss. It is possible that both use the same model (auto or manual) but progress independently. But does it make practical sense to keep them separated? Or are there technical difficulties for all providers to support unified progress?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @shefty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was to remove the arbitrary separation between data and control progress. There's only progress, which can be manual or auto. Heck, even trying to determine if an async operation that generates a completion meets the requirement for being manual versus auto is non-trivial.

The only "non-data" asynchronous operation that is left is connection establishment. Such operations are reported to an EQ, which already provides separation from data transfer completions. Connections require user intervention in order to complete. I don't know what auto progress really means there.

@shijin-aws
Copy link
Contributor

shijin-aws commented Apr 30, 2024

AWS CI failed in ubertest as wait_obj is removed

--------------------------------- Captured Out ---------------------------------

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.38.120 'timeout 1000 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10009-debug/install/fabtests/bin/fi_ubertest -x'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.38.120 'timeout 1000 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10009-debug/install/fabtests/bin/fi_ubertest -u /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10009-debug/install/fabtests/share/fabtests/test_configs/sockets/quick.test 172.31.38.120'"'"''
client_stdout:
[error] fabtests:ubertest/config.c:316: Unknown wait_obj
[error] fabtests:ubertest/config.c:568: Error parsing config!
[error] fabtests:ubertest/config.c:629: Unable to parse file

client returncode: 1
server_stdout:

server returncode: 124

@j-xiong
Copy link
Contributor Author

j-xiong commented Apr 30, 2024

@shijin-aws Thanks. Similarly, the failure in Intel CI is related to trying to run removed test (fi_poll).

#define FI_VARIABLE_MSG 0ULL
#define FI_NOTIFY_FLAGS_ONLY 0ULL
#define FI_RESTRICTED_COMP 0ULL
#define FI_EP_SOCK_STREAM FI_EP_UNSPEC
#define FI_ORDER_NONE 0ULL
#define FI_MR_BASIC (FI_MR_VIRT_ADDR | FI_MR_ALLOCATED | FI_MR_PROV_KEY)
#define FI_MR_SCALABLE 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining FI_MR_BASIC and FI_MR_SCALABLE allow the app to compile, but the behavior isn't the same. FI_MR_BASIC, specifically will result is a broken app. IMO, it's better to fail the compilation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep these temporarily for current CI testing. Will remove before merge.

Makefile.am Outdated
@@ -294,7 +293,7 @@ real_man_pages = \
man/man7/fi_arch.7 \
man/man7/fi_direct.7 \
man/man7/fi_guide.7 \
man/man7/fi_hook.7 \
man/man7/fi_hook.7 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tab got inserted here

Copy link
Contributor

@belynam belynam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the OPX provider look good to me.

enum {
FI_TAG_BITS,
FI_TAG_HPC,
FI_TAG_AI,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we wanting HPC and AI or MPI and CCL? The latter are closer aligned to what the definitions correspond to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI and CCL. Forgot to change that.


The AI tag format does not require searching for matching receive
buffers, only directing the message to the correct virtual message queue
based on to the payload identifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra 'to' in the text that isn't needed

: 16-bit half precision floating point value (IEEE 754-2008).

*FI_BFLOAT16*
: 16-bit brain floating point value (IEEE 754-2008).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brain floating point - I love IEEE

shefty and others added 10 commits May 14, 2024 17:14
Poll sets are a simple iterator around progressing multiple objects.
Remove from the API to reduce provider complexity.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
It is only implemented in the sockets provider as a proof of concept.
The sockets provider itself is deprecated.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Remove FI_MR_BASIC/SCALABLE/UNSPEC.  These were deprecated in
version 1.5.  Remove FI_LOCAL_MR, which was an earlier version of
FI_MR_LOCAL and also deprecated.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Feature is not implemented natively by providers.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Remove FI_ORDER_NONE (flag that's 0) and FI_ORDER_STRICT (which isn't a
flag, and covers only a portion of the valid flags).

Remove FI_ORDER_DATA, which will not be supported by version 2 in
order to allow for greater provider optimization.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
They are still used by MPICH which is needed by current CI runs.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Remove the option of directing transmit and receive completions
to separate CQs for the same endpoint.  The option adds complexity
at the provider and application levels.  This is largely the
result of needing SW based protocols for certain operations, such
as tag matching.  This either makes it necessary for the app to
drive progress across multiple CQs, or the provider emulates the
application's CQs in SW.

This change updates the man page only.  Provider developers are
left to update their code bases separately.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Disallow users hand-crafting or hand-copying their own fi_info structs.
This allows the library to allocate hidden fields for internal use.
Plus, there's no need to apps to do this, given that the API call is
way easier to use.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Add a new call that takes fi_info as input, which provides
consistency with the domain and endpoint open calls.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Subsystem filtering isn't useful (or likely ever used).  Remove
the enum fi_log_subsys.  Instead convert the API to accept int
flags.  This maintains API compatibility.  Update all current
FI_LOG_xxx subsys values to 0.  This avoids needing to update
the providers to the new API, forcing them to pass in 0 for
the flags.

No actual flag values are defined.  Those become a placeholder
for future options.

The logging checks are simplified by this change.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
shefty and others added 4 commits May 14, 2024 17:14
Introduce the concept of peer groups.  A peer group is a set of
peers that are communicating together for some specific set of tasks.
Peer groups provide a lower-level mapping of HPC and AI communicators.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Allow specifying precise tag formatting options.  The mem_tag_format
takes as input a set of bit fields.  In practice, this ends up being
unusable to implement, resulting in the entire tag simply being
masked with ignore bits.

When the mem_tag_format value only has the lower bits set (< 256),
interpret the format as specific options.  Two new options are
defined, one aligned with MPI and the other with CCLs.  This
information can be used by providers to optimize for the separate
use cases.

Signed-off-by: Sean Hefty <sean.hefty@intel.com>
Add 16-bit and 8-bit floating point datatypes that are often used for
machine learning and AI applications.

Add atomic diff operation.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
Today, `ep_attr->max_msg_size` and `tx_attr->inject_size` define the
transport's size limitation for a single data transfer operation. It lacks
the flexibility to allow providers having different limits on different
categories of operations (msg/tagged/rma/atomic). At the same time,
applications can't express usage models that have very different size
limits (e.g, only need small send/recv, but want large RMA).

Here a set of options are added to the fi_getopt()/fi_setopt() interface.
These options allow the maximum message size and inject size be retrieved
or set independently for msg/tagged/rma/atomci operations. The existing
definition of `ep_attr->max_msg_size` and `tx_attr->inject_size` remain
unchanged -- they are the upper bound of the limits for all the data
transfer functions.

A typical application will call fi_getopt() to retrieve the default values
first, and call fi_setopt() if lower values are desired. If -FI_ENOPROTOOPT
is returned, `ep_attr->max_msg_size` and `tx_attr->inject_size` should be
used.

Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com>
@miharulidze
Copy link
Contributor

Dear all,

Are there any published schedule for the 2.0 API/library release and/or any a specific public mailing list or meeting where the ideas/release plans are discussed?

Thank you very much !

@shijin-aws
Copy link
Contributor

shijin-aws commented May 28, 2024

For 8f17f05, I think it doesn't have to be ofi 2.0 only, can we make a separate PR that can be merged to main and we can release it in a future 1.x release

@j-xiong
Copy link
Contributor Author

j-xiong commented May 28, 2024

Dear all,

Are there any published schedule for the 2.0 API/library release and/or any a specific public mailing list or meeting where the ideas/release plans are discussed?

Thank you very much !

There is a roadmap here:
#8049

Discussions are usually held on the bi-weekly ofiwg meeting on Tuesdays mornings: https://www.openfabrics.org/my-calendar/

@j-xiong
Copy link
Contributor Author

j-xiong commented May 28, 2024

@shijin-aws Yes, we can do that.

@j-xiong
Copy link
Contributor Author

j-xiong commented Aug 6, 2024

Replaced with #10279 and #10250 .

@j-xiong j-xiong closed this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants