Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero-copy non-contiguous send #12536

Open
pascal-boeschoten-hapteon opened this issue May 9, 2024 · 4 comments
Open

Support zero-copy non-contiguous send #12536

pascal-boeschoten-hapteon opened this issue May 9, 2024 · 4 comments

Comments

@pascal-boeschoten-hapteon

Hello,

MPICH supports zero-copy intra-node sends for non-contiguous datatypes (although with some restrictions).
Could OpenMPI add support as well? And could it be made to work inter-node with e.g. UCX active message?

Kind regards,
Pascal Boeschoten

@bosilca
Copy link
Member

bosilca commented May 9, 2024

I saw your question on MPICH mailing list, but it was slightly different than what you asked on OMPI user mailing list or what you state on this issue. Because, while MPICH does have some level of support for zero-copy over XPMEM that is not true for structs (which is the exact type you asked about on OMPI user mailing list) nor does provide similar support when UCX is used as underlying transport (for intra-node communications).

Thus, even if we add node-level support for zero-copy over XPMEM as soon as you will enable UCX that will be useless.

@pascal-boeschoten-hapteon
Copy link
Author

Thanks, I see. Is UCX itself unable to do non-contiguous zero-copy (either intra-node or inter-node), or is it not supported on the OMPI/MPICH side?

Not having zero-copy for non-contiguous sends means certain data structures need to be split into many requests.
For example, if you have this:

struct S {
	std::vector<float> vec_f;  // size 10k
	std::vector<int> vec_i;    // size 10k
	std::vector<double> vec_d; // size 10k
	std::vector<char> vec_c;   // size 10k
};
std::vector<S> vec_s; // size 100

Being able to send it in 1 request instead of 400 (one for each contiguous buffer) seems like it could be quite advantageous, for performance and ease of use.
Even if there are restrictions, e.g. the buffer datatypes must be MPI_BYTE (i.e. assuming the sender and receiver have the same architecture / homogeneous cluster).

At the moment sending such a datatype in 1 request with a struct results in packing/unpacking, which is very slow for large buffers, so much so that it is significantly slower than sending 400 zero-copy requests as in the aforementioned example.
You mentioned in the mailing list that doing it in 1 zero-copy request is a complex operation that would be expensive and add significant latency.
But it seems like it could at least be an improvement over 400 separate zero-copy requests?

Other use cases could be sending many sub-views of a large 2D array, or sending map-like/tree-like types.

@yosefe
Copy link
Contributor

yosefe commented May 10, 2024

@pascal-boeschoten-hapteon UCX API currently does not optimize (for example zero-copy over RDMA or XPMEM) for complex datatypes, even though this is something that was discussed and considered to add.

@pascal-boeschoten-hapteon
Copy link
Author

@yosefe Thanks, that's good to know.

@bosilca To give a bit more context, we've observed that when sending many large and complex data structures (similar to the one in the example above) to many other ranks, it's significantly slower to have many small zero-copy requests vs one big request with packing/unpacking. The sheer volume of requests seems to be the bottleneck and we've seen up to a factor of 5 difference in throughput. But when it's just 1 rank sending to 1 other rank, the many small zero-copy requests are faster, as the packing/unpacking becomes limited by memory bandwidth. It should mean that if we could have one big zero-copy request, the performance gain in the congested case would be very significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants