-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New collectives component: XPMEM-based Hierarchical Collectives (XHC) #11418
Conversation
Can one of the admins verify this patch? |
ok to test. |
Your barrier graphs are missing legends or titles to be able to understand what they are showing. Different number of processes I assume ? |
Ah sorry something went wrong there, I fixed them and updated the post. It's 3 different nodes (and yes different proc counts), and then one multi-node one across 48 nodes. |
I might be missing something, but rather than the changes to HAN to support Reduce, why not save the previous Reduce function and call it if the parameters aren't sufficient for your implementation? An example of doing this is the sync collective, which doesn't actually implement any algorithms, but simply calls a barrier after calling the underlying algorithm for the non-synchronizing collectives. |
I could do that with the In the future, I was thinking of implementing the intermediate buffers necessary on the leaders for the hierarchical reduction using a memory pool (not sure if opal/mpool does what I have in mind or not). This might then be further extended to have HAN's Allreduce register the rbuf to the pool as a "single-use" memory area, which would then be picked up by xhc's reduce when it requests a temporary buffer to use. |
I think you're right; that's a bummer. I'm not a huge fan of the coupling between xhc and han, especially to the level in the last patch, but maybe it is unavoidable.
Other than the HAN/xhc coupling, that makes sense. The rbuf definition in the MPI standard is rather unfortunate, but such is life. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big PR it takes time to review and understand how it interacts with the rest of the OMPI internals. But here are some comments for you.
int return_code = OMPI_SUCCESS; | ||
int ret; | ||
|
||
errno = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the question here? Why errno is set to 0? If so, it's for error reporting: in case something fails, I show what errno indicated. In the past I've had trouble where errno was set to non-zero outside the code in question, and when the code failed in a way that errno is not altered, the stale value of errno mileadingly hinted to another error. I don't have strong feelings about it -- let me know if it would be best to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we report vie the return code not via errno. Moreover, why setting it to success ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also report init errors via a help text -- perhaps that could/should go away. In the olden times before smsc, xhc_init also handled xpmem initialization, so failures there were even more common due to xpmem misconfigurations, and it was helpful to be explicitly notified of them. So the help text there is in part a remnant that could potentially be removed.
Or maybe it should only appear when e.g. the priority is explicitly set to non-zero. Or maybe something like that would be nice in coll/base -- a mechanism to inform the user that something went wrong with something that they explicitly requested. Anyway.
The errno set to 0 is to bring it to a known state. Suppose that some previously running code encounters an error, e.g. a failed open for a non-existing file, but does not abort because of it. My init may fail because of some unrelated reason, but errno will misleadingly indicate that failure is related to No such file or directory
. (I'm not 100% sure on errno conventions and that it's OK to reset it like that)
ompi/mca/coll/xhc/coll_xhc.c
Outdated
return return_code; | ||
} | ||
|
||
void xhc_deinit(mca_coll_xhc_module_t *module) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we tend to call these xhc_fini, and they should either follow the naming scheme or be static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will adjust it. There are also a number of other functions that are not static and begin with xhc_
, so I'll fix those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to xhc_fini
to match the other coll components. I also fixed the naming of all non-static functions.
For the functions that are non-static because they are called across xhc files, but are only intended to be used internally, I added some macros that omit the mca_coll_
prefix, so as to keeps the names short and wieldy, and to offer a slight hint to their internal nature.
int return_code = OMPI_SUCCESS; | ||
int ret; | ||
|
||
comms = malloc((comms_size = 5) * sizeof(xhc_comm_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly, please dont initialize a variable in a function call argument list!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have too strongs feelings about it, but I think it's rather nice; it is compact and groups together both the allocation of the memory and the assignment of the variable tracking its size. And has all its parentheses! Or is there a correctness concern here?
Thanks for the review and the initial comments. I addressed most of them with code changes. |
bot:ibm:retest |
finally done with the review and 👍. However, I would like to revisit the discussion above about the |
@bosilca I really like that solution. |
I like it! I'll implement/test it on the XHC side and update the PR |
The only caveat is that the root, for whom It's certainly possible and better than the other hack, so I'll think it over. Part of the problem is the non-full implementation of reduce. In the future the decision shouldn't be whether to fall back to another component or not (based on the presence of |
I see, you need more than local knowledge about the buffers. My patch will definitely not provide that, it just give us all (aka implementers of collective algorithms) a better knowledge of what we can use locally. Let me think a little about the buffer management, the cleaner approach would be a well-defined internal API for management of temporary buffers. IF you don't mind send me a message (email or slack) about the requirements you would need from such an API. |
I got rid of the hacky approach for the HAN allreduce. XHC's reduce now keeps a sticky internal buffer that it manually allocates and grows in size when necessary. If rbuf is NULL (and thanks to #11552, a scenario where rbuf is non-NULL but is not valid is not possible), the internal buffer is used instead. During HAN's allreduce, in XHC's reduce, the rbuf will be present for all ranks, and in this case an internal rbuf won't need to be allocated. The only case in which XHC's reduce falls back to another component now is when This PR should ideally be merged after #11552. See also #11650. I also applied a bunch of misc fixes and a small new feature for the barrier's root rank (it should all be in github's comparison). |
Hello @bosilca, and others, is now a good time to look into completing the integration of this PR? From what I understand the only prerequisites are #11552 and #11650, correct? Can I assist in any way, e.g. with testing or code? I could look into putting together a PR to address the issue identified in #11650 (?). I'll also bring this PR up to speed with the latest main, and test to make sure all still works. |
I performed some tests and measurements with this component, I think the performance improvements are very significant and we should try to merge this in the near future, before we branch for the next release. (Short message allreduce was the only area that the |
89d8897
to
cbeff9a
Compare
With #11552 and #11650 completed, I believe we are now ready to merge this? Is there something else pending? Thanks for the benchmarks Edgar, I'll take a look regarding the short message allreduce. It's also possible that it has already been improved/fixed in upcoming versions, might ask you at some point if you can repeat the test! |
9508b41
to
8c032f2
Compare
There were some mpi4py CI failures, with the tests timing out (327, 328). They actually went away after I re-ran the CI twice. Kinda mysterious, since all changes introduced here are not expected to be utilized at all, unless specifically activated. I'll try to see if I can reproduce the failures on my machine. In the mean time, if there are any insights on the mpi4py CI and whether the failures are related to this PR or not... |
wrt mpi4py failures, they look to be related to problems we are seeing with intercommunicators. intermittent. they are highly unlikely to be triggered by the work in this pr. |
@gkatev We should have brought this to your attention earlier. During our developer meeting last week. We had a constructive discussion about this PR. The consensus is that the change is rather safe as it does not alter default collective behaviors. However, from a code maintenance perspective, the community is severely under-staffed(we are all volunteers). We want to understand if your sponsor would be committed to maintaining XHC in the short to long term. This includes fixing bugs, as well as reviewing PRs from other contributors. |
@gkatev I sent you an email ealier this week about the topics mentioned above but did not have received an answer yet. |
@bosilca Sorry I didn't know that. |
No problem, they either bounced back (the email from the XHC paper), and acted as a black hole (his github email). Let's hope direct poke @gkatev works better ;) |
Hi, sorry for the delay in responding, I just sent a reply. (short answer: yes, we are able to support this work as necessary!) (PS the email from the paper will stop bouncing quite soon...) |
Please revise this PR to account for the merge of the init/fini PR (#12429) |
XHC implements hierarchical and topology-aware intra-node collectives, using XPMEM for inter-process communication. See the README for more information. Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
Signed-off-by: George Katevenis <gkatev@ics.forth.gr>
ready |
Hello,
This PR proposes the XPMEM-based Hierarchical Collectives (XHC) component.
XHC implements optimized intra-node (only) collectives according to a topology-aware n-level hierarchy, especially for nodes with high core counts and elaborate topologies (but not limited to). Supported topological features include NUMA nodes, CPU sockets, shared caches (based on hwloc through ompi's internal process-to-process distance mappings). The design is centered around communication using XPMEM, but Bcast also supports CMA and KNEM via opal/smsc. Currently Bcast, Allreduce and Barrier are supported. Reduce is implemented partially (and disabled by default).
XHC is also combinable with coll/HAN for multi-node runs.
A framework for hierarchical single-copy MPI collectives on multicore nodes
https://doi.org/10.1109/CLUSTER51413.2022.00024
Besides the main commit that adds the component, this PR includes two auxillary commits:
The 2nd commit, an allegedly mostly-trivial one, adds support in coll/HAN for selecting XHC as the component to use for the intra-comm.
The 3rd commit, which is a temporary hack, allows XHC's partially-implemented MPI_Reduce to be used for HAN's Allreduce. XHC's reduce is (for now) implemented as a sub-case of allreduce and requires that the rbuf param be valid for all ranks. I will understand if this commit is not desirable :-) (the reason for its existence would be alleged increased performance potential)Below are some benchmarks from various nodes and operations:
Allreduce
Broadcast
Barrier
HAN+XHC Allreduce
HAN+XHC Broadcast
HAN+XHC Barrier