-
Couldn't load subscription status.
- Fork 928
coll: Remove hcoll component #13467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll: Remove hcoll component #13467
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also remove hcoll from all the docs.
This PRs removes any reference to HCOLL in main, at least grepping for hcoll/hcol or HCOLL/HCOL I don't see anything, so we should be good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PRs removes any reference to HCOLL in main, at least grepping for hcoll/hcol or HCOLL/HCOL I don't see anything, so we should be good.
I'm a dope -- I did my grep wrong. This PR looks good.
|
|
||
| /* Since each process sends several non-contiguos blocks of data, each block sent (and therefore each send and recv call) needs a different tag. */ | ||
| /* As base OpenMPI only provides one tag for allgather, we are forced to use a tag space from other components in the send and recv calls */ | ||
| MCA_PML_CALL(isend(tmpsend + (ptrdiff_t) send_disp * scount * rext, scount, rdtype, sendto, MCA_COLL_BASE_TAG_HCOLL_BASE - send_disp, MCA_PML_BASE_SEND_STANDARD, comm, requests + transfer_count)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is problematic:
- for jobs with more than 1024 ranks we will use tags into an unreserved space (outside the
[ MCA_COLL_BASE_TAG_NEIGHBOR_BASE .. MCA_COLL_BASE_TAG_NEIGHBOR_END ]which is only 1024 long). - it changes all the collective components internal messaging. Which means we need to bump the component version up to break compatibility with older components
- I think 2 means we cannot do it in the middle of a stable series, soas is this PR is only meant for 6.0. If we want to bring it into the 5.x we should leave the HCOLL tag space as is, and only clean it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't going into v5.x only into into v6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, I just wanted to make sure we are well aware of this. However, I think the change to the collective component version is still required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I'll adjust
This commit completely removes the mca/coll/hcoll component from the Open MPI source code. The hcoll component provided integration with Mellanox's Hierarchical Collectives library for collective operation offload. Changes include: - Removed ompi/mca/coll/hcoll component directory and all source files - Removed config/ompi_check_libhcoll.m4 configuration macro - Updated coll_tags.h to remove HCOLL tag space definitions - Updated coll_base_allgather.c and coll_base_allgatherv.c to use NEIGHBOR tag space instead of HCOLL tag space - Removed hcoll references from platform configuration files - Removed hcoll documentation and configuration options - Removed hcoll references from code comments Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
| (1 == component->mca_type_minor_version && | ||
| 0 == component->mca_type_release_version))) { | ||
|
|
||
| return query_3_0_0((const mca_coll_base_component_3_0_0_t *)component, comm, priority, module); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have bumped to 3.1.0, please rename this function to query_3_1_0 -- the convention is to name everything according to the highest required version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course
|
If it was already bumped for 6.x then there is no need to bump it again. I missed that bump. |
|
Did you actually change the coll interface? I didn't review the entire PR -- I just saw that @bosilca told you to change the interface number to 3.1.0, and assumed that that was because you changed the coll interface. If this PR isn't changing the coll interface, then there's no reason to update the coll interface version number to 3.1.0. But if you are changing the coll interface, then yes, you do need to change the version number (e.g., to 3.1.0). The coll interface version is a different version number than the main OMPI version number (i.e., it's a different value than / not related to / not comparable to v5.0.x or v6.0.x). |
|
He didn't change the API, he changed the space for the pt2pt tags used by collective components, which is as problematic as changing the API. I think what @janjust is saying is that there is release with the coll components are version 3.0.0 yet, because there is no 6.x release, so we might not need to bump the coll component version to 3.1. |
Agreed - this is changing the interface. Did we change the coll API version to 3.0.0 just because the main OMPI version went to 6.0.0? Or some other reason? Regardless, now I grok: the v5.0.x series had coll API version < 3.0.0, so v6.0.x will be the first release with coll aPI version 3.0.0. So: no need to bump it again to 3.1.0. |
|
that's right not changing the interface but i'm reverting the pt2pt tags used by different collective components. |
|
ok great, I just removed the version bump, so can I get a thumbs up and let's merge this |
This commit completely removes the mca/coll/hcoll component from the Open MPI source code. The hcoll component provided integration with Mellanox's Hierarchical Collectives library for collective operation offload.
Changes include: