Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COLL framework needs work to support MPI Bigcount #12336

Open
hppritcha opened this issue Feb 14, 2024 · 4 comments
Open

COLL framework needs work to support MPI Bigcount #12336

hppritcha opened this issue Feb 14, 2024 · 4 comments

Comments

@hppritcha
Copy link
Member

hppritcha commented Feb 14, 2024

In the course of work done in #12226 it was discovered that, unlike the PML API, the COLL API is not ready for big count.
One option would be to extend the existing table of coll methods to have entry points for large count functions. This would have the plus of only having to implement initially support for big count in the basic and maybe tuned components. It would have the downside of roughly doubling the size of mca_coll_base_comm_coll_t struct. Changing the definitions for all the existing methods to be generalized to support big count would have the down side of needing to go into every existing component and making sure their collective methods can handle MPI_Count and MPI_Aint - and if they can't be modified to support big count, disqualify their implementation of that particular collective operation.

Related to issue #9194 and PR #12226.

We do not plan to include this work in PR #12226 as its already complex enough and is really targeted at the infrastructure for generating the _c MPI API c entry points for big count and the way too long in implementation correct TS 29113 entry points for Fortran F08 (along with support for Big count on the fortran side too).

@bosilca
Copy link
Member

bosilca commented Feb 14, 2024

Not all APIs will need to be doubled. All API handling a single count and disps (per buffer) can simply be extended to take the larger count into account. However, the APIs using arrays of counts and displacements, where the access will be more complicated (allgatherv, alltoallv, alltoallw, gatherv, reduce_scatter, scatterv, plus the non-blocking and persistent versions) will need to double.

This path is a nightmare, it will basically force us to maintain two copies of the same, already complex code, just to cope with the count and displacement type difference. I think I prefer to change the MCA coll type for counts/disps to void* and use macros to compute the right value, and then compile the code twice (once with int/int and once with MPI_Count/MPI_Aint).

@hppritcha
Copy link
Member Author

This path is a nightmare, it will basically force us to maintain two copies of the same, already complex code, just to cope with the count and displacement type difference. I think I prefer to change the MCA coll type for counts/disps to void* and use macros to compute the right value, and then compile the code twice (once with int/int and once with MPI_Count/MPI_Aint).

Okay. Would you propose adding an extra arg as well to the methods to indicate whether or not the app was invoking a big count method or small count collective op?

@bosilca
Copy link
Member

bosilca commented Feb 14, 2024

This is also a possible approach, but not the one I was going for. My idea would have doubled the size of the coll structure and a little code in the building infrastructure, but not the size of the collective code.

@hppritcha
Copy link
Member Author

oh i see, that's kind of what @jtronge and i were thinking of doing then.

@hppritcha hppritcha moved this from To do to In progress in MPI 4.0 compliance Apr 24, 2024
hppritcha pushed a commit to hppritcha/ompi that referenced this issue May 9, 2024
This commit adds only those functions which make use of C integer promotion.
So none of the 'v,w' and reduce_scatter related methods are added in this PR.

Related to open-mpi#12336

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha pushed a commit to hppritcha/ompi that referenced this issue May 9, 2024
This commit adds only those functions which make use of C integer promotion.
So none of the 'v,w' and reduce_scatter related methods are added in this PR.

Related to open-mpi#12336

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha pushed a commit to hppritcha/ompi that referenced this issue May 9, 2024
This commit adds only those functions which make use of C integer promotion.
So none of the 'v,w' and reduce_scatter related methods are added in this PR.

Related to open-mpi#12336
Pieces of open-mpi#12478 were taken out to make this PR.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
hppritcha pushed a commit to hppritcha/ompi that referenced this issue May 20, 2024
This commit adds only those functions which make use of C integer promotion.
So none of the 'v,w' and reduce_scatter related methods are added in this PR.

Related to open-mpi#12336
Pieces of open-mpi#12478 were taken out to make this PR.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants