Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent & Perhaps buggy dynamic tuning behavior #7672

Open
wckzhang opened this issue Apr 29, 2020 · 6 comments
Open

Inconsistent & Perhaps buggy dynamic tuning behavior #7672

wckzhang opened this issue Apr 29, 2020 · 6 comments

Comments

@wckzhang
Copy link
Contributor

wckzhang commented Apr 29, 2020

Dynamic tuning files are organized in this format:

1             # Number of collectives
1             # Collective ID
1             # Number of com sizes
2             # Com size
2             # Number of message sizes
0 1 0 0       # Message size 0, algorithm 1, topo and segmentation at 0
1024 2 0 0    # Message size 1024, algorithm 2, topo and segmentation at 0

However, the actual definition of "Message size" is vague and dependent on the collective in question (https://github.com/open-mpi/ompi/blob/master/ompi/mca/coll/tuned/coll_tuned_decision_dynamic.c)

The following table represents the behavior of each collective's interpretation of "Message size"

Collective MPI Call Message Size in coll tuned
MPI_Allreduce datatype size * number of elements in send buffer
MPI_Alltoall datatype size * comm size * number of elements to send to each process
MPI_Alltoallv 0
MPI_Barrier 0
MPI_Bcast datatype size * number of entries in buffer
MPI_Reduce datatype size * number of elements in send buffer
MPI_Reduce_scatter datatype size * sum of number of elements in result distributed to each process (sum of recvcounts)
MPI_Reduce_scatter_block datatype size * comm size * element count per block
MPI_Allgather datatype size * comm size * number of elements in send buffer
MPI_Allgatherv datatype size * sum of number of elements that are to be received from each process (sum of recvcounts)
MPI_Gather datatype size * comm size
MPI_Scatter datatype size * comm size
MPI_Exscan datatype size * comm size
MPI_Scan datatype size * comm size

There are a couple issues with how this is done. First, the way that the decision file is parsed, each message size already has a corresponding comm size, so there's no reason to have comm size as part of the calculation, it gives us no additional data. Second, I don't see how the MPI_Gather implementation's interpretation of message size as datatype size * comm size is correct in any way.

We should have a more consistent definition of what message size means for our dynamic tuning file.

@bosilca
Copy link
Member

bosilca commented Apr 30, 2020

The decision is made based on two principles: use a globally known amount such that the decision is deterministic across all processes, and use an amount that is reflective of the amount of data each process is supposed to receive but without taking in account the algorithm. This explains why for allgather we use comm size (because at the end each process will need to receive that amount), but we are not using the comm size for the allreduce. Scan and Exscan are outliers, that have been copy/paste from another algorithm without taking in account the above considerations.

@wckzhang
Copy link
Contributor Author

@bosilca Can you explain why datatype size * comm size is used for MPI_Gather? I don't really understand why datatype size is important without number of elements

@wckzhang
Copy link
Contributor Author

It's not that important whether comm size is used for each collective or not, but MPI_Gather just seems broken, can only ever tune based on collective size, not number of elements being sent

@bosilca
Copy link
Member

bosilca commented May 1, 2020

The comm size is an extremely important tuning parameters as most of the papers highlight. But you are right, the number of elements should have been part of the gather decision.

@wckzhang
Copy link
Contributor Author

wckzhang commented May 1, 2020

I do think comm size is important, but that information is already contained in the file format (since all message sizes belong to a comm size already), so it seems a little extraneous, but it's not a big deal. The datatype size * comm size we should just change, seems like a bug we should fix.

@wckzhang
Copy link
Contributor Author

#7758 resolves this issue for gather/scatter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants