Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify where MPI_PROC_NULL is permitted #87

Open
omor1 opened this issue Mar 13, 2018 · 45 comments
Open

Clarify where MPI_PROC_NULL is permitted #87

omor1 opened this issue Mar 13, 2018 · 45 comments

Comments

@omor1
Copy link
Member

@omor1 omor1 commented Mar 13, 2018

Problem

MPI_PROC_NULL is currently specified in the Point-to-Point Communication chapter. It is unclear from the wording whether it is intended to be useable only from within point-to-point communication calls or anywhere a rank parameter is otherwise permitted. Note that MPI_PROC_NULL is also referenced from within the Process Topologies chapter, where it is defined as the neighbor of a process at the border of Cartesian topology with a non-periodic dimension.

Proposal

The standard should clarify whether MPI_PROC_NULL is allowed in other contexts, e.g. as a source or destination in a (distributed) graph topology or as the target of a one-sided communication call.

Changes to the Text

The Null Processes section of the Point-to-Point Communication chapter will need to be changed to accommodate this clarification. The Process Topologies chapter may also need to clarify that although the neighbor acts as though it were MPI_PROC_NULL, such a construct cannot be specified by the user (should null processes be disallowed in contexts other than point-to-point communication).

Should null processes be allowed as the root of a collective communication, the impact on correctness may need to be described (e.g. if the root is MPI_PROC_NULL, the call returns as soon as possible without modifying the buffers, but processes must still call collectives in the same order on the same communicator).

Impact on Implementations

If implementations currently allow MPI_PROC_NULL in contexts other than point-to-point communication, they may need to add a runtime error check for it. If they do not, they may need to add special behavior (e.g. an additional branch).

Impact on Users

Users may be able to define certain behavior more cleanly (for instance, a binary tree process topology). Current code remains valid.

References

#138: Currently ranks are occasionally referred to as 'non-negative integers', which may prohibit a positive implementation of this proposal on implementations that define MPI_PROC_NULL as a negative integer. It has no impact if the clarification is a narrowing one.
PR: mpi-forum/mpi-standard#43

@hjelmn
Copy link

@hjelmn hjelmn commented Mar 13, 2018

I don't see the benefit of this change. Can you give an example of what would be easier if MPI_PROC_NULL is allowed for dist graph?

@omor1
Copy link
Member Author

@omor1 omor1 commented Mar 13, 2018

A clear example is a virtual binary tree topology. Nodes have between 1 and 3 neighbors (a parent and up to two children); most nodes are inner nodes and all three, some a leaves and have one, and a few have two (the root and some inner nodes in a non-full tree).

Currently, this must be implemented as follows:

MPI_Comm tree;
int world_rank;
int world_size;
int neighbors[3];
int n = 0;

MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

if (world_rank > 0) 
        neighbors[n++] = (world_rank-1)/2;
if (2*world_rank+1 < world_size)
        neighbors[n++] = 2*world_rank+1;
if (2*world_rank+2 < world_size)
        neighbors[n++] = 2*world_rank+2;

MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD,
                               n, neighbors, MPI_UNWEIGHTED,
                               n, neighbors, MPI_UNWEIGHTED,
                               MPI_INFO_NULL, true, &tree_comm);

It then becomes difficult to tell what is in neighbors[0]; is this a parent, or a left child? This differs depending on whether or not you are the root. The same is true for neighbors[1]; this could be a left child, a right child, or a bogus value.

Suppose I later want to have every node send something to their left child; to do so, I must either store the left child separately from the neighbors array, or store what index in the neighbors array contains the left child, or recalculate either of those on demand—each time!

While this isn't difficult for a binary tree, it is easy to see how in a more general topology that is mostly homogeneous this may become a burden.

With this clarification, this instead becomes

MPI_Comm tree;
int world_rank;
int world_size;
int neighbors[3];

MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

neighbors[0] = world_rank > 0              ? (world_rank-1)/2 : MPI_PROC_NULL;
neighbors[1] = 2*world_rank+1 < world_size ? 2*world_rank+1   : MPI_PROC_NULL;
neighbors[2] = 2*world_rank+2 < world_size ? 2*world_rank+2   : MPI_PROC_NULL;

MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD,
                               3, neighbors, MPI_UNWEIGHTED,
                               3, neighbors, MPI_UNWEIGHTED,
                               MPI_INFO_NULL, true, &CommTree);

Now neighbors[0] is always the parent; neighbors[1] is always the left child; and neighbors[2] is always the right child. These may be nonexistent (MPI_PROC_NULL), but sending and receiving from them is never incorrect. In short, this helps simplify operations near the boundaries—exactly what MPI_PROC_NULL is meant to be used for!

In many instances, it is convenient to specify a "dummy" source or destination for communication. This simplifies the code that is needed for dealing with boundaries, for example, in the case of a non-circular shift done with calls to send-receive. (MPI 3.1, Section 3.11, Page 80)

A further point is that currently, a distributed graph topology cannot emulate a cartesian topology! I believe that this is not the intention of the MPI specification; a graph topology is intended to be the most general, and it is only because cartesian topologies are so common that the cartesian topology functions are provided (so as to simplify the process).

@omor1
Copy link
Member Author

@omor1 omor1 commented Mar 16, 2018

Also note that MPI states:

The special value MPI_PROC_NULL can be used instead of a rank wherever a source or a destination argument is required in a call.

So that a strict reading of the text should require this anyway; this proposal merely seeks to clarify the fact, as some implementations do not currently support this.

(Actually, I've gone and tested—it seems that while OpenMPI's behavior is erroneous, MPICH's appears to be correct. In any case, a clarification may be useful.)

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk self-assigned this Mar 21, 2018
@hjelmn
Copy link

@hjelmn hjelmn commented Apr 24, 2018

I would not call Open MPI's behavior erroneous. I follows an interpretation of the standard. Now, I agree that clarification is needed but i don't yet agree with your interpretation.

@omor1
Copy link
Member Author

@omor1 omor1 commented Apr 29, 2018

Given that MPI_PROC_NULL is supposed to be able to be used wherever a rank is required as a source or destination, and coupled with the fact that the neighborhood collectives explicitly support MPI_PROC_NULL neighbors, rejecting a topology in which MPI_PROC_NULL is specified as the rank of a neighbor would appear to be erroneous—at least to me.

What is your reasoning for the validity of Open MPI's behavior? Keep in mind that the neighbors at the borders of non-periodic dimensions of Cartesian topologies are explicitly defined to be MPI_PROC_NULL by the standard.

It is interesting to note that, as far as I can tell, the distributed graph topology constructors are the only place in the standard where ranks are specified to be 'non-negative integers'; all other functions merely specify them as 'integers'. Even the older non-distributed graph topology constructor specifies the flattened edge representation of neighbors as 'integers'.

@hjelmn
Copy link

@hjelmn hjelmn commented Apr 29, 2018

Exactly, the sources and destinations are non-negative. This requirement implicitly forbids MPI_PROC_NULL (which by definition MUST be negative to avoid conflicting with another valid rank). As such Open MPI's behavior is not erroneous. To me it is also logical.

This behavior may or may not have been the original intent and there may be need for an errata (at which point we will change the behavior of Open MPI). This will certainly be discussed at the MPI Forum meeting in June.

@jdinan
Copy link

@jdinan jdinan commented Apr 30, 2018

Does MPI require implementations to support MAX_INT processes? I'm not sure that non-negative in this context was intended to exclude MPI_PROC_NULL.

@tonyskjellum
Copy link

@tonyskjellum tonyskjellum commented Apr 30, 2018

@omor1
Copy link
Member Author

@omor1 omor1 commented Apr 30, 2018

As far as I can tell, there is no requirement that an MPI implementation support INT_MAX processes or that MPI_PROC_NULL be negative. It is usually implemented as a negative integer (-2 in Open MPI, -1 in MPICH), but this is not required—it could conceivably be implemented as INT_MAX in an implementation that supported less than INT_MAX processes.

According to @hjelmn's reasoning, that would make whether or not MPI_PROC_NULL is a valid neighbor an artifact of the implementation, which is probably not the intent.

@hjelmn
Copy link

@hjelmn hjelmn commented Apr 30, 2018

As I said, I do not know there original intent but it does call out non-negative integers while the other functions do not. I interpret this to be any rank but null.

Who wrote the text for these functions?

@hjelmn
Copy link

@hjelmn hjelmn commented Apr 30, 2018

FWIW. My interpretation is based on there being no material benefit to allowing null. Allowing null saves some minimal bookkeeping for apps. But, I can see having a limitation allowing for some optimization in the library. There is a cost to allowing null.

@omor1
Copy link
Member Author

@omor1 omor1 commented Apr 30, 2018

Distributed graph topologies were added in MPI-2.2. The author for Process Topologies for MPI-2.2 was Torsten Hoefler. He was also the editor and organizer for that chapter for MPI-3.0 and MPI-3.1 and is still the chair for the upcoming MPI-4.0.

Certainly, every feature has a cost associated with it—hence why determining what the correct behavior ought to be and ensuring that implementations function according to said behavior is important.

@wgropp
Copy link

@wgropp wgropp commented Apr 30, 2018

@hjelmn
Copy link

@hjelmn hjelmn commented Apr 30, 2018

@wgropp Ok, then we probably should fix this by dropping non-negative so that the argument descriptions match the other calls. Is this a ticket 0 change?

@wgropp
Copy link

@wgropp wgropp commented Apr 30, 2018

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented Apr 30, 2018

I think there is a reasonable argument for choosing to make this an errata in order to increase its visibility. I have always assumed, through osmosis, that MPI_PROC_NULL is not permitted for (dist) graph topologies. I agree that there is no strong indication in the text of the MPI Standard stating that restriction but there is a clear general statement to the contrary. This may be a common mis-conception and it may behove the MPI Forum to publicise this change much more loudly than is typically done for a ticket 0 change.

@wgropp so, for dist graph, the text for the sources and destinations arguments would read "array of valid rank values" instead of "array of non-negative integers"? And, for graph, the text for the edges argument would read "array of valid rank values describing graph edges (see below)"?

I think an explicit sentence akin to the RMA errata should be added for each case as well.
For dist graph, how about adding (at line 34 on page 297):

MPI_PROC_NULL is a valid rank value for \mpiarg{sources} or for \mpiarg{destinations}.

For graph, how about adding (at line 43 on page 294):

MPI_PROC_NULL is a valid rank value for \mpiarg{edges}.

@wgropp
Copy link

@wgropp wgropp commented May 1, 2018

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented May 22, 2018

Suggested change, based on MPI-3.x (as of a few moments ago), with change-bars and change log entry:
mpi-report-ticket87.pdf

@omor1
Copy link
Member Author

@omor1 omor1 commented May 22, 2018

  1. "array of non-negative integers" should be replaced by "array of valid rank values" in MPI_DIST_GRAPH_CREATE_ADJACENT as well (for both sources and destinations) on page 296 (298 in draft).
  2. The following should be added at line 34 of page 297 (page 299 in draft) [it was added at line 32 of page 299 (35 of 301 in draft) instead, it's required in both]:

MPI_PROC_NULL is a valid rank value for \mpiarg{sources} or for \mpiarg{destinations}.

  1. Change-Log references wrong section and pages for these changes (§ 6.4.2 on page 239 instead of § 7.5.3 on page 296 and § 7.5.4 on pages 298, 299, 300, and 301)
  2. Slight typo on page 300—should be "rank", not "rnak" :)
@omor1
Copy link
Member Author

@omor1 omor1 commented May 22, 2018

Similar text would probably be needed for the new functions if #78 and/or #84 are accepted.

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented May 23, 2018

@omor1 thanks for the careful review 👍

Here's the fixed version:
mpi-report-ticket87.pdf

I expect that this errata change will be passed by the MPI Forum before issues #78, #84, and #93; the authors of those issues will be responsible for propagation of this change to new their functions.

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented May 23, 2018

@jdinan (21 hours ago, on the pull request, non-public - hence copied here)

As long as we're clarifying this, it might be helpful to further clarify "MPI_PROC_NULL is valid ..., +and indicates that ...+" to also ensure the meaning of MPI_PROC_NULL is unambiguous.

@dholmes-epcc-ed-ac-uk just now

When explaining, we should cover several points:

  • communication with null neighbours completes normally, like with point-to-point to/from null processes or off the edges of a cartesian topology [the easy tee shot]
  • Graph and dist graph enquiry functions will include null neighbours just like non-null neighbours [the edge of the fairway]
  • buffers/counts/datatypes/displacements/etc for null neighbours must be included for neighbourhood collectives - but not for other collectives [definitely into the rough now]

Alternatively, the graph and dist graph enquiry functions should remove all MPI_PROC_NULL values and give back only non-null neighbours (like a cartesian topology). That means getting back from a query function something different than what went into the creation function. In @omor1's example (issue #87 (comment), comment 13th Mar, binary tree), the user must keep their neighbours input array, perform MPI_GROUP_TRANSLATE_RANKS to get ranks in the new communicator (including MPI_PROC_NULL->MPI_PROC_NULL mappings), and avoid usage of the topology query functions. However, "the sequence of neighbours is defined as the sequence returned by " (MPI-3.1, section 7.6, page 314) [looks like we've found a water hazard]

None of this is trivial, so I agree that some additional explanation is required here.

@rlgraham32
Copy link

@rlgraham32 rlgraham32 commented Jun 15, 2018

After thinking it over, I withdraw my concern. Since proc null is not in the range of the communicator's ranks, and the implementation can store a version of the graph without the proc null's, an implementation can access only data that needs to be send/received w/o any special logic for handling the proc null case.

@wesbland
Copy link
Member

@wesbland wesbland commented Jun 15, 2018

Is that true? If you remove the MPI_PROC_NULLs from your list of neighbors, but the user's input to the collective will still include buffers (or perhaps NULL pointers) for those MPI_PROC_NULL MPI processes, don't you need to do some adjustment before transmitting or receiving buffers.

For instance, if these are your neighbors:

0 | 1, 2
1 | NULL, 0
2 | 0, 3
3 | NULL, 2

Your buffers on ranks 1 and 3 will include extra entries that you'll need to ignore.

@rlgraham32
Copy link

@rlgraham32 rlgraham32 commented Jun 15, 2018

I looked at the neighborhood alltoall and alltoallv, and a single pointer is used for both source and destination buffers. Data size and offsets, respectively, are used to get the address within the buffer, so there is really no portable way for at least the alltoall to specify an address for proc null. I may have missed something ...

@wesbland
Copy link
Member

@wesbland wesbland commented Jun 15, 2018

Ok. That's probably fine then. I just wanted to make sure we weren't requiring a bunch of new null checks for every operation.

@omor1
Copy link
Member Author

@omor1 omor1 commented Jun 15, 2018

the implementation can store a version of the graph without the proc nulls

Do you mean in addition to the version with them? They must be recorded somewhere, for the purpose of retrieving all neighbors with MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS (MPI-3.1, §7.5.5, pp. 309, lines 37–41):

The number of edges into and out of the process returned by MPI_DIST_GRAPH_NEIGHBORS_COUNT are the total number of such edges given in the call to MPI_DIST_GRAPH_CREATE_ADJACENT or MPI_DIST_GRAPH_CREATE (potentially by processes other than the calling process in the case of MPI_DIST_GRAPH_CREATE).

For this proposal, the input and/or output buffers could have blocks that are not modified by the neighborhood collectives, since the corresponding neighbor is MPI_PROC_NULL. These neighbors can't just be ignored; that bypasses the whole point of the proposal. In any case, this particular behavior is already mandated by the standard for cartesian topologies with non-periodic dimensions.

Did I misunderstand what you meant @rlgraham32? My impression of what you said was that an implementation could accept MPI_PROC_NULL neighbors, but then just ignore them.

@rlgraham32
Copy link

@rlgraham32 rlgraham32 commented Jun 15, 2018

the version reference in my comment, just means an implementation copy of the graph - implementation private data.

I think you are right with your comment on proc null - I guess I was assuming that destinations were ordered such that the proc null in the graph would be first or last in the list with the example of the Cartesian shift, which I guess does not need to be the case.

So, one does need to check each neighbor to see if the data really needs to be sent or not - whether this is an mpi send routine, or some other logic, when MPI semantics are not relied on for the internal collective implementation logic.

As for Wesley's concern (and mine) about conditional logic, we can replace conditional logic by adding another vector internal to the implementation with offsets - another memory reference.

@omor1
Copy link
Member Author

@omor1 omor1 commented Jun 15, 2018

I expect null neighbors to be uncommon (seeing as this hasn't been brought up as an issue before now), so optimizing for the more common case where neighbors actually exist is probably better. In that case, the CPU branch predictors (which should be able to predict the branches with high accuracy, since no neighbor is MPI_PROC_NULL) might get better performance with conditional logic than having to dereference memory—but that's something that would require testing & optimization.

@bosilca
Copy link
Member

@bosilca bosilca commented Jun 18, 2018

Would this force implementations to support 2 versions, one optimized for the case without PROC_NULL and one generic (using more memory as suggested by @rlgraham32) ?

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented Jun 1, 2019

Is there still interest in pursuing this proposal?

It is marked as an errata (a little ambitious, perhaps), which means it has one chance to get into MPI-4.0 - specifically, it must be formally read at the next face-to-face meeting.

For the associated PR, I must choose between:

  1. leave base branch as mpi-3.x, close the PR - target: unmerged
  2. leave base branch as mpi-3.x, update the PR urgently, read at next meeting - target: errata for mpi-3.x
  3. update base branch to mpi-4.x, update the PR eventually, read whenever - target: change for mpi-4.x
@omor1
Copy link
Member Author

@omor1 omor1 commented Jun 1, 2019

@dholmes-epcc-ed-ac-uk actually I emailed the mpi-comments mailing list a couple days ago about a related (and extended) version of this.

The gist is that 'rank' arguments (source, destination, root, target, etc) are variously described as either 'integer' or 'non-negative integer'. There's no real reason I can see for this. Additionally, it's unclear whether MPI_PROC_NULL is meant to be useable only in point-to-point communication or more generally wherever a 'rank' argument is allowed (e.g. one can imagine calling MPI_PUT with a null process as the target).

The proposal would be to add 'rank integer' as a semantic term with a possible value in the range [0, N) for a group of size N. If more expansive null process support is desired, then the definition of 'rank integer' can be extended to also include MPI_PROC_NULL; otherwise, this can be specified only for those functions in the point-to-point communication chapter. Receive calls will also need to specify that they can take MPI_ANY_SOURCE (which they already do).

I wanted to pass it through mpi-comments to get feedback before writing and submitting a proposal, but I guess no one actually uses mailing lists nowadays. Should I submit it on mpi-forum/mpi-issues, or to mpiwg-semantic-terms/semantic-terms-issues? It would touch large parts of the standard and the maintainers of the different chapters may want input. (I'd expect little opposition to clarifying 'integer' versus 'non-negative integer', but whether MPI_PROC_NULL should be allowed is evidently much more controversial.)

In any case, that proposal would supersede this one I think, since it's more general. Actually, it should probably be split in two for the two different portions...

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented Jun 1, 2019

@omor1 this proposal/issue could be expanded in scope to encompass the broader problems with inconsistent descriptions of rank in the standard. Closing this issue and starting another loses the history of comments, which dooms us to repeat them on the next issue.

Re-reading the comments on this issue, there is a clear need to make some clarifying text changes for this problem - so option (1) should be dropped.

The comment from @schulzm (#87 (comment)) reminds me that the Forum decided this should be handled as a full change (2 votes), rather than errata (1 vote). In which case, this issue has probably already missed the MPI-4.0 boat - so option (2) should be dropped.

I think we are left with option (3) only.

@omor1
Copy link
Member Author

@omor1 omor1 commented Jun 1, 2019

@dholmes-epcc-ed-ac-uk right I agree, I think this is more complex than an errata and there's a clear need for clarification, so option (3) is left. I'l rewrite the proposal text and update the labels. Is there worth in keeping the current proposal text around for archival purposes? If so, should I copy it to a comment or place it below the updated text?

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented Jun 3, 2019

@omor1 I think we could update this issue to encompass the wider context that you propose to tackle, rather than creating a new issue that would be disassociated from this conversation.

Alternatively, this issue (and its PR) can be closed and referenced from the new ones when they are created (with a couple of appropriate sentences of explanation).

@omor1
Copy link
Member Author

@omor1 omor1 commented Jun 4, 2019

I think that I'll create a new issue for the 'integer'/'non-negative integer' issue and update this one to expand where MPI_PROC_NULL should/shouldn't be permitted. The problems are of course related, but mostly orthogonal.

@omor1 omor1 changed the title Clarification: allow MPI_PROC_NULL as neighbor in (distributed) graph topologies Clarify where MPI_PROC_NULL is permitted Jun 4, 2019
@omor1
Copy link
Member Author

@omor1 omor1 commented Jun 4, 2019

Alright, I've updated this proposal. It now allows for clarification in either direction; i.e. either prohibit MPI_PROC_NULL from everywhere other than P2P calls or all it in all calls with 'rank' parameters. Or something in between, I suppose, as long as that is strictly clarified, though I'd like to see a strong rationale for that.

@dholmes-epcc-ed-ac-uk
Copy link
Member

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk commented Oct 21, 2020

@omor1 is there still interest in pushing this issue forward? We are now running out of time for the mpi-4.0 release, just as we ran out of time for the mpi-3.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants