New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification: allow MPI_PROC_NULL as neighbor in (distributed) graph topologies #87

Open
omor1 opened this Issue Mar 13, 2018 · 37 comments

Comments

@omor1

omor1 commented Mar 13, 2018

This was originally posted on the mpi-comments mailing list, but as @AndrewGaspar noted in #86, this repository is much more active.

Problem

The topology creation functions (MPI_GRAPH_CREATE, MPI_DIST_GRAPH_CREATE, and MPI_DIST_GRAPH_CREATE_ADJACENT) define a virtual topology, described as a communication graph. Edges are defined between sources and destinations, both of which are ranks of processes in the input communicator [0].
In some instances, such as a communication graph where most nodes have the same number of neighbors (for instance a binary tree topology), it may be useful to have a "dummy" source or destination. This is provided for in MPI as MPI_PROC_NULL [1]. Then a caller can have boundary nodes (such as a root or leaves in a tree) communicate with these "dummy" neighbors, simplifying topology code logic.
This in analogous to what is done in non-periodic cartesian topologies [2]; MPI 3.1 documentation on MPI_CART_SHIFT [3] states:

Depending on the periodicity of the Cartesian group in the specified coordinate direction, MPI_CART_SHIFT provides the identifiers for a circular or an end-off shift. In the case of an end-off shift, the value MPI_PROC_NULL may be returned in rank_source or rank_dest, indicating that the source or the destination for the shift is out of range.

Additionally, the MPI 3.1 documentation on Neighborhood Collective Communication on Process Topologies [4] states:

For a Cartesian topology, created with MPI_Cart_create, the sequence of neighbors in the send and receive buffers at each process is defined by order of the dimensions, first the neighbor in the negative direction and then in the positive direction with displacement 1. The numbers of sources and destinations in the communication routines are 2*ndims with ndims defined in MPI_Cart_create. If a neighbor does not exist, i.e., at the border of a Cartesian topology in the case of a non-periodic virtual grid dimension (i.e., periods[ . . . ]==false), then this neighbor is defined to be MPI_PROC_NULL.

If a neighbor in any of the functions is MPI_PROC_NULL, then the neighborhood collective communication behaves like a point-to-point communication with MPI_PROC_NULL in this direction. That is, the buffer is still part of the sequence of neighbors but it is neither communicated nor updated.

There is some additional discussion with @bosilca in [5].

Proposal

Explicitly allow MPI_PROC_NULL as a neighbor in graph topologies. Communication with such a neighbor in the neighborhood communication collectives (MPI_(I)NEIGHBOR_ALLGATHER(,V), MPI_(I)NEIGHBOR_ALLTOALL(,V,W)) is defined identically to point-to-point communications with MPI_PROC_NULL [1]:

A communication with process MPI_PROC_NULL has no effect. A send to MPI_PROC_NULL succeeds and returns as soon as possible. A receive from MPI_PROC_NULL succeeds and returns as soon as possible with no modifications to the receive buffer.

This is, as stated in [4], indeed already in the standard, and merely needs a clarification that MPI_PROC_NULL is explicitly permitted as a neighbor in graph topologies.

Changes to the Text

On pages 295, 297, and 299, clarify that MPI_PROC_NULL may be the neighbor of a process. Sample wording includes:

Processes with MPI_PROC_NULL neighbors are allowed.
MPI_PROC_NULL may be a neighbor of a process.

Impact on Implementations

Implementations will be required to permit MPI_PROC_NULL as a valid neighbor in graph topologies, if they do not already. In most cases, this should be fairly simple: see [6].

Impact on Users

None for current users; all existing code will remain valid. This may open up new possibilities and allow for some simplification of existing code.

References

[0]: MPI 3.1, Section 7.5.3 & 7.5.4, Pages 294–302
[1]: MPI 3.1, Section 3.11, Pages 80–81
[2]: MPI 3.1, Section 7.5.1, Page 292
[3]: MPI 3.1, Section 7.5.6, Page 310
[4]: MPI 3.1, Section 7.5, Pages 314–315
[5] open-mpi/ompi#4675
[6] open-mpi/ompi#4898

PR: mpi-forum/mpi-standard#43

@hjelmn

This comment has been minimized.

hjelmn commented Mar 13, 2018

I don't see the benefit of this change. Can you give an example of what would be easier if MPI_PROC_NULL is allowed for dist graph?

@omor1

This comment has been minimized.

omor1 commented Mar 13, 2018

A clear example is a virtual binary tree topology. Nodes have between 1 and 3 neighbors (a parent and up to two children); most nodes are inner nodes and all three, some a leaves and have one, and a few have two (the root and some inner nodes in a non-full tree).

Currently, this must be implemented as follows:

MPI_Comm tree;
int world_rank;
int world_size;
int neighbors[3];
int n = 0;

MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

if (world_rank > 0) 
        neighbors[n++] = (world_rank-1)/2;
if (2*world_rank+1 < world_size)
        neighbors[n++] = 2*world_rank+1;
if (2*world_rank+2 < world_size)
        neighbors[n++] = 2*world_rank+2;

MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD,
                               n, neighbors, MPI_UNWEIGHTED,
                               n, neighbors, MPI_UNWEIGHTED,
                               MPI_INFO_NULL, true, &tree_comm);

It then becomes difficult to tell what is in neighbors[0]; is this a parent, or a left child? This differs depending on whether or not you are the root. The same is true for neighbors[1]; this could be a left child, a right child, or a bogus value.

Suppose I later want to have every node send something to their left child; to do so, I must either store the left child separately from the neighbors array, or store what index in the neighbors array contains the left child, or recalculate either of those on demand—each time!

While this isn't difficult for a binary tree, it is easy to see how in a more general topology that is mostly homogeneous this may become a burden.

With this clarification, this instead becomes

MPI_Comm tree;
int world_rank;
int world_size;
int neighbors[3];

MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

neighbors[0] = world_rank > 0              ? (world_rank-1)/2 : MPI_PROC_NULL;
neighbors[1] = 2*world_rank+1 < world_size ? 2*world_rank+1   : MPI_PROC_NULL;
neighbors[2] = 2*world_rank+2 < world_size ? 2*world_rank+2   : MPI_PROC_NULL;

MPI_Dist_graph_create_adjacent(MPI_COMM_WORLD,
                               3, neighbors, MPI_UNWEIGHTED,
                               3, neighbors, MPI_UNWEIGHTED,
                               MPI_INFO_NULL, true, &CommTree);

Now neighbors[0] is always the parent; neighbors[1] is always the left child; and neighbors[2] is always the right child. These may be nonexistent (MPI_PROC_NULL), but sending and receiving from them is never incorrect. In short, this helps simplify operations near the boundaries—exactly what MPI_PROC_NULL is meant to be used for!

In many instances, it is convenient to specify a "dummy" source or destination for communication. This simplifies the code that is needed for dealing with boundaries, for example, in the case of a non-circular shift done with calls to send-receive. (MPI 3.1, Section 3.11, Page 80)

A further point is that currently, a distributed graph topology cannot emulate a cartesian topology! I believe that this is not the intention of the MPI specification; a graph topology is intended to be the most general, and it is only because cartesian topologies are so common that the cartesian topology functions are provided (so as to simplify the process).

@omor1

This comment has been minimized.

omor1 commented Mar 16, 2018

Also note that MPI states:

The special value MPI_PROC_NULL can be used instead of a rank wherever a source or a destination argument is required in a call.

So that a strict reading of the text should require this anyway; this proposal merely seeks to clarify the fact, as some implementations do not currently support this.

(Actually, I've gone and tested—it seems that while OpenMPI's behavior is erroneous, MPICH's appears to be correct. In any case, a clarification may be useful.)

@dholmes-epcc-ed-ac-uk dholmes-epcc-ed-ac-uk self-assigned this Mar 21, 2018

@hjelmn

This comment has been minimized.

hjelmn commented Apr 24, 2018

I would not call Open MPI's behavior erroneous. I follows an interpretation of the standard. Now, I agree that clarification is needed but i don't yet agree with your interpretation.

@omor1

This comment has been minimized.

omor1 commented Apr 29, 2018

Given that MPI_PROC_NULL is supposed to be able to be used wherever a rank is required as a source or destination, and coupled with the fact that the neighborhood collectives explicitly support MPI_PROC_NULL neighbors, rejecting a topology in which MPI_PROC_NULL is specified as the rank of a neighbor would appear to be erroneous—at least to me.

What is your reasoning for the validity of Open MPI's behavior? Keep in mind that the neighbors at the borders of non-periodic dimensions of Cartesian topologies are explicitly defined to be MPI_PROC_NULL by the standard.

It is interesting to note that, as far as I can tell, the distributed graph topology constructors are the only place in the standard where ranks are specified to be 'non-negative integers'; all other functions merely specify them as 'integers'. Even the older non-distributed graph topology constructor specifies the flattened edge representation of neighbors as 'integers'.

@hjelmn

This comment has been minimized.

hjelmn commented Apr 29, 2018

Exactly, the sources and destinations are non-negative. This requirement implicitly forbids MPI_PROC_NULL (which by definition MUST be negative to avoid conflicting with another valid rank). As such Open MPI's behavior is not erroneous. To me it is also logical.

This behavior may or may not have been the original intent and there may be need for an errata (at which point we will change the behavior of Open MPI). This will certainly be discussed at the MPI Forum meeting in June.

@jdinan

This comment has been minimized.

jdinan commented Apr 30, 2018

Does MPI require implementations to support MAX_INT processes? I'm not sure that non-negative in this context was intended to exclude MPI_PROC_NULL.

@tonyskjellum

This comment has been minimized.

tonyskjellum commented Apr 30, 2018

@omor1

This comment has been minimized.

omor1 commented Apr 30, 2018

As far as I can tell, there is no requirement that an MPI implementation support INT_MAX processes or that MPI_PROC_NULL be negative. It is usually implemented as a negative integer (-2 in Open MPI, -1 in MPICH), but this is not required—it could conceivably be implemented as INT_MAX in an implementation that supported less than INT_MAX processes.

According to @hjelmn's reasoning, that would make whether or not MPI_PROC_NULL is a valid neighbor an artifact of the implementation, which is probably not the intent.

@hjelmn

This comment has been minimized.

hjelmn commented Apr 30, 2018

As I said, I do not know there original intent but it does call out non-negative integers while the other functions do not. I interpret this to be any rank but null.

Who wrote the text for these functions?

@hjelmn

This comment has been minimized.

hjelmn commented Apr 30, 2018

FWIW. My interpretation is based on there being no material benefit to allowing null. Allowing null saves some minimal bookkeeping for apps. But, I can see having a limitation allowing for some optimization in the library. There is a cost to allowing null.

@omor1

This comment has been minimized.

omor1 commented Apr 30, 2018

Distributed graph topologies were added in MPI-2.2. The author for Process Topologies for MPI-2.2 was Torsten Hoefler. He was also the editor and organizer for that chapter for MPI-3.0 and MPI-3.1 and is still the chair for the upcoming MPI-4.0.

Certainly, every feature has a cost associated with it—hence why determining what the correct behavior ought to be and ensuring that implementations function according to said behavior is important.

@wgropp

This comment has been minimized.

wgropp commented Apr 30, 2018

@hjelmn

This comment has been minimized.

hjelmn commented Apr 30, 2018

@wgropp Ok, then we probably should fix this by dropping non-negative so that the argument descriptions match the other calls. Is this a ticket 0 change?

@wgropp

This comment has been minimized.

wgropp commented Apr 30, 2018

@dholmes-epcc-ed-ac-uk

This comment has been minimized.

Member

dholmes-epcc-ed-ac-uk commented Apr 30, 2018

I think there is a reasonable argument for choosing to make this an errata in order to increase its visibility. I have always assumed, through osmosis, that MPI_PROC_NULL is not permitted for (dist) graph topologies. I agree that there is no strong indication in the text of the MPI Standard stating that restriction but there is a clear general statement to the contrary. This may be a common mis-conception and it may behove the MPI Forum to publicise this change much more loudly than is typically done for a ticket 0 change.

@wgropp so, for dist graph, the text for the sources and destinations arguments would read "array of valid rank values" instead of "array of non-negative integers"? And, for graph, the text for the edges argument would read "array of valid rank values describing graph edges (see below)"?

I think an explicit sentence akin to the RMA errata should be added for each case as well.
For dist graph, how about adding (at line 34 on page 297):

MPI_PROC_NULL is a valid rank value for \mpiarg{sources} or for \mpiarg{destinations}.

For graph, how about adding (at line 43 on page 294):

MPI_PROC_NULL is a valid rank value for \mpiarg{edges}.

@wgropp

This comment has been minimized.

wgropp commented May 1, 2018

@dholmes-epcc-ed-ac-uk

This comment has been minimized.

Member

dholmes-epcc-ed-ac-uk commented May 22, 2018

Suggested change, based on MPI-3.x (as of a few moments ago), with change-bars and change log entry:
mpi-report-ticket87.pdf

@omor1

This comment has been minimized.

omor1 commented May 22, 2018

  1. "array of non-negative integers" should be replaced by "array of valid rank values" in MPI_DIST_GRAPH_CREATE_ADJACENT as well (for both sources and destinations) on page 296 (298 in draft).
  2. The following should be added at line 34 of page 297 (page 299 in draft) [it was added at line 32 of page 299 (35 of 301 in draft) instead, it's required in both]:

MPI_PROC_NULL is a valid rank value for \mpiarg{sources} or for \mpiarg{destinations}.

  1. Change-Log references wrong section and pages for these changes (§ 6.4.2 on page 239 instead of § 7.5.3 on page 296 and § 7.5.4 on pages 298, 299, 300, and 301)
  2. Slight typo on page 300—should be "rank", not "rnak" :)
@omor1

This comment has been minimized.

omor1 commented May 22, 2018

Similar text would probably be needed for the new functions if #78 and/or #84 are accepted.

@dholmes-epcc-ed-ac-uk

This comment has been minimized.

Member

dholmes-epcc-ed-ac-uk commented May 23, 2018

@omor1 thanks for the careful review 👍

Here's the fixed version:
mpi-report-ticket87.pdf

I expect that this errata change will be passed by the MPI Forum before issues #78, #84, and #93; the authors of those issues will be responsible for propagation of this change to new their functions.

@dholmes-epcc-ed-ac-uk

This comment has been minimized.

Member

dholmes-epcc-ed-ac-uk commented May 23, 2018

@jdinan (21 hours ago, on the pull request, non-public - hence copied here)

As long as we're clarifying this, it might be helpful to further clarify "MPI_PROC_NULL is valid ..., +and indicates that ...+" to also ensure the meaning of MPI_PROC_NULL is unambiguous.

@dholmes-epcc-ed-ac-uk just now

When explaining, we should cover several points:

  • communication with null neighbours completes normally, like with point-to-point to/from null processes or off the edges of a cartesian topology [the easy tee shot]
  • Graph and dist graph enquiry functions will include null neighbours just like non-null neighbours [the edge of the fairway]
  • buffers/counts/datatypes/displacements/etc for null neighbours must be included for neighbourhood collectives - but not for other collectives [definitely into the rough now]

Alternatively, the graph and dist graph enquiry functions should remove all MPI_PROC_NULL values and give back only non-null neighbours (like a cartesian topology). That means getting back from a query function something different than what went into the creation function. In @omor1's example (issue #87 (comment), comment 13th Mar, binary tree), the user must keep their neighbours input array, perform MPI_GROUP_TRANSLATE_RANKS to get ranks in the new communicator (including MPI_PROC_NULL->MPI_PROC_NULL mappings), and avoid usage of the topology query functions. However, "the sequence of neighbours is defined as the sequence returned by " (MPI-3.1, section 7.6, page 314) [looks like we've found a water hazard]

None of this is trivial, so I agree that some additional explanation is required here.

@omor1

This comment has been minimized.

omor1 commented May 23, 2018

@dholmes-epcc-ed-ac-uk updated version looks good.

Regarding your points:

  1. § 7.6 already covers this:

If a neighbor in any of the functions is MPI_PROC_NULL, then the neighborhood collective communication behaves like a point-to-point communication with MPI_PROC_NULL in this direction. That is, the buffer is still part of the sequence of neighbors but it is neither communicated nor updated.

  1. The neighborhood collectives are defined in terms of point-to-point communications with all neighbors (see e.g. § 7.6.2). As per § 7.6 (the quote above), this includes the null neighbors. Thus the inquiry functions must return null neighbors. Cartesian topologies don't have a direct analogue to MPI_DIST_GRAPH_NEIGHBORS_COUNT or MPI_DIST_GRAPH_NEIGHBORS. The closest is MPI_CART_SHIFT with input disp=1, which does indeed return null neighbors for the borders of non-periodic dimensions.
  2. Since neighborhood collectives 'communicate' with null neighbors, buffers/counts/datatypes/displacements/etc must be included for them. I don't think that the 'normal' collectives should communicate with null neighbors though. How are normal collectives handled for cartesian topologies with non-periodic dimensions? As far as I know, all the non-neighborhood collectives only communicate with real processes, and never more than once (whereas the neighborhood collectives can communicate with null processes and with the same processes multiple times, in the case of an edge with multiplicity greater than 1). MPI_PROC_NULL 'always' belongs to every group (even MPI_GROUP_EMPTY), as per the definition of MPI_GROUP_TRANSLATE_RANKS, which states that the translation of MPI_PROC_NULL is always MPI_PROC_NULL—yet the null process is never included in 'normal' collective communications.

Some additional text clarifying interactions between communicators with virtual process topologies that have null neighbors or multiply-defined edges and non-neighborhood collective communications may be useful, but I think this is fairly unambiguous. (Side note: is it valid to pass MPI_PROC_NULL as the root in broadcast, gather, scatter, reduce, etc. operations for intracommunicators? This would appear to be ambiguous—it isn't explicitly excluded, and as per § 3.11 may be allowed. However, unlike with null neighbors, I fail to see the point of it, other than ensuring the generality of the operation.)

@dholmes-epcc-ed-ac-uk

This comment has been minimized.

Member

dholmes-epcc-ed-ac-uk commented May 24, 2018

@omor1 I agree completely - my points in the list (I hope) follow the already-unambiguous text in §7.6 (without referencing it directly; thanks for finding the quotes) but show that if we start explaining (again) then the resulting text must be long and complex. Normal collectives are defined in terms of n point-to-point operations, where n is implied to be the result of MPI_COMM_SIZE, i.e. the number of real MPI processes in the communicator (e.g. MPI_GATHER, MPI-3.1, §5.5, p150, line 11, "as if the n processes in the group"). The "alternative" tries to make neighbourhood collectives like normal collectives but is, IMHO, untenable.

@jdinan Perhaps a cross-reference to §7.6 is sufficient? Such as "Section 7.6 describes how including MPI_PROC_NULL affects neighborhood collective operations."

@hjelmn This discussion about including MPI_PROC_NULL nodes (and duplicate nodes?) seems to indicate that the number of nodes in a distributed graph topology can be arbitrarily larger than the number of MPI processes in the communicator, which has implications for #89 (or, at least, for any future proposal to add MPI_DIST_GRAPH_MAP).

Given (MPI-3.1, §7.5.3, p294, line 35-36, re: MPI_GRAPH_CREATE):

The call is erroneous if it specifies a graph that is larger than the group size of the input communicator.

and (MPI-3.1, §7.5.3, p295, line 45-46, re: MPI_GRAPH_CREATE):

For a graph structure the number of nodes is equal to the number of processes in the group. Therefore, the number of nodes does not have to be stored explicitly.

it would seem that including MPI_PROC_NULL as a node in a non-distributed graph implies that one of the real MPI processes must be omitted as a node in that non-distributed graph.

The definitions for the distributed graph functions don't (I think) have this restriction.

@jdinan

This comment has been minimized.

jdinan commented May 24, 2018

This sounds ok. A couple quick comments on the text -- "valid" should be unnecessary. We would certainly like every argument be valid. The new text specifying the set of valid ranks should also capture the requirement that ranks must be members of the group of the parent/old communicator in addition MPI_PROC_NULL. I don't see this in the text (sorry if I missed it, only had time for a quick skim). Elsewhere in the spec, set notation is used for this, but that is perhaps extra credit.

@omor1

This comment has been minimized.

omor1 commented May 24, 2018

@dholmes-epcc-ed-ac-uk @hjelmn counting MPI_PROC_NULL as a process in a communicator doesn't make sense. It isn't a 'real' process, and so poses a problem for collectives (MPI-3.1, §5.2.1, line 31-32):

All processes in the group identified by the intracommunicator must call the collective routine.

MPI_PROC_NULL can't call anything, so it can't be a process in the group identified by an intracommunicator (whether it has a virtual process topology or otherwise).

Also note that while non-distributed graphs explicitly state the number of nodes in the resulting graph in the constructor, the distributed graph constructors do not (MPI-3.1, §7.5.4, p297 line 36-37 and p299 line 34-35):

The number of processes in comm_dist_graph is identical to the number of processes in comm_old.

@schulzm

This comment has been minimized.

schulzm commented Jun 12, 2018

The forum decided that should be handled as a full ticket and should be considered after a further investigation on implementation impact (having to check for NULL could lead to problems in optimized implementations). Either way, the forum decided that it should be clarified whether MPI_PROC_NULL is allowed or not

@wgropp

This comment has been minimized.

wgropp commented Jun 13, 2018

@wesbland

This comment has been minimized.

Member

wesbland commented Jun 13, 2018

@rlgraham32 will be able to more accurately represent his argument, but I think the main points were something like this:

Optimized implementations of neighborhood collectives (whether software or some future hardware) would not only have to check for MPI_PROC_NULL for the sake of not sending messages, but they would also have to shuffle buffers around to make sure the input and output buffers go to the correct places. Doing all of this extra shuffling is the concern as it may have a much greater performance impact than a single branch.

If an implementation is purely relying on point-to-point communication to implement neighborhood collectives (as all implementations we know of right now do), this has very little performance impact as you say because we only have to check for MPI_PROC_NULL, which we already do during all of those communication calls.

@rlgraham32

This comment has been minimized.

rlgraham32 commented Jun 15, 2018

After thinking it over, I withdraw my concern. Since proc null is not in the range of the communicator's ranks, and the implementation can store a version of the graph without the proc null's, an implementation can access only data that needs to be send/received w/o any special logic for handling the proc null case.

@wesbland

This comment has been minimized.

Member

wesbland commented Jun 15, 2018

Is that true? If you remove the MPI_PROC_NULLs from your list of neighbors, but the user's input to the collective will still include buffers (or perhaps NULL pointers) for those MPI_PROC_NULL MPI processes, don't you need to do some adjustment before transmitting or receiving buffers.

For instance, if these are your neighbors:

0 | 1, 2
1 | NULL, 0
2 | 0, 3
3 | NULL, 2

Your buffers on ranks 1 and 3 will include extra entries that you'll need to ignore.

@rlgraham32

This comment has been minimized.

rlgraham32 commented Jun 15, 2018

I looked at the neighborhood alltoall and alltoallv, and a single pointer is used for both source and destination buffers. Data size and offsets, respectively, are used to get the address within the buffer, so there is really no portable way for at least the alltoall to specify an address for proc null. I may have missed something ...

@wesbland

This comment has been minimized.

Member

wesbland commented Jun 15, 2018

Ok. That's probably fine then. I just wanted to make sure we weren't requiring a bunch of new null checks for every operation.

@omor1

This comment has been minimized.

omor1 commented Jun 15, 2018

the implementation can store a version of the graph without the proc nulls

Do you mean in addition to the version with them? They must be recorded somewhere, for the purpose of retrieving all neighbors with MPI_DIST_GRAPH_NEIGHBORS_COUNT and MPI_DIST_GRAPH_NEIGHBORS (MPI-3.1, §7.5.5, pp. 309, lines 37–41):

The number of edges into and out of the process returned by MPI_DIST_GRAPH_NEIGHBORS_COUNT are the total number of such edges given in the call to MPI_DIST_GRAPH_CREATE_ADJACENT or MPI_DIST_GRAPH_CREATE (potentially by processes other than the calling process in the case of MPI_DIST_GRAPH_CREATE).

For this proposal, the input and/or output buffers could have blocks that are not modified by the neighborhood collectives, since the corresponding neighbor is MPI_PROC_NULL. These neighbors can't just be ignored; that bypasses the whole point of the proposal. In any case, this particular behavior is already mandated by the standard for cartesian topologies with non-periodic dimensions.

Did I misunderstand what you meant @rlgraham32? My impression of what you said was that an implementation could accept MPI_PROC_NULL neighbors, but then just ignore them.

@rlgraham32

This comment has been minimized.

rlgraham32 commented Jun 15, 2018

the version reference in my comment, just means an implementation copy of the graph - implementation private data.

I think you are right with your comment on proc null - I guess I was assuming that destinations were ordered such that the proc null in the graph would be first or last in the list with the example of the Cartesian shift, which I guess does not need to be the case.

So, one does need to check each neighbor to see if the data really needs to be sent or not - whether this is an mpi send routine, or some other logic, when MPI semantics are not relied on for the internal collective implementation logic.

As for Wesley's concern (and mine) about conditional logic, we can replace conditional logic by adding another vector internal to the implementation with offsets - another memory reference.

@omor1

This comment has been minimized.

omor1 commented Jun 15, 2018

I expect null neighbors to be uncommon (seeing as this hasn't been brought up as an issue before now), so optimizing for the more common case where neighbors actually exist is probably better. In that case, the CPU branch predictors (which should be able to predict the branches with high accuracy, since no neighbor is MPI_PROC_NULL) might get better performance with conditional logic than having to dereference memory—but that's something that would require testing & optimization.

@bosilca

This comment has been minimized.

Member

bosilca commented Jun 18, 2018

Would this force implementations to support 2 versions, one optimized for the case without PROC_NULL and one generic (using more memory as suggested by @rlgraham32) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment