Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology awareness in MPI_Dims_create #195

Open
mpiforumbot opened this issue Jul 24, 2016 · 36 comments
Open

Topology awareness in MPI_Dims_create #195

mpiforumbot opened this issue Jul 24, 2016 · 36 comments

Comments

@mpiforumbot
Copy link
Collaborator

mpiforumbot commented Jul 24, 2016

Originally by balaji on 2009-10-09 16:45:51 -0500


Description

MPI_Dims_create does not take any information with respect to the processes in its parameter (e.g., there is no communicator parameter). Without this information the function cannot really use any information about the topology of the physical network to optimize the Cartesian layout. (This problem is answered by Alternative A)

Additionally, MPI_Dims_create is not aware of the application topology. E.g., if the application uses a grid of 1200 x 500 elements and 60 MPI processes (as in ticket #195) then the expected Cartesian process topology should be 12 x 5 and not 10 x 6 as returned by existing MPI_Dims_create. (Both problems are answered by solution B)

Extended Scope

None.

History

Proposed Solution

-MPI-2.2, Chapter 7 (Topologies), 7.4 (Overview), page 247, line 9-17 read*

MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension.
For each coordinate direction one specifies whether the process structure is periodic or
not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per
coordinate direction. Thus, special support for hypercube structures is not necessary.
The local auxiliary function MPI_DIMS_CREATE
can be used to compute a balanced distribution
of processes among a given number of dimensions.

Rationale. Similar functions are contained in EXPRESS [12] and PARMACS. (End of rationale.)

-but should read*

MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension.
For each coordinate direction one specifies whether the process structure is periodic or
not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per
coordinate direction. Thus, special support for hypercube structures is not necessary.
The local auxiliary function MPI_DIMS_CREATE
and the collective function MPI_DIMS_CREATEV
can be used to compute a balanced an optimized distribution
of processes among a given number of dimensions.

Rationale. Similar functions are contained in EXPRESS [12] and PARMACS. (End of rationale.)

-MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 248, line 38-44 read*

7.5.2 Cartesian Convenience Function: MPI_DIMS_CREATE

For Cartesian topologies, the function MPI_DIMS_CREATE helps the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

-but should read*

7.5.2 Cartesian Convenience Function__s__: MPI_DIMS_CREATE

For Cartesian topologies, the function__s__ MPI_DIMS_CREATE and MPI_DIMS_CREATEV helps the user select a balanced an optimized distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced optimized and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

-Add after MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 249, line 35, i.e. after MPI_DIMS_CREATE:*

MPI_DIMS_CREATEV(comm,ndims,gridsizes,adapt,dims)
 IN     comm      communicator that defines the number 
                  nodes and the basis for the 
                  hardware adaption (handle)
 IN     ndims     number of Cartesian dimensions 
                  (positive integer)
 IN     gridsizes integer array of size ndims specifying
                  the application size in each dimension
 IN     info      hints on optimization if adapt=true (handle)
 INOUT  dims      integer array of size ndims specifying
                  the number of nodes in each dimension
int MPI_Dims_createv(MPI_Comm comm, int ndims, 
             int gridsizes[], MPI_Info info, int dims[])
MPI_DIMS_CREATEV(COMM, NDIMS, GRIDSIZES, INFO, DIMS, IERROR)
    INTEGER COMM, NDIMS, GRIDSIZES(*), INFO, DIMS(*), IERROR

The entries in the array dims are set to describe a Cartesian grid
with ndims dimensions and a total number of nodes identical to the
size of the group of comm. The caller may constrain the operation
of this routine by specifying elements of array dims. If
dims[i] is set to a positive number, the routine will not modify
the number of nodes in dimension i; only those entries where
dims[i] # 0 will be modified. Those dimensions where dims[i]0 are
set such that
__the returned values will allow
MPI_CART_CREATE to return a communicator where neighbor communication is efficient
all ratios gridsizes[i]/dims[i] (using floating point divide)
are as close to each other as possible, using an appropriate divisibility
algorithm. The largest ratio divided by the smallest ratio should be
minimal.
If there are different possibilities for calculating these
dims[i], then the solution set with minimal difference between
largest and smallest dims[i] should be used
.
If i<j, gridsize[i]=gridsize[j], dims[i]=0,
and dims[j]=0 at input, then the output dims shall
satisfy this property: dims[i]>=dims[j].

Negative input values of dims[i] are erroneous. An error will occur if the size of the group of comm is not a multiple of the product of all dims[i] with dims[i]>0.

The array dims is suitable for use as input to MPI_CART_CREATE.
MPI_DIMS_CREATEV is a collective operation.
The arguments comm, ndims, gridsizes, adapt and the (key,value)-pairs stored in info must have identical values on all processes of the group of comm.
MPI_DIMS_CREATEV will return in dims the same results on all proccesses.

The meaning of adapt=true returned result can be additionally
influenced by the info argument.
Valid info values are implementation-dependent.
The constant MPI_INFO_NULL can be used when no hints
should be specified.
An MPI implementation is not obliged to follow specific hints,
and it is valid for an MPI implementation to ignore the hints.

Example 7.2 MPI_DIMS_CREATEV on a system with 12 cores per SMP node and 1 MPI process per core, and the following input values: nnodes=2400, ndims=3, gridsizes=(120,80,250), dims=(0,0,0). Possible output values and their implication for MPI_CART_CREATE and the underlying communication.

|| dims after the call || possible cores x nodes || gridpoints per process || gridpoints per node || on 2-dim boundary^1)^ || sum of communicated points ||
|| (12,8, 25) || (12x1,1x8, 1x25) || (10,10,10) || (120,10,10) || (100^3)^,1200,1200) || 2400 ||
|| (12,8, 25) || (6x2, 2x4, 1x25) || (10,10,10) || (60,20,10) || ( 200, 600,1200) || 2000 ||
|| (12,8, 25) || (3x4, 4x2, 1x25) || (10,10,10) || (30,40,10) || ( 400, 300,1200) || 1900 ||
|| (10,10,24) || (2x5, 2x5, 3x8 ) || (12,8,11^2)^) || (24,16,33) || ( 528, 792, 384) || 1704 ||
|| (10,10,24) || (1x10,2x5, 6x4 ) || (12,8,11^2)^) || (12,16,66) || (1056, 792, 192) || 2040 ||
|| (10,10,24) || (2x5, 1x10,6x4 ) || (12,8,11^2)^) || (24,8, 66) || ( 528,1584, 192) || 2304 ||
|| (10,10,24) || (1x10,1x10,12x2) || (12,8,11^2)^) || (12,8,132) || (1056,1584, 96) || 2736 ||

^1)^ In each dimension, calculated as product of the "gridpoints per node" in the other dimensions.

^2)^ 1.0*250/24 rounded up.

^3)^ no node-to-node communication in this direction because the number of nodes is only 1.

Impact on Implementations

The new routine must be implemented.

Impact on Applications / Users

The proposal is source-code and binary (ABI) backward compatible with MPI-2.2.
Application that currently use an own algorithm for calculating MPI process
dims based on application grid dimensions may benefit from this new routine.

Alternative Solutions

None.

Entry for the Change Log

Section 7.5.2 on page 248.[[BR]]
New routine MPI_DIMS_CREATEV.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2009-10-11 14:12:40 -0500


Minor changes in wording and two spelling fixes.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2009-10-11 14:17:00 -0500


Added more rationale.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2009-11-12 15:04:09 -0600


First step, make it readable on a normal screen size.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2009-11-12 16:27:48 -0600


As proposed in the discussion at Portland.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2009-11-12 16:39:53 -0600


New name for this new routine

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2009-11-12 17:53:07 -0600


Corrected and with example

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-17 11:20:18 -0600


After the discussion on ticket #314, I propose

  • add before Alternative A: See Alternative B = Proposed Solution

  • change headline "Alternative A" into "Alternative A (Withdrawn)"

  • change headline "Alternative B" into "Alternative B = Proposed Solution"

  • add after "Alternative B = Proposed Solution" the following line

    **MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 248, line 38-44 read**

    7.5.2 Cartesian Convenience Function: MPI_DIMS_CREATE

    For Cartesian topologies, the function MPI_DIMS_CREATE helps the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

    **but should read**

    7.5.2 Cartesian Convenience Function____s__: MPI_DIMS_CREATE__(V)____

    For Cartesian topologies, the function MPI_DIMS_CREATE **and MPI_DIMS_CREATEV** help~~s~~ the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.

    **Add after MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 249, line 35, i.e. after MPI_DIMS_CREATE:**

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-18 04:12:58 -0600


Pavan,

I propose to add an additional example with adapt=true:

Example 7.3 With adapt=true on a system with 12 cores per SMP node and 1 MPI process per core, and the following input values: nnodes=2400, ndims=3, gridsizes=(120,80,250), dims=(0,0,0). Possible output values and there implication for MPI_CART_CREATE and the underlying communication.

||dims after the call || possible cores x nodes || gridpoints per process || gridpoints per node || on 2-dim boundary^1)^ || sum of communicated points ||
||(12,8, 25) ||(12x1,1x8, 1x25) ||(10,10,10) ||(120,10,10) ||(100^3)^,1200,1200) ||2400 ||
||(12,8, 25) ||(3x4, 4x2, 1x25) ||(10,10,10) ||(30,40,10) ||( 400, 300,1200) ||1900 ||
||(10,10,24) ||(2x5, 2x5, 3x8 ) ||(12,8,11^2)^)||(24,16,33) ||( 528, 792, 384) ||1704 ||
||(10,10,24) ||(1x10,2x5, 6x4 ) ||(12,8,11^2)^)||(12,16,66) ||(1056, 792, 192) ||2040 ||
||(10,10,24) ||(2x5, 1x10,6x4 ) ||(12,8,11^2)^)||(24,8, 66) ||( 528,1584, 192) ||2304 ||
||(10,10,24) ||(1x10,1x10,12x2) ||(12,8,11^2)^)||(12,8,132) ||(1056,1584, 96) ||2736 ||

^1)^ In each dimension, calculated as product of the "gridpoints per node" in the other dimensions.

^2)^ 1.0*250/24 rounded up.

^3)^ no node-to-node communication in this direction because the number of nodes is only 1.

Although the decision about the factorization of each dims[i] into cores[i] x nodes[i] is part of MPI_CART_CREATE, the factorization within MPI_DIMS_CREATEV with
adapt=true restricts already the possibilities for MPI_CART_CREATE.
Although the first two results have both the perfect ratios (10.0,10.0,10.0),
the third solution with the ratios (12.0, 8.0, 10.4) may have minimal communication time
because the sum of the node-to-node communications in all 3 directions is proportional to only 1704 grid points.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-23 05:48:45 -0600


Update done, based on previous comments, and as discussed with Pavan.

I additional added an info argument.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-23 07:08:24 -0600


C and Fortran interface added.

Example 7.3. a bit extended and clearer wording.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2012-01-23 08:03:32 -0600


The idea is good but there are some problems with the details.

  1. The info argument is missing from the definitions
  2. The language needs editing
  3. What exactly does "as close to each other as possible" mean? Could I use different norms for evaluating this (e.g., 1-norm or infinity-norm)? Could those norms have weights?

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-23 09:37:27 -0600


Fixed the missing info in C and Fortran.
My English should be fixed by native speakers.
"close to" must be fixed. I'm not sure whether the second condition
about the ratio of largest to smallest ratio is already enough.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-23 10:35:24 -0600


Replying to gropp:

The idea is good but there are some problems with the details.

  1. The info argument is missing from the definitions
    Done.
  2. The language needs editing
    Needs native speakers.
  3. What exactly does "as close to each other as possible" mean? Could I use different norms for evaluating this (e.g., 1-norm or infinity-norm)?

This sentence is stolen from MPI_DIMS_CREATE:

MPI-2.2 page 249, lines 16-17: "The dimensions are set to be as close to each other as possible, using an appropriate divisibility algorithm."

Example: Input is nnodes=15360, ndims=4.

Possible results may be

  • (16,12,10,8) and
  • (16,15,8,8), i.e. both with largestsmallest = 2

The neighbor ratios are

  • (1.333, 1.20, 1.25) with avg=1.261 and max=1.333 (best)
  • (1.067, 1.875, 1.0) with avg=1.314 and max=1.875

More complicated nnodes=1843200, ndims=6:

  • (8 10 10 12 12 16) -> (1.25 1 1.2 1 1.3333) -> avg=1.1567 max=1.33
  • (8 8 10 12 15 16) -> (1 1.25 1.2 1.25 1.0667) -> avg=1.1533 max=1.25 (best)[[BR]]
    Remark: although nnodes=2**13 * 3**2 * 5**2, the best result does not contain any "2"
    in the 5th dims (=15).

Our Cray MPI (mpich2 based) returns the wrong result.
The algorithm in OpenMPI seems to do the same.

I'm not sure, whether there may be an example where both norms
produce different results for MPI_DIMS_CREATE.

But it may be, that this is possible for the V-version MPI_DIMS_CREATEV.

Another question for adapt=false is, whether we allow explicitely that an implementation chooses
a worse solution (compared to an optimum based on the wording in the MPI standard)
to allow fast (heuristic) algorithms.

@mpiforumbot
Copy link
Collaborator Author

Originally by goodell on 2012-01-23 14:16:08 -0600


editing pass over the language

@mpiforumbot
Copy link
Collaborator Author

Originally by goodell on 2012-01-23 14:22:09 -0600


I've made an editing pass over the text, but I'm still not sure what was intended by this sentence:

"All `dims[i]` modified by this call are returned in the same order as the `gridsize[i]` values, and if several `gridsize[i]` values are identical and then the in non-increasing order in the sequence of the indices."

I understand it's meant to disambiguate the correct choice among several otherwise valid options, but the first part still doesn't make sense to me.

Also, is the following sentence too restrictive?

The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is local and will return the same results when called with the same input arguments on all processes of the group of comm. 

This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-24 03:57:25 -0600


Replying to goodell:

I've made an editing pass over the text, but I'm still not sure what was intended by this sentence:

"All `dims[i]` modified by this call are returned in the same order as the `gridsize[i]` values, and if several `gridsize[i]` values are identical and then the in non-increasing order in the sequence of the indices."

I understand it's meant to disambiguate the correct choice among several otherwise valid options, but the first part still doesn't make sense to me.

The first part may want to tell that the result in dims must not be ordered as in non-V
MPI_DIMS_CREATE. This is a nice goal but the result may be not good:

This first part of the sentence implies that after finding the optimal dims result (i.e., with optimal ratios...), the routine has to reorder these values
according to the sequence in gridsizes.

My understanding is that this would imply that the ratio rules are broken, i.e., the final result is worse than defined by the ratio rules (because the set of ratios is changed by such a reordering when gridsize-values are not identical).

There are two possibilities:

  • Such a reordering can never happen, because the ratio rules already produce
    such a sequence. Then, the sentence is not needed and may cause only
    additional implementation overhead because nobody wants too mathematically prove
    that this cannot happen.
    This means, the first part may be removed.
  • Such a reordering can happen and the result would be therefore worse.
    In this case, this first part should definitly removed.

Both possibilities imply that the first part may or must be removed.
Therefore I recommend to substitute the sentence by

"If several `gridsize[i]` values are identical then for those indeces `i`, the `dims[i]` modified by this call are returned in non-increasing order in the sequence of the indices."

Okay?

Also, is the following sentence too restrictive?

The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is local and will return the same results when called with the same input arguments on all processes of the group of comm. 

This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create.

We should not change this sentence, because this routine has one major goal, returning in dims the input for MPI_CART_CREATE.
It is up to the application, whether MPI_DIMS_CREATE(V) is called only by one process and then dims being broadcast to all others, or whether MPI_DIMS_CREATE(V) is called collectively. For both routines, the same rule should apply to make it easy to move from MPI_DIMS_CREATE to MPI_DIMS_CREATEV if this is more optimal for the application.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-24 05:53:59 -0600


More worse: Classical MPI_DIMS_CREATE(240,2,dims) returns

  • Cray: 16*15 (correct)
  • OpenMPI 1.5.4: 20*12 (bad result - not as close as possible)
  • MVAPICH 1.6: 20*12 (bad result - not as close as possible)

I do not report this to tell bugs that should be corrected.
I tell this, because we may want to allow heuristics, as already obviously used in
all these implementations.

Proposal for both MPI_DIMS_CREATE and MPI_DIMS_CREATEV:

'''The returned result in the array dims may differ from the optimum
because the implementation may use a faster implementation based on a heuristics.'''

This would also finish the discussion whether the text unambiguously defines the optimum.

@mpiforumbot
Copy link
Collaborator Author

Originally by goodell on 2012-01-24 13:08:43 -0600


Replying to RolfRabenseifner:

Both possibilities imply that the first part may or must be removed.
Therefore I recommend to substitute the sentence by

"If several `gridsize[i]` values are identical then for those indeces `i`, the `dims[i]` modified by this call are returned in non-increasing order in the sequence of the indices."

Okay?

Not okay. I think I agree with eliminating the first part. But I still don't entirely understand what you mean even in the new version. Here's my attempt at cleaning it up into what I think you mean; you tell me if I got it right:

"If `i<j`, `gridsize[i]=gridsize[j]`, `dims[i]=0`, and `dims[j]=0` at input, then the output `dims` shall satisfy this property: `dims[i]>=dims[j]`.

Why decreasing versus increasing order?

Also, is the following sentence too restrictive?

The array dims is suitable for use as input to MPI_CART_CREATE. MPI_DIMS_CREATEV is local and will return the same results when called with the same input arguments on all processes of the group of comm. 

This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create.

We should not change this sentence, because this routine has one major goal, returning in dims the input for MPI_CART_CREATE.
It is up to the application, whether MPI_DIMS_CREATE(V) is called only by one process and then dims being broadcast to all others, or whether MPI_DIMS_CREATE(V) is called collectively. For both routines, the same rule should apply to make it easy to move from MPI_DIMS_CREATE to MPI_DIMS_CREATEV if this is more optimal for the application.

I don't think I agree with that assessment, but I don't feel strongly enough to fight over it. Your interpretation means that calling MPI_DIMS_CREATEV must be very deterministic, even across processes. I think that is too strong of a restriction on the implementation. Let's see what the other chapter committee members think.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2012-01-27 05:42:48 -0600


Its clearly the case that the intended use of MPI_DIMS_CREATE(V) is a input to MPI_CART_CREATE. That establishes the requirements on the the determinacy of the outputs based on the inputs. (Asking the user to run in one process an broadcast the results is both ugly and introduces unnecessary communication in almost all cases).

An alternative would be to give MPI_DIMS_CREATEV collective semantics. That is, require all processes in the input communicator to call it. In the typical use, this is not a problem. Further, the collective semantics does not require any communication - a deterministic implementation could execute purely locally. Should some ambitious implementation want to do something more, then it would have the ability to communicate to ensure that the output was usable to MPI_CART_CREATE.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-01-29 06:27:22 -0600


I did the following changes:

  • Two sub-proposals:

    • B = local call
    • C = collective call (as proposed by Bill Gropp)
  • Correction also to MPI-2.2 page 247 lines 9-17

  • Correction of

    All dims[i] modified by this call are returned in the same order as the gridsize[i] values, and if several gridsize[i] values are identical and then the in non-increasing order in the sequence of the indices.

    into (as proposed by Dave Goodell)

    If i<j, gridsize[i]=gridsize[j], dims[i]=0,
    and dims[j]=0 at input, then the output dims shall
    satisfy this property: dims[i]>=dims[j].

  • ndims as "positive integer" instead of "integer"

I personally would prefere proposal C, because it allows both,

  • a determistic (non-parallelized and non-communicating) implementation, and
  • a parallelized implementation to achieve an optimum although
    the search space may be huge on a huge number of processes.

@mpiforumbot
Copy link
Collaborator Author

Originally by htor on 2012-02-01 22:07:15 -0600


I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

  • I disagree to the word "balanced" in the general description, the returned
    dimensions are not optimized for balance
  • "1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"
  • why is the constraint "If i<j, gridsize[i]=gridsize[j], dims[i]=0, and
    dims[j]=0 at input, then the output dims shall satisfy this property:
    dims[i]>=dims[j]." defined?
  • why would we ever say adapt = 0? It does not change program semantics
    and is "good" :-)
  • "An error will occur if the size of the group of comm is not a
    multiple of the product of all dims[i] with dims[i]>0." -- this
    doesn't seem to work because if this were true, then dims[i]==0 would
    need to remain 0.

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

Thanks,
Torsten

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-05 06:56:08 -0600


Replying to htor:

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

  • I disagree to the word "balanced" in the general description, the returned
    dimensions are not optimized for balance

I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i]
are balanced. We need not to be better than the current wording
that uses "balanced" in combination with "optional constraints".
In MPI-2.2, these optional constraints have been fixed dims[i] values.
With MPI_DIMS_CREATEV in MPI-3.0, it is additionally the gridsizes[i]/dims[i] rule.

  • "1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"

This is to make clear, that the ratio is a float and not a rounded or cut integer
division. To stay as precise as possible, I would keep it until a better
proposal is done.
A term float(gridsizes[i])/float(dims[i]) would be worse, because it is not language neutral.

  • why is the constraint "If i<j, gridsize[i]=gridsize[j], dims[i]=0, and
    dims[j]=0 at input, then the output dims shall satisfy this property:
    dims[i]>=dims[j]." defined?

To get the same behavior with gridsizes == 1 as with old MPI_DIMS_CREATE,
and to achieve deterministic results in most cases.

  • why would we ever say adapt = 0? It does not change program semantics
    and is "good" :-)

To be able to substitute MPI_DIMS_CREATE by
MPI_DIMS_CREATEV with gridsizes == 1,
as long as the nnodes is identical to the size of a communicator
and all processes in the communicator are calling old MPI_CREATE.

  • "An error will occur if the size of the group of comm is not a
    multiple of the product of all dims[i] with dims[i]>0." -- this
    doesn't seem to work because if this were true, then dims[i]==0 would
    need to remain 0.

I do not see a logical difference to the text of MPI_DIMS_CREATE, MPI-2.2 page 249 lines 21-22. Therefore, I would not make it more complicate.
It would be possible to substitute the current sentence by

An error will occur if one ore more indexes i with dims[i]>0 exist and
the size of the group of comm is not a
multiple of the product of all dims[i] with dims[i]>0.

I do not include the bold portion into the sentence, because nobody complained with the existing sentence in MPI_DIMS_CREATE of MPI-1.1 - MPI-2.2.

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

I try to fix all these small things now, that we can read it in Chicago.
There is nothing complicated in.

Thanks,
Torsten

Thank you for your detailed and very helpful review.

I'll add with my next steps the still needed text for handling non-optimal
results based on heuristics.
This will be the more complicated wording and content.
I'll do it as an additional option D.

Rolf

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-05 09:29:23 -0600


Several current implementations of MPI_DIMS_CREATE are wrong, i.e., they definitely do not
follow the standard: "The dimensions are set to be as close to each other as possible" (MPI-2.2, page 249, line 16).

They use fast heuristics instead of really calculating the optimum.

There are two options:

  1. We keep the standard as it is and all these MPI implementations are buggy.
    Correcting these libraries will be a non-backward compatible change.
  2. We change the standard and allow the use of fast heuristics.
    Then the implementers have the freedom to choose:
    • Changing the current heuristics to produce better results, or
    • keeping it as it is to be backward compatible.

I'll integrate an Option D into the proposal to allow the second choice.
Current text is along the first choice.


Details about current implementations of MPI_DIMS_CREATE:

I checked existing MPI_DIMS_CREATE and learned that they work with a heuristics:

  1. Calculating a prime factorization
  2. Filling all empty dims with 1
  3. Starting with the largest primes and filling them into these dims enties,
    always into one with currentlysmallest value
  4. Sorting the entries

Example: MPI_DIMS_CREATE(16_16_15, 3, dims)

  1. 16_16_15 = 51 * 31 * 2**8
  2. dims = 1 x 1 x 1
  3. Starting with the largest prime: 5, filling in to (first) smallest dims[i]
    • 5 --> 5 x 1 x 1
    • 3 --> 5 x 3 x 1
    • 2 --> 5 x 3 x 2
    • 2 --> 5 x 3 x 4
    • 2 --> 5 x 6 x 4
    • 2 --> 5 x 6 x 8
    • 2 --> 10 x 6 x 8
    • 2 --> 10 x 12 x 8
    • 2 --> 10 x 12 x 16
    • 2 --> 20 x 12 x 16
  4. Sorting: --> 20 x 16 x 12.

This result is of course wrong according to the definition of MPI_DIMS_CREATE.

The correct result would be 16 x 16 x 15.

I checked MVAPICH2, OpenMPI, and CrayMPI with following results:

  • compiler/intel/12.0.4 mpi/openmpi/1.5.4-intel-12.0.4 and
  • compiler/intel/12.0 mpi/mvapich2/1.6-intel-12.0
  • PrgEnv-cray/4.0.36

Results:

  • dt= 0.000008 nnodes= 240=16*15 best ndims= 2 dims= 20 12
  • dt= 0.000545 nnodes= 3840=16_16_15 best ndims= 3 dims= 20 16 12
  • dt= 0.069598 nnodes= 61440=16_16_16*15 best ndims= 4 dims= 20 16 16 12

In all 3 cases, the results
(20 12), (20 16 12) and (20 16 16 12)
are worse than the optimum which would be
(16 15), (16 16 15) and (16 16 16 15)

Only OpenMPI and CrayMPI (mvapich2 seems to be in an endless loop):
dt=34.429274 nnodes= 1843200=16_15_12_10_8*8 ndims= 6 dims= 16 12 12 10 10 8

This result ( 16 12 12 10 10 8) is worse than[[BR]]
the optimum (16 15 12 10 8 8).

Reason why the second one is really the optimum:

||# ||=dims =||||||||||=Ratios between dims[i] and dims[i+1] =||=max =||=avg. =||
||=Result =||(16 12 12 10 10 8) ||16/12 ||12/12 ||12/10 ||10/10 ||10/8 || || ||
||# || ||1.3333 ||1.000 ||1.200 ||1.000 ||1.250 ||1.333||1.1567||
||=Optimum=||(16 15 12 10 8 8)||16/15 ||15/12 ||12/10 ||10/8 ||8/8 || || ||
||# || ||1.0667 ||1.250 ||1.200 ||1.250 ||1.000 ||1.250 ||1.1533 ||

About the timing:[[BR]]

  • dt values are reported by OpenMPI.
  • dt by mvapich2 and CrayMPI: between 0.000000 and 0.000004 sec.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-05 10:15:03 -0600


With the last change of the Description, I only added Option D and defined
C+D as proposed solution.

@mpiforumbot
Copy link
Collaborator Author

Originally by htor on 2012-02-05 11:47:31 -0600


Replying to RolfRabenseifner:

Replying to htor:

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

  • I disagree to the word "balanced" in the general description, the returned
    dimensions are not optimized for balance

I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i]
are balanced. We need not to be better than the current wording
that uses "balanced" in combination with "optional constraints".
In MPI-2.2, these optional constraints have been fixed dims[i] values.
With MPI_DIMS_CREATEV in MPI-3.0, it is additionally the gridsizes[i]/dims[i] rule.
This is unclear and confusing. I'd go with "optimized" because it does not imply what the optimization criteria is.

  • "1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"

This is to make clear, that the ratio is a float and not a rounded or cut integer
division. To stay as precise as possible, I would keep it until a better
proposal is done.
A term float(gridsizes[i])/float(dims[i]) would be worse, because it is not language neutral.
Something like "the exact" or so seems more appropriate to me.

  • why is the constraint "If i<j, gridsize[i]=gridsize[j], dims[i]=0, and
    dims[j]=0 at input, then the output dims shall satisfy this property:
    dims[i]>=dims[j]." defined?

To get the same behavior with gridsizes == 1 as with old MPI_DIMS_CREATE,
and to achieve deterministic results in most cases.
ok

  • why would we ever say adapt = 0? It does not change program semantics
    and is "good" :-)

To be able to substitute MPI_DIMS_CREATE by
MPI_DIMS_CREATEV with gridsizes == 1,
as long as the nnodes is identical to the size of a communicator
and all processes in the communicator are calling old MPI_CREATE.
Why would we want to do this?

  • "An error will occur if the size of the group of comm is not a
    multiple of the product of all dims[i] with dims[i]>0." -- this
    doesn't seem to work because if this were true, then dims[i]==0 would
    need to remain 0.

I do not see a logical difference to the text of MPI_DIMS_CREATE, MPI-2.2 page 249 lines 21-22. Therefore, I would not make it more complicate.
It would be possible to substitute the current sentence by

An error will occur if one ore more indexes i with dims[i]>0 exist and
the size of the group of comm is not a
multiple of the product of all dims[i] with dims[i]>0.

I do not include the bold portion into the sentence, because nobody complained with the existing sentence in MPI_DIMS_CREATE of MPI-1.1 - MPI-2.2.
I see, the first version seems best (together with what we have already in dims_create!

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

I try to fix all these small things now, that we can read it in Chicago.
There is nothing complicated in.
I think this exceeds the threshold for a proposal to be passed without a discussion and a demonstration etc.. I'm not saying we should not go ahead, but we need to discuss it in the broader audience (and not just read it!).

I'll add with my next steps the still needed text for handling non-optimal
results based on heuristics.
This will be the more complicated wording and content.
I'll do it as an additional option D.
see ....

Torsten

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-06 04:25:41 -0600


Replying to htor:

Replying to RolfRabenseifner:

Replying to htor:

I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:

  • I disagree to the word "balanced" in the general description, the returned
    dimensions are not optimized for balance

I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i]
are balanced. We need not to be better than the current wording
that uses "balanced" in combination with "optional constraints".
In MPI-2.2, these optional constraints have been fixed dims[i] values.
With MPI_DIMS_CREATEV in MPI-3.0, it is additionally the gridsizes[i]/dims[i] rule.
This is unclear and confusing. I'd go with "optimized" because it does not imply what the optimization criteria is.

Yes, this is a good proposal. I'll change the four locations of "balanced" (all in the overview paragraphs) into "optimized".

  • "1.0*gridsizes[i]/dims[i]" should really be "gridsizes[i]/dims[i]"

This is to make clear, that the ratio is a float and not a rounded or cut integer
division. To stay as precise as possible, I would keep it until a better
proposal is done.
A term float(gridsizes[i])/float(dims[i]) would be worse, because it is not language neutral.
Something like "the exact" or so seems more appropriate to me.

I'll change "1.0*gridsizes[i]/dims[i]" into "gridsizes[i]/dims[i](using floating point divide)"

... (topic is resolved)

  • why would we ever say adapt = 0? It does not change program semantics
    and is "good" :-)

To be able to substitute MPI_DIMS_CREATE by
MPI_DIMS_CREATEV with gridsizes == 1,
as long as the nnodes is identical to the size of a communicator
and all processes in the communicator are calling old MPI_CREATE.
Why would we want to do this?

As long as we go with Alternative C (collective MPI_DIMS_CREATEV),
the new routine can calculate the optimum faster (i.e., in parallel)
than the existing MPI_DIMS_CREATE. Therefore user of MPI_DIMS_CREATE
may want to switch to MI_DIMS_CREATEV without loosing the semantics.
With gridsizes == 1, they have the same semantics.

I know, this is not a very strong argument, but for the implementation
of MPI_DIMS_CREATEV, the adapt=0 is not a real problem,
because it can be internally handled as adapt=1 with
a CPU size equal to one core.

... (topic is resolved)

Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum.

I try to fix all these small things now, that we can read it in Chicago.
There is nothing complicated in.
I think this exceeds the threshold for a proposal to be passed without a discussion and a demonstration etc.. I'm not saying we should not go ahead, but we need to discuss it in the broader audience (and not just read it!).

This topic has a 3h slot, Wednesday March 7, 9am-12pm.

... (topic is resolved)

Rolf

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-06 04:34:46 -0600


Changed "balanced" into "optimized" and [[BR]]
"1.0*gridsizes[i]/dims[i]" into "gridsizes[i]/dims[i] (using floating point divide)".

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-06 10:05:55 -0600


Attachment added: dims_create_test.c (2.4 KiB)
Testing MPI_Dims_create and its factorizations

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-02-06 10:06:45 -0600


Attachment added: dims_create_test_log.txt (10.6 KiB)
Protocol of test of MPI_Dims_create with three MPI libraries.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-03-06 16:46:27 -0600


In the Chicago meeting, March 2012, we decided:

  • Use Alternative C
  • remove the argument adapt
   IN     adapt     the result is independent of the 
                    underlying hardware (false) or may be adapted
                    to the underlying hardware (true) (logical) 
  • substitute Option D by the text used for MPI_DIMS_CREATE in Ensure MPI_Dims_create is really suitable for MPI_Cart_create #194:

    Advice to Implementors:
    MPI_Dims_create should return, as much as possible, values that will allow
    MPI_Cart_create to return a communicator so that neighbor communication is efficient.

  • Remove "MPI_DIMS_CREATE(V)" from the sectiuon title

  • Use the collective interface (Alternative C).

  • Remove Example 7.2:

    Example 7.2. MPI_DIMS_CREATEV with adapt=false

    || size of comm || ndims || gridsizes || dims before the call || dims after the call ||
    || 60 || 2 || (1200,500) || (0,0) || (12,5) ||
    || 60 || 2 || (500,1200) || (0,0) || (5,12) ||
    || 60 || 3 || (50,100,60) || (0,2,0) || (5,2,6) ||
    || 60 || 3 || (1,1,1) || (0,0,0) || (5,4,3) ||
    || 60 || 3 || (10,15,10) || (0,0,0) || (4,5,3) ||
    || 60 || 3 || (6,15,10) || (0,0,0) || (3,5,4) ||
    || 60 || 3 || (5,15,10) || (0,0,0) || (2,6,5) ||

    Note that for the second to last example the dims values (3,5,4) and (2,6,5) would
    imply same set of ratios, i.e., (2.0, 3.0, 2.5) and (3.0, 2.5, 2.0), with a ratio of 1.5
    between the largest and smallest.
    The result is (3,5,4) because of its smaller difference between largest and
    smallest dims[i].

  • Remove the explanations to Example 7.3 (now renumbered to 7.2)

    Although the decision about the factorization of each dims[i]
    into cores[i] x nodes[i] is part of MPI_CART_CREATE, the factorization
    within MPI_DIMS_CREATEV with
    adapt=true restricts the possibilities for MPI_CART_CREATE.
    The first three results both have perfect ratios of (10.0,10.0,10.0),
    i.e., dims=(12,8,25) would be returned with adapt=false, although the
    values shown in the last column are not minimal.
    The fourth solution with dims=(10,10,12) and a core-distribution (2,2,3) has worse
    ratios (12.0, 8.0, 10.4), but may have minimal communication time
    because the sum of the node-to-node communications in all 3 directions is proportional
    to only 1704 grid points. Specifically, with adapt=true the routine may
    return dims=(10,10,24) if the group of comm is executed with 12 MPI processes
    per SMP node.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-03-06 19:11:16 -0600


Based on a discussion with Bill, I further changed

Those dimensions where dims[i] = 0 are
set such that
all ratios gridsizes[i]/dims[i] (using floating point divide)
are as close to each other as possible, using an appropriate divisibility
algorithm. The largest ratio divided by the smallest ratio should be
minimal.
If there are different possibilities for calculating these
dims[i], then the solution set with minimal difference between
largest and smallest dims[i] should be used.

into

Those dimensions where dims[i] = 0 are
set such that
they are good as input for MPI_CART_CREATE.

It needs re-reading of these parts.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-03-07 13:48:15 -0600


Latest changes as discussed in the formal re-reading.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-03-07 13:51:56 -0600


Had formal reading in the Chicago meeting, March 2012.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2012-05-29 19:39:28 -0500


There are still too many errors in the description, starting with the fact that the language-independent form has errors (there is an info parameter in the argument description but no info parameter in the argument list, conversely, there is no adapt in the argument descriptions but an adapt in the parameter list).

@mpiforumbot
Copy link
Collaborator Author

Originally by jsquyres on 2012-06-20 09:48:16 -0500


1st vote failed in Japan Forum meeting, May, 2012. Moved to "author rework".

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-06-21 03:07:01 -0500


Vote has been yes=2, no=10, abstain=4, missed=0.

Did it fail because

  • we should not proceed with this ticket,
  • or because it is not ready (bugs in the description)?

Corrections are needed in the block with the language independent definition:

  • substitute "adapt" by "info"
  • remove "if adapt=true" in the description of the "info" argument
    The language bindings were correct.

The full text description was also correct, but we should remove the canceled text portions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant