-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topology awareness in MPI_Dims_create #195
Comments
Originally by gropp on 2009-10-11 14:12:40 -0500 Minor changes in wording and two spelling fixes. |
Originally by gropp on 2009-10-11 14:17:00 -0500 Added more rationale. |
Originally by RolfRabenseifner on 2009-11-12 15:04:09 -0600 First step, make it readable on a normal screen size. |
Originally by RolfRabenseifner on 2009-11-12 16:27:48 -0600 As proposed in the discussion at Portland. |
Originally by RolfRabenseifner on 2009-11-12 16:39:53 -0600 New name for this new routine |
Originally by RolfRabenseifner on 2009-11-12 17:53:07 -0600 Corrected and with example |
Originally by RolfRabenseifner on 2012-01-17 11:20:18 -0600 After the discussion on ticket #314, I propose
|
Originally by RolfRabenseifner on 2012-01-18 04:12:58 -0600 Pavan, I propose to add an additional example with adapt=true: Example 7.3 With ||dims after the call || possible cores x nodes || gridpoints per process || gridpoints per node || on 2-dim boundary^1)^ || sum of communicated points || ^1)^ In each dimension, calculated as product of the "gridpoints per node" in the other dimensions. ^2)^ 1.0*250/24 rounded up. ^3)^ no node-to-node communication in this direction because the number of nodes is only 1. Although the decision about the factorization of each |
Originally by RolfRabenseifner on 2012-01-23 05:48:45 -0600 Update done, based on previous comments, and as discussed with Pavan. I additional added an info argument. |
Originally by RolfRabenseifner on 2012-01-23 07:08:24 -0600 C and Fortran interface added. Example 7.3. a bit extended and clearer wording. |
Originally by gropp on 2012-01-23 08:03:32 -0600 The idea is good but there are some problems with the details.
|
Originally by RolfRabenseifner on 2012-01-23 09:37:27 -0600 Fixed the missing info in C and Fortran. |
Originally by RolfRabenseifner on 2012-01-23 10:35:24 -0600 Replying to gropp:
This sentence is stolen from MPI_DIMS_CREATE: MPI-2.2 page 249, lines 16-17: "The dimensions are set to be as close to each other as possible, using an appropriate divisibility algorithm." Example: Input is nnodes=15360, ndims=4. Possible results may be
The neighbor ratios are
More complicated nnodes=1843200, ndims=6:
Our Cray MPI (mpich2 based) returns the wrong result. I'm not sure, whether there may be an example where both norms But it may be, that this is possible for the V-version MPI_DIMS_CREATEV. Another question for adapt=false is, whether we allow explicitely that an implementation chooses |
Originally by goodell on 2012-01-23 14:16:08 -0600 editing pass over the language |
Originally by goodell on 2012-01-23 14:22:09 -0600 I've made an editing pass over the text, but I'm still not sure what was intended by this sentence:
I understand it's meant to disambiguate the correct choice among several otherwise valid options, but the first part still doesn't make sense to me. Also, is the following sentence too restrictive?
This would seem to preclude any optimizations that involve any amount of randomization, since it's probably too impossible to make sure that the random number generator is used exactly the same way on all processes. This may not matter too much for the default adapt=false case, but the adapt=true case is probably excessively hampered this way. OTOH, if we remove this restriction then we are obligating the user to broadcast the resulting information themselves in order to have a consistent arguments at all processes for the call to MPI_Cart_create. |
Originally by RolfRabenseifner on 2012-01-24 03:57:25 -0600 Replying to goodell:
The first part may want to tell that the result in dims must not be ordered as in non-V This first part of the sentence implies that after finding the optimal dims result (i.e., with optimal ratios...), the routine has to reorder these values My understanding is that this would imply that the ratio rules are broken, i.e., the final result is worse than defined by the ratio rules (because the set of ratios is changed by such a reordering when gridsize-values are not identical). There are two possibilities:
Both possibilities imply that the first part may or must be removed.
Okay?
We should not change this sentence, because this routine has one major goal, returning in dims the input for MPI_CART_CREATE. |
Originally by RolfRabenseifner on 2012-01-24 05:53:59 -0600 More worse: Classical MPI_DIMS_CREATE(240,2,dims) returns
I do not report this to tell bugs that should be corrected. Proposal for both MPI_DIMS_CREATE and MPI_DIMS_CREATEV: '''The returned result in the array This would also finish the discussion whether the text unambiguously defines the optimum. |
Originally by goodell on 2012-01-24 13:08:43 -0600 Replying to RolfRabenseifner:
Not okay. I think I agree with eliminating the first part. But I still don't entirely understand what you mean even in the new version. Here's my attempt at cleaning it up into what I think you mean; you tell me if I got it right:
Why decreasing versus increasing order?
I don't think I agree with that assessment, but I don't feel strongly enough to fight over it. Your interpretation means that calling MPI_DIMS_CREATEV must be very deterministic, even across processes. I think that is too strong of a restriction on the implementation. Let's see what the other chapter committee members think. |
Originally by gropp on 2012-01-27 05:42:48 -0600 Its clearly the case that the intended use of MPI_DIMS_CREATE(V) is a input to MPI_CART_CREATE. That establishes the requirements on the the determinacy of the outputs based on the inputs. (Asking the user to run in one process an broadcast the results is both ugly and introduces unnecessary communication in almost all cases). An alternative would be to give MPI_DIMS_CREATEV collective semantics. That is, require all processes in the input communicator to call it. In the typical use, this is not a problem. Further, the collective semantics does not require any communication - a deterministic implementation could execute purely locally. Should some ambitious implementation want to do something more, then it would have the ability to communicate to ensure that the output was usable to MPI_CART_CREATE. |
Originally by RolfRabenseifner on 2012-01-29 06:27:22 -0600 I did the following changes:
I personally would prefere proposal C, because it allows both,
|
Originally by htor on 2012-02-01 22:07:15 -0600 I think this is generally a good idea but it needs some work and discussion. I read through it and have some questions:
Thus, I don't think we should rush it through 3.0 since we have enough other things to push. This needs certainly a discussion at the Forum. Thanks, |
Originally by RolfRabenseifner on 2012-02-05 06:56:08 -0600 Replying to htor:
I would keep the text because in with the new routine, the values of gridsizes[i]/dims[i]
This is to make clear, that the ratio is a float and not a rounded or cut integer
To get the same behavior with gridsizes == 1 as with old MPI_DIMS_CREATE,
To be able to substitute MPI_DIMS_CREATE by
I do not see a logical difference to the text of MPI_DIMS_CREATE, MPI-2.2 page 249 lines 21-22. Therefore, I would not make it more complicate. An error will occur if one ore more indexes i with dims[i]>0 exist and I do not include the bold portion into the sentence, because nobody complained with the existing sentence in MPI_DIMS_CREATE of MPI-1.1 - MPI-2.2.
I try to fix all these small things now, that we can read it in Chicago.
Thank you for your detailed and very helpful review. I'll add with my next steps the still needed text for handling non-optimal Rolf |
Originally by RolfRabenseifner on 2012-02-05 09:29:23 -0600 Several current implementations of MPI_DIMS_CREATE are wrong, i.e., they definitely do not They use fast heuristics instead of really calculating the optimum. There are two options:
I'll integrate an Option D into the proposal to allow the second choice. Details about current implementations of MPI_DIMS_CREATE: I checked existing MPI_DIMS_CREATE and learned that they work with a heuristics:
Example: MPI_DIMS_CREATE(16_16_15, 3, dims)
This result is of course wrong according to the definition of MPI_DIMS_CREATE. The correct result would be 16 x 16 x 15. I checked MVAPICH2, OpenMPI, and CrayMPI with following results:
Results:
In all 3 cases, the results Only OpenMPI and CrayMPI (mvapich2 seems to be in an endless loop): This result ( 16 12 12 10 10 8) is worse than[[BR]] Reason why the second one is really the optimum: ||# ||=dims =||||||||||=Ratios between dims[i] and dims[i+1] =||=max =||=avg. =|| About the timing:[[BR]]
|
Originally by RolfRabenseifner on 2012-02-05 10:15:03 -0600 With the last change of the Description, I only added Option D and defined |
Originally by htor on 2012-02-05 11:47:31 -0600 Replying to RolfRabenseifner:
Torsten |
Originally by RolfRabenseifner on 2012-02-06 04:25:41 -0600 Replying to htor:
Yes, this is a good proposal. I'll change the four locations of "balanced" (all in the overview paragraphs) into "optimized".
I'll change "1.0*gridsizes[i]/dims[i]" into "gridsizes[i]/dims[i](using floating point divide)"
I know, this is not a very strong argument, but for the implementation
This topic has a 3h slot, Wednesday March 7, 9am-12pm.
Rolf |
Originally by RolfRabenseifner on 2012-02-06 04:34:46 -0600 Changed "balanced" into "optimized" and [[BR]] |
Originally by RolfRabenseifner on 2012-02-06 10:05:55 -0600 Attachment added: |
Originally by RolfRabenseifner on 2012-02-06 10:06:45 -0600 Attachment added: |
Originally by RolfRabenseifner on 2012-03-06 16:46:27 -0600 In the Chicago meeting, March 2012, we decided:
|
Originally by RolfRabenseifner on 2012-03-06 19:11:16 -0600 Based on a discussion with Bill, I further changed Those dimensions where into Those dimensions where It needs re-reading of these parts. |
Originally by RolfRabenseifner on 2012-03-07 13:48:15 -0600 Latest changes as discussed in the formal re-reading. |
Originally by RolfRabenseifner on 2012-03-07 13:51:56 -0600 Had formal reading in the Chicago meeting, March 2012. |
Originally by gropp on 2012-05-29 19:39:28 -0500 There are still too many errors in the description, starting with the fact that the language-independent form has errors (there is an info parameter in the argument description but no info parameter in the argument list, conversely, there is no adapt in the argument descriptions but an adapt in the parameter list). |
Originally by jsquyres on 2012-06-20 09:48:16 -0500 1st vote failed in Japan Forum meeting, May, 2012. Moved to "author rework". |
Originally by RolfRabenseifner on 2012-06-21 03:07:01 -0500 Vote has been yes=2, no=10, abstain=4, missed=0. Did it fail because
Corrections are needed in the block with the language independent definition:
The full text description was also correct, but we should remove the canceled text portions. |
Originally by balaji on 2009-10-09 16:45:51 -0500
Description
MPI_Dims_create does not take any information with respect to the processes in its parameter (e.g., there is no communicator parameter). Without this information the function cannot really use any information about the topology of the physical network to optimize the Cartesian layout. (This problem is answered by Alternative A)
Additionally, MPI_Dims_create is not aware of the application topology. E.g., if the application uses a grid of 1200 x 500 elements and 60 MPI processes (as in ticket #195) then the expected Cartesian process topology should be 12 x 5 and not 10 x 6 as returned by existing MPI_Dims_create. (Both problems are answered by solution B)
Extended Scope
None.
History
Proposed Solution
-MPI-2.2, Chapter 7 (Topologies), 7.4 (Overview), page 247, line 9-17 read*
MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension.
For each coordinate direction one specifies whether the process structure is periodic or
not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per
coordinate direction. Thus, special support for hypercube structures is not necessary.
The local auxiliary function MPI_DIMS_CREATE
can be used to compute a balanced distribution
of processes among a given number of dimensions.
Rationale. Similar functions are contained in EXPRESS [12] and PARMACS. (End of rationale.)
-but should read*
MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension.
For each coordinate direction one specifies whether the process structure is periodic or
not. Note that an n-dimensional hypercube is an n-dimensional torus with 2 processes per
coordinate direction. Thus, special support for hypercube structures is not necessary.
The local auxiliary function MPI_DIMS_CREATE
and the collective function MPI_DIMS_CREATEV
can be used to compute
a balancedan optimized distributionof processes among a given number of dimensions.
Rationale. Similar functions are contained in EXPRESS [12] and PARMACS. (End of rationale.)
-MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 248, line 38-44 read*
7.5.2 Cartesian Convenience Function: MPI_DIMS_CREATE
For Cartesian topologies, the function MPI_DIMS_CREATE helps the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.
-but should read*
7.5.2 Cartesian Convenience Function__s__
: MPI_DIMS_CREATEFor Cartesian topologies, the function__s__ MPI_DIMS_CREATE and MPI_DIMS_CREATEV help
sthe user selecta balancedan optimized distribution of processes per coordinate direction, depending on the number of processes in the group to bebalancedoptimized and optional constraints that can be specified by the user. One use is to partition all the processes (the size of MPI_COMM_WORLD's group) into an n-dimensional topology.-Add after MPI-2.2, Chapter 7 (Topologies), 7.5.2, page 249, line 35, i.e. after MPI_DIMS_CREATE:*
The entries in the array
dims
are set to describe a Cartesian gridwith
ndims
dimensions and a total number of nodes identical to thesize of the group of
comm
. The caller may constrain the operationof this routine by specifying elements of array
dims
. Ifdims[i]
is set to a positive number, the routine will not modifythe number of nodes in dimension
i
; only those entries wheredims[i] # 0
will be modified. Those dimensions wheredims[i]0
areset such that
__the returned values will allow
MPI_CART_CREATE to return a communicator where neighbor communication is efficient
all ratios
.gridsizes[i]/dims[i]
(using floating point divide)are as close to each other as possible, using an appropriate divisibility
algorithm. The largest ratio divided by the smallest ratio should be
minimal.
If there are different possibilities for calculating these
dims[i]
, then the solution set with minimal difference betweenlargest and smallest
dims[i]
should be usedIf
i<j
,gridsize[i]=gridsize[j]
,dims[i]=0
,and
dims[j]=0
at input, then the outputdims
shallsatisfy this property:
dims[i]>=dims[j]
.Negative input values of
dims[i]
are erroneous. An error will occur if the size of the group ofcomm
is not a multiple of the product of alldims[i]
withdims[i]>0
.The array
dims
is suitable for use as input to MPI_CART_CREATE.MPI_DIMS_CREATEV is a collective operation.
The arguments
comm
,ndims
,gridsizes
,and the (key,value)-pairs stored inadapt
info
must have identical values on all processes of the group ofcomm
.MPI_DIMS_CREATEV will return in
dims
the same results on all proccesses.The
meaning ofreturned result can be additionallyadapt=true
influenced by the
info
argument.Valid
info
values are implementation-dependent.The constant MPI_INFO_NULL can be used when no hints
should be specified.
An MPI implementation is not obliged to follow specific hints,
and it is valid for an MPI implementation to ignore the hints.
Example 7.2 MPI_DIMS_CREATEV on a system with 12 cores per SMP node and 1 MPI process per core, and the following input values:
nnodes=2400
,ndims=3
,gridsizes=(120,80,250)
,dims=(0,0,0)
. Possible output values and their implication for MPI_CART_CREATE and the underlying communication.|| dims after the call || possible cores x nodes || gridpoints per process || gridpoints per node || on 2-dim boundary^1)^ || sum of communicated points ||
|| (12,8, 25) || (12x1,1x8, 1x25) || (10,10,10) || (120,10,10) || (100^3)^,1200,1200) || 2400 ||
|| (12,8, 25) || (6x2, 2x4, 1x25) || (10,10,10) || (60,20,10) || ( 200, 600,1200) || 2000 ||
|| (12,8, 25) || (3x4, 4x2, 1x25) || (10,10,10) || (30,40,10) || ( 400, 300,1200) || 1900 ||
|| (10,10,24) || (2x5, 2x5, 3x8 ) || (12,8,11^2)^) || (24,16,33) || ( 528, 792, 384) || 1704 ||
|| (10,10,24) || (1x10,2x5, 6x4 ) || (12,8,11^2)^) || (12,16,66) || (1056, 792, 192) || 2040 ||
|| (10,10,24) || (2x5, 1x10,6x4 ) || (12,8,11^2)^) || (24,8, 66) || ( 528,1584, 192) || 2304 ||
|| (10,10,24) || (1x10,1x10,12x2) || (12,8,11^2)^) || (12,8,132) || (1056,1584, 96) || 2736 ||
^1)^ In each dimension, calculated as product of the "gridpoints per node" in the other dimensions.
^2)^ 1.0*250/24 rounded up.
^3)^ no node-to-node communication in this direction because the number of nodes is only 1.
Impact on Implementations
The new routine must be implemented.
Impact on Applications / Users
The proposal is source-code and binary (ABI) backward compatible with MPI-2.2.
Application that currently use an own algorithm for calculating MPI process
dims based on application grid dimensions may benefit from this new routine.
Alternative Solutions
None.
Entry for the Change Log
Section 7.5.2 on page 248.[[BR]]
New routine MPI_DIMS_CREATEV.
The text was updated successfully, but these errors were encountered: