-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Comm_split_type functionality #287
Comments
Originally by balaji on 2011-08-25 19:26:47 -0500 Attachment added: |
Originally by balaji on 2011-08-25 19:29:03 -0500 The current draft of the context chapter, which contains a new routine MPI_Comm_split_type, has been uploaded. This routine provides a mechanism to split a communicator to form subcommunicators on which shared memory can be created. |
Originally by balaji on 2011-08-26 02:32:51 -0500 Attachment added: |
Originally by dougmill on 2011-08-26 09:03:00 -0500 Some comments on v0.3, we can discuss at meeting. The two predefined types seem a little disjoint. Maybe that is OK, but one relates directly to some concept of "shared" memory, the other has an implied concept of memory accessible to all ranks (endpoints) in the subcommunicator, and perhaps other implications as well. I guess I think the terms are not quite consistent. It's not really ranks/endpoints that have shared memory, it is the processes/threads that share various types of memory. I guess the endpoints proposal, and threads themselves, confuses the "hidden" relationship between processes/threads and ranks/endpoints. I'm thinking that these predefined types may not support a clean extension into various memory hierarchies that exist in numa-like systems, where various groups of processing elements share optimal access to regions/types of memory. Of course, NUMA placement is a matter of optimization, not necessity, while the tranditional concept of Shared Memory simply will not work with processes/threads that are outside of the domain (e.g. node). Even on systems that support global shared memory, though, there are performance reasons to keep the concept of "node local domain". I'll spend some time trying to think of an alternative set of types. |
Originally by dougmill on 2011-08-29 07:50:59 -0500 As I thought more about this, I thought about the following domains, where each domain is a subset of the previous domain: GLOBAL:: may not exist in all implementations, unless paltform supports global shared memory. NODE:: Is this term universal? This would be like POSIX or SYSV "shared memory". PROCESS:: All threads in a process "share" this memory. This might just be "normal heap". PROCESSOR_GROUP*:: (better name needed) subset(s) of threads in a process, that share "optimal" access to some memory. This potentially represents multiple nested domains depending on the platform NUMA characteristics. An implementation might add more domains, possibly between GLOBAL and NODE or more likely below PROCESS. I think the other key point is that the calling thread of MPI_Comm_split_type drives how the domain (type) is processed. I guess that is already said, but the clarification being that it is the thread+endpoint that drive the selection (a thread must be attached to an endpoint to make an MPI call). Not just the endpoint (rank). This is a bit unusual in MPI, I think. This becomes most significant when dealing with sub-process memory domains, where clusters of cores have "near memory" and "far memory". Also, a thread might attach to an endpoint that is implicitly associated with a different memory domain, as there is no mechanism for a user to determine which threads shared memory domains with which endpoints. Perhaps this points out the need for another function for the endpoints proposal? |
Originally by jhammond on 2011-08-29 14:04:42 -0500 Hi Doug, Anything besides MPI_COMM_TYPE_SHM and MPI_COMM_TYPE_PROCESS seem to be what is referred to by this: "Advice to implementors. Implementations can define their own types, or use the info argument, to assist in creating communicators that help expose platform-specific information to the application." It will be impossible to clearly define all possible flavors of MPI_COMM_TYPE_NUMA_DOMAIN one can imagine. We can create even more mud by trying to define MPI_COMM_TYPE_ACCELERATOR (Note: I think this is a terrible idea). It seems that MPI_COMM_TYPE_PROCESS might be slightly harder to define if MPI ranks are threads, not processes. Do we really mean MPI_COMM_TYPE_PROCESS or do we mean MPI_COMM_TYPE_RANK? Isn't the goal to have a communicator for endpoints associated with the same rank? Best, Jeff Replying to dougmill:
|
Originally by dougmill on 2011-08-29 14:23:43 -0500 I was not suggesting we define all types here, just that we have some consistency about the names and ensure that we've thought enough about how some implementers might extend it to be sure that it works. I do not like "SHM" and "PROCESS", because they do not seem consistent. "NODE" and "PROCESS" seemed more consistent to me. Maybe I just need to know what these names refer to (what are the "units"). I was thinking "scope" of memory, so node-scoped and process-scoped seemed natural. |
Originally by jhammond on 2011-08-29 14:36:42 -0500 NODE is problematic in the case where we can do load-store across the network ala SGI or Cray. Does it make sense to say that such a machine has only one node if one can use load-store across the entire machine? Does it makes sense to say MPI_COMM_TYPE_PHYS_ADDR and MPI_COMM_TYPE_VIRT_ADDR, meaning the sets of ranks that can directly access the same physical and virtual address spaces, respectively? Ron B. is probably going to pwn this noob now :-) |
Originally by dougmill on 2011-08-29 14:40:16 -0500 I was thinking that SGI/Cray situation was more of a GLOBAL scoped memory, yet another type that was implementation defined. Don't these systems still have NODE scoped memory, which I'd imagine is more efficient than GLOBAL? |
Originally by balaji on 2011-08-30 13:39:22 -0500 Attachment added: |
Originally by balaji on 2011-08-30 13:40:57 -0500 I've attached a new version of the proposal with the changes discussed during the telecon today. Please read through the proposal to make sure everything looks OK. |
Originally by balaji on 2011-10-05 09:57:21 -0500 Attachment added: |
Originally by balaji on 2011-10-05 10:03:16 -0500 I've attached a new version of the proposal based on the Forum feedback in Santorini. |
Originally by jdinan on 2011-10-11 11:25:32 -0500 Attachment added: |
Originally by jsquyres on 2011-10-27 09:03:00 -0500 Pavan -- what's the implementation status of this proposal? |
Originally by balaji on 2011-10-27 10:22:16 -0500 We are working on the implementation and will have it ready before the second vote. |
Originally by balaji on 2011-10-27 16:57:35 -0500 Attachment added: |
Originally by balaji on 2011-10-27 17:02:54 -0500 Attached a new draft with a few ticket 0 changes. These two were suggested at the Forum meeting.
This change needs to be discussed in the working group.
|
Originally by balaji on 2011-10-27 17:04:46 -0500 In the above changes suggested at the Forum, I forgot to mention that I included the ChangeLog entry as well. |
Originally by balaji on 2011-10-27 17:07:09 -0500 An implementation of the MPI_Comm_split_type function is now publicly available in MPICH2 here: http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/ (please use r9071 or higher). |
Originally by balaji on 2011-10-30 15:39:15 -0500 Updated the ticket description based on Rolf's suggestions at the Forum. |
Originally by gropp on 2012-05-29 00:25:03 -0500 I assume that this does not go into the context chapter. If it does, I need a diff of the context chapter LaTeX file; it isn't feasible to work with the nearly 700 page full draft to extract the update. |
Originally by balaji on 2012-05-30 02:15:22 -0500 This is supposed to go into the context chapter. I was asked to commit the changes into the approved branch. I'll do that soon and send out a note. |
Originally by RolfRabenseifner on 2012-06-28 15:12:03 -0500 appLang committed (SVN 1206) |
Originally by jsquyres on 2012-07-03 09:09:48 -0500 Rolf: the change log currently reads:
But should read:
I committed the fix. |
Originally by RolfRabenseifner on 2012-07-14 01:18:14 -0500 Bill already committed text to context.tex (before svn r1280),[[BR]] |
Originally by gropp on 2012-07-18 14:23:54 -0500 As noted, already committed to context chapter. |
Originally by buntinas on 2012-07-18 16:28:07 -0500 Reviewed PDF. |
Originally by RolfRabenseifner on 2013-01-07 11:42:51 -0600 Since Sep. 21, 2012, this ticket is included in MPI-3.0 and the pdf is checked according to https://svn.mpi-forum.org/svn/mpi-forum-docs/trunk/meetings/2012-07-jul/mpi3-tickets.xlsx Therefore, I set the priority to "Ticket complete". |
Originally by balaji on 2011-08-25 19:25:55 -0500
Authors: MPI-3 Hybrid working group
Description
Creating communicators based on platform specific information such as shared memory capabilities can provide several benefits, especially in multi-core environments. We propose to add a new function to split a communicator based on a user-provided split type.
History
The original proposal to provide this functionality was provided by Ron Brightwell, which combined the communicator creator functionality with shared memory creation. This ticket deals with the communicator creation functionality. The shared memory creation functionality has been spawned off to a new ticket (ticket #284).
Proposed Solution
Define a new call MPI_Comm_split_type that splits a parent communicator based on a split_type argument. See the attached proposal for details.
Impact on Implementations
A trivial implementation that does not expose any shared memory capabilities is trivial. An implementation that does expose a shared memory communicator is relatively simple as well, since this functionality is internally already used in most MPI implementations. An implementation within MPICH2 is available for reference.
Impact on Applications / Users
None for current users. Adds a new function in a fully backward compatible manner.
Alternative Solutions
See past discussions in the Hybrid WG.
Entry for the Change Log
Added MPI_Comm_split_type.
The text was updated successfully, but these errors were encountered: