Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_Comm_split_type functionality #287

Closed
mpiforumbot opened this issue Jul 24, 2016 · 29 comments
Closed

MPI_Comm_split_type functionality #287

mpiforumbot opened this issue Jul 24, 2016 · 29 comments

Comments

@mpiforumbot
Copy link
Collaborator

mpiforumbot commented Jul 24, 2016

Originally by balaji on 2011-08-25 19:25:55 -0500


Authors: MPI-3 Hybrid working group

Description

Creating communicators based on platform specific information such as shared memory capabilities can provide several benefits, especially in multi-core environments. We propose to add a new function to split a communicator based on a user-provided split type.

History

The original proposal to provide this functionality was provided by Ron Brightwell, which combined the communicator creator functionality with shared memory creation. This ticket deals with the communicator creation functionality. The shared memory creation functionality has been spawned off to a new ticket (ticket #284).

Proposed Solution

Define a new call MPI_Comm_split_type that splits a parent communicator based on a split_type argument. See the attached proposal for details.

Impact on Implementations

A trivial implementation that does not expose any shared memory capabilities is trivial. An implementation that does expose a shared memory communicator is relatively simple as well, since this functionality is internally already used in most MPI implementations. An implementation within MPICH2 is available for reference.

Impact on Applications / Users

None for current users. Adds a new function in a fully backward compatible manner.

Alternative Solutions

See past discussions in the Hybrid WG.

Entry for the Change Log

Added MPI_Comm_split_type.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-08-25 19:26:47 -0500


Attachment added: context-v0.2.pdf (534.9 KiB)
Context chapter draft v0.2

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-08-25 19:29:03 -0500


The current draft of the context chapter, which contains a new routine MPI_Comm_split_type, has been uploaded. This routine provides a mechanism to split a communicator to form subcommunicators on which shared memory can be created.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-08-26 02:32:51 -0500


Attachment added: context-v0.3.pdf (537.9 KiB)

@mpiforumbot
Copy link
Collaborator Author

Originally by dougmill on 2011-08-26 09:03:00 -0500


Some comments on v0.3, we can discuss at meeting.

The two predefined types seem a little disjoint. Maybe that is OK, but one relates directly to some concept of "shared" memory, the other has an implied concept of memory accessible to all ranks (endpoints) in the subcommunicator, and perhaps other implications as well. I guess I think the terms are not quite consistent. It's not really ranks/endpoints that have shared memory, it is the processes/threads that share various types of memory. I guess the endpoints proposal, and threads themselves, confuses the "hidden" relationship between processes/threads and ranks/endpoints. I'm thinking that these predefined types may not support a clean extension into various memory hierarchies that exist in numa-like systems, where various groups of processing elements share optimal access to regions/types of memory. Of course, NUMA placement is a matter of optimization, not necessity, while the tranditional concept of Shared Memory simply will not work with processes/threads that are outside of the domain (e.g. node). Even on systems that support global shared memory, though, there are performance reasons to keep the concept of "node local domain". I'll spend some time trying to think of an alternative set of types.

@mpiforumbot
Copy link
Collaborator Author

Originally by dougmill on 2011-08-29 07:50:59 -0500


As I thought more about this, I thought about the following domains, where each domain is a subset of the previous domain:

GLOBAL:: may not exist in all implementations, unless paltform supports global shared memory.

NODE:: Is this term universal? This would be like POSIX or SYSV "shared memory".

PROCESS:: All threads in a process "share" this memory. This might just be "normal heap".

PROCESSOR_GROUP*:: (better name needed) subset(s) of threads in a process, that share "optimal" access to some memory. This potentially represents multiple nested domains depending on the platform NUMA characteristics.

An implementation might add more domains, possibly between GLOBAL and NODE or more likely below PROCESS.

I think the other key point is that the calling thread of MPI_Comm_split_type drives how the domain (type) is processed. I guess that is already said, but the clarification being that it is the thread+endpoint that drive the selection (a thread must be attached to an endpoint to make an MPI call). Not just the endpoint (rank). This is a bit unusual in MPI, I think. This becomes most significant when dealing with sub-process memory domains, where clusters of cores have "near memory" and "far memory". Also, a thread might attach to an endpoint that is implicitly associated with a different memory domain, as there is no mechanism for a user to determine which threads shared memory domains with which endpoints. Perhaps this points out the need for another function for the endpoints proposal?

@mpiforumbot
Copy link
Collaborator Author

Originally by jhammond on 2011-08-29 14:04:42 -0500


Hi Doug,

Anything besides MPI_COMM_TYPE_SHM and MPI_COMM_TYPE_PROCESS seem to be what is referred to by this:

"Advice to implementors. Implementations can define their own types, or use the info argument, to assist in creating communicators that help expose platform-specific information to the application."

It will be impossible to clearly define all possible flavors of MPI_COMM_TYPE_NUMA_DOMAIN one can imagine. We can create even more mud by trying to define MPI_COMM_TYPE_ACCELERATOR (Note: I think this is a terrible idea).

It seems that MPI_COMM_TYPE_PROCESS might be slightly harder to define if MPI ranks are threads, not processes. Do we really mean MPI_COMM_TYPE_PROCESS or do we mean MPI_COMM_TYPE_RANK? Isn't the goal to have a communicator for endpoints associated with the same rank?

Best,

Jeff

Replying to dougmill:

As I thought more about this, I thought about the following domains, where each domain is a subset of the previous domain:

GLOBAL:: may not exist in all implementations, unless paltform supports global shared memory.

NODE:: Is this term universal? This would be like POSIX or SYSV "shared memory".

PROCESS:: All threads in a process "share" this memory. This might just be "normal heap".

PROCESSOR_GROUP*:: (better name needed) subset(s) of threads in a process, that share "optimal" access to some memory. This potentially represents multiple nested domains depending on the platform NUMA characteristics.

An implementation might add more domains, possibly between GLOBAL and NODE or more likely below PROCESS.

I think the other key point is that the calling thread of MPI_Comm_split_type drives how the domain (type) is processed. I guess that is already said, but the clarification being that it is the thread+endpoint that drive the selection (a thread must be attached to an endpoint to make an MPI call). Not just the endpoint (rank). This is a bit unusual in MPI, I think. This becomes most significant when dealing with sub-process memory domains, where clusters of cores have "near memory" and "far memory". Also, a thread might attach to an endpoint that is implicitly associated with a different memory domain, as there is no mechanism for a user to determine which threads shared memory domains with which endpoints. Perhaps this points out the need for another function for the endpoints proposal?

@mpiforumbot
Copy link
Collaborator Author

Originally by dougmill on 2011-08-29 14:23:43 -0500


I was not suggesting we define all types here, just that we have some consistency about the names and ensure that we've thought enough about how some implementers might extend it to be sure that it works.

I do not like "SHM" and "PROCESS", because they do not seem consistent. "NODE" and "PROCESS" seemed more consistent to me. Maybe I just need to know what these names refer to (what are the "units"). I was thinking "scope" of memory, so node-scoped and process-scoped seemed natural.

@mpiforumbot
Copy link
Collaborator Author

Originally by jhammond on 2011-08-29 14:36:42 -0500


NODE is problematic in the case where we can do load-store across the network ala SGI or Cray. Does it make sense to say that such a machine has only one node if one can use load-store across the entire machine?

Does it makes sense to say MPI_COMM_TYPE_PHYS_ADDR and MPI_COMM_TYPE_VIRT_ADDR, meaning the sets of ranks that can directly access the same physical and virtual address spaces, respectively? Ron B. is probably going to pwn this noob now :-)

@mpiforumbot
Copy link
Collaborator Author

Originally by dougmill on 2011-08-29 14:40:16 -0500


I was thinking that SGI/Cray situation was more of a GLOBAL scoped memory, yet another type that was implementation defined. Don't these systems still have NODE scoped memory, which I'd imagine is more efficient than GLOBAL?

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-08-30 13:39:22 -0500


Attachment added: context-v0.4.pdf (534.9 KiB)

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-08-30 13:40:57 -0500


I've attached a new version of the proposal with the changes discussed during the telecon today. Please read through the proposal to make sure everything looks OK.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-05 09:57:21 -0500


Attachment added: context-v0.5.pdf (534.9 KiB)

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-05 10:03:16 -0500


I've attached a new version of the proposal based on the Forum feedback in Santorini.

@mpiforumbot
Copy link
Collaborator Author

Originally by jdinan on 2011-10-11 11:25:32 -0500


Attachment added: fulldoc-v0.5.pdf (2287.3 KiB)
This draft includes the full spec and adds the constants definition that was omitted in the chapter-only version.

@mpiforumbot
Copy link
Collaborator Author

Originally by jsquyres on 2011-10-27 09:03:00 -0500


Pavan -- what's the implementation status of this proposal?

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-27 10:22:16 -0500


We are working on the implementation and will have it ready before the second vote.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-27 16:57:35 -0500


Attachment added: fulldoc-v0.6.pdf (3654.6 KiB)

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-27 17:02:54 -0500


Attached a new draft with a few ticket 0 changes.

These two were suggested at the Forum meeting.

  1. Fixed a typo in the word "assigment" --> "assignment".
  2. Changed "Communicator type constants" --> "Communicator split type constants".

This change needs to be discussed in the working group.

  1. Removed the reference to MPI_WIN_ALLOCATE_SHARED since that is a separate ticket.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-27 17:04:46 -0500


In the above changes suggested at the Forum, I forgot to mention that I included the ChangeLog entry as well.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-27 17:07:09 -0500


An implementation of the MPI_Comm_split_type function is now publicly available in MPICH2 here: http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/ (please use r9071 or higher).

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2011-10-30 15:39:15 -0500


Updated the ticket description based on Rolf's suggestions at the Forum.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2012-05-29 00:25:03 -0500


I assume that this does not go into the context chapter. If it does, I need a diff of the context chapter LaTeX file; it isn't feasible to work with the nearly 700 page full draft to extract the update.

@mpiforumbot
Copy link
Collaborator Author

Originally by balaji on 2012-05-30 02:15:22 -0500


This is supposed to go into the context chapter. I was asked to commit the changes into the approved branch. I'll do that soon and send out a note.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-06-28 15:12:03 -0500


appLang committed (SVN 1206)

@mpiforumbot
Copy link
Collaborator Author

Originally by jsquyres on 2012-07-03 09:09:48 -0500


Rolf: the change log currently reads:

Added MPI_COMM_SPLIT_TYPE function and the communicator split type constand MPI_COMM_TYPE_SHARED.

But should read:

Added MPI_COMM_SPLIT_TYPE function and the communicator split type constan**t** MPI_COMM_TYPE_SHARED.

I committed the fix.

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2012-07-14 01:18:14 -0500


Bill already committed text to context.tex (before svn r1280),[[BR]]
I committed the new Fortran binding for MPI_COMM_SPLIT_TYPE in context.tex in svn r1281.

@mpiforumbot
Copy link
Collaborator Author

Originally by gropp on 2012-07-18 14:23:54 -0500


As noted, already committed to context chapter.

@mpiforumbot
Copy link
Collaborator Author

Originally by buntinas on 2012-07-18 16:28:07 -0500


Reviewed PDF.
-d

@mpiforumbot
Copy link
Collaborator Author

Originally by RolfRabenseifner on 2013-01-07 11:42:51 -0600


Since Sep. 21, 2012, this ticket is included in MPI-3.0 and the pdf is checked according to https://svn.mpi-forum.org/svn/mpi-forum-docs/trunk/meetings/2012-07-jul/mpi3-tickets.xlsx

Therefore, I set the priority to "Ticket complete".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant