Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big MPI---point-to-point considerations (MPI_Rank only) #97

Closed
tonyskjellum opened this issue Jun 14, 2018 · 17 comments

Comments

Projects
None yet
5 participants
@tonyskjellum
Copy link

commented Jun 14, 2018

Problem

For 64-bit clean functionality, convenience, and symmetry, the Big MPI principles being applied in Ticket #80 to collective operations should be applied to MPI more widely. In this case, we consider the idea that you might want more than 2^31 MPI ranks, hence needing a new data type, MPI_Rank.

Proposal

MPI needs to be 64-bit clean throughout.

Changes to the Text

MPI_Rank will replace int for ranks; support > 2^31 MPI processes in a communicator.

A separate ticket considers MPI_Count and miscellaneous concerns for point-to-point.

Impact on Implementations

No current API is impacted. New _X APIs for all point-to-point operations affected will be needed.

MPI implementations will have to be 64-bit clean inside since count*extent > 2^31 is already problematic for some implementations. New APIs will have to be added and the internals of MPI will have to be 64-bit capable for buffers and related issues.

Impact on Users

Users who opt in with the new API will be able to have communicators larger than 2^31. [MPI_Rank]

References

See also Ticket #80, #98, #99, #100

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

I vigorously object to MPI_Rank. At the very least, this has nothing to do with the original problem and proposed solutions.

Furthermore, you need to propose new versions of a large number of functions that have nothing to do with large-count support. Below is a partial list.

  • MPI_Comm_rank
  • MPI_Comm_size
  • MPI_Group_translate_ranks
  • MPI_Group_incl
  • MPI_Group_excl
  • MPI_Group_range_incl
  • MPI_Group_range_excl
  • MPI_Cart_coords
  • MPI_Graph_neighbors_count
  • MPI_Graph_neighbors
  • MPI_Cart_shift
  • MPI_Win_lock and MPI_Win_unlock
  • MPI_Win_shared_query
  • MPI_Win_flush and MPI_Win_flush_local
  • MPI_Type_create_darray

Please make the MPI_Rank proposal separate. If you insist on adding it to the large-count stuff, you will likely tank the whole thing, because I'm going to lobby full-time against anything that includes MPI_Rank.

If somebody builds a system that needs more than needs more than 2147483648 ranks, it is not unreasonable to expect them to move to ILP64 such that INT_MAX is sufficiently large and no MPI changes are required.

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

Honestly, MPI_Rank is a reason to break backwards-compatibility because it will break ABI if it is implemented in the intended manner (something wider thanint).

You really need to consider how badly you want to break every piece of MPI software in the world today and if the nonsensical possibility of a machine that effectively supports more than 2147483648 ranks is worth it.

MPICH

/* The order of these elements must match that in mpif.h, mpi_f08_types.f90,
   and mpi_c_interface_types.f90 */
typedef struct MPI_Status {
    int count_lo;
    int count_hi_and_cancelled;
    int MPI_SOURCE;
    int MPI_TAG;
    int MPI_ERROR;
} MPI_Status;

Open-MPI

struct ompi_status_public_t {
    /* These fields are publicly defined in the MPI specification.
       User applications may freely read from these fields. */
    int MPI_SOURCE;
    int MPI_TAG;
    int MPI_ERROR;
    /* The following two fields are internal to the Open MPI
       implementation and should not be accessed by MPI applications.
       They are subject to change at any time.  These are not the
       droids you're looking for. */
    int _cancelled;
    size_t _ucount;
};
typedef struct ompi_status_public_t ompi_status_public_t;
@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

You will also need to revise the matching rules for how this works when users send using MPI_Rank and probe or recv using int. What is the semantic of this? Return an error code and expect the user to try again with the new function?

@tonyskjellum

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

Jeff, we will consider that.

Process: Martin specifically asked that we consider the MPI_Rank option. We will specifically split the vote in a way to allow the forum to accept/reject this change for MPI-4. It was pointed out that endpoints and GPU-like devices and fine-grain accelerators could yield > 2^31 ranks in a communicator.

Technical: The problem with the heterogeneous use of the APIs will have to be fully considered like you're saying. Seems like, to allow this, protocols will carry an extra 32-bits of rank space. (Not my favorite answer, that's my straw answer. That means a tax on current performance.)

@tonyskjellum

This comment has been minimized.

Copy link
Author

commented Jun 14, 2018

How about if I split this ticket now --- a) MPI_Count + miscellaneous; b) MPI_Rank ?

@jdinan

This comment has been minimized.

Copy link

commented Jun 14, 2018

There are several constants that can be passed through rank arguments -- e.g. MPI_ANY_SOURCE and MPI_PROC_NULL. We would potentially need to introduce _X versions of these constants or require implementations to define them in a way that is compatible with both int and MPI_Rank (in both C and Fortran language interfaces).

@tonyskjellum tonyskjellum changed the title Big MPI---point-to-point considerations Big MPI---point-to-point considerations (MPI_Rank only) Jun 14, 2018

@hjelmn

This comment has been minimized.

Copy link

commented Jun 14, 2018

I maintain my view that we should just break backwards compatibility in MPI-4.0. Yes, this will require a period of time where MPI implementors have an MPI-3.x release and an MPI-4.x+ release but it would be worth it to avoid having _x and _with_info versions all over the place.

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

@hjelmn Note that this will not only break ABI compatibility but user code that uses int for ranks. For example, MPI_Group{_range}_[in,ex]cl and MPI_Group_translate_ranks take vectors of int in C. Any user code that has arrays of ints to pass to these functions will likely segfault if MPI_Rank is more than 32b.

As out-of-bounds array accesses are undefined behavior in C, your proposal not only breaks applications in practice but also causes them to violate the base language in which they are written.

The only reasonable thing to do here is expect ILP64 support if more than 2Bi ranks are required.

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

Any tickets proposing to add MPI_Rank or break ABI compatibility must be accompanied by a production-grade implementation that demonstrates successful execution on more than 2Bi ranks to know that the proposal is technically sound.

@hjelmn

This comment has been minimized.

Copy link

commented Jun 14, 2018

MPI does not define ABI compatibility so I am not concerned about that.

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

@hjelmn

This comment has been minimized.

Copy link

commented Jun 14, 2018

The way I see it, if we break API user will have to modify their code for MPI-4.0. All the changes will be simple to make but will take some work. Thats why I imagine that a high-quality implementation will provide an MPI-3.x layer during some transition period.

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

@hjelmn

This comment has been minimized.

Copy link

commented Jun 14, 2018

Not at all. My preference is that we will continue to support an older version for a little longer than usual and drop the MPI-3.x API in the new releases.

This happens all the time in the software world. MPI's API has issues. We should fix it the right way now and be done with it. None of this _foo nonsense.

@jeffhammond

This comment has been minimized.

Copy link
Member

commented Jun 14, 2018

@hjelmn

This comment has been minimized.

Copy link

commented Jun 18, 2018

Not even remotely saying we break API for ranks. Just saying if we break it because of info, counts, etc might as well change ranks as well.

@tonyskjellum

This comment has been minimized.

Copy link
Author

commented May 31, 2019

I am closing this ticket for now, it is highly controversial and it will distract from the rest of the Big MPI activities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.