Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What features do users need from an MPI C++ interface? #288

Open
jeffhammond opened this issue Apr 24, 2020 · 61 comments
Open

What features do users need from an MPI C++ interface? #288

jeffhammond opened this issue Apr 24, 2020 · 61 comments
Labels
mpi-5 For inclusion in the MPI 5.0 standard needs guidance Needs guidance on what chapter committees need to do

Comments

@jeffhammond
Copy link
Member

This is a meta-issue, which I am creating to capture user feedback on MPI C++ bindings.

I am moving this over from https://scicomp.stackexchange.com/questions/7978/what-features-do-users-need-from-an-mpi-c-interface, which was extremely well-received despite not complying with the rules of StackExchange.

Original Prompt

The 3.0 version of the MPI standard formally deleted the C++ interface (it was previously deprecated). While implementations may still support it, features that are new in MPI-3 do not have a C++ interface defined in the MPI standard. See http://blogs.cisco.com/performance/the-mpi-c-bindings-what-happened-and-why/ for more information.

The motivation for removing the C++ interface from MPI was that it had no significant value over the C interface. There were very few differences other than "s/_/::/g" and many features that C++ users are accustomed to were not employed (e.g. automatic type determination via templates).

As someone who participates in the MPI Forum and works with a number of C++ projects that have implemented their own C++ interface to the MPI C functions, I would like to know what are the desirable features of a C++ interface to MPI. While I commit to nothing, I would be interested in seeing the implementation of a standalone MPI C++ interface that meets the needs of many users.

And yes, I am familiar with Boost::MPI but it only supports MPI-1 features and the serialization model would be extremely difficult to support for RMA.

One C++ interface to MPI that I like is that of Elemental's mpi wrapper so perhaps people can provide some pro and con w.r.t. that approach. In particular, I think MpiMap solves an essential problem.

@jeffhammond jeffhammond self-assigned this Apr 24, 2020
@jeffhammond
Copy link
Member Author

Wolfgang Bangerth provided the following response (https://scicomp.stackexchange.com/a/7991/150):

Let me first answer why I think C++ interfaces to MPI have generally not been overly successful, having thought about the issue for a good long time when trying to decide whether we should just use the standard C bindings of MPI or building on something at higher level:

When you look at real-world MPI codes (say, PETSc, or in my case deal.II), one finds that maybe surprisingly, the number of MPI calls isn't actually very large. For example, in the 500k lines of deal.II, there are only ~100 MPI calls. A consequence of this is that the pain involved in using lower-level interfaces such as the MPI C bindings, is not too large. Conversely, one would not gain all that much by using higher level interfaces.

My second observation is that many systems have multiple MPI libraries installed (different MPI implementations, or different versions). This poses a significant difficulty if you wanted to use, say, boost::mpi that don't just consist of header files: either there needs to be multiple installations of this package as well, or one needs to build it as part of the project that uses boost::mpi (but that's a problem in itself again, given that boost uses its own build system, which is unlike anything else).

So I think all of this has conspired against the current crop of C++ interfaces to MPI: The old MPI C++ bindings didn't offer any advantage, and external packages had difficulties with the real world.

This all said, here's what I think would be the killer features I would like to have from a higher-level interface:

  • It should be generic. Having to specify the data type of a variable is decidedly not C++-like. Of course, it also leads to errors. Elemental's MpiMap class would already be a nice first step (though I can't figure out why the heck the MpiMap::type variable isn't static const, so that it can be accessed without creating an object).

  • It should have facilities for streaming arbitrary data types.

  • Operations that require an MPI_Op argument (e.g., reductions) should integrate nicely with C++'s std::function interface, so that it's easy to just pass a function pointer (or a lambda!) rather than having to clumsily register something.

boost::mpi actually satisfies all of these. I think if it were a header-only library, it'd be a lot more popular in practice. It would also help if it supported post-MPI 1.0 functions, but let's be honest: this covers most of what we need most of the time.

@jeffhammond
Copy link
Member Author

jeffhammond commented Apr 24, 2020

@gnzlbg provided the following response (https://scicomp.stackexchange.com/a/14640/150):

My list in no particular order of preference. The interface should:

  • be header only, without any dependencies but <mpi.h>, and the standard library,
  • be generic and extensible,
  • be non-blocking only (if you want to block, then block explicitly, not by default),
  • allow continuation-based chaining of non-blocking operations,
  • support extensible and efficient serialization (Boost.Fusion like, such that it works with RMA),
  • have zero abstraction penalty (i.e. be at least as fast as the C interface),
  • be safe (the destructor of a non-ready future is called? -> std::terminate!),
  • have a strong DEBUG mode with tons of assertions,
  • extremely type-safe (no more ints/void* for everything, heck I want tags to be types!),
  • it should work with lambdas (e.g. all reduce + lambda),
  • use exceptions consistently as error-reporting and error-handling mechanism (no more error codes! no more function output arguments!),
  • MPI-IO should offer a non-blocking I/O interface in the style of Boost.AFIO,
  • and just follow good modern C++ interface design practices (define regular types, non-member non-friend functions, play well with move semantics, support range operations, ...)

Extras:

  • allow me to chose the executor of the MPI environment, that is, which thread pool it uses. Right now you can have applications with a mix of OpenMP, MPI, CUDA, and TBB... all at the same time, where each run-time thinks it owns the environment and thus ask the operating system for threads every time they feel like it. Seriously?

  • use the STL (and Boost) naming convention. Why? Every C++ programmer knows it.

I want to write code like this:

    auto buffer = some_t{no_ranks};
    auto future = gather(comm, root(comm), my_offsets, buffer)
                  .then([&](){
                    /* when the gather is finished, this lambda will 
                       execute at the root node, and perform an expensive operation
                       there asynchronously (compute data required for load 
                       redistribution) whose result is broadcasted to the rest 
                       of the communicator */
                    return broadcast(comm, root(comm), buffer);
                  }).then([&]() {
                    /* when broadcast is finished, this lambda executes 
                       on all processes in the communicator, performing an expensive
                       operation asynchronously (redistribute the load, 
                       maybe using non-blocking point-to-point communication) */
                     return do_something_with(buffer);
                  }).then([&](auto result) {
                     /* finally perform a reduction on the result to check
                        everything went fine */
                     return all_reduce(comm, root(comm), result, 
                                      [](auto acc, auto v) { return acc && v; }); 
                  }).then([&](auto result) {
                      /* check the result at every process */
                      if (result) { return; /* we are done */ }
                      else {
                        root_only([](){ write_some_error_log(); });
                        throw some_exception;
                      }
                  });

    /* Here nothing has happened yet! */
 
    /* ... lots and lots of unrelated code that can execute concurrently 
       and overlaps with communication ... */

    /* When we now call future.get() we will block 
       on the whole chain (which might have finished by then!).
    */
    
    future.get();

Think how one could chain all this operations using MPI_C's requests. You would have to test at multiple (or every single) intermediate step through a whole lot of unrelated code to see if you can advance your chain without blocking.

@jeffhammond
Copy link
Member Author

GradGuy provided the following response (https://scicomp.stackexchange.com/a/8009/150):

Personally, I don't really mind calling long C-style functions for the exact reason Wolfgang mentioned; there are really few places you need to call them and even then, they almost always get wrapped around by some higher-level code.

The only things that really bother me with C-style MPI are custom datatypes and, to a lesser degree, custom operations (because I use them less often). As for custom datatypes, I'd say that a good C++ interface should be able to support generic and efficient way of handling this, most probably through serialization. This is of course the route that boost.mpi has taken, which if you are careful, is a big time saver.

As for boost.mpi having extra dependencies (particularly boost.serialization which itself is not header-only), I've recently came across a header-only C++ serialization library called cereal which seems promising; granted it requires a C++11 compliant compiler. It might worth looking into and using it as a based for something similar to boost.mpi.

@jeffhammond
Copy link
Member Author

Utkarsh Bhardwaj provided the following response (https://scicomp.stackexchange.com/a/25094/150):

The github project easyLambda provides a high level interface to MPI with C++14.

I think the project has similar goals and it will give some idea on things that can be and are being done in this area by using modern C++. Guiding other efforts as well as easyLambda itself.

The initial benchmarks on performance and lines of code have shown promising results.

enter image description here

Following is a short description of features and interface it provides.

The interface is based on data flow programming and functional list operations that provide inherent parallelism. The parallelism is expressed as property of a task. The process allocation and data distribution for the task can be requested with a .prll() property. There are good number of examples in the webpage and code-repository that include LAMMPS molecular dynamics post processing, explicit finite difference solution to heat equation, logistic regression etc. As an example the heat diffusion problem discussed in the article HPC is dying... can be expressed in ~20 lines of code.

I hope it is fine to give links rather than adding more details and example codes here.

Disclamer: I am the author of the library. I believe I am not doing any harm in hoping to get a constructive feedback on the current interface of easyLambda that might be advantageous to easyLambda and any other project that pursues similar goals.

@mhoemmen
Copy link

Given how fast the C++ Standard is moving with respect to thread and task parallelism, coroutines, networking, and reflection, it seems premature to standardize a C++ MPI interface now. Why not let all these great libraries first build experience presenting a modern C++ interface to the latest MPI features? Why repeat the mistake of the '90s and rush to standardize? I would love for someone to modernize Boost.MPI, for example; I would be happy to help with that (at least to test changes).

If we want gather(...).then(...).then(...)...., then why not build on the C++ networking TS? If we worry about thread interactions, then why not wait on (or participate in) an executors-networking merger? I can guess some reasons why, but I would expect an MPI proposal to answer questions like that.

Regarding a header-only library: this sounds good if you're starting a new project, but some existing C++ projects that use MPI care a lot about build sizes and times. If we want to put something in the MPI Standard, I'd like to see some build experiments in real applications.

@mhoemmen
Copy link

Wolfgang Bangerth wrote:

My second observation is that many systems have multiple MPI libraries installed (different MPI implementations, or different versions). This poses a significant difficulty if you wanted to use, say, boost::mpi that don't just consist of header files: either there needs to be multiple installations of this package as well, or one needs to build it as part of the project that uses boost::mpi (but that's a problem in itself again, given that boost uses its own build system, which is unlike anything else).

We've dealt with this issue of multiple MPI installations by writing an MPI (C binding) library that just calls through to an underlying MPI implementation. Our library dispatches to an underlying MPI implementation at run time via dlopen or the Windows equivalent (it works great on Windows). We don't expose any details of the underlying MPI implementation's ABI, so it's handy for things like Python bindings. Our library takes effort to maintain and incurs function call overhead, but it's been useful enough that we're thinking about open-sourcing it. If you're interested, please let me know.

@omor1
Copy link
Member

omor1 commented Apr 27, 2020

We've dealt with this issue of multiple MPI installations by writing an MPI (C binding) library that just calls through to an underlying MPI implementation. Our library dispatches to an underlying MPI implementation at run time via dlopen or the Windows equivalent (it works great on Windows). We don't expose any details of the underlying MPI implementation's ABI, so it's handy for things like Python bindings.

Unrelated to the discussion at hand, but I'm curious as to how do you deal with the opaque handles (e.g. MPI_Comm, MPI_Request) that are exposed via mpi.h? These are highly implementation-dependent features whose sizes do depend on the underlying ABI. There was discussion of exactly this issue in #159. As a concrete example: in Open MPI, handles are pointers, while in MPICH-derivatives, they are int.

@omor1
Copy link
Member

omor1 commented Apr 27, 2020

Regarding a header-only library: this sounds good if you're starting a new project, but some existing C++ projects that use MPI care a lot about build sizes and times. If we want to put something in the MPI Standard, I'd like to see some build experiments in real applications.

There are both benefits and detriments to defining the MPI C++ interface so that it can be implemented as a header-only library. An obvious benefit is that a single generic implementation may be sufficient for all underlying MPI libraries, which can ease adoption and maintenance burden. The flip side is that then there are severe restrictions on the e.g. datatypes interface, as they would be required to use the MPI C interface rather than whatever low-level representation is used by the implementation.

@mhoemmen
Copy link

@acdemiralp wrote:

Why not co-develop it along with the C++ standard?

Yes -- let's write a library first, then standardize it. Maybe that means becoming a Boost.MPI developer or taking over Boost.MPI development, or maybe it means starting a new library (if one can make a strong technical argument that Boost.MPI has a fundamentally flawed design).

@sg0
Copy link

sg0 commented Apr 28, 2020 via email

@mhoemmen
Copy link

@sg0 wrote:

However, from the example mentioned by Mark H., it seems the return object of the MPI function invocation is a future.

It would be a sender, in P0443R13 terms, not a future. Senders and receivers avoid some of the shared state issues that futures have.

In any case, I'm not necessarily advocating this design. I'm just saying that if people want that kind of design, then it should fit with how modern C++ is doing it. I'd like to see the people doing that design engage with C++ networking and executors experts.

@StellarTodd
Copy link

We've dealt with this issue of multiple MPI installations by writing an MPI (C binding) library that just calls through to an underlying MPI implementation. Our library dispatches to an underlying MPI implementation at run time via dlopen or the Windows equivalent (it works great on Windows). We don't expose any details of the underlying MPI implementation's ABI, so it's handy for things like Python bindings.

Unrelated to the discussion at hand, but I'm curious as to how do you deal with the opaque handles (e.g. MPI_Comm, MPI_Request) that are exposed via mpi.h? These are highly implementation-dependent features whose sizes do depend on the underlying ABI. There was discussion of exactly this issue in #159. As a concrete example: in Open MPI, handles are pointers, while in MPICH-derivatives, they are int.

We defined a Handle class that contains a union, and conversions methods for converting back and forth between native handles and our handles. The conversions are done in the plugin portion of the library that is compiled against a specific MPI implementation.

Since this is off topic, I don't want to get into any more details here. Feel free to contact Mark or me for further details.

@jeffhammond
Copy link
Member Author

jeffhammond commented May 18, 2020 via email

@raffenet
Copy link

FYI https://gitlab.com/correaa/boost-mpi3. I don't know any of the details of the implementation, just that it exists and some projects have investigated using it.

@mhoemmen
Copy link

@acdemiralp wrote:

Can https://www.mpich.org/static/docs/latest/www3/MPI_Type_create_struct.html forward the difficulties of serialization to MPI, and potentially even allow removing the dependency to Boost.Serialization?

If C++ gets actual reflection, that would let us use MPI_Type_create_struct to iterate over the fields of a class and convert them into an MPI_Datatype. Right now, there's no way in standard C++ to do that.

@omor1
Copy link
Member

omor1 commented May 25, 2020

If C++ gets actual reflection, that would let us use MPI_Type_create_struct to iterate over the fields of a class and convert them into an MPI_Datatype. Right now, there's no way in standard C++ to do that.

This would probably work for most POD / Trivial / StandardLayout types, but isn't portable to types that don't need all members serialized. I think most high-level C++-based APIs (thinking Charm++ and STAPL here, for instance) use user-provided pack/unpack routines to do serialization. If we can find a mechanism that allows users to easily select which fields of a class must be serialized, that would probably be the way to go.

@omor1
Copy link
Member

omor1 commented May 25, 2020

I believe the best practice solution to such a problem lies on the user's part: Create a smaller struct of things which will actually be serialized, and put it in a struct which also contains other stuff. If you need sequentiality, use pointer to the serialized struct in the larger struct and store them sequentially separately. Decent, intuitive solution in C++ terms.

I agree that this is indeed a nifty solution. Actually, it should be possible to make a template type with a parameter pack that serializes the types in the order given, something similar to std::tuple. That would allow use in current C++.

@mhoemmen
Copy link

Automagical serialization could be a footgun. I'm already uncomfortable with Boost automatically "taking care of" types that have run-time length, like std::string. It's useful for my current project, but I don't like that there could be multiple messages happening when I only typed one (what does that mean for progress of nonblocking messages, for instance?).

@rabauke
Copy link

rabauke commented May 28, 2020

@acdemiralp wrote:

Can https://www.mpich.org/static/docs/latest/www3/MPI_Type_create_struct.html forward the difficulties of serialization to MPI, and potentially even allow removing the dependency to Boost.Serialization?

If C++ gets actual reflection, that would let us use MPI_Type_create_struct to iterate over the fields of a class and convert them into an MPI_Datatype. Right now, there's no way in standard C++ to do that.

Actually, one can do kind of reflection for some generic types as std::tuple, std::array etc. to build MPI datatypes at run time fully automatically and not visible to the user. This was the route that I took in MPL. MPL is a C++11 header-only message passing library build around the MPI standard.

@omor1
Copy link
Member

omor1 commented May 29, 2020

The problem with using std::tuple and std::pair directly is that as far as I know they aren't guaranteed to be standard layout types and don't provide direct access to the underlying storage.

@rabauke
Copy link

rabauke commented May 29, 2020

@omor1 Not being standard layout types is the reason, why reflection via template magic is performed and an MPI datatype is constructed via MPI_Type_create_struct for each std::tuple type. Access to underlying member storage is gained via std::get and &. To my understanding, a restriction to standard layout types would be only required if one would send data in a memcpy-like fashion in MPI calls, e.g., by sending blocks of raw memory and using MPI_BYTE.

@omor1
Copy link
Member

omor1 commented May 29, 2020

Oh, I think I understand—you can get the offset from the base of the tuple and thus construct an MPI type for the tuple itself. Very clever! I'd been playing around for a bit with something similar, but I was recursively constructing structures to ensure they would be standard layout and thus be able to use offsetof, since C++ has no way to expand a parameter pack into a set of variables of those types.

@VictorEijkhout
Copy link

Well, this discussion went a long time before anyone mentioned MPL. I've been very impressed with MPL, which like mpi4py makes life a lot easier. For instance, data knows which type it is so for the 99.99 percent of the cases where you don't care you don't have to spell it out.

I've started incorporating MPL in my MPI book, hoping that it will find wider adoption.
https://web.corral.tacc.utexas.edu/CompEdu/pdf/pcse/EijkhoutParComp.pdf

@jeffhammond
Copy link
Member Author

@mhoemmen

why not build on the C++ networking TS?

I tried a few years ago to get the C++ networking people to support semantics other than HTTP and they were rather hostile. I proposed a fabric TS that behaved like OFI/libfabric was told I just didn't understand what the word "networking" meant.

You may have better luck, but I don't have time to teach SG14 people that Internet Protocol is not the only way to move bytes between computers.

@mhoemmen
Copy link

@jeffhammond Ugh, sorry to hear that. I wish I had more time to work on this.

@hzhangxyz
Copy link

With c++ coroutine maybe we can write something like this?

auto value = MPI::Async::Receive(xxxxxx);
something_else();
use_value(co_await value);

@correaa
Copy link

correaa commented Feb 17, 2022

Hi,

When I got my hands on the article "MPI Language Bindings are Holding MPI Back", I wrote a couple of notes (for myself) as friendly critique to the paper.
I don't disagree with the paper, I just think that there are harder problems than the ones mentioned in the paper:

I will leave here the link to these notes: https://gitlab.com/correaa/boost-mpi3/-/wikis/A-critique-on-%22MPI-Language-Bindings-are-Holding-MPI-Back%22 .

In addition to that, to add to what @bangerth just wrote,

  1. constexpr. MPI is a runtime system and even if somethings could be defined constexpr I don't think the system can do much with them in terms of composing more compile time operations. constexpr Datatypes seems like something useful although the MPI system has be able "compile" or "bless" them at compile time for them to be useful. Also, I would say that is a very particular subset of all useful Datatypes (e.g. arrays of dynamic size).

  2. Continuations: The problem of "continuations" is also very important and I myself have a pressing need for this in the C++ interface I propose, because except for trivial datatypes and trivial datastructures (arrays) I almost always need to attach encoding or decoding tasks to the communication task.
    Sometimes what I need can be regarded as a continuation (like decoding a serialized packet) but sometimes is something that needs to be executed before the communication task, like packing data asynchronically.
    So generically what I need is to be able to reuse the available MPI threads to piggy back, at the least, some O(N) data manipulation.
    I started doing things in this direction but I left it for lack of time.

A related problem with asynchronous operations is to see if there is any idiom available to C++ that can allow "marking" data or values as being "locked" into a request, perhaps by some combination of smart pointer (for ranges) or move sematics (for values).
Or, for that matter, anything where a static analyzer, or the compiler, can help. (e.g. something similar to "use variable after move", or in this case "use variable after asynchronous request has started but not finished")

  1. Value semantics: I couldn't agree more. In my library, I experimented with two types of interfaces. One that takes iterators generically, and the other deals with values and incidentally defines the concept of a "process". As can be seen in the examples.
    A collection can be send an received in the canonical form:
std::vector<double> v = ...;
std::vector<double> w = ...;
comm.send(v.begin(), v.end(), 1 );
comm.receive(w.begin(), w.end(), 0);  // see elsewhere the discussions on a less redundant interface comm.receive(w.begin(), 0)

or a value based interface:

comm[1] << v;
comm[0] >> w;

(see details here: https://gitlab.com/correaa/boost-mpi3/-/blob/master/test/process.cpp#L51-58)

Note that I am all for dealing with values, but not necessarily "return" them from functions.
Returning values is not natural for IO in my opinion, and always tends to generate more allocations than needed.
(Think of the case when w doesn't need to be resized above)

Interestingly, move semantics can implicitly hint the library to use asynchronous operations, which would simplify the interface tremendously.
For example: (This is not implemented yet).

auto unique_req = (comm[1] << std::move(v));
comm[0] >> w;
... // v cannot be (mostly) use yet, and it is clear to the user (and to a static analyzer
v = unique_req.get();

This is not perfect still, because std::move still allows operations with no preconditions to be performed on the variable.
In Rust one can "steal" the variable completely but I am not aware how to do it in C++, except for the partial solution above.
So, for the idiom to really work and be fool proof one needs to really move v into unque_req above.

  1. I also agree that error handling should be done via exceptions. The hard part is to write exception safe code around it, including MPI (or MPI interface) code.
    I also have the view that exceptions should not be the defined behavior of logical errors. (They can be defacto implementation of undefined behavior but one shouldn't be forced to handle them.)
    The point is that when I see the error codes reported by MPI functions, 90% of them are logical errors. (For example, invalid communicator.)
    Some basic functions do not report any non-logical errors anyway, even if we all know that they can happen they are not reported error codes, which begs the question, what can we do from the C++ perspective really?
    One would expect to get runtime errors when the network is down or things like that but they are not reported AFAIK.
    Perhaphs I don't know enough to have an opinion about this.

  2. At the time, when asked by LLNL I contributed my two cents about big count. The main idea I transmitted was that without big count it was impossible to send data structures such as std::deque and datatypes wouldn't help, because it is not a matter of the number of elements, but the size of gaps between elements comming from independent allocations.
    I started implementing a fallback mechanism for when big "pointer differences" or big "number of elements" are implicitly used but is was a lot of work.

  3. I have strong opinions about serialization, I think it is fundamental. Serialization is an integral part of value semantics and regular types. Datatypes is at best an optimization over serialization and it doesn't cover all cases.
    Boost.Serialization (what I use) has lots of issues, specially not being header only and being old but it is a good canonical model.
    What I am working on is into having the option to use different serialization backend, such as Cereal.

@VictorEijkhout
Copy link

VictorEijkhout commented Feb 17, 2022 via email

@bangerth
Copy link

bangerth commented Feb 17, 2022 via email

@jacobmerson
Copy link

As @VictorEijkhout says in C++ futures are a bit of a loaded term and use of std::future cause all sorts of lifetime/state issues and is not particularly performant due to this need of shared state. I think any forward looking C++ MPI API should consider the async utilities that are coming into the language via coroutines and std::execution/p2300.

@bangerth
Copy link

bangerth commented Feb 17, 2022 via email

@sg0
Copy link

sg0 commented Feb 17, 2022

Technical reasons aside, there has to be some dedicated funding for getting this work done, since this is not just forum participation and developing myriad modern C++ language bindings. I contributed to 3 LDRD open calls and one DOE proposal solicitation (jointly with more established/senior scientists in this area) in the last 3 years in trying to get some funding for this work - all of them failed (I am still trying, but mostly pessimistic). I think there is perhaps limited incentive structure for this work in the minds of the senior people, at least in US DOE.

@bkmgit
Copy link

bkmgit commented Feb 18, 2022

US DOE has traditionally been important, but MPI is used in a wide range of codes. An important additional consideration is use in industry. Examining software such as OpenFOAM may be helpful to get some idea of used features. Some C++ applications may also choose to directly build on top of UCX.

@correaa
Copy link

correaa commented Feb 18, 2022

@bangerth,

The good thing about the word "future" (and continuations) is that many people knows what it means and it is a good initial sketch in principle.

Having said that, it is important to recognize the std::future in its current status might be too general and too heavy weight for some family of basic task.
Coincidentally on this family there are things that are very related to message passing.

First, std::future are not ideal because they do type erasure on the task (sort of like std::function), they are quite flexible but the best option in all cases. Second std::future contemplates the possibility of tasks failing (throwing) and that has a cost. It also typically needs to allocate the return object, which in turn can be a failure point.

What I found in my experiments is that from the outset, before and after sending a message there is the typical need for encoding and decoding messages (for example [de]serialization).
These are the specific tasks we should consider before going to the more general case of an arbitrary continuation.
In fact, while decoding can be seen as continuation, encoding is not, it is more like a prolog.

Also, it is interesting to consider that encoding and decoding tasks can be made/programmed in such a way that they cannot fail (and not throw).
Therefore in principle it is possible to disregard exceptions in this context.

Additionally, as I mentioned in other posts, I don't think that returning objects or values are a good idea, and this extends to asynchronous messaging too.
There are several reason for that and even a specific reason in this context.
If these future-like request return iterators-like objects instead of new value then we don't need to even worry about exceptions thrown during construction.

In summary, for request or future-likes that do not return values and are that restricted to only do encoding and decoding (or more generally epilogs or prologs that cannot fail and be noexcept) the implementation doesn't need to be as complicated or as heavy as what std::future offers right now.

Feedback on these ideas will be appreciated too.

@correaa
Copy link

correaa commented Feb 19, 2022

You can use a std::expected instead of throwing. Even nicer is to allow both via macros.

any problem can be solved adding a level of indirection, except too many levels of indirection. (std::expected is the indirection here)

More seriously, i think returning values (or expected) do not reflect what MPI communication ultimately is, IO.
In the IO picture, object exists (maybe in unspecified but valid state) before communication.

returning values forces allocation even in cases where it is obvious it is not needed. (think of the case of receiving into a vector that already has enough capacity to receive the number of elements sent)

I do not understand why you are occupied with the idea of byte-level serialization, which to my knowledge is last resort practice.

i don't know in general, but in my case it is not byte-level serialization. the fundamental block of serialization are typed packages of basic types.
i call it encoding for the lack of a better word. what i refer to is a standard transformation of a data structure into packed format that both ends of a message have to agree upon.
also, byte-level serialization would break endianness compatibility, which, i won't defend, but it is a nice to have.

If you have proper reflection, or even precise flat reflection like MPL's or Boost.PFR, you often do not need byte-level serialization.

(static) reflection can get you so far. it doesn't solve all the problems. reflection is ok for generating custom data types which can be known at compilation but not much more.
it doesn't help with dynamic data structures (e.g. a multi block data structure, like std::queue or a CSR matrix) or MPI data types that in practice would take about the same memory as the size of the message itself (e.g. std::list).

I also do not understand what problem you have between std::future and serialization.

No problem, i am just pointing out that std::future are made to handle almost any kind of tasks.

And serialization, that is an important example for the need “ continuation", is not a general task, but a simpler one.

If you want one or more intermediate (de)serialization steps that are not async, then make them async compatible via https://en.cppreference.com/w/cpp/experimental/make_ready_future instead of opening callback points for them or using asymmetrical packing and unpacking to confuse the user.

i have to think about that.
yes, the idea is that generic asynchronous messaging (like in BMPI3) needs preprocessing or postpropocessing.

i would like to make this processing 1) asynchronous also, 2) optimally use the resources (threads, buffers) already given to MPI.
i don’t know how to do exactly yet.
this part is also work in progress.

Which iterators? Iterators of contiguous sequential containers (span, string, valarray, vector<!bool>)? Or iterators of non-contiguous sequential containers (deque, forward list, list, vector<bool>)? Or iterators of associative containers (map, unordered map, set, unordered set)?

All of the above, depends on the case. It can even be pure input and output iterators. (not that i recommend using them).

The BMPI3 "basic" interface is iterator-based, as you indicate.
it also returns other (new) iterators in the cases where the internal computations are hard or impossible to replicate outside the message call.

(STL is designed with the same philosophy, although not always got it right).

the asynchronous versions are not different in principle, in the sense that the request could return (like via future::get) iterators.
This is work in progress.

The latter two do not ensure contiguity, whereas MPI often prerequisites contiguity.

sure, low level interfaces require contiguity. (think of memcpy)

high level interfaces try to take advantage of them through direct or indirect means, even when data is not contiguous.
they do whatever possible for them with whatever resources it has available, heuristics, buffer, pinned memory, data types, packed-level serialization, byte-level serialization, etc.
and yes, under sufficiently complex situations they can fail to do their job efficiently (while still doing the job correctly).

MPI forces a C mentality, we think how to use them through contiguos arrays, and it is fine.
BMPI3 has a C++ mentality, (or STL).
It will try to do the best job possible and the idea is to have a decent base level of quality of implementation which will be work in progress for a while, and any help will be appreciated.

This is also confusing to me in your library Boost.MPI3. What happens when I pass a std::unordered_map::begin() and std::unordered_map::end() to your functions that accept iterators? Does my map get copied to contiguous memory e.g. a std::vector<std::pair> and then transmitted?

very good question.
(the answer has many corner cases because you didn't say what are the element types, but i am going to ignore this and assume the best possible scenario, that the datatype is a builtin).

but, yes, broadly speaking, what you describe is a good starting point solution.
(i will add some levels of details as we go.)
after all what is the alternative otherwise? partition the message in N smaller messages with one element (or pair) each)? that is, as you know unacceptable.

the solution you propose works and one has to accept that the user had a very good reason to use a unordered_map to begin with. the user has to know the cost of transversal in general and communication in particular of such specialized data structure.

an important point before continuing is that if you pass a pair of iterators the library lost already the information that the container is associative.

the only information that it has is that the range is defined by a pair of iterators that are bidirectional iterators and that the elements are decomposable as pairs.

Where does that std::vector<std::pair> live if the call is immediate?

ok, yes, assuming we are going this route then the vector lives in some sort of free store. a possible candidate is the default heap (std::allocator) and that would work.

but we can do better, we have access to the MPI system as well, and to the communicator, with all its hypothetical buffers. we also know we are copying to the vector for the sake of communicating, nothing else.

Therefore what the library should do is to put the vector in MPI pinned memory, which if it is available, can make the communication faster).

what if there is no enough pinned memory?, well, then a series of few smaller intermediate vectors can built and sent, one at a time.

if many vectors are necessary to be constructed and destructed maybe it also a good idea not to allocate each one and use a single one or use a specialized arena allocator.

so as you see, it can get intricate internally. there are levels of optimizations one can take advantage from.

is this the only way to do this? no, i can also take advantage that the elements are pairs so can construct two vectors one for each type. i am not doing this, maybe if it is proven to work across multiple systems, one can write (inside the library) special code for this. what i am trying to illustrate is that one can optimize up to different levels.

What about std::vector::begin() and std::vector::end()? Do you still make a copy like you would in the std::map case or do you somehow detect it and avoid the copy?

no, I don't, first of all, at this point I have a temporary vector and I can send it directly. i know it is a vector.

but anyway, if you were to pass a vector::begin() and vector::end() the library (not necessarily with your help) detects that these are random-access and contiguous iterators so it knows how to handle this case, without intermediate copies.

i will stop the details of what i am doing internally here.
i hope the idea is clear even if you disagree with it in general or in the details.
the important point is that this is all internal to the library.

You see? Iterators are confusing in this context.

sorry, no, i don't see.
what is confusing about this? this is work that the library does for you.
if the implementation i described confuses you that’s fine: it is just that, an implementation; it is enough for you to know that an unordered_map has costly transversal and it is not contiguous. and if your dataset is small enough you can even get away with not knowing that.

when you use iterators… do you worry if they use memcpy at some point below? maybe, maybe not. if you don't have many elements you might not care.
of course if you want performance you need to know your data structures: do not expect that unsorted_map would be able to take much advantage of hardware or low level MPI primitives.

to finish, the two types of iterators that you mentioned belong to two different iterator categories, and they naturally have different performance guarantees.

In summary, for request or future-likes that do not return values and are that restricted to only do encoding and decoding (or more generally epilogs or prologs that cannot fail and be noexcept) the implementation doesn't need to be as complicated or as heavy as what std::future offers right now.

Yes as you can see in the 89 liner above.

Yes to what exactly? (what is the “89 liner”?)

yes to that prologues and epilogues do need to be handled by things as heavy as futures?

maybe, i didn't write all the possible epilogues and prologues that could be necessary so, yes, this is, until proven correct a guess. the fundamental difference is that prologues and preambles do not need to return values, like future are designed to do. my prologues work with elements that are already there in some sense, the do not need to return anything "new".

Thank you for your questions. -- A

@bangerth
Copy link

bangerth commented Feb 21, 2022 via email

@correaa
Copy link

correaa commented Feb 21, 2022

On 2/19/22 12:55, Alfredo Correa wrote:

More seriously, i think returning values (or expected) do not reflect what MPI
communication ultimately is, IO.
In the IO picture, object exists (maybe in unspecified but valid state) before
communication.

Just to be clear, this is not what I wanted to advocate for. The actual send
and receive buffers should be allocated by the user. It is things such as the
output integer arguments of MPI_Comm_rank and MPI_Comm_size that would be
nice to return, as well as MPI_Request objects by immediate functions.

thank you for the very important clarification.

if you are referring to your quote "return whatever they are producing by-value, rather than through arguments; ...", and by values you didn't mean the values of the communicated data, then, yes, i am in the same page.

maybe @acdemiralp was referring to the same thing as well and i also misinterpreted.

@mhoemmen
Copy link

I'm all for this kind of stuff. But do you want to standardize on things that are only available in C++23 or C++26?

  1. P2300 won't make C++23, though it has a good chance at C++26.
  2. I've seen plenty of MPI 1.x code in the wild. This suggests that people shouldn't worry about requiring newer versions of a programming language in newer versions of MPI, because users will always be able to fall back to implementations of older MPI versions.
  3. That being said, a standard should standardize existing practice. Thus, I'd rather see one or more examples of a senders/receivers-based C++ MPI interface first, before considering its standardization. P2300 is a library solution with existing implementations, so interested parties should feel welcome to try this. P2300's authors are open to considering more use cases, so now would be a good time to explore using senders/receivers.
  4. I think MPI (2-sided or 1-sided) is a poor match for senders/receivers, but am open to discussion.

@VictorEijkhout
Copy link

VictorEijkhout commented Feb 21, 2022 via email

@mhoemmen
Copy link

@VictorEijkhout wrote:

Considering what a terrible mess threading is in C++ (every next standard seems to say “Oh no, we should have done it this way”)....

I'll fight you on that one, my friend Victor : - ) .

  1. std::thread is a perfectly fine wrapper for an operating system thread. It never aimed to be anything more.
  2. Regarding "every next standard seems to say...," the only way in which the Standard has actually changed was in discouraging use of release-consume memory ordering. That came out of some recent academic work. I've never seen code in the wild that uses this ordering.
  3. I've written and used thread-parallel C++ code for over a decade. It works fine and it runs at scale.

You don't have to like C++, but phrases like "terrible mess" just aren't accurate. I would say MPI is a bigger mess; consider, for example, how long it's taking the community of MPI experts to decide what MPI_THREAD_MULTIPLE means.

@ibaned
Copy link

ibaned commented Apr 5, 2022

Reading through some of this discussion, it strikes me that the primary pitfall is the sheer size and complexity of ISO C++ and the temptation to ask ourselves how an MPI interface might be compatible with every single feature of C++.

Thinking of how an MPI interface could interact with ranges, reflection, threading, executors, etc. is an exciting exercise but seems to lead to an MPI interface that is as large as the ISO C++ standard itself.

My thought is that the C++ interface to MPI should look more like the MPI standard than the ISO C++ standard. By this I mean that it should mainly consist of applying tried-and-true (albeit less exciting) C++ features consistently over the whole interface. I'm convinced enough of this principle of simplicity that I made a C++ interface to MPI that I am using in large projects:

https://github.com/sandialabs/mpicpp

Here are the tried-and-true, non-controversial and non-daunting features of C++ that it applies to MPI so far:

  1. RAII for requests, communicators, etc. with unique ownership and move semantics. This also encompasses non-blocking semantics by having the destructor of a request wait on the request. Ignoring a returned request is equivalent to calling a blocking function.
  2. Exception-based error handling. Throws exceptions everywhere that the C MPI interface returns an error code.
  3. Deduction of MPI_Datatype for C++ types but only for pre-defined MPI_Datatypes

Personally, I don't currently have code that sends user-defined structs or maps of lists that is begging for reflection, nor code that calls MPI from multiple threads that would really benefit from concurrency compatibility.

I think a minimal system like this would be a good starting point, and over time it can add compatibility with more and more C++ features. Adding compatibility with a new feature should consider carefully the maintenance cost of this part of the MPI C++ interface (both standardization and implementation), the stability and user experience of the C++ feature itself, and the clear benefit to existing users of MPI.

@correaa
Copy link

correaa commented Apr 5, 2022

Hi @ibaned

My thought is that the C++ interface to MPI should look more like the MPI standard than the ISO C++ standard. By this I mean that it should mainly consist of applying tried-and-true (albeit less exciting) C++ features consistently over the whole interface. I'm convinced enough of this principle of simplicity that I made a C++ interface to MPI that I am using in large projects:

https://github.com/sandialabs/mpicpp

Yes, you leave with no option other than to agree with you. :)
These are exactly the principles I designed my wrapper https://github.com/LLNL/b-mpi3 around.

The subtitle of the project is "This aims to be an wrapper to C-MPI3 for C++, using the principles of simplicity, STL, RAII and Boost and enforcing type-safety."

I would like to comment some subtleties below.

Here are the tried-and-true, non-controversial and non-daunting features of C++ that it applies to MPI so far:

1. RAII for requests, communicators, etc. with unique ownership and move semantics. This also encompasses non-blocking semantics by having the destructor of a request wait on the request. Ignoring a returned request is equivalent to calling a blocking function.

I couldn't agree more, if I have to choose a single principle it would be this one.
RAII starts by writing the necessary destructors/constructor pairs, that is more or less mechanical, but it doesn't end there: one has to think about other fundamental operations and more importantly if they make sense, assignment, move-assignment, copy-construction and move-constructor.

RAII also touches the broader topic of "guarantees", modern C++ is all about guarantees in my opinion, thread safety and exception safety.
Make the code exception safe is the real challenge.

2. Exception-based error handling. Throws exceptions everywhere that the C MPI interface returns an error code.

In principle yes, however I would like to add that logical errors should not be handled by exceptions at all.
When I read the documentation of MPI many "return" errors look like logical errors, therefore I don't see the urgent need to handle them with exceptions.
At the end, the situation I find my self is that most of the errors that the C MPI interface reports should not even be converted exceptions.
(We could still throw exceptions but there is little gain in doing so. I am a fan of the concept of narrow and wide contracts and not "defining undefined behavior".)

3. Deduction of `MPI_Datatype` for C++ types but only for pre-defined `MPI_Datatype`s

I agree, if something can be mapped to a MPI_Datatype (and the size of the MPI_Datatypes is less than O(N)) we should use all the tools at our disposal to achieve that (including dark magic).

Having said that, it is a fact of life that not all value objects have a MPI_Datatype of of size less than O(N).
The question is what to do in these cases, should we go beyond and use magic/user helper code? or just say that anything like that would not be handled? and the user is responsible for communicating such complicated datastructures.

In https://github.com/LLNL/b-mpi3, I went the route of 0) detect basic MPI_Datatypes, basic datastructures, if that doesn't 1) attempt (at compile time) to construct an MPI_Datatype, if that doesn't work 2) invoke serialization routines if available, fail (at compile time) otherwise.

(The boundary between 1) and 2) is tricky and I don't have a general way to handle that).

  1. introduces the need of a serialization framework, which may or may not introduce a hard dependency on a third-party serialization library, such as Cereal or Boost.Serialization.

Personally, I don't currently have code that sends user-defined structs or maps of lists that is begging for reflection,

Here it is an example of a custom class sent communicated by MPI: https://github.com/LLNL/b-mpi3/blob/master/test/communicator_send_class.cpp

nor code that calls MPI from multiple threads that would really benefit from concurrency compatibility.

I think the library should be thread-compatible, and thread-safe only if the user wants to handle it. I think there are simple rules to achieve that and at the least be transparent about the relation between communication and threads.

I think a minimal system like this would be a good starting point, and over time it can add compatibility with more and more C++ features. Adding compatibility with a new feature should consider carefully the maintenance cost of this part of the MPI C++ interface (both standardization and implementation), the stability and user experience of the C++ feature itself, and the clear benefit to existing users of MPI.

I agree, not every C++ feature should be used, reflected or taken into account by MPI C++ interface.
Hopefully most features will be orthogonal or simply play nice with what we achieve.

@tschuett
Copy link

tschuett commented Apr 10, 2022 via email

@tschuett
Copy link

tschuett commented Apr 10, 2022 via email

@mhoemmen
Copy link

@ibaned Hi! : - ) Excellent points y'all! Some thoughts on your list of C++ features:

RAII for requests, communicators, etc. with unique ownership and move semantics.

I'm not actually convinced that code should manage lifetimes of MPI communicators at all. Idiomatic C++ destructors are nonblocking and nonthrowing, while MPI "destructors" are collective and possibly blocking. C++ code paths can diverge, which may break the requirement to call free functions collectively.

I see a callback-based model as more natural. MPI kind of already does this. For example, MPI_Init effectively launches a program that takes "nonowning references" to MPI_COMM_WORLD and MPI_COMM_SELF, and MPI_Comm_split effectively launches some number of programs with a rebinding of the "current communicator." The idea is that the callback would take its "current communicator" as a function argument. The "current communicator"'s lifetime would strictly contain the callback's execution. Users would then not need to think about communicator lifetimes at all. The callback approach would also expose nonblocking execution for communicator creation functions that require communication. One challenge with this technique, though, would be handling more complicated intercommunicator lifetimes.

Exception-based error handling. Throws exceptions everywhere that the C MPI interface returns an error code.

Exceptions are for recoverable errors. How would I write code to recover from MPI_Comm_free returning something other than MPI_SUCCESS?

@ibaned
Copy link

ibaned commented Apr 11, 2022

Hi @mhoemmen !

I agree that the collective requirement on MPI "constructor/descructor" C APIs is a different requirement than just the scope executed on one rank. The design I landed on is based on the idea of "standardize current practice", where the majority of existing code I see does use C++ scopes to denote these parallel collective lifetimes, so using RAII to do this wouldn't change their mental model. Your callback model sounds interesting, I wonder how much work is involved in building the "runtime" that executes those callbacks and handles communicator lifetimes.

Working on "parameter scans" of many small simulations has really changed my perspective on what is a recoverable error. Even if the simulation code has a bug and MPI basically says "you gave me invalid arguments", we can still terminate the current simulation and just start a new simulation (even better, if using RAII then its communicators are safely freed). Our users are often upset that one or two out of ten thousand simulations failed and all the data points are lost not just those two.

The other use case for exceptions is to build up useful debugging information while fatally exiting. Calling code in outer scopes can catch the MPI exception and throw a new exception with further information, like a manual stacktrace with application-specific metadata included.

My new perspective (I definitely didn't this think before) is that it is not up to the point of origin to decide what is recoverable, it is up to the calling code what to do with an exception: catch it and recover, catch it and add useful debug info but don't recover, or don't catch it at all.

@bangerth
Copy link

bangerth commented Apr 11, 2022 via email

@devreal
Copy link

devreal commented Apr 11, 2022

Exceptions and RAII for MPI objects with non-local collective destruction semantics (files and windows, potentially communicators) won't match well. If an exception triggers the destructor of such object the application will deadlock and users won't even see why because the exception never makes it up the call chain. Such an interface would require littering code with try-catch blocks even if you are perfectly fine with not handling the exception and letting the application die, which is hardly more readable than checking return codes.

Other MPI types work perfectly fine with RAII (data types, ops, info objects). It would be somewhat inconsistent though.

Writing a simple wrapper around the C-API will never go far.
Using unique C++ features to make MPI more ergonomic to use sounds more interesting:

  • RAII
  • coroutines
  • reflection
  • futures

I'd like to see a good use-case for coroutines in MPI. Not everything that is possible is suitable and efficient. Also, there is a proposal for continuations under active discussion, which would (hopefully) work well with future.then() semantics.

@mhoemmen
Copy link

Hi @ibaned ! Always good to discuss C++ with you! : - )

Your callback model sounds interesting, I wonder how much work is involved in building the "runtime" that executes those callbacks and handles communicator lifetimes.

I built a callback-based interface like this a few years ago for access to a global (distributed) object's local data. It was an almost entirely compile-time wrapper around Kokkos::DualView. I don't think the MPI communicator version would need more run-time tracking than current code already needs to do. Contact me offline if you'd like an overview of the design.

My new perspective (I definitely didn't this think before) is that it is not up to the point of origin to decide what is recoverable, it is up to the calling code what to do with an exception: catch it and recover, catch it and add useful debug info but don't recover, or don't catch it at all.

Sure, I suppose that if MPI_Comm_split fails, the algebraic multigrid library could carefully handle and report this to the library encapsulating linear solvers, and the latter library could then fall back to a slower multigrid hierarchy construction strategy, or even CG + domain decomposition. In practice, though, this requires structuring code as a sequence of transactions. If any MPI process fails to catch an exception that needs to be handled collectively, a crash is the good outcome and a hang is the more likely outcome. I think it's too hard to get a whole team to write a whole code base like this without a helpful programming model, that forces them to write transactional code and doesn't let them own MPI state.

@ibaned
Copy link

ibaned commented Apr 11, 2022

I think the point being made here about the case of only one rank in a communicator throwing an exception is a good and important point... I agree that some structure or tools to help the user deal with non-collective failures in otherwise collective environments is needed. This might tie into regular MPI standardization efforts around resilience.

@devreal
Copy link

devreal commented Apr 11, 2022

On second thought, you might be right: maybe we just need guaranteed local destruction semantics for all MPI objects...

@correaa
Copy link

correaa commented Apr 12, 2022

I am squarely in the camp of the deterministic destruction and release of resources in C++.
If nothing else, at least, because this is what allows (me) to reason about programs and performance.

The blocking aspect is in fact the exact dilemma I have for the destructor of the mpi3::communicator object in my library https://gitlab.com/correaa/boost-mpi3.
Whether to use MPI_free or MPI_disconnect for destruction.
After some back and forth I decided that MPI_disconnect is more correct and desirable.

Among other things, this fits with the view that pending messages are some sort of dependent resource of the communicator.

@wesbland wesbland added needs guidance Needs guidance on what chapter committees need to do mpi-5 For inclusion in the MPI 5.0 standard labels Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi-5 For inclusion in the MPI 5.0 standard needs guidance Needs guidance on what chapter committees need to do
Projects
Status: To Do
Development

No branches or pull requests