Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always set the MPI_ERROR field in status objects #814

Open
devreal opened this issue Nov 7, 2023 · 24 comments
Open

Always set the MPI_ERROR field in status objects #814

devreal opened this issue Nov 7, 2023 · 24 comments
Assignees
Labels
chap-p2p Point to Point Communication Chapter Committee mpi-4.2 wg-ft Fault Tolerance Working Group

Comments

@devreal
Copy link

devreal commented Nov 7, 2023

Problem

Section 3.2.5 states:

In general, message-passing calls do not modify the value of the error code field of
status variables. This field may be updated only by the functions in Section 3.7.5 that
return multiple statuses.

I understand that for single-operation message passing calls (MPI_RECV for example) the error code is returned to the user, if the appropriate error handler is set. However, I would argue that the difference in behavior between MPI_Testany (does not set the MPI_ERROR field), MPI_Testsome (does set the MPI_ERROR field), and MPI_Test (does not set the MPI_ERROR field) is confusing for users and breaks with the principle of least surprise. Intentionally leaving a value undefined in some cases in a structure we return to the user can cause all sorts of issues if users are not careful. That's bad API design.

Case in point: open-mpi/ompi#12049

I should note that not all fields of the status are set in all cases. There are operations where MPI_TAG or MPI_SOURCE have no meaning (e.g., RMA request-based operations). Leaving them undefined worries me less since this is a property of the operation and consistent across all test/wait functions. Every relevant MPI operation produces a return value so we should set the error field in every associated status object.

Proposal

The MPI_ERROR field of the status should always be set, even if the error value is returned to the user directly (MPI_Test, MPI_Recv`, ...).

Changes to the Text

Strike the statement in Section 3.2.5 and add verbiage that the error field is always set.

Impact on Implementations

This could simplify implementations since they don't have to deal with the difference between single- and multi-request functions.

Impact on Users

Less surprises and exceptions to think about.

There may be applications out there that rely on the status not being set (i.e., carry information through the MPI_ERROR field across MPI function calls) but I expect the chances of that are very slim and it's bad design anyway. Such applications can easily be fixed by storing that information in a local variable instead.

References and Pull Requests

Brought up in an Open MPI ticket: open-mpi/ompi#12049

@devreal devreal added the chap-p2p Point to Point Communication Chapter Committee label Nov 7, 2023
@devreal devreal self-assigned this Nov 7, 2023
@raffenet
Copy link

This also came up recently in pmodels/mpich#6751. We would support this change. cc @hzhou.

@hzhou
Copy link

hzhou commented Nov 10, 2023

@devreal It is incorrect that MPI_Testsome does set the MPI_ERROR field. It only sets the MPI_ERROR field if it returns MPI_ERR_IN_STATUS. If it returns MPI_SUCCESS, the MPI_ERROR is unset. Similarly MPI_Testall/MPI_Waitsome/MPI_Waitall. Now we also need cover MPI_Request_get_status_{all,some}.

I think the rationale -

Rationale. The error field in status is not needed for calls that return only one status,
such as MPI_WAIT, since that would only duplicate the information returned by the
function itself. The current design avoids the additional overhead of setting it, in such
cases. The field is needed for calls that return multiple statuses, since each request
may have had a different failure. (End of rationale.)

makes sense. Users in general don't need to check the status MPI_ERROR field and it is unnecessary to always require implementations to set MPI_ERROR field, in the same sense that implementations are not required to set the MPI_SOURCE or MPI_TAG field in a send request status.

However, I am strongly against the interpretation that implementations must not touch the MPI_ERROR field unless returning MPI_ERR_IN_STATUS -

The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS.

The if and only if seems to force this interpretation.

In reality, it is often simpler for implementations to blindly update the entire statuses struct. To untouch certain field, e.g. MPI_ERROR field, often forces implementations to add more code and instructions to preserve the field. This is, IMO, contrary to the rationale. IMO, the rationale is supposed to give implementation flexibility rather than adding restrictions without any rationale on the user's part.

Thus, I am against requiring implementations to always set the MPI_ERROR field. But I am also against requiring implementations to if and only if set the field when returning MPI_ERR_IN_STATUS. My suggestion is to leave the MPI_ERROR field undefined unless MPI_ERR_IN_STATUS is returned.

@devreal
Copy link
Author

devreal commented Nov 10, 2023

@devreal It is incorrect that MPI_Testsome does set the MPI_ERROR field. It only sets the MPI_ERROR field if it returns MPI_ERR_IN_STATUS. If it returns MPI_SUCCESS, the MPI_ERROR is unset.

I assumed the error case:

if the appropriate error handler is set

The current design avoids the additional overhead of setting it, in such cases.

Maybe this was true 20 years ago, it certainly isn't today. The potential saving (one load, one store) from allowing it is irrelevant on today's architectures so there is no rationale for allowing such "optimizations" in a public API.

My suggestion is to leave the MPI_ERROR field undefined unless MPI_ERR_IN_STATUS is returned.

That is at the heart of this issue: we should never leaves values undefined in our return values. They can easily cause undefined behavior and I hope we can agree that that is a bad idea. The trade-off between avoiding undefined behavior and saving two instructions should be clear.

"Premature optimization is the root of all evil."

@bosilca
Copy link
Member

bosilca commented Nov 12, 2023

Also #522

@abouteiller abouteiller added the wg-ft Fault Tolerance Working Group label Nov 14, 2023
@jeffhammond
Copy link
Member

That is at the heart of this issue: we should never leaves values undefined in our return values. They can easily cause undefined behavior and I hope we can agree that that is a bad idea. The trade-off between avoiding undefined behavior and saving two instructions should be clear.

Is it only two instructions when MPI_{Wait,Test}all return MPI_SUCCESS and you require the implementation to set the MPI_ERROR field of every single status object to MPI_SUCCESS?

If we are so concerned about uninitialized data, shouldn't we make underflow illegal? Or should we require all buffers passed to receive functions to be initialized to ensure that users can never read uninitialized data?

No, because we have MPI_Get_count and the user is expected to determine how much of the buffer has been written, and not read more than that.

In the same manner, it is clear when the fields of status objects can/should be read. There is no serious threat from UB unless users write obviously incorrect code.

@devreal
Copy link
Author

devreal commented Nov 29, 2023

Is it only two instructions when MPI_{Wait,Test}all return MPI_SUCCESS and you require the implementation to set the MPI_ERROR field of every single status object to MPI_SUCCESS?

OK, one instruction per status, in the worst case. In the best case, the compiler is able to use wider move instructions if we don't have to poke holes into the status object. See OMPI for example: https://github.com/open-mpi/ompi/blob/main/ompi/request/request.h#L219 OMPI_COPY_STATUS is called for every status on every request and we just distinguish whether the error field is copied or not (typically with a compile time constant so the branch in the macro is optimized out).

Or should we require all buffers passed to receive functions to be initialized to ensure that users can never read uninitialized data?

There is an important difference here: the buffer is updated the same way no matter how you send or receive your data. It doesn't matter whether you use MPI_Recv or MPI_Irecv, MPI_Send or MPI_Isend. The problem is that the resulting value of MPI_ERROR is dependent on the procedure used to complete an operation. That's not true for message buffers.

In the same manner, it is clear when the fields of status objects can/should be read. There is no serious threat from UB unless users write obviously incorrect code.

I went through the standard to try and collect all the places where we define the semantics of MPI_ERROR.

Section 3.2.5, at the top:

In C, status is a structure that contains three fields named MPI_SOURCE, MPI_TAG,
and MPI_ERROR; the structure may contain additional fields. Thus,
status.MPI_SOURCE, status.MPI_TAG, and status.MPI_ERROR contain the source, tag, and
error code, respectively, of the received message.

Similar verbiage follows for Fortran. So the MPI_ERROR field contains the error code of the received message?

Not so fast! Later in this section we say:

In general, message-passing calls do not modify the value of the error code field of
status variables. This field may be updated only by the functions in Section 3.7.5 that
return multiple statuses. The field is updated if and only if such function returns with an
error code of MPI_ERR_IN_STATUS.

So this contradicts the first paragraph: status.MPI_ERROR does not contain the status of the received message, unless there was an error in some tested/awaited request. But we still don't know exactly it is updated.

Section 3.7.3:

Error codes belonging to the error class MPI_ERR_IN_STATUS should be returned only
by the MPI completion functions that take arrays of MPI_Status. For the functions that take
a single MPI_Status argument, the error code is returned by the function, and the value of
the MPI_ERROR field in the MPI_Status argument is undefined (see 3.2.5).

So now we say that the value of the MPI_ERROR field is undefined after a call to a function that takes a single status argument. It says the value of the field is undefined, which is different from "updated if and only if" in the previous paragraph: are implementations allowed to change the field? "Undefined" suggests they can, "updated if and only if" says they cannot. But we still don't know how the field is updated if it is updated.

Section 3.7.5 (MPI_WAITALL):

When one or more of the communication operations completed by a call to
MPI_WAITALL fail, it is desirable to return specific information on each communication. The
function MPI_WAITALL will return in such case the error code MPI_ERR_IN_STATUS and will
set the error field of each status to a specific error code. This code will be MPI_SUCCESS, if
the specific communication completed; it will be another specific error code, if it failed; or it
can be MPI_ERR_PENDING if it has neither failed nor completed. The function MPI_WAITALL
will return MPI_SUCCESS if no request had an error, or will return another error code if it
failed for other reasons (such as invalid arguments). In such cases, it will not update the
error fields of the statuses.

Now we know. There is more text explaining these rules in different colors for MPI_WAITSOME and in Section 9.4.

What I am proposing is that we cut out all the special cases littered throughout the standard and stick to the semantics at the very top of 3.2.5: the MPI_ERROR field contains the error code of the received message.

Save some trees!

@wgropp
Copy link

wgropp commented Nov 29, 2023

Just so we're talking about what counts in terms of overhead, it isn't instructions - its memory references, especially writes. Getting a single byte out to memory, while it might happen in the background, can impact subsequent operations.

@devreal
Copy link
Author

devreal commented Nov 29, 2023

The status structure in OMPI is 24B (12B public, 12B private), The error field is 4B, or 16% of the total structure size. In each function call, we're writing several kB of data to memory. Most of that (and likely all of the status) will never make its way out of a cache and then its part of a cache line anyway. The overhead of storing 4 additional B is negligible. From where I stand, the discussion on overhead in this context is moot and we should focus on usability here.

@jprotze
Copy link

jprotze commented Nov 29, 2023

If an application cares about ub, it should just initialize the error field before passing the status to MPI.
MPI_Status s = {.MPI_ERROR=MPI_SUCCESS}; should even initialize the whole object.

We only need to make sure then that MPI will never copy uninitiated memory to this field.

@devreal
Copy link
Author

devreal commented Nov 29, 2023

If an application cares about ub, it should just initialize the error field before passing the status to MPI.
MPI_Status s = {.MPI_ERROR=MPI_SUCCESS}; should even initialize the whole object.

That's not the problem. See my initial post. The problem is that some procedure calls set the field and others don't. Of course, to avoid UB you just don't make the mistake of reading that field after a call to MPI_Wait or MPI_Waitany. Good ol' "just don't pull the trigger on that footgun" advice :)

@hzhou
Copy link

hzhou commented Nov 29, 2023

It is a strawman arguing for the instruction cost. The key points are:

  • Users do not need to check status.MPI_ERROR other than when aggregate completion calls return MPI_ERR_IN_STATUS.
  • Whether the implementations set or do not set the MPI_ERROR field when it is not returning MPI_ERR_IN_STATUS is an implementation detail, since it does not affect the correctness of a correct user program.

The cost argument is ultimately an implementation issue. For some implementations, it may be more efficient or easier always to set the status fields. But for some implementations that may be a cost or concern, say, due to some weird internal design. Anyway, until we determine it is necessary, why should the standard restrict the implementation to one way or another?

@jeffhammond
Copy link
Member

jeffhammond commented Nov 30, 2023

Of course, to avoid UB you just don't make the mistake of reading that field after a call to MPI_Wait or MPI_Waitany. Good ol' "just don't pull the trigger on that footgun" advice :)

How is this different from "don't read the receive buffer until the operation is complete"? The specification states if and when one can read various parts of memory. We say when the user can and cannot read status.MPI_ERROR.

The best solution for users who are confused is to just use MPI_STATUS(ES)_IGNORE unless (1) they are using a bulk completion operation and errors will not stop program execution, (2) underflow is possible and they intend to call MPI_Get_count or similar, or (3) they are using wildcards and .MPI_SOURCE or .MPI_TAG are needed.

Given that very few users allow errors to return in the first place, the hand-wringing over this issue seems undeserved. Writing error-aware code is nontrivial and it is not too much to ask to require that users only read .MPI_ERROR when it is appropriate.

@jeffhammond
Copy link
Member

jeffhammond commented Nov 30, 2023

If we change the specification to allow but not require implementations to set the MPI_ERROR field then the people who believe it's important can implement it and prove that it's trivial and free. That evidence would justify changing it to a requirement in the future.

Maybe I missed it, but any change here is backwards-incompatible, although I doubt any user code depends on the current behavior.

@jeffhammond
Copy link
Member

jeffhammond commented Nov 30, 2023

Regarding performance, I think some folks need a refresher on CPU microarchitecture.

When one stores any number of bytes to memory with an x86 store instruction, it causes RFO (request for ownership), which fetches the 64B lines (usually 1) containing those bytes. This means that the 4B store generates 64B of memory traffic from memory into cache and consumes a cache line, regardless of when the line is evicted. It also consumes an entry in one or more microarchitectural state buffers, such as the store buffer and - if the memory page was not already translated - a TLB entry. In the worst case, when the line is cold, the store will take 50+ nanoseconds (100-200 instructions), so we must hope that our CPU can have a sufficient number of instructions, stores and TLB walkers in-flight to hide this latency.

Note that in the case of MPI RMA request-based operations, MPI_ERROR is the only valid field. If we unconditionally require MPI_ERROR to be set, then we are forcing the implementation of bulk request completion operations to add the aforementioned overheads to what is otherwise a no-op in most cases. Of course, MPI_STATUS(ES)_IGNORE should be used in most cases, but if I am writing an error-aware library, I can't do that, so now I pay the price of MPI_ERROR_IN_STATUS every time, rather than only when I need it.

Do implementations mitigate this by checking whether errors return and only initializing the MPI_ERROR field in that case? Is this a trivial change or not? I don't recall seeing any code like this in MPICH or OMPI. Is it possible to have a single flag for this or does one need to check multiple communications and/or sessions?

If we add the branch on can-errors-return to hide the aforementioned 4B store overheads, modern CPUs will still speculatively execute both sides of the branch, so we can still end up taking a TLB and cache miss anyways.

@bosilca
Copy link
Member

bosilca commented Nov 30, 2023

@jeffhammond In OMPI we had a lax interpretation of the The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS requirement. We kept it simple (aka setting all fields) until very recently (2021, issue PR) basically until someone came up with a test from MPICH that was failing in OMPI and we had to fix it.

Such a trivial matter resulting in such a waste of time. We can argue on how users shall be doing their coding, but if we put ourselves in their place for a little bit we would realize how ridiculous all this debate is. Writing correct distributed code is already hard enough without forcing the users to learn all the convoluted cases in which they should or should not test a field in a structure the user had explicitly requested to be returned.

@jeffhammond
Copy link
Member

My position is that we should relax the specification to allow both the OMPI 2020 and MPICH implementations. I'm sorry that you had to fix that, but it was a choice. You could have treated that trivial noncompliance the same as MPI_THREAD_MULTIPLE and RMA were treated for years.

Is it your position that we should now change the spec to force MPICH to change their implementation (and break backwards-compatibility of MPI)?

What is your proposal to fix the undefined situation with RMA and NBC requests, so that users do not have to learn that .MPI_SOURCE and .MPI_TAG can't be read in those cases?

@bosilca
Copy link
Member

bosilca commented Nov 30, 2023

It was a indeed a choice OMPI devs made to abide by the standard. Based on the fact that nobody complained about this change in a little over 2 years, I don't think we affected any codes because as you mentionned above 99.*% of users rely on the default error handler for MPI applications.

My position is "make it simple and consistent for the users", and if some implementations have to change their ways so be it. What @devreal proposed above seems very sensible, initialize the MPI_ERROR field all the time, and let the users decide how to use it.

@jprotze
Copy link

jprotze commented Nov 30, 2023

A reasonable choice might be to explicitly state that the value of the field is undefined in these cases. That would allow both implementations and at the same time shout out that accessing the field is ub, even if the value was initialized before the call.

The argument of not returning uninitialized memory to the application would reach beyond the receive buffer, just think of the index array in waitsome.

@jeffhammond
Copy link
Member

initialize the MPI_ERROR field all the time, and let the users decide how to use it.

I will continue to object to this, because it forces implementations to do completely unnecessary operations in order to accommodate user illiteracy.

The majority of users don't handle errors at all. You are foisting backwards-compatibility, implementation changes, and runtime overhead onto every usage of status objects, in order to support illiterate users trying to write error-aware code.

How many MPI users do you know who write error-aware MPI code that is correct except for its use of the .MPI_ERROR field when the return code is not MPI_ERROR_IN_STATUS?

A reasonable choice might be to explicitly state that the value of the field is undefined in these cases.

This is what I want.

We should clean up the text to say very clearly that .MPI_ERROR is only relevant when the return code is MPI_ERROR_IN_STATUS. This is not a hard rule to learn.

@hppritcha
Copy link

This is what I want.

We should clean up the text to say very clearly that .MPI_ERROR is only relevant when the return code is MPI_ERROR_IN_STATUS. This is not a hard rule to learn.

This is what I would like to see as well.

@bosilca
Copy link
Member

bosilca commented Nov 30, 2023

If your proposal is to replace

The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS

with

The field is only relevant when functions return with an error code of MPI_ERR_IN_STATUS

then we are in complete agreement.

@hzhou
Copy link

hzhou commented Nov 30, 2023

In OMPI we had a lax interpretation of the The field is updated if and only if such function returns with an error code of MPI_ERR_IN_STATUS requirement. We kept it simple (aka setting all fields) until very recently (2021, issue PR) basically until someone came up with a test from MPICH that was failing in OMPI and we had to fix it.

I would love to remove those tests from MPICH testsuite. I don't understand why we went through so much trouble engineering those tests and fixing the code back then instead of protesting.

@devreal
Copy link
Author

devreal commented Dec 5, 2023

FWIW, RMA requests are the only requests where we do not have to touch the the status object if one is provided. For all other requests, the status object must at least contain some information on whether the request was cancelled (even in cases where cancellation is not allowed, like collective operations). That's a rather narrow justification for leaving the error field unset...

@devreal
Copy link
Author

devreal commented Jan 27, 2024

Here is a nugget from Section 3.7.2:

An empty status is a status that is set to
return tag = MPI_ANY_TAG, source = MPI_ANY_SOURCE, error = MPI_SUCCESS, and is also
internally configured so that calls to MPI_GET_COUNT and MPI_GET_ELEMENTS return
count = 0 and MPI_TEST_CANCELLED returns false. We set a status variable to empty
when the value returned by it is not significant. Status is set in this way so as to prevent
errors due to accesses of stale information.

Based on the highlighted text, a user can reasonably assume that for all but receive requests the returned status is empty as defined here, i.e., has a well-defined content. After all, for all but receive requests the value of the status object is not significant if no error occurred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chap-p2p Point to Point Communication Chapter Committee mpi-4.2 wg-ft Fault Tolerance Working Group
Projects
Status: To Do
Development

No branches or pull requests

10 participants