User-Level Failure Mitigation #20
This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.
More information on the prototype implementation in Open MPI can be found here:
Current (full) draft proposal (May 2022):
Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.
Impact on Implementations
Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.
Impact on Applications / Users
Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. They can request that fault tolerance is turned off or the implementation can inform them that it is not supported.
This issue only covers the main portion of ULFM (that talking about the general fault model, how MPI handles faults generally, and communicator-based functions). For the sections on RMA, the ticket #21 has more information. For sections of Files, the ticket #22 has more information.
Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).
The run-through stabilization proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.
Lots of old conversations and history attached to this issue can be found on the old Trac ticket.
The text was updated successfully, but these errors were encountered:
Updated PDF with comments from june 2016:
Full diff and changes from last reading available https://github.com/mpi-forum/mpi-standard/pull/13/commits
Please can you fix two things:
Proposal for that- it should look like
process failure --> 20, <and all the many pages in the other section, like dynamic>
process fault tolerance, see fault tolerance
@dholmes-epcc-ed-ac-uk I think it's more than just what you mentioned here. The biggest challenge, I believe, that we faced with non-shrinking recovery of communicators is that you'd also have to define how each of the collectives functions in the face of "gaps" in your communicator. For some (
@wesbland I think that Martin's point was that if ULFM is put in before we know the impact of these other ideas then we could be doing the wrong thing. That could be used as a way to delay ULFM forever and I don't want to advocate or support that. On the other hand, do you have notes/documentation from your previous consideration of "non-shrinking recovery"? I'd like to review that, and the reasoning behind the "could easily be added later" statement. If we could do recovery without needing revoke or shrink, would they be deprecated and removed?
This turned out to be quite the tome so please forgive the length, but I wanted to capture all of the discussion from yesterday in one place as there were many side discussions that went on.
I want to show the example that @schulzm gave that could potentially cause a deadlock with ULFM as currently specified.
In this example, there is a green communicator and an orange communicator which overlap. Process 1 is communicating (point-to-point) with process 5 and process 3 is communicating (point-to-point) with process 4. Processes 3 and 5 simultaneously fail and trigger error handlers which revoke both the green and orange communicator. This sends all processes into error handlers where they attempt to repair the two communicators by shrinking. The problem here is that
We tried to come up with a variety of scenarios which would let the user escape this situation. I'm probably going to forget some because I didn't take notes quickly enough, so add more if I leave some out.
In the end, @schulzm suggested that the main problem is the combination of asynchronous notification (error reporting and
We then briefly discussed what a solution that works as @schulzm suggested might look like. One solution would be to just bring the revoke and shrink operations (as defined in ULFM) into the MPI library itself and not expose them to the user. Instead of returning
There are some (probably many) unresolved questions around this:
Whatever we decide to do next, the concerns of @schulzm and @dholmes-epcc-ed-ac-uk (and others) are very valid. Namely, that we have to be very careful with whatever FT solution(s) we pass as a forum. We don't want to pick something that will exclude future solutions and even more importantly, we don't want to pick something that we would later discover has a fundamental flaw.
Hopefully this is a fair representation of everyone's opinions as expressed yesterday. I know that the ULFM authors (myself included) will want to think more about the concerns expressed to make sure we can't come up with a more elegant solution to resolve the problem before trying to re-architect the entire proposal.
@dholmes-epcc-ed-ac-uk I completely agree with your statement about wanting to make sure we are forward compatible with FT and my remark about allowing both shrinking and non-shrinking recovery is more hopeful than certain at this point. A quick consideration seems to me that if we had all the semantics that we discussed yesterday (and I wrote about above), picking the particular style of communicator recovery is something that probably could be swappable while still being compatible with the other styles. We'd definitely have to work through that more extensively though.
As for the issues with leaving holes in the communicators, I don't think that it's an impossible problem. I think it's just a much more difficult problem than using shrinking recovery because it requires many more modifications to the rest of MPI. I don't have specific notes (or if I do, they're from years ago and hard to find), but I think my previous example of the two different collectives demonstrates how tricky the semantics of collectives could be if you didn't have dense communicators. I can expand a bit here:
If you have an
To name a few, you have to reconcile communicator names between respawned and surviving processes, hardware matching is difficult (possibly impossible) as distributed race condition make messages from previous "epochs" in the comm continue to be delivered for some time, collective operations have to be reimplemented and are going to be more expensive, etc.
I obviously strongly disagree with the way @schulzm frame the issue. The issue (as illustrated in @wesbland above) has a known solution that has been deployed in practice in developing the X10 programming language FT support. This solution (the one with iRecv from any source on an over-spanning communicator) is effective and bug-free. I can however concede the point that standardizing a solution to that problem (rather than letting users deal with it on their own in a more ad-hoc fashion)
Obviously, feedback and participation to the WG to follow through is most welcome!
@wesbland First of all, thanks for the extensive summary, I think this hits it fairly well. Only one clarification: The part of "In the end, @schulzm suggested that the main problem is the combination of asynchronous notification (error reporting and MPI_COMM_REVOKE) and synchronous recovery (MPI_COMM_SHRINK)." is not quite what I had in mind: I suggested that the combination of synchronous notification (as in, one has to actively call a particular set of operations on a the failing communicator, instead of being interrupted somehow) and collective recovery (all processes have to participate as in MPI_COMM_SHRINK) is the main problem.
Regarding "However, it also has potential downsides when compared to current ULFM, primarily that it enforces a particular recovery model which the user might not want" - I agree, this is a concern and we would have to look. It seems to me, though (in the short time we had to think about this), that this could be alleviated in a composable way and on a per-communicator basis. For each communicator, the user could decide whether they want automatic shrinking or whether they are OK with leaving gaps, which then would avoid triggering any recovery operation (and should even be lower cost than the current ULFM scheme?). The former would be targeted (probably) at end users, especially those wanting collectives, while the latter could a reasonable solution for runtimes that manage the process space themselves (like the mentioned X10 runtime).
One additional comment regarding: "This solution is workable for applications that can continue with fewer processes, but it still kludgy for apps that can't recover in that way easily" - yes, that is true. However, ReInit like solutions (at least the ones we looked at) rely already on the asynchronous notification scheme (and have to). Having a shrinking solution (which, I agree, other user groups can use or even need) based on the same notification concept has a much higher chance of being combinable towards a compatible solution that supports both. On the other hand, having two different notification schemes in MPI (one for shrinking and one for a reinit style, which we will need as well, since a majority of our current apps are that way) is likely to clash no matter what we do.
Apologies. That's what I was trying to convey.
I completely agree here. That's what I was trying to get at in the conversation with @dholmes-epcc-ed-ac-uk. That seems like something that could interoperate more safely than completely changing the model. I think there are other things that might be able to do something with the non-shrinking model. Perhaps libraries such as Fenix from @marcgamell?
I'm not sure what you mean here. I would think that having an automatically shrinking (or not) solution might be worse because it doesn't give you the clean way of jumping to a clean state the way reinit does. It would still have the problems that reinit initially faced with ULFM, namely that all communicators, windows, files, requests, etc. would need to be tracked by the application and torn down in an error handler. Am I missing something?
You used the term "non-shrinking" to describe the "leave gaps" approach. My understanding of that portion of the discussion was a little different. The communicator would not shrink, in that it contains the same number of processes (some failed and some non-failed), but would shrink, in that only non-failed processes would be expected to participate. There is no expectation or intent to replace failed processes with non-failed ones in this communicator. Therefore, "non-shrinking" is an incorrect characterisation; the active portion of the communicator does shrink.
@wesbland MPI_REDUCE with failed processes: failed processes do not participate. The comm size for the operation is the number of non-failed processes. The operation may fail with MPI_ERR_PROC_FAILED if a new (non-acked) process failure is detected during the operation (i.e. happened before the operation completed). MPI_ALL_TO_ALL with gaps: failed processes do not participate. The input/output buffers would be sized according the total number of ranks but the array elements related to failed processes would be unaffected by the operation. Compare with MPI_SEND/MPI_RECV to/from MPI_PROC_NULL, which succeeds but does not read/write data from/into the user buffer. Basically, treat failed processes exactly like MPI_PROC_NULL by extending semantics already present elsewhere in the MPI Standard.
@abouteiller reconcile communicator names between re-spawned and surviving processes: no processes are re-spawned so there is no issue here. Hardware matching is difficult (possibly impossible) as distributed race condition make messages from previous "epochs" in the comm continue to be delivered for some time: matching is can continue without interruption or change because no context id has changed and no ranks have changed. There is no "epoch" concept. Collective operations have to be reimplemented and are going to be more expensive: agreed, this is my main concern regarding this approach. A tree-based collective, for example, would need to rebuild the tree (a distributed agreement, in the worst case). Some topology structures are easier to repair than others. In many cases, if process X can calculate its parent/children processes for a particular topology (as a local operation) then it can calculate their parent/children processes too (also as a local operation). So, it could figure (locally) out how to skip over a failed process. Forcing a higher cost, even for the non-failure usage is going to be a difficult sell. Alternatively, build the initial tree/topology during communicator creation (known to be expensive) and rebuild it during failure recovery (known to be expensive) otherwise assume that the current topology is usable (normal usage has no performance degradation).
@schulzm the ReInit recovery could be built on top of the "leave gaps" thing by jumping back to the Init point whenever MPI_ERR_PROC_FAILED was raised. However, in order to guarantee all processes noticed this and did the same, the scope of the revoke-like notification should be expanded to any MPI function or made into an interrupt. If one process has jumped back to Init (or is about to), that involves destroying all communicators (+windows+files+requests+etc). There is no longer any point in letting other processes use any MPI object, even if it were safe for them to do so. On the other hand, the current notification mechanism (revoke), notifies that all processes in the affected communicator are failed, which is incorrect. The scope could be narrowed to a notification of particular process failures. A ReInit recovery model would react the same way because even a single process failure would cause a jump back to Init. Finer-grained recovery would be possible, though: continue with the same communicator (now 'sparse' or 'with gaps' with fewer non-failed processes) would be the default; spawn new processes and create a new communicator that is the same size as the original before failure would be the user responsibility (using MPI_COMM_SPAWN and MPI_INTERCOMM_MERGE, exactly as in ULFM examples of such recovery).
@dholmes-epcc-ed-ac-uk I think I'd have to start trying to write text for the collectives before deciding if it were really going to be that hard. In principle, I think we agree what the "right thing" to do here would be as long as we can figure out the right words. You're right that I was conflating the two terms. I did mean "leave gaps" in the same way you did.
This is an update that substitutes the 'ack/get_acked' functions with 'get_failed/ack_failed' functions, as we decided during the last forum meeting.
Experimental version (for comment):
This adds a number of new error reporting modes that permit loosely synchronized and implicit recovery modes. This has been available as a branch on the wg-ft repository, just duplicating here so it is accessible to a wider audience.