Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-Level Failure Mitigation #20

Open
wesbland opened this issue Nov 11, 2015 · 24 comments
Open

User-Level Failure Mitigation #20

wesbland opened this issue Nov 11, 2015 · 24 comments
Assignees
Labels
mpi-5 For inclusion in the MPI 5.0 standard wg-ft Fault Tolerance Working Group
Milestone

Comments

@wesbland
Copy link
Member

wesbland commented Nov 11, 2015

Description

This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.

Note: the versions attached to the ticket are updated loosely and are not always current. Please access the repository to see the latest revision, especially between forum meetings.

More information on the prototype implementation in Open MPI can be found here:
http://fault-tolerance.org/

Pull Requests:
READING MAY 2022: Slice 1: https://github.com/mpi-forum/mpi-standard/pull/665

Current (full) draft proposal (May 2022):
20220509-ulfm-master.pdf
rolling diff: https://github.com/mpiwg-ft/mpi-standard/pull/19/files

RFCs:
https://github.com/mpiwg-ft/mpi-standard/pull/17
https://github.com/mpiwg-ft/mpi-standard/pull/18
https://github.com/mpiwg-ft/mpi-standard/commit/92f7596e8958dfc3a71bbc83514dec3d3b7dcc07

Proposed Solution

Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.

Impact on Implementations

Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.

Impact on Applications / Users

Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. They can request that fault tolerance is turned off or the implementation can inform them that it is not supported.

Connected Issues

This issue only covers the main portion of ULFM (that talking about the general fault model, how MPI handles faults generally, and communicator-based functions). For the sections on RMA, the ticket #21 has more information. For sections of Files, the ticket #22 has more information.

Alternative Solutions

Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).

History

The run-through stabilization proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.

Lots of old conversations and history attached to this issue can be found on the old Trac ticket.

@wesbland
Copy link
Member Author

The most recent copy of the text can be found here.

@wesbland wesbland added the wg-ft Fault Tolerance Working Group label Nov 12, 2015
@wesbland wesbland added scheduled reading Reading is scheduled for the next meeting and removed not ready labels Feb 2, 2016
@wesbland
Copy link
Member Author

@abouteiller
Copy link
Member

Same pdf as above, with color annotations to find the changes.

issue-20-21-22-anotated.pdf

@akoniges akoniges closed this as completed Mar 2, 2016
@akoniges akoniges reopened this Mar 2, 2016
@abouteiller
Copy link
Member

20160302-Aurelien@mpiforum-ulfmreading.pdf
slides used for the reading

@abouteiller
Copy link
Member

20160608-Aurelien@mpiforum-ulfmreading.pdf
Updated slides

@abouteiller abouteiller added had reading Completed the formal proposal reading and removed scheduled reading Reading is scheduled for the next meeting labels Jun 8, 2016
@abouteiller abouteiller self-assigned this Jun 8, 2016
@abouteiller
Copy link
Member

abouteiller commented Sep 7, 2016

Updated PDF with comments from june 2016:
issue-20-21-22-a905a5a.pdf

Full diff and changes from last reading available https://github.com/mpi-forum/mpi-standard/pull/13/commits

@abouteiller abouteiller added the scheduled reading Reading is scheduled for the next meeting label Sep 7, 2016
@RolfRabenseifner
Copy link

Please can you fix two things:

  • in the errata item, there is one section mentioned twice
    -- it should be in the sequence of pages
  • In the General Index, Fault Tolarance is missing.

Proposal for that- it should look like
fault tolerance --> 601 (BOLD),


process failure --> 20, <and all the many pages in the other section, like dynamic>
MPI_FT -->
MPI_ERR_PROC_FAILED -->
....
process fault tolerance, see fault tolerance

@dholmes-epcc-ed-ac-uk
Copy link
Member

@wesbland I think that Martin's point was that if ULFM is put in before we know the impact of these other ideas then we could be doing the wrong thing. That could be used as a way to delay ULFM forever and I don't want to advocate or support that. On the other hand, do you have notes/documentation from your previous consideration of "non-shrinking recovery"? I'd like to review that, and the reasoning behind the "could easily be added later" statement. If we could do recovery without needing revoke or shrink, would they be deprecated and removed?

@wesbland
Copy link
Member Author

wesbland commented Dec 8, 2016

This turned out to be quite the tome so please forgive the length, but I wanted to capture all of the discussion from yesterday in one place as there were many side discussions that went on.

I want to show the example that @schulzm gave that could potentially cause a deadlock with ULFM as currently specified.

Overlapping Communicator Failures

In this example, there is a green communicator and an orange communicator which overlap. Process 1 is communicating (point-to-point) with process 5 and process 3 is communicating (point-to-point) with process 4. Processes 3 and 5 simultaneously fail and trigger error handlers which revoke both the green and orange communicator. This sends all processes into error handlers where they attempt to repair the two communicators by shrinking. The problem here is that {0,3,4} will be in a call to MPI_COMM_SHRINK(green) while {1,2,5} are in a call to MPI_COMM_SHRINK(orange). This is a deadlock because 1 and 4 are needed in order to shrink either green or orange.

We tried to come up with a variety of scenarios which would let the user escape this situation. I'm probably going to forget some because I didn't take notes quickly enough, so add more if I leave some out.

  1. Aurelien - The user could post an MPI_IRECV(MPI_ANY_SOURCE, some_unused_tag) on MPI_COMM_WORLD (or similar) which would be used to detect failures besides the one that is currently being handled. By doing this before calling MPI_COMM_SHRINK, it could detect that there have been other failures from which it also needs to recover and attempt to do so on a communicator which emcompasses all of the failures.
  • Criticisms - This might still be unsafe due to the inherent race conditions present in the asynchronous detection and notification of failures. It also is a really gross way to tell users that they need to do their own failure detection. If this were absolutely required for safety, there should be some sort of API to help instead of a non-matching MPI_IRECV.
  1. Wesley - It might be possible to escape this problem with MPI_ISHRINK (planned for a follow-on issue). You can instead use MPI_TEST to do recovery and check for new failures in the meantime.
  • Criticisms - This might have a similar race condition problem as above, though the non-blocking nature might help. It is still subject to the "ick factor" of having to constantly be checking for new failures on unrelated communicators while doing recovery.

In the end, @schulzm suggested that the main problem is the combination of asynchronous notification (error reporting and MPI_COMM_REVOKE) and synchronous recovery (MPI_COMM_SHRINK). His opinion was that the best way to solve this problem would be to pull the notification and recovery portions of any fault tolerance solution into the MPI library itself and perform automatic recovery. (It's important to note here that automatic recovery in this sense is different than what has been discussed previously, where the intention was to replace failed processes with new processes, which the FTWG believes is not solvable. This is instead advocating for using either an internal shrink or to leave holes in the communicator, as @dholmes-epcc-ed-ac-uk suggested above.)

We then briefly discussed what a solution that works as @schulzm suggested might look like. One solution would be to just bring the revoke and shrink operations (as defined in ULFM) into the MPI library itself and not expose them to the user. Instead of returning MPI_ERR_PROC_FAILED to the user when a failure is detected, the MPI library will automatically begin recovery on that communicator. It would internally "revoke" the communicator and shrink the group of processes to remove failures. Then at some point in the future (probably on the next operation, but it could be delayed without a problem I believe), the user will see a new error, perhaps MPI_COMM_SHRUNK, indicating that the nature of the communicator has changed and the user should find out the new information about the communicator (size, rank, etc.) and then continue using the communicator with its new makeup. This has the benefit of being able to happen simultaneously with failure detection and recovery of as many other communicators as necessary (alleviating the deadlock situation described above), as well as exposing an easy interface for users (automatic, in-place recovery of communicators). However, it also has potential downsides when compared to current ULFM, primarily that it enforces a particular recovery model which the user might not want (the user might be ok with leaving the failures in place and communicating around them in the future, e.g. a simple master-worker style application where failed workers can be simply discarded).

There are some (probably many) unresolved questions around this:

  • It may not be necessary to have both current ULFM and automatic recovery. It may be enough to define error reporting and post-failure semantics and let the user just figure out which processes have failed.
  • This solution is workable for applications that can continue with fewer processes, but it still kludgy for apps that can't recover in that way easily (like bulk synchronous style applications with which colleagues of @kathrynmohror, Ignacio, and @schulzm are concerned). For those applications, a reinit-style of recovery is still more desirable (something that wipes the MPI state clean while not requiring complete teardown and reconstruction of the MPI runtime).
    • In a perfect world, we would allow these two solutions to both live in the MPI Standard. However, there are serious interaction concerns about how these two options would work together if an application chose to use automatic recovery and a library which the application used chose to use a reinit-style recovery. It is a non-starter to say that library A can use FT solution A and is therefore incompatible with any other library or application that uses FT solution B.
    • It's possible that sessions might save us here by allowing us to make recovery a property of the session, but this idea hasn't been thought through at all and could be very wrong after more consideration. If it does work however, it would be a great way to marry these two (and possibly other) FT solutions.

Whatever we decide to do next, the concerns of @schulzm and @dholmes-epcc-ed-ac-uk (and others) are very valid. Namely, that we have to be very careful with whatever FT solution(s) we pass as a forum. We don't want to pick something that will exclude future solutions and even more importantly, we don't want to pick something that we would later discover has a fundamental flaw.

Hopefully this is a fair representation of everyone's opinions as expressed yesterday. I know that the ULFM authors (myself included) will want to think more about the concerns expressed to make sure we can't come up with a more elegant solution to resolve the problem before trying to re-architect the entire proposal.

@wesbland
Copy link
Member Author

wesbland commented Dec 8, 2016

@dholmes-epcc-ed-ac-uk I completely agree with your statement about wanting to make sure we are forward compatible with FT and my remark about allowing both shrinking and non-shrinking recovery is more hopeful than certain at this point. A quick consideration seems to me that if we had all the semantics that we discussed yesterday (and I wrote about above), picking the particular style of communicator recovery is something that probably could be swappable while still being compatible with the other styles. We'd definitely have to work through that more extensively though.

As for the issues with leaving holes in the communicators, I don't think that it's an impossible problem. I think it's just a much more difficult problem than using shrinking recovery because it requires many more modifications to the rest of MPI. I don't have specific notes (or if I do, they're from years ago and hard to find), but I think my previous example of the two different collectives demonstrates how tricky the semantics of collectives could be if you didn't have dense communicators. I can expand a bit here:

If you have an MPI_REDUCE on a communicator with holes, what do the empty processes contribute? If you are doing MPI_SUM, it's simple to say that they contribute 0, but if you are doing multiplication or division, that's not a good answer. You could also say that they contribute nothing, but then it makes MPI_ALLTOALL very complicated because you would need to leave holes in your output buffers that you provide to users. None of these are impossible problems, just ones that we preferred to avoid if possible.

@abouteiller
Copy link
Member

@dholmes-epcc-ed-ac-uk
You can re-read the working group minutes from the previous ill-fated RTS proposal from 5/6 years ago. It has investigated heavily these ideas of replace/blank modes, and it was found that this approach, although seducing from a high mile perspective, had holes everywhere (pun intended) when you scratch the surface.

To name a few, you have to reconcile communicator names between respawned and surviving processes, hardware matching is difficult (possibly impossible) as distributed race condition make messages from previous "epochs" in the comm continue to be delivered for some time, collective operations have to be reimplemented and are going to be more expensive, etc.

I obviously strongly disagree with the way @schulzm frame the issue. The issue (as illustrated in @wesbland above) has a known solution that has been deployed in practice in developing the X10 programming language FT support. This solution (the one with iRecv from any source on an over-spanning communicator) is effective and bug-free. I can however concede the point that standardizing a solution to that problem (rather than letting users deal with it on their own in a more ad-hoc fashion)
with some new interface can help foster a more interoperable environment for libraries that are not part of the same software package, and I think that at this point we should investigate this direction.

Obviously, feedback and participation to the WG to follow through is most welcome!

@schulzm
Copy link

schulzm commented Dec 8, 2016

@wesbland First of all, thanks for the extensive summary, I think this hits it fairly well. Only one clarification: The part of "In the end, @schulzm suggested that the main problem is the combination of asynchronous notification (error reporting and MPI_COMM_REVOKE) and synchronous recovery (MPI_COMM_SHRINK)." is not quite what I had in mind: I suggested that the combination of synchronous notification (as in, one has to actively call a particular set of operations on a the failing communicator, instead of being interrupted somehow) and collective recovery (all processes have to participate as in MPI_COMM_SHRINK) is the main problem.

Regarding "However, it also has potential downsides when compared to current ULFM, primarily that it enforces a particular recovery model which the user might not want" - I agree, this is a concern and we would have to look. It seems to me, though (in the short time we had to think about this), that this could be alleviated in a composable way and on a per-communicator basis. For each communicator, the user could decide whether they want automatic shrinking or whether they are OK with leaving gaps, which then would avoid triggering any recovery operation (and should even be lower cost than the current ULFM scheme?). The former would be targeted (probably) at end users, especially those wanting collectives, while the latter could a reasonable solution for runtimes that manage the process space themselves (like the mentioned X10 runtime).

One additional comment regarding: "This solution is workable for applications that can continue with fewer processes, but it still kludgy for apps that can't recover in that way easily" - yes, that is true. However, ReInit like solutions (at least the ones we looked at) rely already on the asynchronous notification scheme (and have to). Having a shrinking solution (which, I agree, other user groups can use or even need) based on the same notification concept has a much higher chance of being combinable towards a compatible solution that supports both. On the other hand, having two different notification schemes in MPI (one for shrinking and one for a reinit style, which we will need as well, since a majority of our current apps are that way) is likely to clash no matter what we do.

@wesbland
Copy link
Member Author

wesbland commented Dec 8, 2016

I suggested that the combination of synchronous notification (as in, one has to actively call a particular set of operations on a the failing communicator, instead of being interrupted somehow) and collective recovery (all processes have to participate as in MPI_COMM_SHRINK) is the main problem.

Apologies. That's what I was trying to convey.

It seems to me, though (in the short time we had to think about this), that this could be alleviated in a composable way and on a per-communicator basis.

I completely agree here. That's what I was trying to get at in the conversation with @dholmes-epcc-ed-ac-uk. That seems like something that could interoperate more safely than completely changing the model. I think there are other things that might be able to do something with the non-shrinking model. Perhaps libraries such as Fenix from @marcgamell?

Having a shrinking solution (which, I agree, other user groups can use or even need) based on the same notification concept has a much higher chance of being combinable towards a compatible solution that supports both.

I'm not sure what you mean here. I would think that having an automatically shrinking (or not) solution might be worse because it doesn't give you the clean way of jumping to a clean state the way reinit does. It would still have the problems that reinit initially faced with ULFM, namely that all communicators, windows, files, requests, etc. would need to be tracked by the application and torn down in an error handler. Am I missing something?

@dholmes-epcc-ed-ac-uk
Copy link
Member

@wesbland & @abouteiller thanks for the summary of the discussion at this meeting and the references to previous discussions. I'll take a look at RTS soon.

You used the term "non-shrinking" to describe the "leave gaps" approach. My understanding of that portion of the discussion was a little different. The communicator would not shrink, in that it contains the same number of processes (some failed and some non-failed), but would shrink, in that only non-failed processes would be expected to participate. There is no expectation or intent to replace failed processes with non-failed ones in this communicator. Therefore, "non-shrinking" is an incorrect characterisation; the active portion of the communicator does shrink.

@wesbland MPI_REDUCE with failed processes: failed processes do not participate. The comm size for the operation is the number of non-failed processes. The operation may fail with MPI_ERR_PROC_FAILED if a new (non-acked) process failure is detected during the operation (i.e. happened before the operation completed). MPI_ALL_TO_ALL with gaps: failed processes do not participate. The input/output buffers would be sized according the total number of ranks but the array elements related to failed processes would be unaffected by the operation. Compare with MPI_SEND/MPI_RECV to/from MPI_PROC_NULL, which succeeds but does not read/write data from/into the user buffer. Basically, treat failed processes exactly like MPI_PROC_NULL by extending semantics already present elsewhere in the MPI Standard.

@abouteiller reconcile communicator names between re-spawned and surviving processes: no processes are re-spawned so there is no issue here. Hardware matching is difficult (possibly impossible) as distributed race condition make messages from previous "epochs" in the comm continue to be delivered for some time: matching is can continue without interruption or change because no context id has changed and no ranks have changed. There is no "epoch" concept. Collective operations have to be reimplemented and are going to be more expensive: agreed, this is my main concern regarding this approach. A tree-based collective, for example, would need to rebuild the tree (a distributed agreement, in the worst case). Some topology structures are easier to repair than others. In many cases, if process X can calculate its parent/children processes for a particular topology (as a local operation) then it can calculate their parent/children processes too (also as a local operation). So, it could figure (locally) out how to skip over a failed process. Forcing a higher cost, even for the non-failure usage is going to be a difficult sell. Alternatively, build the initial tree/topology during communicator creation (known to be expensive) and rebuild it during failure recovery (known to be expensive) otherwise assume that the current topology is usable (normal usage has no performance degradation).

@schulzm the ReInit recovery could be built on top of the "leave gaps" thing by jumping back to the Init point whenever MPI_ERR_PROC_FAILED was raised. However, in order to guarantee all processes noticed this and did the same, the scope of the revoke-like notification should be expanded to any MPI function or made into an interrupt. If one process has jumped back to Init (or is about to), that involves destroying all communicators (+windows+files+requests+etc). There is no longer any point in letting other processes use any MPI object, even if it were safe for them to do so. On the other hand, the current notification mechanism (revoke), notifies that all processes in the affected communicator are failed, which is incorrect. The scope could be narrowed to a notification of particular process failures. A ReInit recovery model would react the same way because even a single process failure would cause a jump back to Init. Finer-grained recovery would be possible, though: continue with the same communicator (now 'sparse' or 'with gaps' with fewer non-failed processes) would be the default; spawn new processes and create a new communicator that is the same size as the original before failure would be the user responsibility (using MPI_COMM_SPAWN and MPI_INTERCOMM_MERGE, exactly as in ULFM examples of such recovery).

@wesbland
Copy link
Member Author

wesbland commented Dec 8, 2016

@dholmes-epcc-ed-ac-uk I think I'd have to start trying to write text for the collectives before deciding if it were really going to be that hard. In principle, I think we agree what the "right thing" to do here would be as long as we can figure out the right words. You're right that I was conflating the two terms. I did mean "leave gaps" in the same way you did.

@wesbland
Copy link
Member Author

Updated PDF for February/March 2018 Reading:

ulfm.pdf

@wesbland wesbland added scheduled reading Reading is scheduled for the next meeting and removed not ready labels Feb 21, 2018
@wesbland wesbland added this to the 2018-02 Portland, USA milestone Feb 21, 2018
@wesbland wesbland removed the scheduled reading Reading is scheduled for the next meeting label Mar 1, 2018
@wesbland wesbland removed this from the 2018-02 Portland, USA milestone Mar 1, 2018
@abouteiller
Copy link
Member

New version:
mpi40-ulfm-20200921.pdf

This is a 'parity' update to port the existing ULFM text over mpi-4.x, compatibility with new error handling, and pythonization.

@wesbland wesbland removed the mpi-4.0 label Oct 21, 2020
@abouteiller
Copy link
Member

abouteiller commented Jan 25, 2022

New version:

ft-annotated-with-ackfailed.pdf

This is an update that substitutes the 'ack/get_acked' functions with 'get_failed/ack_failed' functions, as we decided during the last forum meeting.

Experimental version (for comment):

ft-annotated-with-modes.pdf

This adds a number of new error reporting modes that permit loosely synchronized and implicit recovery modes. This has been available as a branch on the wg-ft repository, just duplicating here so it is accessible to a wider audience.

@wesbland wesbland added the mpi-5 For inclusion in the MPI 5.0 standard label Feb 7, 2022
@wesbland wesbland added the scheduled reading Reading is scheduled for the next meeting label May 10, 2022
@wesbland wesbland added this to the May 2022 milestone May 10, 2022
@wesbland wesbland removed their assignment Jul 7, 2022
@wesbland wesbland removed the scheduled reading Reading is scheduled for the next meeting label Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi-5 For inclusion in the MPI 5.0 standard wg-ft Fault Tolerance Working Group
Projects
Status: In Progress
Development

No branches or pull requests

6 participants