Skip to content

User-Level Failure Mitigation #20

@wesbland

Description

@wesbland

Description

This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.

Note: the versions attached to the ticket are updated loosely and are not always current. Please access the repository to see the latest revision, especially between forum meetings.

More information on the prototype implementation in Open MPI can be found here:
http://fault-tolerance.org/

Pull Requests:
READING MAY 2022: Slice 1: https://github.com/mpi-forum/mpi-standard/pull/665

Current (full) draft proposal (May 2022):
20220509-ulfm-master.pdf
rolling diff: https://github.com/mpiwg-ft/mpi-standard/pull/19/files

RFCs:
https://github.com/mpiwg-ft/mpi-standard/pull/17
https://github.com/mpiwg-ft/mpi-standard/pull/18
https://github.com/mpiwg-ft/mpi-standard/commit/92f7596e8958dfc3a71bbc83514dec3d3b7dcc07

Proposed Solution

Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.

Impact on Implementations

Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.

Impact on Applications / Users

Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. They can request that fault tolerance is turned off or the implementation can inform them that it is not supported.

Connected Issues

This issue only covers the main portion of ULFM (that talking about the general fault model, how MPI handles faults generally, and communicator-based functions). For the sections on RMA, the ticket #21 has more information. For sections of Files, the ticket #22 has more information.

Alternative Solutions

Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).

History

The run-through stabilization proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.

Lots of old conversations and history attached to this issue can be found on the old Trac ticket.

Metadata

Metadata

Assignees

Labels

mpi-nextFor inclusion in the MPI 5.1 or 6.0 standardwg-ftFault Tolerance Working Group

Type

No type

Projects

Status

In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions