Description
This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.
Note: the versions attached to the ticket are updated loosely and are not always current. Please access the repository to see the latest revision, especially between forum meetings.
More information on the prototype implementation in Open MPI can be found here:
http://fault-tolerance.org/
Pull Requests:
READING MAY 2022: Slice 1: https://github.com/mpi-forum/mpi-standard/pull/665
Current (full) draft proposal (May 2022):
20220509-ulfm-master.pdf
rolling diff: https://github.com/mpiwg-ft/mpi-standard/pull/19/files
RFCs:
https://github.com/mpiwg-ft/mpi-standard/pull/17
https://github.com/mpiwg-ft/mpi-standard/pull/18
https://github.com/mpiwg-ft/mpi-standard/commit/92f7596e8958dfc3a71bbc83514dec3d3b7dcc07
Proposed Solution
Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.
Impact on Implementations
Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.
Impact on Applications / Users
Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. They can request that fault tolerance is turned off or the implementation can inform them that it is not supported.
Connected Issues
This issue only covers the main portion of ULFM (that talking about the general fault model, how MPI handles faults generally, and communicator-based functions). For the sections on RMA, the ticket #21 has more information. For sections of Files, the ticket #22 has more information.
Alternative Solutions
Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).
History
The run-through stabilization proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.
Lots of old conversations and history attached to this issue can be found on the old Trac ticket.
Description
This chapter describes a flexible approach, providing process fault tolerance by allowing the application to react to failures, while maintaining a minimal execution path in failure-free executions. The focus is on returning control to the application by avoiding deadlocks due to failures within the MPI library.
More information on the prototype implementation in Open MPI can be found here:
http://fault-tolerance.org/
Pull Requests:
READING MAY 2022: Slice 1: https://github.com/mpi-forum/mpi-standard/pull/665
Current (full) draft proposal (May 2022):
20220509-ulfm-master.pdf
rolling diff: https://github.com/mpiwg-ft/mpi-standard/pull/19/files
RFCs:
https://github.com/mpiwg-ft/mpi-standard/pull/17
https://github.com/mpiwg-ft/mpi-standard/pull/18
https://github.com/mpiwg-ft/mpi-standard/commit/92f7596e8958dfc3a71bbc83514dec3d3b7dcc07
Proposed Solution
Process errors denote impossibility to provide normal MPI semantic during an operation (as observed by a particular process). Specify clearly error classes returned in this scenario, provide new APIs for applications to obtain a consistent view of failures, add new APIs to create replacement communication objects to replace damaged objects.
Impact on Implementations
Adds semantic and functions to communicator operations. Implementations that do not care about fault tolerance have to provide all the proposed functions, with the correct semantic when no failure occur. However, an implementation that never raise an exception related to process failures does not have to actually tolerate failures.
Impact on Applications / Users
Provides fault tolerance to interested users. Users/implementations that do not care about fault tolerance are not impacted. They can request that fault tolerance is turned off or the implementation can inform them that it is not supported.
Connected Issues
This issue only covers the main portion of ULFM (that talking about the general fault model, how MPI handles faults generally, and communicator-based functions). For the sections on RMA, the ticket #21 has more information. For sections of Files, the ticket #22 has more information.
Alternative Solutions
Stronger consistency models are more convenient to users, but much more expensive. These can be implemented on top of this proposal as user libraries (or potential future candidates to standardization, without conflict).
History
The run-through stabilization proposal was a complete different effort. This current ticket represents a ground-up restart, accounting for the issues raised during this previous work.
Lots of old conversations and history attached to this issue can be found on the old Trac ticket.