Skip to content

Latest commit

 

History

History
45 lines (33 loc) · 1.73 KB

supported.rst

File metadata and controls

45 lines (33 loc) · 1.73 KB

Supported fault tolerance techniques

Open MPI is a vehicle for research in fault tolerance and over the years provided support for a wide range of resilience techniques:

  • Currently supported

  • Only for research / non-production usage

    • Message logging techniques. Similar to those implemented in MPICH-V.
  • Deprecated / no longer available

    • Coordinated and uncoordinated process checkpoint and restart. Similar to those implemented in LAM/MPI and MPICH-V, respectively.
    • Data Reliability and network fault tolerance. Similar to those implemented in LA-MPI.

Current fault tolerance development

The only active work in resilience in Open MPI targets the User Level Fault Mitigation (ULFM) approach, a technique discussed in the context of the MPI standardization body.

For information on the Fault Tolerant MPI prototype in Open MPI see the links below:

Support for other types of resilience (e.g., :ref:`data reliability <ft-data-reliability-label>`, :ref:`checkpoint <ft-checkpoint-restart-label>`) has been deprecated over the years due to lack of adoption and lack of maintenance. If you are interested in doing some archeological work, traces are still available on the main repository.