Simulation checkpoints

When you execute a long-running simulation, it can be very helpful to store the state of a simulation at certain intervals. For example, your simulation running on a HPC cluster may crash due to insufficient memory available. Instead of restarting this simulation from scratch, you could restart it -- with an increased memory allocation -- from a checkpoint, which would save a lot of compute time!

Checkpointing in distributed simulations is difficult. Fortunately, MUSCLE3 comes with built-in checkpointing support. This page describes in detail how to use the MUSCLE3 checkpointing API, how to specify checkpoints in the workflow configuration and how to resume a workflow.

In the user tutorial, you can read about the checkpointing concepts and how to use the API when running and resuming MUSCLE3 simulations. This is followed by a developer tutorial, which explains how to add checkpointing capabilities to your MUSCLE3 component. Finally, the checkpointing deep-dive describes in detail the (inner) working of checkpointing in MUSCLE3; though this level of detail is not required for general usage of the API.

Contents

Glossary

Checkpoint: A checkpoint is a moment during the workflow where the user wants to have the state of the whole workflow stored.
Snapshot: A snapshot is the stored state of an instance in the workflow.
Workflow snapshot: A workflow snapshot is a collection of snapshots<snapshot> for all instances in the workflow, which can be resumed from. This means that the snapshots of every combination of peer instances must be consistent <Snapshot consistency>.
Peer instances: Two instances that are connected by a Conduit.

User tutorial

This user tutorial explains all you need to know about checkpointing for running and resuming simulations. Some details are deliberately left out, though you can read all about those in the developer tutorial or checkpointing deep-dive.

User tutorial contents

Defining checkpoints

The first step for using checkpoints is to define checkpoints in your workflow. The checkpoint definitions are for your whole workflow, and you can specify them in yMMSL as in the following example:

checkpoints:
  at_end: true
  simulation_time:
  - every: 10
    start: 0
    stop: 100
  - every: 20
    start: 100
  wallclock_time:
  - every: 3600
  - at:
    - 300
    - 600
    - 1800

Let's break this down: the first element in this example checkpoints definition is at_end. When this is set to true (as in the example), it means that every instance in the workflow will create a snapshot just before the workflow finishes. This set of snapshots can be used to resume a simulation near the end and, for example, let it run for a longer time. Some caveats apply, though, see resuming from *at_end* snapshots for full details.

The other two items in the checkpoints definition are the time-based simulation time<Simulation time checkpoints> and wallclock time<Wallclock time checkpoints>. You can use two types of rules to set checkpoint moments for these:

at rules select specific moments. The example rule above request a checkpoint to be taken at 300, 600 and 1800 seconds after the start of the simulation. You can define multiple times in one at rule, but you may also add multiple at rules. The following definitions are all equivalent:
Standard
checkpoints: wallclock_time: - at: - 300 - 600 - 1800
Inline list
checkpoints: wallclock_time: - at: [300, 600, 1800]
Multiple at rules
checkpoints: wallclock_time: - at: 300 - at: 600 - at: 1800

every rules define a recurring set of checkpoints. In the simplest form you indicate the interval at which checkpoints should be taken -- every hour in the wallclock_time example above. You may optionally indicate a start or stop -- as in the simulation_time example above.
Simple
checkpoints: wallclock_time: every: 3600
Start and stop
checkpoints: simulation_time: - every: 10 start: 0 stop: 100 - every: 20 start: 100
Overlapping ranges
checkpoints: simulation_time: - every: 1 - every: 0.25 start: 0 stop: 2
Note

When stop is specified, the stop time is included when stop == start + n * every, with n a positive whole number. However, this might give surprising results due to the inaccuracies of floating point computations. Compare for example:
```
checkpoints:
  simulation_time:
  - every: 1
    start: 0
    stop: 7
```
```
checkpoints:
  simulation_time:
  - every: 0.1
    start: 0
    stop: 0.7
```
Why the difference? Well - compare in python:
```
>>> 7 * 1.0
7.0
>>> 7 * 0.1
0.7000000000000001
```
Since 0.7000000000000001 is larger than 0.7, no checkpoint will be generated for this time.

yMMSL documentation on :external+ymmslCheckpoints

yMMSL API reference: :external:pyymmsl.Checkpoints, :external:pyymmsl.CheckpointAtRule, :external:pyymmsl.CheckpointRangeRule.

Simulation time checkpoints

Checkpoints defined in the simulation_time section are taken based on the time inside your simulation. This will only work correctly if all components in the simulation have a shared concept of time, which only increases during the simulation. This should be no problem for physics-based simulations, though it does require that the instances make correct use of the timestamp in MUSCLE3 messages <message timestamps>. When this requirement is fulfilled, checkpoints based on simulation time are the most reliable way to checkpoint your workflow.

MUSCLE3 does not interpret or convert the units that you configure in the checkpoints. The units are the same as the components in the simulation use for the timestamps in the messages. Typically this will be in SI seconds, but components may deviate from this standard. MUSCLE3 assumes that all components in the workflow use the same time units in the interfaces to libmuscle.

Note

MUSCLE3 does not assume anything about the start time of a simulation. Your simulation time may start at any value, even negative! Therefore, checkpoint ranges <every checkpoint rule> include 0 and negative numbers when no start value is provided.

Because MUSCLE3 does not know what internal time your simulation starts on, an every rule without a start value will always trigger a checkpoint at the first possible moment in the simulation. You should supply a start value if you do not want this to happen.

Wallclock time checkpoints

Checkpoints defined in the wallclock_time section are taken based on the elapsed wallclock time of your simulation (also known as elapsed real time). Each component in the simulation will make a snapshot at the earliest possible moment after a checkpoint is passed.

The checkpoint times in the configuration are interpreted as seconds since the initialization of muscle_manager.

Warning

Wallclock time checkpoint definitions are (currently) not a reliable way to create workflow snapshots <workflow snapshot>. While each instance in the simulation will create a snapshot when requested, there is no guarantee that all snapshots are consistent <Snapshot consistency>.

When a simulation has relatively simple coupling between components, checkpointing based on wallclock time usually works fine.

However for co-simulation (the interact coupling type) and more complex coupling, it is likely that not all checkpoints lead to a consistent workflow snapshot.

If you intend to use wallclock time checkpoints and find that you often don't get a consistent workflow snapshot, you may try the following workaround: instead of requesting a wallclock time checkpoint at (for example) 600 seconds, you can specify checkpoints at 600, 601, 602, 603, 604 and 605 seconds. The "right" interval to use will depend on the typical compute times of your components and coupling in the simulation.

Running a simulation with checkpoints

Starting a simulation with checkpoints is no different than starting one without. You need to start the muscle_manager with the configuration yMMSL file (or files), as well as the individual components (or let muscle_manager start them for you with the --start-all flag). The sole difference is that the yMMSL configuration must contain a checkpoints section <Defining checkpoints>.

When muscle_manager is started with checkpoints configured, a couple of things change. First, all of the component implementations must support checkpointing: the simulation will stop with an error if this is not the case. The simulation may also stop with an error if there is an issue in the checkpointing implementation of any of the components.

Second, all components are instructed to make snapshots according to the configured checkpoints. muscle_manager keeps track of all created snapshots during the simulation, looking for workflow snapshots <workflow snapshot>. When a workflow snapshot is detected, muscle_manager writes a yMMSL file that can be used to resume the simulation <Resuming a simulation>.

During the simulation, all of the created snapshots are stored on the file system. See the table below for the directories where MUSCLE3 stores the files. Note: a run-directory is automatically created when using the --start-all flag for muscle_manager. You may also specify a custom run directory through the --run-dir DIRECTORY option. When you do not provide a run directory, the last column in the table below indicates where snapshots are stored.

Directories where MUSCLE3 stores snapshot files.

Snapshot type Run directory provided No run directory provided

Workflow run_dir/snapshots/ Working directory of muscle_manager

Instance

run_dir/instances/<instance>/snapshots/,

with <instance> the name of the instance.

Working directory of the instance

Note

When running a distributed simulation <distributed execution> on multiple compute nodes, MUSCLE3 assumes that the run directory is accessible to all nodes (i.e. on a shared or distributed file system). This is usually the case on HPC clusters.

Example: running the reaction-diffusion model with checkpoints

The reaction-diffusion example model from the Tutorial with Python also has a variant with checkpointing enabled. To run this yourself, navigate in a command line prompt to the docs/source/examples folder in the MUSCLE3 git repository. Then execute the following command:

$ mkdir run_rd_example
$ muscle_manager --start-all --run-dir run_rd_example rd_implementations.ymmsl rd_checkpoints_python.ymmsl rd_settings.ymmsl

Note

You may get an error File 'rd_implementations.ymmsl' does not exist. To fix this, you need to build the examples in the MUSCLE3 source; in the root of the git repository, execute:

$ make test_examples

The above command runs the muscle_manager and starts all components (the reaction model and the diffusion model). The rd_checkpoints_python.ymmsl file contains the checkpoint definitions used in this example:

examples/rd_checkpoints_python.ymmsl

MUSCLE3 will create the run directory run_rd_example for you. In it you'll find the instance snapshots in instances/macro/snapshots and instances/micro/snapshots. The workflow snapshots are stored in the snapshots folder in the run directory.

Resuming a simulation

You can resume a simulation from a workflow snapshot stored in a previous run of the simulation. This works by appending a workflow snapshot yMMSL file from a previous run to the regular yMMSL configuration. If you started your original simulation with:

$ muscle_manager --run-dir ./run1 configuration.ymmsl

You can resume it from a snapshot of this run like so:

$ muscle_manager --run-dir ./run2 configuration.ymmsl ./run1/snapshots/snapshot_20221202_112840.ymmsl

Here we choose a different run directory, and resume from the snapshot file snapshot_20221202_112840.ymmsl that was produced by the first run. This file contains the information required to resume the workflow:

It contains a description which allows you to inspect metadata of the workflow snapshot. It indicates the trigger or triggers leading to this snapshot, and some information of the state of each component in the workflow. This data is for informational purposes only, and ignored by muscle_manager.
It also contains the paths to the snapshots that each instance needs to resume. Note that these snapshots must still exist on the same location. If you move or delete them (or a parent directory), resuming your simulation will fail with an error message:
```
Unable to load snapshot: <snapshot filename> is not a file. Please ensure this path exists and can be read.
```

Example: resuming the reaction-diffusion model

To resume the reaction-diffusion model from a snapshot created in the previous section <Example: running the reaction-diffusion model with checkpoints>, replace <date> and <time> in the following command to point to the snapshot you want to resume from.

$ mkdir run_rd_resume
$ muscle_manager --start-all --run-dir run_rd_resume rd_implementations.ymmsl rd_checkpoints_python.ymmsl rd_settings.ymmsl run_rd_example/snapshots/snapshot_<date>_<time>.ymmsl

When the command completes you can see the output in the new working directory run_rd_resume.

Making changes to your simulation

MUSCLE3 checkpointing is designed for resuming simulations as if they never stopped. This means that resuming is only supported for consistent snapshots <Snapshot consistency> and for simulation configurations that have not changed.

MUSCLE3 does not support any changes to the model when resuming, such as adding or removing components, or changing conduits. Attempting this will likely lead to deadlocks or error messages.

You are allowed to change the settings of your simulation when resuming. However, it depends on the implementation of your components if and when changed settings take effect. Please ask the developers of your simulation components for this information.

Resuming from at_end snapshots

Warning

Resuming from an at_end snapshot only will immediately complete.

Snapshot consistency

MUSCLE3 checkpointing was designed for consistency: no messages between the components must be lost when restarting. When we fulfill this criterium, a simulation can resume from a checkpoint as if it was never interrupted.

During a simulation run, each component creates snapshots independently from all other components. For simulation time checkpoints, the MUSCLE3 checkpointing algorithm is guaranteed to give consistent workflow snapshots <workflow snapshot> when all components adhere to the Multiscale Modeling and Simulation Framework (MMSF) <citation needed>.

Wallclock time checkpoints in the currrent implementation are less reliable: components may take snapshots while messages are still in transit. When that happens an inconsistent state is produced and no workflow snapshots are written by muscle_manager.

MUSCLE3 does not support combining inconsistent snapshots, so it is not possible to freely mix snapshots produced during a simulation. When resuming, MUSCLE3 checks the consistency of all snapshots. The run will end with an error when an inconsistent state is detected:

Received message on <port> with unexpected message number <num>. Was
expecting <num>. Are you resuming from an inconsistent snapshot?

When resuming from a snapshot yMMSL <Running a simulation with checkpoints> written by muscle_manager, you should not encounter this error.

Troubleshooting

General troubleshooting strategy:

First try to find the root cause of the problem that your simulation ran into. You can start by looking in the log file of the muscle_manager, located in <run directory>/muscle3_manager.log. This log file may show the error message or point you in the right direction.
If the muscle_manager log did not display an error, it may indicate which component failed first. Have a look at the logs of that component to figure out what went wrong. The output of an instance is usually found in <run directory>/instances/<instance name>/. Open stdout.txt and stderr.txt to find out what went wrong.
If the muscle_manager logs did not point to a specific instance, you should have a look at the log files of each instance (see point 2 for instructions). Note that some instances may log Broken Pipe errors --this usually happens when a peer component has crashed and it is typically not the root cause of your simulation crash.

Once you find the root cause of your problem, check the list below for common issues and their resolutions. You may also have found a bug in MUSCLE3: please help us and your fellow MUSCLE3 users by creating an issue <Make an issue> on GitHub.

The simulation crashes when using checkpoints.

The first thing you should check is: does the simulation run error-free when checkpoints are disabled? You can test this by commenting the checkpointing section of your input ymmsl file(s).

If it runs error-free without checkpoints, have a look at the error message in the log file generated by your run. MUSCLE3 attempts to have clear error messages to explain what went wrong and give you pointers to a solution.

When the error message indicates a problem with the implementation of the checkpointing API, please check with the developer of the component to fix this. If you are the developer of the component, please see the Developer tutorial section for additional resources.
The simulation crashes when resuming.
Some common causes for this are:
- The snapshot files that the instances are resuming from no longer exist. This could for example happen when a previous run directory has been moved or deleted. For distributed execution, some compute nodes may not be able to access the directories where the instance snapshots are stored. See also Resuming a simulation.
- Your simulation configuration has incompatible changes compared to the original simulation that the snapshots were from. See Making changes to your simulation. Luckily, MUSCLE3 stores the previous simulation configuration in the run directory. If the snapshot that your resume from is stored in run1/snapshots/snapshot_xyz.ymmsl, then you can find that configuration in run1/configuration.ymmsl. Try resuming with that configuration first to see if this is the real problem:
  $ muscle_manager --run-dir run2 run1/configuration.ymmsl run1/snapshots/snapshot_xyz.ymmsl
- One of your components has a bug that is triggered when resuming from a previous snapshot, or perhaps your snapshot belonged to a different version of the component. Please ask your component developer(s) for help.

Developer tutorial

This developer tutorial explains all you need to know about implementing checkpointing in your MUSCLE3 simulation component. If you're not a developer and want to learn how to define checkpoints and resume simulations, please have a look at the user tutorial.

Some details are deliberately left out in this developer tutorial, though you can read all about those in the checkpointing deep-dive.

Developer totorial contents

Start situation: components without checkpointing

In this tutorial we will add checkpointing to the reaction and diffusion components from the Python <Tutorial with Python>, C++ <MUSCLE and C++> and Fortran <MUSCLE and Fortran> tutorials.

Additionally, we will do the same for a generic MUSCLE3 component template. These templates illustrate the structure of a MUSCLE3 component, but they are not complete and cannot be executed.

Reaction model

Python

examples/python/reaction.py

C++

examples/cpp/reaction.cpp

Fortran

examples/fortran/reaction.f90

Diffusion model

Python

examples/python/diffusion.py

C++

examples/cpp/diffusion.cpp

Fortran

examples/fortran/diffusion.f90

Generic template

Python

templates/instance.py

C++

templates/instance.cpp

Fortran

templates/instance.f90

Step 1: Set `USES_CHECKPOINT_API` on instance creation

As first step, you need to indicate that you intend to use the checkpoint API. You do this through the :py~InstanceFlags.USES_CHECKPOINT_API flag when creating the instance:

Python

from libmuscle import Instance, USES_CHECKPOINT_API

...

ports = ...
instance = Instance(ports, USES_CHECKPOINT_API)

API documentation for :py~libmuscle.InstanceFlags.USES_CHECKPOINT_API.

C++

#include <libmuscle/libmuscle.hpp>
#include <ymmsl/ymmsl.hpp>

using libmuscle::PortsDescription;
using libmuscle::Instance;
using libmuscle::InstanceFlags;

...

int main(int argc, char * argv[]) {
    PortsDescription ports = ...;
    Instance instance(argc, argv, ports, InstanceFlags::USES_CHECKPOINT_API);

    ...
}

API documentation for :cpp~libmuscle::impl::InstanceFlags::USES_CHECKPOINT_API.

Fortran

use ymmsl
use libmuscle

type(LIBMUSCLE_PortsDescription) :: ports
type(LIBMUSCLE_Instance) :: instance

ports = ...
instance = LIBMUSCLE_Instance_create( &
    ports, LIBMUSCLE_InstanceFlags(USES_CHECKPOINT_API=.true.))

API documentation for :fLIBMUSCLE_InstanceFlags.

If you do not set this flag, you'll get a runtime error when trying to use any of the checkpointing API calls on the Instance object.

Step 2: Implement checkpoint hooks

The first step in implementing the checkpointing API is implementing the checkpoint hooks. These are the points where your component can make checkpoints:

Intermediate snapshots

Intermediate snapshots are taken inside the reuse-loop, immediately after the S Operator of your component.
Final snapshots

Final snapshots are taken at the end of the reuse-loop, after the O_F Operator of your component.

Intermediate snapshots

Intermediate snapshots are taken inside the reuse-loop, immediately after the S Operator of your component.

Taking intermediate snapshots is optional. However, we recommend implementing intermediate snapshots when any of the following points holds for your component:

Your component has a loop containing O_I and S, and you communicate during Operator O_I or Operator S.

Implementing intermediate checkpointing allows submodels connected to your component to also create checkpoints.

Warning

If you do not implement intermediate checkpoints in this case, then it is likely that a large amount of user-provided checkpoints will not lead to consistent workflow snapshots <workflow snapshot>. Please implement intermediate snapshots to give the users of your component a good checkpointing experience.
There is no communication during O_I and S, but the state update S is executed in a (time-integration) loop which takes a relatively long time.

In this case, intermediate checkpointing allows users to create checkpoints of your component during long-running computations.

In all other cases, there usually is little or no added value in implementing intermediate snapshots in addition to Final snapshots.

You implement taking intermediate snapshots as follows:

Find out where in your code to implement the checkpointing calls. Typically there is a state update loop (e.g. a while or for loop) in a component. You should implement the checkpointing calls at the end of this state update loop. In this way, your code can resume immediately at the begin of that loop. This allows for consistent restarts with the least amount of code.
Ask libmuscle if you need to store your state and create an intermediate snapshot with the API call should_save_snapshot(t). You must provide the current time t in your simulation, such that MUSCLE3 can determine if Simulation time checkpoints are triggered.
Collect the state that you need to store.
Create a libmuscle.Message object to put your state in.
Store the snapshot Message with the API call save_snapshot(message).

See Example: implemented checkpoint hooks for example implementations in the reaction-diffusion models and the component template.

Python API documentation: :py~libmuscle.Instance.should_save_snapshot, :py~libmuscle.Instance.save_snapshot.
C++ API documentation: :cpp~libmuscle::impl::Instance::should_save_snapshot, :cpp~libmuscle::impl::Instance::save_snapshot.
Fortran API documentation: :fLIBMUSCLE_Instance_should_save_snapshot, :fLIBMUSCLE_Instance_save_snapshot.

Final snapshots

Final snapshots must be implemented by all components supporting checkpointing. You implement taking a final snapshot as follows:

You must implement the checkpoint calls at the end of the reuse loop <The reuse loop>.
Ask libmuscle if you need to store your state and create a final snapshot with the API call should_save_final_checkpoint(). Contrary to the intermediate checkpoints, this call may block to determine if a checkpoint is needed (this is also the reason it must happen at the end of the reuse loop).
Collect the state that you need to store.
Create a libmuscle.Message object to put your state in.
Store the snapshot Message with the API call save_final_snapshot(message).

See Example: implemented checkpoint hooks for example implementations in the reaction-diffusion models and the component template.

Python API documentation: :py~libmuscle.Instance.should_save_final_snapshot, :py~libmuscle.Instance.save_final_snapshot.
C++ API documentation: :cpp~libmuscle::impl::Instance::should_save_final_snapshot, :cpp~libmuscle::impl::Instance::save_final_snapshot.
Fortran API documentation: :fLIBMUSCLE_Instance_should_save_final_snapshot, :fLIBMUSCLE_Instance_save_final_snapshot.

Example: implemented checkpoint hooks

Note that below examples only shows the changes compared to the start situation <Start situation: components without checkpointing>. You can view the full contents of the files in the git repository.

Reaction model

Intermediate snapshots

The state we need to store consists of three parts: the current U, the current time t_cur and the end-time for the time integration t_stop. The current time is stored as the timestamp attribute of the Message object. The rest is stored in Message.data.

Final snapshots

For the final snapshot there is no state that is required for resuming. The complete state will be received with the next message on the initial_state port.

Python

tutorial_code/checkpointing_reaction_partial.py

C++

tutorial_code/checkpointing_reaction_partial.cpp

Fortran

tutorial_code/checkpointing_reaction_partial.f90

Diffusion model

Intermediate snapshots

The state we need to store consists of two parts: the current time t_cur and the history of U: Us. Note that the last value of U is contained in Us, so we do not need to save U explicitly. The current time is stored as the timestamp attribute of the Message object. Us is stored in Message.data.

Final snapshots

The same state is stored as for intermediate snapshots.

Python

tutorial_code/checkpointing_diffusion_partial.py

C++

tutorial_code/checkpointing_diffusion_partial.cpp

Fortran

tutorial_code/checkpointing_diffusion_partial.f90

Generic template

Python

tutorial_code/checkpointing_instance_partial.py

C++

tutorial_code/checkpointing_instance_partial.cpp

Fortran

tutorial_code/checkpointing_instance_partial.f90

Step 3: Implement resume

Now that the checkpoint hooks are implemented, we can add support for resuming from a previously created checkpoint. When resuming, there are two options: resuming from an intermediate checkpoint and resuming from a final checkpoint.

When resuming from an intermediate checkpoint, your component first loads its state from the checkpoint. Then it should continue where it left off, which is at the beginning of O_I. This means that it has to skip F_INIT in order to run as if it had never stopped.

When resuming from a final checkpoint, your component first loads its state from the checkpoint. Next, your component executes the F_INIT operator as usual, as it would have had it continued after writing the snapshot.

Steps to implement the resumption logic:

At the start of -- but inside -- the reuse loop you check if you need to resume from a previous snapshot with the API call resuming().

Note

This takes place inside the reuse loop. Currently resuming can only happen during the first iteration of the reuse loop. However, additional checkpointing features are planned that would allow a model to resume multiple times inside one run. By implementing the resume logic inside the reuse loop, your component will be forwards-compatible with this.
When resuming, you load the previously stored snapshot with load_snapshot() and restore the state of your component.
Afterwards check if initialization is required with should_init() and run the regular initialization logic.
Continue with the time-integration loop.

See Example: implemented checkpoint hooks and resume for example implementations in the reaction-diffusion models and the component template.

Python API documentation: :py~libmuscle.Instance.resuming, :py~libmuscle.Instance.load_snapshot, :py~libmuscle.Instance.should_init.
C++ API documentation: :cpp~libmuscle::impl::Instance::resuming, :cpp~libmuscle::impl::Instance::load_snapshot, :cpp~libmuscle::impl::Instance::should_init.
Fortran API documentation: :fLIBMUSCLE_Instance_resuming, :fLIBMUSCLE_Instance_load_snapshot, :fLIBMUSCLE_Instance_should_init.

Reload settings when resuming

You will notice in the examples <Example: implemented checkpoint hooks and resume> that the resume logic is not executed first in the reuse-loop. Instead, the components all retrieve settings. The reason behind this is that it allows the user to resume a simulation with slightly different settings and have those settings take effect immediately after resuming.

It is not required to do this, so you get to decide if (and when) you reload settings after resuming. Be sure to include the behaviour of your component in the documentation, such that users of your component know what they can expect.

Example: implemented checkpoint hooks and resume

Note that below examples only shows the changes compared to the start situation <Start situation: components without checkpointing>. You can view the full contents of the files in the git repository.

Reaction model

Resume logic

In Example: implemented checkpoint hooks we made the choice to store different data in the message for intermediate and final snapshots. When resuming we therefore need to handle these two cases.

Python

examples/python/checkpointing_reaction.py

C++

examples/cpp/checkpointing_reaction.cpp

Fortran

examples/fortran/checkpointing_reaction.f90

Diffusion model

Resume logic

For the diffusion model we stored the same state for intermediate and final snapshots. This makes resuming easier because we do not have to distinguish between the data stored in the loaded Message object.

Python

examples/python/checkpointing_diffusion.py

C++

examples/cpp/checkpointing_diffusion.cpp

Fortran

examples/fortran/checkpointing_diffusion.f90

Generic template

Python

templates/checkpointing_instance.py

C++

templates/checkpointing_instance.cpp

Fortran

templates/checkpointing_instance.f90

Components that do not keep state between reuse

Some components do not need to keep state between reuses. An example of that is the reaction model from the above examples. In the final snapshot, no state needs to be stored to allow properly resuming this component, see Example: implemented checkpoint hooks.

Other examples of such components may be data transformers, receiving data on an F_INIT port and sending the converted data on an O_F port.

If you indicate to libmuscle that your component does not keep state between reuse, libmuscle automatically provides checkpointing for your component. You do this by providing the ~InstanceFlags.KEEPS_NO_STATE_FOR_NEXT_USE flag when creating the instance. See the below example for a variant of the example reaction model.

Python

examples/python/reaction_no_state_for_next_use.py

C++

examples/cpp/reaction_no_state_for_next_use.cpp

Fortran

examples/fortran/reaction_no_state_for_next_use.f90

Python API documentation: :py~libmuscle.InstanceFlags.
C++ API documentation :cpp~libmuscle::impl::InstanceFlags.
Fortran API documentation :fLIBMUSCLE_InstanceFlags.

Builtin validation

MUSCLE3's checkpointing API was carefully designed to allow consistenly resuming a simulation. This is only possible when components carefully implement the checkpointing API. To support you in this task, MUSCLE3 tries to detect any issues with the checkpointing implementation. When MUSCLE3 detects a problem, an error is raised to indicate what went wrong and point you in the right direction for fixing the problem.

Checkpointing in MPI-enabled components

Checkpionting in MPI-enabled components works in the same way as for non-MPI components. The main difference is that some API methods must be called by all processes, while others can only be called from the root process.

resuming() must be called simultaneously in all processes.
load_snapshot() may only be called on the root process. It is up to the model code to scatter or broadcast the snapshot state to the non-root processes, if necessary.
should_init() must be called simultaneously in all processes.
should_save_snapshot() and should_save_final_snapshot() must be called simultaneously in all processes.
save_snapshot() and save_final_snapshot() may only be called on the root process. It is therefore up to the model code to gather the necessary state from the non-root processes before saving the snapshot.

C++ API documentation:
- :cpp~libmuscle::impl::Instance::resuming
- :cpp~libmuscle::impl::Instance::load_snapshot
- :cpp~libmuscle::impl::Instance::should_init
- :cpp~libmuscle::impl::Instance::should_save_final_snapshot
- :cpp~libmuscle::impl::Instance::save_final_snapshot
- :cpp~libmuscle::impl::Instance::should_save_snapshot
- :cpp~libmuscle::impl::Instance::save_snapshot
Fortran API documentation:
- :fLIBMUSCLE_Instance_resuming
- :fLIBMUSCLE_Instance_load_snapshot
- :fLIBMUSCLE_Instance_should_init
- :fLIBMUSCLE_Instance_should_save_final_snapshot
- :fLIBMUSCLE_Instance_save_final_snapshot
- :fLIBMUSCLE_Instance_should_save_snapshot
- :fLIBMUSCLE_Instance_save_snapshot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpointing.rst

checkpointing.rst

Simulation checkpoints

Glossary

User tutorial

Defining checkpoints

Simulation time checkpoints

Wallclock time checkpoints

Running a simulation with checkpoints

Example: running the reaction-diffusion model with checkpoints

Resuming a simulation

Example: resuming the reaction-diffusion model

Making changes to your simulation

Resuming from at_end snapshots

Snapshot consistency

Troubleshooting

Developer tutorial

Start situation: components without checkpointing

Step 1: Set `USES_CHECKPOINT_API` on instance creation

Step 2: Implement checkpoint hooks

Intermediate snapshots

Final snapshots

Example: implemented checkpoint hooks

Step 3: Implement resume

Reload settings when resuming

Example: implemented checkpoint hooks and resume

Components that do not keep state between reuse

Builtin validation

Checkpointing in MPI-enabled components

Files

checkpointing.rst

Latest commit

History

checkpointing.rst

File metadata and controls

Simulation checkpoints

Glossary

User tutorial

Defining checkpoints

Simulation time checkpoints

Wallclock time checkpoints

Running a simulation with checkpoints

Example: running the reaction-diffusion model with checkpoints

Resuming a simulation

Example: resuming the reaction-diffusion model

Making changes to your simulation

Resuming from at_end snapshots

Snapshot consistency

Troubleshooting

Developer tutorial

Start situation: components without checkpointing

Step 1: Set USES_CHECKPOINT_API on instance creation

Step 2: Implement checkpoint hooks

Intermediate snapshots

Final snapshots

Example: implemented checkpoint hooks

Step 3: Implement resume

Reload settings when resuming

Example: implemented checkpoint hooks and resume

Components that do not keep state between reuse

Builtin validation

Checkpointing in MPI-enabled components

Step 1: Set `USES_CHECKPOINT_API` on instance creation