When you execute a long-running simulation, it can be very helpful to store the state of a simulation at certain intervals. For example, your simulation running on a HPC cluster may crash due to insufficient memory available. Instead of restarting this simulation from scratch, you could restart it -- with an increased memory allocation -- from a checkpoint, which would save a lot of compute time!
Checkpointing in distributed simulations is difficult. Fortunately, MUSCLE3 comes with built-in checkpointing support. This page describes in detail how to use the MUSCLE3 checkpointing API, how to specify checkpoints in the workflow configuration and how to resume a workflow.
In the user tutorial
, you can read about the checkpointing concepts and how to use the API when running and resuming MUSCLE3 simulations. This is followed by a developer tutorial
, which explains how to add checkpointing capabilities to your MUSCLE3 component. Finally, the checkpointing
deep-dive
describes in detail the (inner) working of checkpointing in MUSCLE3; though this level of detail is not required for general usage of the API.
Contents
- Checkpoint
A checkpoint is a moment during the workflow where the user wants to have the state of the whole workflow stored.
- Snapshot
A snapshot is the stored state of an instance in the workflow.
- Workflow snapshot
A workflow snapshot is a collection of
snapshots<snapshot>
for all instances in the workflow, which can be resumed from. This means that the snapshots of every combination ofpeer instances
must beconsistent <Snapshot consistency>
.- Peer instances
Two instances that are connected by a Conduit.
This user tutorial explains all you need to know about checkpointing for running and resuming simulations. Some details are deliberately left out, though you can read all about those in the developer tutorial
or checkpointing
deep-dive
.
User tutorial contents
The first step for using checkpoints is to define checkpoints in your workflow. The checkpoint definitions are for your whole workflow, and you can specify them in yMMSL as in the following example:
checkpoints:
at_end: true
simulation_time:
- every: 10
start: 0
stop: 100
- every: 20
start: 100
wallclock_time:
- every: 3600
- at:
- 300
- 600
- 1800
Let's break this down: the first element in this example checkpoints
definition is at_end
. When this is set to true
(as in the example), it means that every instance in the workflow will create a snapshot just before the workflow finishes. This set of snapshots can be used to resume a simulation near the end and, for example, let it run for a longer time. Some caveats apply, though, see resuming from *at_end* snapshots
for full details.
The other two items in the checkpoints
definition are the time-based simulation time<Simulation time checkpoints>
and wallclock time<Wallclock time checkpoints>
. You can use two types of rules to set checkpoint moments for these:
at
rules select specific moments. The example rule above request a checkpoint to be taken at 300, 600 and 1800 seconds after the start of the simulation. You can define multiple times in oneat
rule, but you may also add multipleat
rules. The following definitions are all equivalent:Standard
checkpoints: wallclock_time: - at: - 300 - 600 - 1800
Inline list
checkpoints: wallclock_time: - at: [300, 600, 1800]
Multiple
at
rulescheckpoints: wallclock_time: - at: 300 - at: 600 - at: 1800
every
rules define a recurring set of checkpoints. In the simplest form you indicate the interval at which checkpoints should be taken -- every hour in thewallclock_time
example above. You may optionally indicate astart
orstop
-- as in thesimulation_time
example above.Simple
checkpoints: wallclock_time: every: 3600
Start and stop
checkpoints: simulation_time: - every: 10 start: 0 stop: 100 - every: 20 start: 100
Overlapping ranges
checkpoints: simulation_time: - every: 1 - every: 0.25 start: 0 stop: 2
Note
When
stop
is specified, the stop time is included whenstop == start + n * every
, withn
a positive whole number. However, this might give surprising results due to the inaccuracies of floating point computations. Compare for example:checkpoints: simulation_time: - every: 1 start: 0 stop: 7
checkpoints: simulation_time: - every: 0.1 start: 0 stop: 0.7
Why the difference? Well - compare in python:
>>> 7 * 1.0 7.0 >>> 7 * 0.1 0.7000000000000001
Since
0.7000000000000001
is larger than0.7
, no checkpoint will be generated for this time.
yMMSL documentation on :external+ymmslCheckpoints
yMMSL API reference: :external:pyymmsl.Checkpoints
, :external:pyymmsl.CheckpointAtRule
, :external:pyymmsl.CheckpointRangeRule
.
Checkpoints defined in the simulation_time
section are taken based on the time inside your simulation. This will only work correctly if all components in the simulation have a shared concept of time, which only increases during the simulation. This should be no problem for physics-based simulations, though it does require that the instances make correct use of the timestamp in
MUSCLE3 messages <message timestamps>
. When this requirement is fulfilled, checkpoints based on simulation time are the most reliable way to checkpoint your workflow.
MUSCLE3 does not interpret or convert the units that you configure in the checkpoints. The units are the same as the components in the simulation use for the timestamps in the messages. Typically this will be in SI seconds, but components may deviate from this standard. MUSCLE3 assumes that all components in the workflow use the same time units in the interfaces to libmuscle.
Note
MUSCLE3 does not assume anything about the start time of a simulation. Your simulation time may start at any value, even negative! Therefore, checkpoint ranges <every checkpoint rule>
include 0 and negative numbers when no start
value is provided.
Because MUSCLE3 does not know what internal time your simulation starts on, an every
rule without a start
value will always trigger a checkpoint at the first possible moment in the simulation. You should supply a start
value if you do not want this to happen.
Checkpoints defined in the wallclock_time
section are taken based on the elapsed wallclock time of your simulation (also known as elapsed real time). Each component in the simulation will make a snapshot at the earliest possible moment after a checkpoint is passed.
The checkpoint times in the configuration are interpreted as seconds since the initialization of muscle_manager
.
Warning
Wallclock time checkpoint definitions are (currently) not a reliable way to create workflow snapshots <workflow snapshot>
. While each instance in the simulation will create a snapshot when requested, there is no guarantee that all snapshots are consistent <Snapshot consistency>
.
When a simulation has relatively simple coupling between components, checkpointing based on wallclock time usually works fine.
However for co-simulation (the interact coupling type) and more complex coupling, it is likely that not all checkpoints lead to a consistent workflow snapshot
.
If you intend to use wallclock time checkpoints and find that you often don't get a consistent workflow snapshot, you may try the following workaround: instead of requesting a wallclock time checkpoint at (for example) 600 seconds, you can specify checkpoints at 600, 601, 602, 603, 604 and 605 seconds. The "right" interval to use will depend on the typical compute times of your components and coupling in the simulation.
Starting a simulation with checkpoints is no different than starting one without. You need to start the muscle_manager
with the configuration yMMSL file (or files), as well as the individual components (or let muscle_manager
start them for you with the --start-all
flag). The sole difference is that the yMMSL configuration must contain a checkpoints section <Defining
checkpoints>
.
When muscle_manager
is started with checkpoints configured, a couple of things change. First, all of the component implementations must support checkpointing: the simulation will stop with an error if this is not the case. The simulation may also stop with an error if there is an issue in the checkpointing implementation of any of the components.
Second, all components are instructed to make snapshots according to the configured checkpoints. muscle_manager
keeps track of all created snapshots during the simulation, looking for workflow snapshots <workflow
snapshot>
. When a workflow snapshot is detected, muscle_manager
writes a yMMSL file that can be used to resume the simulation <Resuming a
simulation>
.
During the simulation, all of the created snapshots are stored on the file system. See the table below for the directories where MUSCLE3 stores the files. Note: a run-directory is automatically created when using the --start-all
flag for muscle_manager
. You may also specify a custom run directory through the --run-dir DIRECTORY
option. When you do not provide a run directory, the last column in the table below indicates where snapshots are stored.
Snapshot type | Run directory provided | No run directory provided |
---|---|---|
Workflow | run_dir/snapshots/ |
Working directory of muscle_manager |
Instance |
with |
Working directory of the instance |
Note
When running a distributed simulation <distributed execution>
on multiple compute nodes, MUSCLE3 assumes that the run directory is accessible to all nodes (i.e. on a shared or distributed file system). This is usually the case on HPC clusters.
The reaction-diffusion example model from the Tutorial with Python
also has a variant with checkpointing enabled. To run this yourself, navigate in a command line prompt to the docs/source/examples
folder in the MUSCLE3 git repository. Then execute the following command:
$ mkdir run_rd_example
$ muscle_manager --start-all --run-dir run_rd_example rd_implementations.ymmsl rd_checkpoints_python.ymmsl rd_settings.ymmsl
Note
You may get an error File 'rd_implementations.ymmsl' does not exist.
To fix this, you need to build the examples in the MUSCLE3 source; in the root of the git repository, execute:
$ make test_examples
The above command runs the muscle_manager
and starts all components (the reaction model and the diffusion model). The rd_checkpoints_python.ymmsl
file contains the checkpoint definitions used in this example:
examples/rd_checkpoints_python.ymmsl
MUSCLE3 will create the run directory run_rd_example
for you. In it you'll find the instance snapshots in instances/macro/snapshots
and instances/micro/snapshots
. The workflow snapshots are stored in the snapshots
folder in the run directory.
You can resume a simulation from a workflow snapshot
stored in a previous run of the simulation. This works by appending a workflow snapshot yMMSL file from a previous run to the regular yMMSL configuration. If you started your original simulation with:
$ muscle_manager --run-dir ./run1 configuration.ymmsl
You can resume it from a snapshot of this run like so:
$ muscle_manager --run-dir ./run2 configuration.ymmsl ./run1/snapshots/snapshot_20221202_112840.ymmsl
Here we choose a different run directory, and resume from the snapshot file snapshot_20221202_112840.ymmsl
that was produced by the first run. This file contains the information required to resume the workflow:
- It contains a
description
which allows you to inspect metadata of the workflow snapshot. It indicates the trigger or triggers leading to this snapshot, and some information of the state of each component in the workflow. This data is for informational purposes only, and ignored bymuscle_manager
. It also contains the paths to the snapshots that each instance needs to resume. Note that these snapshots must still exist on the same location. If you move or delete them (or a parent directory), resuming your simulation will fail with an error message:
Unable to load snapshot: <snapshot filename> is not a file. Please ensure this path exists and can be read.
To resume the reaction-diffusion model from a snapshot created in the previous section <Example: running the reaction-diffusion model with
checkpoints>
, replace <date>
and <time>
in the following command to point to the snapshot you want to resume from.
$ mkdir run_rd_resume
$ muscle_manager --start-all --run-dir run_rd_resume rd_implementations.ymmsl rd_checkpoints_python.ymmsl rd_settings.ymmsl run_rd_example/snapshots/snapshot_<date>_<time>.ymmsl
When the command completes you can see the output in the new working directory run_rd_resume
.
MUSCLE3 checkpointing is designed for resuming simulations as if they never stopped. This means that resuming is only supported for consistent
snapshots <Snapshot consistency>
and for simulation configurations that have not changed.
MUSCLE3 does not support any changes to the model when resuming, such as adding or removing components, or changing conduits. Attempting this will likely lead to deadlocks or error messages.
You are allowed to change the settings of your simulation when resuming. However, it depends on the implementation of your components if and when changed settings take effect. Please ask the developers of your simulation components for this information.
Warning
Resuming from an at_end
snapshot only will immediately complete.
MUSCLE3 checkpointing was designed for consistency: no messages between the components must be lost when restarting. When we fulfill this criterium, a simulation can resume from a checkpoint as if it was never interrupted.
During a simulation run, each component creates snapshots independently from all other components. For simulation time checkpoints
, the MUSCLE3 checkpointing algorithm is guaranteed to give consistent workflow
snapshots <workflow snapshot>
when all components adhere to the Multiscale Modeling and Simulation Framework (MMSF) <citation needed>
.
Wallclock time checkpoints
in the currrent implementation are less reliable: components may take snapshots while messages are still in transit. When that happens an inconsistent state is produced and no workflow snapshots are written by muscle_manager
.
MUSCLE3 does not support combining inconsistent snapshots, so it is not possible to freely mix snapshots produced during a simulation. When resuming, MUSCLE3 checks the consistency of all snapshots. The run will end with an error when an inconsistent state is detected:
Received message on <port> with unexpected message number <num>. Was
expecting <num>. Are you resuming from an inconsistent snapshot?
When resuming from a snapshot yMMSL <Running a simulation with
checkpoints>
written by muscle_manager
, you should not encounter this error.
General troubleshooting strategy:
- First try to find the root cause of the problem that your simulation ran into. You can start by looking in the log file of the
muscle_manager
, located in<run directory>/muscle3_manager.log
. This log file may show the error message or point you in the right direction. - If the
muscle_manager
log did not display an error, it may indicate which component failed first. Have a look at the logs of that component to figure out what went wrong. The output of an instance is usually found in<run directory>/instances/<instance name>/
. Openstdout.txt
andstderr.txt
to find out what went wrong. - If the
muscle_manager
logs did not point to a specific instance, you should have a look at the log files of each instance (see point 2 for instructions). Note that some instances may logBroken Pipe
errors --this usually happens when a peer component has crashed and it is typically not the root cause of your simulation crash.
Once you find the root cause of your problem, check the list below for common issues and their resolutions. You may also have found a bug in MUSCLE3: please help us and your fellow MUSCLE3 users by creating an issue <Make an
issue>
on GitHub.
- The simulation crashes when using checkpoints.
The first thing you should check is: does the simulation run error-free when checkpoints are disabled? You can test this by commenting the checkpointing section of your input ymmsl file(s).
If it runs error-free without checkpoints, have a look at the error message in the log file generated by your run. MUSCLE3 attempts to have clear error messages to explain what went wrong and give you pointers to a solution.
When the error message indicates a problem with the implementation of the checkpointing API, please check with the developer of the component to fix this. If you are the developer of the component, please see the
Developer tutorial
section for additional resources.
- The simulation crashes when resuming.
Some common causes for this are:
- The snapshot files that the instances are resuming from no longer exist. This could for example happen when a previous run directory has been moved or deleted. For distributed execution, some compute nodes may not be able to access the directories where the instance snapshots are stored. See also
Resuming a simulation
. Your simulation configuration has incompatible changes compared to the original simulation that the snapshots were from. See
Making changes to your simulation
. Luckily, MUSCLE3 stores the previous simulation configuration in the run directory. If the snapshot that your resume from is stored inrun1/snapshots/snapshot_xyz.ymmsl
, then you can find that configuration inrun1/configuration.ymmsl
. Try resuming with that configuration first to see if this is the real problem:$ muscle_manager --run-dir run2 run1/configuration.ymmsl run1/snapshots/snapshot_xyz.ymmsl
- One of your components has a bug that is triggered when resuming from a previous snapshot, or perhaps your snapshot belonged to a different version of the component. Please ask your component developer(s) for help.
- The snapshot files that the instances are resuming from no longer exist. This could for example happen when a previous run directory has been moved or deleted. For distributed execution, some compute nodes may not be able to access the directories where the instance snapshots are stored. See also
This developer tutorial explains all you need to know about implementing checkpointing in your MUSCLE3 simulation component. If you're not a developer and want to learn how to define checkpoints and resume simulations, please have a look at the user tutorial
.
Some details are deliberately left out in this developer tutorial, though you can read all about those in the checkpointing deep-dive
.
Developer totorial contents
In this tutorial we will add checkpointing to the reaction and diffusion components from the Python <Tutorial with Python>
, C++ <MUSCLE and C++>
and Fortran <MUSCLE and Fortran>
tutorials.
Additionally, we will do the same for a generic MUSCLE3 component template. These templates illustrate the structure of a MUSCLE3 component, but they are not complete and cannot be executed.
Reaction model
Python
examples/python/reaction.py
C++
examples/cpp/reaction.cpp
Fortran
examples/fortran/reaction.f90
Diffusion model
Python
examples/python/diffusion.py
C++
examples/cpp/diffusion.cpp
Fortran
examples/fortran/diffusion.f90
Generic template
Python
templates/instance.py
C++
templates/instance.cpp
Fortran
templates/instance.f90
As first step, you need to indicate that you intend to use the checkpoint API. You do this through the :py~InstanceFlags.USES_CHECKPOINT_API
flag when creating the instance:
Python
from libmuscle import Instance, USES_CHECKPOINT_API
...
ports = ...
instance = Instance(ports, USES_CHECKPOINT_API)
API documentation for :py~libmuscle.InstanceFlags.USES_CHECKPOINT_API
.
C++
#include <libmuscle/libmuscle.hpp>
#include <ymmsl/ymmsl.hpp>
using libmuscle::PortsDescription;
using libmuscle::Instance;
using libmuscle::InstanceFlags;
...
int main(int argc, char * argv[]) {
PortsDescription ports = ...;
Instance instance(argc, argv, ports, InstanceFlags::USES_CHECKPOINT_API);
...
}
API documentation for :cpp~libmuscle::impl::InstanceFlags::USES_CHECKPOINT_API
.
Fortran
use ymmsl
use libmuscle
type(LIBMUSCLE_PortsDescription) :: ports
type(LIBMUSCLE_Instance) :: instance
ports = ...
instance = LIBMUSCLE_Instance_create( &
ports, LIBMUSCLE_InstanceFlags(USES_CHECKPOINT_API=.true.))
API documentation for :fLIBMUSCLE_InstanceFlags
.
If you do not set this flag, you'll get a runtime error when trying to use any of the checkpointing API calls on the Instance object.
The first step in implementing the checkpointing API is implementing the checkpoint hooks. These are the points where your component can make checkpoints:
Intermediate snapshots
Intermediate snapshots are taken inside the reuse-loop, immediately after the
S
Operator of your component.Final snapshots
Final snapshots are taken at the end of the reuse-loop, after the
O_F
Operator of your component.
Intermediate snapshots are taken inside the reuse-loop, immediately after the S
Operator of your component.
Taking intermediate snapshots is optional. However, we recommend implementing intermediate snapshots when any of the following points holds for your component:
Your component has a loop containing
O_I
andS
, and you communicate during OperatorO_I
or OperatorS
.Implementing intermediate checkpointing allows submodels connected to your component to also create checkpoints.
Warning
If you do not implement intermediate checkpoints in this case, then it is likely that a large amount of user-provided checkpoints will not lead to consistent
workflow snapshots <workflow snapshot>
. Please implement intermediate snapshots to give the users of your component a good checkpointing experience.There is no communication during
O_I
andS
, but the state updateS
is executed in a (time-integration) loop which takes a relatively long time.In this case, intermediate checkpointing allows users to create checkpoints of your component during long-running computations.
In all other cases, there usually is little or no added value in implementing intermediate snapshots in addition to Final snapshots
.
You implement taking intermediate snapshots as follows:
- Find out where in your code to implement the checkpointing calls. Typically there is a state update loop (e.g. a
while
orfor
loop) in a component. You should implement the checkpointing calls at the end of this state update loop. In this way, your code can resume immediately at the begin of that loop. This allows for consistent restarts with the least amount of code. - Ask libmuscle if you need to store your state and create an intermediate snapshot with the API call
should_save_snapshot(t)
. You must provide the current timet
in your simulation, such that MUSCLE3 can determine ifSimulation time checkpoints
are triggered. - Collect the state that you need to store.
- Create a
libmuscle.Message
object to put your state in. - Store the snapshot Message with the API call
save_snapshot(message)
.
See Example: implemented checkpoint hooks
for example implementations in the reaction-diffusion models and the component template.
- Python API documentation: :py
~libmuscle.Instance.should_save_snapshot
, :py~libmuscle.Instance.save_snapshot
. - C++ API documentation: :cpp
~libmuscle::impl::Instance::should_save_snapshot
, :cpp~libmuscle::impl::Instance::save_snapshot
. - Fortran API documentation: :f
LIBMUSCLE_Instance_should_save_snapshot
, :fLIBMUSCLE_Instance_save_snapshot
.
Final snapshots must be implemented by all components supporting checkpointing. You implement taking a final snapshot as follows:
- You must implement the checkpoint calls at the end of the
reuse loop <The reuse loop>
. - Ask libmuscle if you need to store your state and create a final snapshot with the API call
should_save_final_checkpoint()
. Contrary to the intermediate checkpoints, this call may block to determine if a checkpoint is needed (this is also the reason it must happen at the end of the reuse loop). - Collect the state that you need to store.
- Create a
libmuscle.Message
object to put your state in. - Store the snapshot Message with the API call
save_final_snapshot(message)
.
See Example: implemented checkpoint hooks
for example implementations in the reaction-diffusion models and the component template.
- Python API documentation: :py
~libmuscle.Instance.should_save_final_snapshot
, :py~libmuscle.Instance.save_final_snapshot
. - C++ API documentation: :cpp
~libmuscle::impl::Instance::should_save_final_snapshot
, :cpp~libmuscle::impl::Instance::save_final_snapshot
. - Fortran API documentation: :f
LIBMUSCLE_Instance_should_save_final_snapshot
, :fLIBMUSCLE_Instance_save_final_snapshot
.
Note that below examples only shows the changes compared to the start
situation <Start situation: components without checkpointing>
. You can view the full contents of the files in the git repository.
Reaction model
Intermediate snapshots
The state we need to store consists of three parts: the current U
, the current time t_cur
and the end-time for the time integration t_stop
. The current time is stored as the timestamp
attribute of the Message
object. The rest is stored in Message.data
.
Final snapshots
For the final snapshot there is no state that is required for resuming. The complete state will be received with the next message on the initial_state
port.
Python
tutorial_code/checkpointing_reaction_partial.py
C++
tutorial_code/checkpointing_reaction_partial.cpp
Fortran
tutorial_code/checkpointing_reaction_partial.f90
Diffusion model
Intermediate snapshots
The state we need to store consists of two parts: the current time t_cur
and the history of U
: Us
. Note that the last value of U
is contained in Us
, so we do not need to save U
explicitly. The current time is stored as the timestamp
attribute of the Message
object. Us
is stored in Message.data
.
Final snapshots
The same state is stored as for intermediate snapshots.
Python
tutorial_code/checkpointing_diffusion_partial.py
C++
tutorial_code/checkpointing_diffusion_partial.cpp
Fortran
tutorial_code/checkpointing_diffusion_partial.f90
Generic template
Python
tutorial_code/checkpointing_instance_partial.py
C++
tutorial_code/checkpointing_instance_partial.cpp
Fortran
tutorial_code/checkpointing_instance_partial.f90
Now that the checkpoint hooks are implemented, we can add support for resuming from a previously created checkpoint. When resuming, there are two options: resuming from an intermediate checkpoint and resuming from a final checkpoint.
When resuming from an intermediate checkpoint, your component first loads its state from the checkpoint. Then it should continue where it left off, which is at the beginning of O_I
. This means that it has to skip F_INIT
in order to run as if it had never stopped.
When resuming from a final checkpoint, your component first loads its state from the checkpoint. Next, your component executes the F_INIT
operator as usual, as it would have had it continued after writing the snapshot.
Steps to implement the resumption logic:
At the start of -- but inside -- the reuse loop you check if you need to resume from a previous snapshot with the API call
resuming()
.Note
This takes place inside the reuse loop. Currently resuming can only happen during the first iteration of the reuse loop. However, additional checkpointing features are planned that would allow a model to resume multiple times inside one run. By implementing the resume logic inside the reuse loop, your component will be forwards-compatible with this.
- When resuming, you load the previously stored snapshot with
load_snapshot()
and restore the state of your component. - Afterwards check if initialization is required with
should_init()
and run the regular initialization logic. - Continue with the time-integration loop.
See Example: implemented checkpoint hooks and resume
for example implementations in the reaction-diffusion models and the component template.
- Python API documentation: :py
~libmuscle.Instance.resuming
, :py~libmuscle.Instance.load_snapshot
, :py~libmuscle.Instance.should_init
. - C++ API documentation: :cpp
~libmuscle::impl::Instance::resuming
, :cpp~libmuscle::impl::Instance::load_snapshot
, :cpp~libmuscle::impl::Instance::should_init
. - Fortran API documentation: :f
LIBMUSCLE_Instance_resuming
, :fLIBMUSCLE_Instance_load_snapshot
, :fLIBMUSCLE_Instance_should_init
.
You will notice in the examples <Example: implemented checkpoint hooks and
resume>
that the resume logic is not executed first in the reuse-loop. Instead, the components all retrieve settings. The reason behind this is that it allows the user to resume a simulation with slightly different settings and have those settings take effect immediately after resuming.
It is not required to do this, so you get to decide if (and when) you reload settings after resuming. Be sure to include the behaviour of your component in the documentation, such that users of your component know what they can expect.
Note that below examples only shows the changes compared to the start
situation <Start situation: components without checkpointing>
. You can view the full contents of the files in the git repository.
Reaction model
Resume logic
In Example: implemented checkpoint hooks
we made the choice to store different data in the message for intermediate and final snapshots. When resuming we therefore need to handle these two cases.
Python
examples/python/checkpointing_reaction.py
C++
examples/cpp/checkpointing_reaction.cpp
Fortran
examples/fortran/checkpointing_reaction.f90
Diffusion model
Resume logic
For the diffusion model we stored the same state for intermediate and final snapshots. This makes resuming easier because we do not have to distinguish between the data stored in the loaded Message
object.
Python
examples/python/checkpointing_diffusion.py
C++
examples/cpp/checkpointing_diffusion.cpp
Fortran
examples/fortran/checkpointing_diffusion.f90
Generic template
Python
templates/checkpointing_instance.py
C++
templates/checkpointing_instance.cpp
Fortran
templates/checkpointing_instance.f90
Some components do not need to keep state between reuses. An example of that is the reaction model from the above examples. In the final snapshot, no state needs to be stored to allow properly resuming this component, see Example: implemented checkpoint hooks
.
Other examples of such components may be data transformers, receiving data on an F_INIT
port and sending the converted data on an O_F
port.
If you indicate to libmuscle that your component does not keep state between reuse, libmuscle automatically provides checkpointing for your component. You do this by providing the ~InstanceFlags.KEEPS_NO_STATE_FOR_NEXT_USE
flag when creating the instance. See the below example for a variant of the example reaction model.
Python
examples/python/reaction_no_state_for_next_use.py
C++
examples/cpp/reaction_no_state_for_next_use.cpp
Fortran
examples/fortran/reaction_no_state_for_next_use.f90
- Python API documentation: :py
~libmuscle.InstanceFlags
. - C++ API documentation :cpp
~libmuscle::impl::InstanceFlags
. - Fortran API documentation :f
LIBMUSCLE_InstanceFlags
.
MUSCLE3's checkpointing API was carefully designed to allow consistenly resuming a simulation. This is only possible when components carefully implement the checkpointing API. To support you in this task, MUSCLE3 tries to detect any issues with the checkpointing implementation. When MUSCLE3 detects a problem, an error is raised to indicate what went wrong and point you in the right direction for fixing the problem.
Checkpionting in MPI-enabled components works in the same way as for non-MPI components. The main difference is that some API methods must be called by all processes, while others can only be called from the root process.
resuming()
must be called simultaneously in all processes.load_snapshot()
may only be called on the root process. It is up to the model code to scatter or broadcast the snapshot state to the non-root processes, if necessary.should_init()
must be called simultaneously in all processes.should_save_snapshot()
andshould_save_final_snapshot()
must be called simultaneously in all processes.save_snapshot()
andsave_final_snapshot()
may only be called on the root process. It is therefore up to the model code to gather the necessary state from the non-root processes before saving the snapshot.
- C++ API documentation:
- :cpp
~libmuscle::impl::Instance::resuming
- :cpp
~libmuscle::impl::Instance::load_snapshot
- :cpp
~libmuscle::impl::Instance::should_init
- :cpp
~libmuscle::impl::Instance::should_save_final_snapshot
- :cpp
~libmuscle::impl::Instance::save_final_snapshot
- :cpp
~libmuscle::impl::Instance::should_save_snapshot
- :cpp
~libmuscle::impl::Instance::save_snapshot
- :cpp
- Fortran API documentation:
- :f
LIBMUSCLE_Instance_resuming
- :f
LIBMUSCLE_Instance_load_snapshot
- :f
LIBMUSCLE_Instance_should_init
- :f
LIBMUSCLE_Instance_should_save_final_snapshot
- :f
LIBMUSCLE_Instance_save_final_snapshot
- :f
LIBMUSCLE_Instance_should_save_snapshot
- :f
LIBMUSCLE_Instance_save_snapshot
- :f