Scalable Checkpoint / Restart (SCR) Library
The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing and restarting large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.
Detailed usage is provided at SCR.ReadTheDocs.io.
SCR uses the CMake build system and we recommend out-of-source builds.
git clone firstname.lastname@example.org:llnl/scr.git mkdir build mkdir install cd build cmake -DCMAKE_INSTALL_PREFIX=../install ../scr make make install make test
Some useful CMake command line options:
-DCMAKE_INSTALL_PREFIX=[path]: Place to install the SCR library
-DCMAKE_BUILD_TYPE=[Debug/Release]: Build with debugging or optimizations
-DBUILD_PDSH=[OFF/ON]: CMake can automatically download and build the PDSH dependency
-DWITH_PDSH_PREFIX=[path to PDSH]: Path to an existing PDSH installation (should not be used with
-DWITH_DTCMP_PREFIX=[path to DTCMP]
-DWITH_YOGRT_PREFIX=[path to YOGRT]
- C (with support for C++ and Fortran)
- CMake, Version 2.8+
- DTCMP (optional)
- libYOGRT (optional)
- MySQL (optional)
SCR searches the following locations in the following order for a parameter value, taking the first value it finds.
- Environment variables,
- User configuration file,
- System configuration file,
- Compile-time constants.
To find a user configuration file, SCR looks for a file named
.scrconf in the prefix directory (note the leading dot).
Alternatively, one may specify the name and location of the user configuration file by setting the
SCR_CONF_FILE environment variable at run time.
This repository includes some example configuration files (
Numerous people have contributed to the SCR project.
To reference SCR in a publication, please cite the following paper:
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, LLNL-CONF-427742, Supercomputing 2010, New Orleans, LA, November 2010.
Additional information and research publications can be found here:
Developer documentation is provided at SCR-dev.ReadTheDocs.io.