-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for checkpoint/restart of mtt logfiles #916
add support for checkpoint/restart of mtt logfiles #916
Conversation
Hi Howard -- This checkpoint/restart feature sounds handy. FWIW, we used to use |
Can you describe how to use this feature, to help me understand how it works? |
looks like this needs a rebase and I should add an example. I use this in the following scenario. To the best of my ability I've not been able to use the existing ASIS keyword support to do what I need to do. Many of our DOE systems are set up as follows:
We tried for while at NERSC to use the slurm plugin in MTT but it was a frustrating disaster. We were trying to get an allocation with --no-shell which would keep MTT running on the front end, and we used srun's for each individual test. Then there was ANL's theta, which uses a batch system which MTT doesn't even know about so I came up with this approach. Here's how I use it:
|
This feature adds support for checkpointing MTT logfiles as part of a MTT workflow wherein some stages of an MTT workflow must be done on one set of system resources - in particular login nodes to typical HPC DoE systems which have internet access - but for other stages, such as running tests, MTT has to be run on backend nodes with no internet access. Pickle is used to serialize/deserialize the logfile irrespective of its structure. Signed-off-by: Howard Pritchard <howardp@lanl.gov>
d78b1d5
to
0b37e66
Compare
added WIP to PR name as this PR should come with some examples. |
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@ribab finally got around to adding some example scripts and ini files showing a way to use this feature. |
Any issues with merging this PR into master next week? |
I'm going to test this with our test content to ensure everything still works |
Thanks very much Ricky! |
pylib/Tools/Launcher/ALPS.py
Outdated
@@ -38,7 +40,8 @@ | |||
# @param allocate_cmd Command to use for allocating nodes from the resource manager | |||
# @param deallocate_cmd Command to use for deallocating nodes from the resource manager | |||
# @param dependencies List of dependencies specified as the build stage name | |||
# @} | |||
# @param checkpoint_file Optional checkpoint file | |||
@} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line needs to be commented out, it is causing a syntax error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think I fixed this.
Signed-off-by: Howard Pritchard <howardp@lanl.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now -- My test of existing content passed.
This feature adds support for checkpointing MTT logfiles as part of a MTT workflow
wherein some stages of an MTT workflow must be done on one set of system resources -
in particular login nodes to typical HPC DoE systems which have internet access -
but for other stages, such as running tests, MTT has to be run on backend nodes
with no internet access.
Pickle is used to serialize/deserialize the logfile irrespective of its structure.
Signed-off-by: Howard Pritchard howardp@lanl.gov