Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for checkpoint/restart of mtt logfiles #916

Merged
merged 3 commits into from
Jun 24, 2021

Conversation

hppritcha
Copy link
Member

This feature adds support for checkpointing MTT logfiles as part of a MTT workflow
wherein some stages of an MTT workflow must be done on one set of system resources -
in particular login nodes to typical HPC DoE systems which have internet access -
but for other stages, such as running tests, MTT has to be run on backend nodes
with no internet access.

Pickle is used to serialize/deserialize the logfile irrespective of its structure.

Signed-off-by: Howard Pritchard howardp@lanl.gov

@mallove79
Copy link

Hi Howard -- This checkpoint/restart feature sounds handy. FWIW, we used to use mtt-relay for submitting test results from an offline network.

@ribab
Copy link
Contributor

ribab commented Mar 17, 2021

Can you describe how to use this feature, to help me understand how it works?

@hppritcha
Copy link
Member Author

looks like this needs a rebase and I should add an example.

I use this in the following scenario. To the best of my ability I've not been able to use the existing ASIS keyword support to do what I need to do. Many of our DOE systems are set up as follows:

  • front end nodes that can access the internet, aka cherry-py server running at mtt.open-mpi.org at AWS, middleware can be cloned from github, etc, tests can get cloned from github, etc.
  • middleware and tests can be built on the frontend nodes
  • frontends can't run the tests
  • compute nodes can't reach the internet, so no talk with github, mtt.open-mpi.org etc.
  • compute nodes may be managed by any number of resource managers, not just slurm
  • compute nodes are obtained by requesting an allocation from the resource manager
  • tests can be run on the compute nodes

We tried for while at NERSC to use the slurm plugin in MTT but it was a frustrating disaster. We were trying to get an allocation with --no-shell which would keep MTT running on the front end, and we used srun's for each individual test.
This approach was completely unreliable. slurm would sometimes just drop the allocation out from under us, so then all our sruns would just hang for some long time then come back with an error, etc. etc. nersc consult never were able to figure out why this wasn't working. Then srun support dropped out with master for that platform since we can't control how slurm is built there and at the time the install did not support pmix.

Then there was ANL's theta, which uses a batch system which MTT doesn't even know about so I came up with this approach.

Here's how I use it:

  • main script runs on head node and runs an initial MTT script that downloads Open MPI, builds it, downloads ompi-tests and builds them.
  • the last stage of this MTT script stores the log data via pickle into a restart file.
  • the main script submits a job in a blocking fashion to the backend compute nodes which runs another script (backend script)
  • backendscript runs on the head node of the allocation (or in the case of ANL theta, on a mom node)
  • backendscript submits the second stage of the MTT run, this MTT reads in the restart file from above and has only sections for running the tests and storing a checkpoint of the logfile (which includes the first logfile) as the final step
  • batch job ends, hopefully successfully and the main script running on the front end resumes (it was blocked waiting for the batch job to finish)
  • this main script runs the final MTT stage
  • this final MTT stage reads up the logfile (pickle format) created at the end of the batch job and reports results back to the cherry-py server running at mtt.open-mpi.org

This feature adds support for checkpointing MTT logfiles as part of a MTT workflow
wherein some stages of an MTT workflow must be done on one set of system resources -
in particular login nodes to typical HPC DoE systems which have internet access -
but for other stages, such as running tests, MTT has to be run on backend nodes
with no internet access.

Pickle is used to serialize/deserialize the logfile irrespective of its structure.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha hppritcha changed the title add support for checkpoint/restart of mtt logfiles [WIP]: add support for checkpoint/restart of mtt logfiles Mar 23, 2021
@hppritcha
Copy link
Member Author

added WIP to PR name as this PR should come with some examples.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha hppritcha changed the title [WIP]: add support for checkpoint/restart of mtt logfiles add support for checkpoint/restart of mtt logfiles May 13, 2021
@hppritcha
Copy link
Member Author

@ribab finally got around to adding some example scripts and ini files showing a way to use this feature.

@hppritcha
Copy link
Member Author

Any issues with merging this PR into master next week?

@ribab
Copy link
Contributor

ribab commented Jun 22, 2021

I'm going to test this with our test content to ensure everything still works

@hppritcha
Copy link
Member Author

Thanks very much Ricky!

@@ -38,7 +40,8 @@
# @param allocate_cmd Command to use for allocating nodes from the resource manager
# @param deallocate_cmd Command to use for deallocating nodes from the resource manager
# @param dependencies List of dependencies specified as the build stage name
# @}
# @param checkpoint_file Optional checkpoint file
@}
Copy link
Contributor

@ribab ribab Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line needs to be commented out, it is causing a syntax error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think I fixed this.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Copy link
Contributor

@ribab ribab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now -- My test of existing content passed.

@hppritcha hppritcha merged commit 1f5daed into open-mpi:master Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants