Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful termination of MOOSE with files or signals #12722

Open
andrewritzmann opened this issue Jan 17, 2019 · 2 comments
Open

Graceful termination of MOOSE with files or signals #12722

andrewritzmann opened this issue Jan 17, 2019 · 2 comments
Assignees
Labels
C: Framework P: normal A defect affecting operation with a low possibility of significantly affects. T: task An enhancement to the software.

Comments

@andrewritzmann
Copy link

Rationale

Gracefully terminating MOOSE calculations using a SIGNAL or flag file would be extremely useful for avoiding corrupted output files when MOOSE when a user is forced to use a hard kill to terminate a calculation. This would ensure proper checkpoint files and avoid corrupting any data files used for visualization. While sizing jobs to fit within a cluster allocation is desirable, the fact remains that nonlinear convergence rates can change as the calculation progresses and a best estimate for the end time may turn out to be inappropriate. This also helps users avoid timeouts which could erase data from local scratch directories.

I strongly encourage the use of a SIGNAL because both PBS (through the qsig command) and SLURM (through, e.g., the scancel command or #SBATCH --signal=... in the submit script). If MOOSE terminates cleanly, then the user's submit script can move files as needed before the job times out.

Description

This is an enhancement request.
For comparison, other codes check for files or file modifications at each time step. Examples:

  1. VASP (Vienna ab Initio Simulation Package) allows users to place a file called STOPCAR in the execution directory that acts as a flag to stop the calculation. It can be used to terminate in different ways depending on the time remaining and the calculation at hand. (https://cms.mpi.univie.ac.at/wiki/index.php/STOPCAR)
  2. openfoam where the user can alter the stopAt flag to read "writeNow" (https://cfd.direct/openfoam/user-guide/v6-controldict/) and the calculation will respond at the next iteration.
  3. nwchem has the option to integrate libslurm allowing it to query the queueing system and make an educated guess as to whether it can finish the next step in the calculation. It terminates with a message if it determines that it cannot complete the requested calculation.

Impact

This should impact most strongly the internal API of moose. The main execution loop would need to handle checking for the termination mechanism and initiate the appropriate process. I am not sure how this would influence the multiapp infrastructure. Documentation of this new feature would be required on the wiki to indicate how the user can make use of it.

Attachment

I have attached the relevant conversation from the moose-users list between myself, Jacob Bair, and Cody Permann.
moose-users_thread_with_Cody_Permann.txt

@permcody permcody added C: Framework T: task An enhancement to the software. P: normal A defect affecting operation with a low possibility of significantly affects. labels Jan 17, 2019
@friedmud
Copy link
Contributor

Exception handling is already in MOOSE... it does gracefully handle that. This couled just piggy-back on that system to handle signals during the solve. It would basically immediately "fail" the current solve and return back to the previous timestep where we could execute FINAL stuff (to get the PerfGraph, etc.) and then end.

If it happens outside of a solve (like during output) then we would just run into it at the start of the next timestep and still do the above.

Sounds good to me.

@friedmud
Copy link
Contributor

friedmud commented Jan 6, 2023

As an addition here - once we can handle a signal and throw one of our parallel_exceptions... we should also implement the ability to write out a checkpoint file at that time.

socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Jan 19, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Jan 19, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Jan 19, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Jan 19, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Jan 24, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Jan 24, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Apr 5, 2023
socratesgorilla added a commit to socratesgorilla/moose that referenced this issue Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: Framework P: normal A defect affecting operation with a low possibility of significantly affects. T: task An enhancement to the software.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants