Signal-based checkpoint sometimes outputs unconverged solution #25755

pbehne · 2023-10-16T22:53:12Z

Please assign this issue to me, as I have a fix and am updating tests.

Bug Description

This bug is regarding MOOSE's ability to output a checkpoint when receiving a signal via the command kill -s USR1 <PID>. If the signal is sent while MOOSE is busy solving a time step, a checkpoint of an unconverged solution is output. When the problem is restarted using this checkpoint, MOOSE assumes that the checkpoint contains the converged solution for the time step at the time of checkpoint writing, and proceeds to solve the next time step using the unconverged solution as the solution for the preceding time step. This results in an incorrect solution for all time steps after the checkpoint.

Steps to Reproduce

The attached test.txt input file has a postprocessor that reports the average value of the solution at each time step. This input should first be run in its entirety to obtain the correct pp values. During this initial run, a signal-based checkpoint should be created using the kill command specified above. To observe the incorrect behavior, the signal should be sent while MOOSE is in the middle of solving a time step. Next, the problem should be restarted using the --recover command line input. If the checkpoint recorded an unconverged solution, the pp values will be different from the initial run for time steps after the signal was sent. If they are not, then the signal was not sent at the "correct" time. Try again, and if not successful, increase the mesh refinement to slow down the problem to time the signal better.

Impact

This bug has the effect that restarting simulations from signal-based checkpoints can result in incorrect results, depending on when the signal was sent.

test.txt

The text was updated successfully, but these errors were encountered:

lindsayad · 2023-10-17T15:28:07Z

@pbehne what are you charging for this work? The indirect number?

This commit patches the issue (idaholab#25755) where a checkpoint can write an unconverged solution depending on when the USR1 signal is received, thereby affecting the accuracy of recovered simulations. The fix 1) adds logic that ensures a checkpoint is not output unless _current_execute_flag is contained in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may only be set to ‘TIMESTEP_END’ so that only converged solutions are output. Exodiff tests are added to ensure that simulations recovered from signal-based checkpoints result in the same solutions as the uninterrupted simulations.

pbehne · 2024-01-03T23:12:41Z

@pbehne what are you charging for this work? The indirect number?

@lindsayad: yes.

This commit patches the issue (idaholab#25755) where a checkpoint can write an unconverged solution depending on when the USR1 signal is received, thereby affecting the accuracy of recovered simulations. The fix 1) adds logic that ensures a checkpoint is not output unless _current_execute_flag is contained in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may only be set to ‘TIMESTEP_END’ so that only converged solutions are output. Exodiff tests are added to ensure that simulations recovered from signal-based checkpoints result in the same solutions as the uninterrupted simulations.

pbehne added P: normal A defect affecting operation with a low possibility of significantly affects. T: defect An anomaly, which is anything that deviates from expectations. labels Oct 16, 2023

nmnobre added a commit to farscape-project/moose that referenced this issue Oct 16, 2023

Remove unnecessary include directives (idaholab#25755)

ca641fb

lindsayad assigned pbehne Oct 17, 2023

lindsayad added the C: Framework label Oct 17, 2023

pbehne mentioned this issue Oct 18, 2023

Signal checkpoint patch #25773

Merged

loganharbour closed this as completed in #25773 Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal-based checkpoint sometimes outputs unconverged solution #25755

Signal-based checkpoint sometimes outputs unconverged solution #25755

pbehne commented Oct 16, 2023

lindsayad commented Oct 17, 2023

pbehne commented Jan 3, 2024

Signal-based checkpoint sometimes outputs unconverged solution #25755

Signal-based checkpoint sometimes outputs unconverged solution #25755

Comments

pbehne commented Oct 16, 2023

Bug Description

Steps to Reproduce

Impact

lindsayad commented Oct 17, 2023

pbehne commented Jan 3, 2024