Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal-based checkpoint sometimes outputs unconverged solution #25755

Closed
pbehne opened this issue Oct 16, 2023 · 2 comments · Fixed by #25773
Closed

Signal-based checkpoint sometimes outputs unconverged solution #25755

pbehne opened this issue Oct 16, 2023 · 2 comments · Fixed by #25773
Assignees
Labels
C: Framework P: normal A defect affecting operation with a low possibility of significantly affects. T: defect An anomaly, which is anything that deviates from expectations.

Comments

@pbehne
Copy link
Contributor

pbehne commented Oct 16, 2023


Please assign this issue to me, as I have a fix and am updating tests.


Bug Description

This bug is regarding MOOSE's ability to output a checkpoint when receiving a signal via the command kill -s USR1 <PID>. If the signal is sent while MOOSE is busy solving a time step, a checkpoint of an unconverged solution is output. When the problem is restarted using this checkpoint, MOOSE assumes that the checkpoint contains the converged solution for the time step at the time of checkpoint writing, and proceeds to solve the next time step using the unconverged solution as the solution for the preceding time step. This results in an incorrect solution for all time steps after the checkpoint.

Steps to Reproduce

The attached test.txt input file has a postprocessor that reports the average value of the solution at each time step. This input should first be run in its entirety to obtain the correct pp values. During this initial run, a signal-based checkpoint should be created using the kill command specified above. To observe the incorrect behavior, the signal should be sent while MOOSE is in the middle of solving a time step. Next, the problem should be restarted using the --recover command line input. If the checkpoint recorded an unconverged solution, the pp values will be different from the initial run for time steps after the signal was sent. If they are not, then the signal was not sent at the "correct" time. Try again, and if not successful, increase the mesh refinement to slow down the problem to time the signal better.

Impact

This bug has the effect that restarting simulations from signal-based checkpoints can result in incorrect results, depending on when the signal was sent.

test.txt

@pbehne pbehne added P: normal A defect affecting operation with a low possibility of significantly affects. T: defect An anomaly, which is anything that deviates from expectations. labels Oct 16, 2023
nmnobre added a commit to farscape-project/moose that referenced this issue Oct 16, 2023
@lindsayad
Copy link
Member

@pbehne what are you charging for this work? The indirect number?

pbehne added a commit to pbehne/moose that referenced this issue Oct 18, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 19, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 19, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 23, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 27, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 27, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 30, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Oct 31, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Nov 30, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 1, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 14, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 14, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 18, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 19, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 19, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 19, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 20, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 20, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
pbehne added a commit to pbehne/moose that referenced this issue Dec 20, 2023
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
@pbehne
Copy link
Contributor Author

pbehne commented Jan 3, 2024

@pbehne what are you charging for this work? The indirect number?

@lindsayad: yes.

maxnezdyur pushed a commit to maxnezdyur/moose that referenced this issue Jan 5, 2024
This commit patches the issue (idaholab#25755) where a checkpoint can write an
unconverged solution depending on when the USR1 signal is received, thereby
affecting the accuracy of recovered simulations. The fix 1) adds logic that
ensures a checkpoint is not output unless _current_execute_flag is contained
in _execute_on, and 2) enforces that Checkpoint’s ‘execute_on’ parameter may
only be set to ‘TIMESTEP_END’ so that only converged solutions are output.
Exodiff tests are added to ensure that simulations recovered from signal-based
checkpoints result in the same solutions as the uninterrupted simulations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: Framework P: normal A defect affecting operation with a low possibility of significantly affects. T: defect An anomaly, which is anything that deviates from expectations.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants