Added realtime checkpoint autosaves after 10 minutes #23989

socratesgorilla · 2023-04-05T23:55:26Z

Closes #12722

framework/src/outputs/Checkpoint.C

permcody

Add a test object in moose/test/<src|include>/outputs

moosebuild · 2023-04-06T17:16:16Z

Job Documentation on b59816b wanted to post the following:

View the site here

This comment will be updated on new commits.

framework/src/outputs/Checkpoint.C

…save timer

moosebuild · 2023-04-06T22:47:16Z

Job Coverage on b59816b wanted to post the following:

Framework coverage

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	85.08%	85.08%	+0.00%	100.00%
Hits	86219	86233	+14	15
Misses	15118	15118	-	0

Diff coverage report

Full coverage report

Modules coverage

Geochemistry

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	96.75%	96.38%	-0.37%	-
Hits	4999	4980	-19	0
Misses	168	187	+19	0

Diff coverage report

Full coverage report

Misc

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	84.80%	34.31%	-50.49%	-
Hits	173	70	-103	0
Misses	31	134	+103	0

Diff coverage report

Full coverage report

Phase field

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	89.57%	86.27%	-3.31%	-
Hits	12519	12057	-462	0
Misses	1457	1919	+462	0

Diff coverage report

Full coverage report

Porous flow

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	96.26%	96.25%	-0.01%	-
Hits	10687	10686	-1	0
Misses	415	416	+1	0

Diff coverage report

Full coverage report

Richards

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	97.49%	94.75%	-2.74%	-
Hits	3341	3247	-94	0
Misses	86	180	+94	0

Diff coverage report

Full coverage report

Tensor mechanics

	f0c82f	#23989 b59816
	Total	Total	+/-	New
Rate	89.18%	85.21%	-3.97%	-
Hits	26710	25521	-1189	0
Misses	3242	4431	+1189	0

Diff coverage report

Full coverage report

Full coverage reports

Reports

This comment will be updated on new commits.

moosebuild · 2023-04-12T19:39:58Z

Job Precheck on e2fe421 wanted to post the following:

Your code requires style changes.

A patch was auto generated and copied here
You can directly apply the patch by running, in the top level of your repository:

curl -s https://mooseframework.inl.gov/docs/PRs/23989/clang_format/style.patch | git apply -v

Alternatively, with your repository up to date and in the top level of your repository:

git clang-format f0c82f7eda772c976247c8b501be080b8380ceb5

moosebuild · 2023-04-12T22:19:54Z

Job Coverage on b59816b wanted to post the following:

The following coverage requirement(s) failed:

misc coverage rate 34.31% is less than the required 55.0%
phase_field coverage rate 86.27% is less than the required 89.0%
richards coverage rate 94.75% is less than the required 97.0%
tensor_mechanics coverage rate 85.21% is less than the required 86.0%

permcody

Lots of comments, but this is close!

permcody · 2023-04-13T14:35:24Z

framework/src/outputs/Checkpoint.C

@@ -47,13 +47,18 @@ Checkpoint::validParams()

  // Advanced settings
  params.addParam<bool>("binary", true, "Toggle the output of binary files");
+  params.addParam<int>(


Suggested change

params.addParam<int>(

params.addParam<unsigned int>(

framework/src/outputs/Checkpoint.C

permcody · 2023-04-13T14:39:03Z

framework/include/outputs/Checkpoint.h

@@ -124,6 +130,9 @@ class Checkpoint : public FileOutput
  /// Vector of checkpoint filename structures
  std::deque<CheckpointFileNames> _file_names;

+  /// Starting time compared against to see if we should automatically print out a checkpoint
+  std::chrono::time_point<std::chrono::steady_clock> start_time;


Suggested change

std::chrono::time_point<std::chrono::steady_clock> start_time;

std::chrono::time_point<std::chrono::steady_clock> _start_time;

permcody · 2023-04-13T14:39:19Z

framework/src/outputs/Checkpoint.C

+void
+Checkpoint::initialSetup()
+{
+  start_time = std::chrono::steady_clock::now();


Suggested change

start_time = std::chrono::steady_clock::now();

_start_time = std::chrono::steady_clock::now();

permcody · 2023-04-13T14:43:10Z

framework/src/outputs/Checkpoint.C

+  // Print only on timestep end to avoid weird issues
+  if (elapsed >= _autosave_time_interval and type == EXEC_TIMESTEP_END)
+  {
+    start_time = std::chrono::steady_clock::now();


_start_time is a bit of a misnomer - should we call it _time_since_autosave or something similar?

Also, I would argue that we should reset this time each time we checkpoint regardless of how that occurred. If we are outputting every 5 time steps and the checkpoint is on. We may never end up checkpointing to autosave if those values are coming out frequently enough.

I'm not sure if its a misnomer per se, but it may cause confusion since Output objects contain like 8 variables already that sound relatively similar. _time_since_autosave works for me.

My only concern re: the reset every checkpoint is that we are sure that's the behavior we want before I commit the changes.

Yeah, that makes a lot of sense. For a single instance (honestly, you'd be hard pressed to really find a good reason for multiple instances IMO), you really only want to receive outputs frequently enough that you don't have to waste a bunch of CPU time if something goes wrong. For checkpointing, we don't typically care about what time step number these saves come out on. For output purposes, we care a lot!

permcody · 2023-04-13T14:48:03Z

test/tests/outputs/checkpoint/tests

+    input = autosavetimer.i
+    recover = false
+    max_parallel = 1
+    max_threads = 1


Why the restrictions? Shouldn't this work perfectly fine with threads or MPI?

Combining response with bottom comment -- the test file we are using is basically the same one used in the signal handler tests. Part of the reason I used those commands on those tests was to prevent the tests from running for a long time.

I can locally check how much time it takes to run these with parallel and threads and see.

permcody · 2023-04-13T14:48:30Z

test/tests/outputs/checkpoint/tests

+    cli_args = "Outputs/cp/test_system=true"
+    recover = false
+    max_parallel = 1
+    max_threads = 1


Same as above, why the restrictions?

permcody · 2023-04-13T14:49:23Z

test/tests/outputs/checkpoint/tests

+    expect_out = "1 seconds have elapsed, autosaving checkpoint..."
+    input = autosavetimer.i
+    cli_args = "Outputs/cp/test_system=true"
+    recover = false


I would prefer to even see this working with recover. That might mean you need to make the test longer since it'll be cut in half by the recover testing. Try it out and see.

Is recover recovering from the checkpoint objects created in the test spec, or is this that thing that runs it for half the time on its own?

If its the second, I will admit I don't understand what the benefit of the recovery testing is -- after all, we don't actually care about the solve.

recover in the test spec simply means that you don't want the TestHarness to manually split this test into two halves. It does make sense to skip it if you've explicitly setup a recover test. However, that's not what you've done here. A manual recover test typically has to not solve the whole problem on the first half (either by overriding the number of steps on the command line, or by passing the special --half-transient flag. Then the second half would pass --recover on the command line to read the checkpoint and make sure the test continues.

Now that I think about this a bit more, we probably don't want the TestHarness to help us with the split here as we might not pick up the right checkpoint file (if we are trying to verify that we can recover from the one written out by autosave). With your testing, you are indeed checking that the screen output is working, which is good. What we aren't testing though is if the checkpoint file actually works.

permcody · 2023-04-13T14:51:42Z

test/include/outputs/TestAutosaveTimedCheckpoint.h

+
+#include "Checkpoint.h"
+
+class TestAutosaveTimedCheckpoint : public Checkpoint


Wait, do we need this object? I can see your comment about testing the autosave rather than a regular checkpoint, but I don't believe you need this. Checkpoint is doing both, we should be able to test both with that object.

Combining the response to the bottom question with this comment -- the reason this test object exists is to test that the system checkpoint (the one created by the signal handler) is able to output the timed checkpoint we are doing now. Since we decided against the shortcut syntax, that means there's normally no way to modify the amount of time it takes for the system checkpoint to output the timed output we are performing here.

So, instead, we have this test object that we can make operate exactly like a system checkpoint -- we set the TestAutosaveTimedCheckpoint object to have the same _is_autosave flag as the system checkpoint, and then change the interval to 1 second so we can test the system checkpoint is outputting correctly. This is the easiest way to verify the automatically created checkpoint is working with the wall time output correctly.

I don't know if there's a good way to avoid the test object here to accomplish this. Obviously open to suggestions if you have any, but that's the justification for its existence.

OK fair, but then wouldn't I expect to see one of your tests use type = Checkpoint (i.e. an explicit Checkpoint instance that does autosaving too)? Maybe I'm still missing something.

The TestAutosaveTimedCheckpoint is just a derived Checkpoint object so there's no difference between it and a normal checkpoint if you don't pass the test_system=true

permcody · 2023-04-13T14:54:04Z

test/tests/outputs/checkpoint/autosavetimer.i

+
+[Outputs]
+  [cp]
+    type = TestAutosaveTimedCheckpoint


This is the type of syntax I want to see, but it should just be Checkpoint right? We can and should add a whole sub-block like this for testing, overriding the autosave time - I don't have any issues with that. I don't believe we need a separate test object to verify that the default works if the override value works though.

github-actions · 2023-07-24T05:39:52Z

This pull request has been automatically marked as stale because it has not had recent activity in the last 100 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

socratesgorilla requested a review from roystgnr as a code owner April 5, 2023 23:55

socratesgorilla changed the title ~~Added realtime checkpoint autosaves after 10 minutes #12722~~ Added realtime checkpoint autosaves after 10 minutes Apr 5, 2023

socratesgorilla marked this pull request as draft April 5, 2023 23:56

Added realtime checkpoint autosaves after 10 minutes idaholab#12722

b557746

socratesgorilla force-pushed the walltimecheckpoint branch from 37669b5 to b557746 Compare April 5, 2023 23:59

permcody reviewed Apr 6, 2023

View reviewed changes

framework/src/outputs/Checkpoint.C Outdated Show resolved Hide resolved

permcody reviewed Apr 6, 2023

View reviewed changes

framework/src/outputs/Checkpoint.C Outdated Show resolved Hide resolved

permcody requested changes Apr 6, 2023

View reviewed changes

roystgnr reviewed Apr 6, 2023

View reviewed changes

framework/src/outputs/Checkpoint.C Outdated Show resolved Hide resolved

socratesgorilla force-pushed the walltimecheckpoint branch from a90e3f6 to b9e94b0 Compare April 6, 2023 18:39

Changed autosave timer to be a modifiable param, added tests for auto…

6301e6e

…save timer

socratesgorilla force-pushed the walltimecheckpoint branch 3 times, most recently from 6d12d40 to 24ae5ab Compare April 6, 2023 20:37

GiudGiud assigned permcody Apr 7, 2023

socratesgorilla force-pushed the walltimecheckpoint branch from 24ae5ab to e2fe421 Compare April 12, 2023 19:35

Made autosave interval modifiable

b59816b

socratesgorilla force-pushed the walltimecheckpoint branch from e2fe421 to b59816b Compare April 12, 2023 19:40

socratesgorilla marked this pull request as ready for review April 12, 2023 20:44

permcody requested changes Apr 13, 2023

View reviewed changes

github-actions bot added the stale PRs that have reached or exceeded 90 days with no activity label Jul 24, 2023

github-actions bot closed this Aug 1, 2023

permcody assigned pbehne Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added realtime checkpoint autosaves after 10 minutes #23989

Added realtime checkpoint autosaves after 10 minutes #23989

socratesgorilla commented Apr 5, 2023 •

edited

permcody left a comment

moosebuild commented Apr 6, 2023 •

edited

moosebuild commented Apr 6, 2023 •

edited

moosebuild commented Apr 12, 2023

moosebuild commented Apr 12, 2023

permcody left a comment

permcody Apr 13, 2023

permcody Apr 13, 2023

permcody Apr 13, 2023

permcody Apr 13, 2023

socratesgorilla Apr 14, 2023

permcody Apr 14, 2023

permcody Apr 13, 2023

socratesgorilla Apr 14, 2023

permcody Apr 13, 2023

permcody Apr 13, 2023

socratesgorilla Apr 14, 2023 •

edited

permcody Apr 14, 2023

permcody Apr 13, 2023

socratesgorilla Apr 14, 2023 •

edited

permcody Apr 14, 2023

socratesgorilla Apr 14, 2023

permcody Apr 13, 2023

github-actions bot commented Jul 24, 2023

	std::chrono::time_point<std::chrono::steady_clock> start_time;
	std::chrono::time_point<std::chrono::steady_clock> _start_time;

	start_time = std::chrono::steady_clock::now();
	_start_time = std::chrono::steady_clock::now();


		#include "Checkpoint.h"

		class TestAutosaveTimedCheckpoint : public Checkpoint

Added realtime checkpoint autosaves after 10 minutes #23989

Added realtime checkpoint autosaves after 10 minutes #23989

Conversation

socratesgorilla commented Apr 5, 2023 • edited

permcody left a comment

Choose a reason for hiding this comment

moosebuild commented Apr 6, 2023 • edited

moosebuild commented Apr 6, 2023 • edited

Framework coverage

Modules coverage

Geochemistry

Misc

Phase field

Porous flow

Richards

Tensor mechanics

Full coverage reports

moosebuild commented Apr 12, 2023

moosebuild commented Apr 12, 2023

permcody left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

socratesgorilla Apr 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

socratesgorilla Apr 14, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 24, 2023

socratesgorilla commented Apr 5, 2023 •

edited

moosebuild commented Apr 6, 2023 •

edited

moosebuild commented Apr 6, 2023 •

edited

socratesgorilla Apr 14, 2023 •

edited

socratesgorilla Apr 14, 2023 •

edited