-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added realtime checkpoint autosaves after 10 minutes #23989
Conversation
37669b5
to
b557746
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test object in moose/test/<src|include>/outputs
Job Documentation on b59816b wanted to post the following: View the site here This comment will be updated on new commits. |
a90e3f6
to
b9e94b0
Compare
6d12d40
to
24ae5ab
Compare
Job Coverage on b59816b wanted to post the following: Framework coverage
Modules coverageGeochemistry
Misc
Phase field
Porous flow
Richards
Tensor mechanics
Full coverage reportsReports
This comment will be updated on new commits. |
24ae5ab
to
e2fe421
Compare
Job Precheck on e2fe421 wanted to post the following: Your code requires style changes. A patch was auto generated and copied here
Alternatively, with your repository up to date and in the top level of your repository:
|
e2fe421
to
b59816b
Compare
Job Coverage on b59816b wanted to post the following: The following coverage requirement(s) failed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments, but this is close!
@@ -47,13 +47,18 @@ Checkpoint::validParams() | |||
|
|||
// Advanced settings | |||
params.addParam<bool>("binary", true, "Toggle the output of binary files"); | |||
params.addParam<int>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
params.addParam<int>( | |
params.addParam<unsigned int>( |
@@ -124,6 +130,9 @@ class Checkpoint : public FileOutput | |||
/// Vector of checkpoint filename structures | |||
std::deque<CheckpointFileNames> _file_names; | |||
|
|||
/// Starting time compared against to see if we should automatically print out a checkpoint | |||
std::chrono::time_point<std::chrono::steady_clock> start_time; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::chrono::time_point<std::chrono::steady_clock> start_time; | |
std::chrono::time_point<std::chrono::steady_clock> _start_time; |
void | ||
Checkpoint::initialSetup() | ||
{ | ||
start_time = std::chrono::steady_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_time = std::chrono::steady_clock::now(); | |
_start_time = std::chrono::steady_clock::now(); |
// Print only on timestep end to avoid weird issues | ||
if (elapsed >= _autosave_time_interval and type == EXEC_TIMESTEP_END) | ||
{ | ||
start_time = std::chrono::steady_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_start_time
is a bit of a misnomer - should we call it _time_since_autosave
or something similar?
Also, I would argue that we should reset this time each time we checkpoint regardless of how that occurred. If we are outputting every 5 time steps and the checkpoint is on. We may never end up checkpointing to autosave if those values are coming out frequently enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if its a misnomer per se, but it may cause confusion since Output
objects contain like 8 variables already that sound relatively similar. _time_since_autosave
works for me.
My only concern re: the reset every checkpoint is that we are sure that's the behavior we want before I commit the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes a lot of sense. For a single instance (honestly, you'd be hard pressed to really find a good reason for multiple instances IMO), you really only want to receive outputs frequently enough that you don't have to waste a bunch of CPU time if something goes wrong. For checkpointing, we don't typically care about what time step number these saves come out on. For output purposes, we care a lot!
input = autosavetimer.i | ||
recover = false | ||
max_parallel = 1 | ||
max_threads = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the restrictions? Shouldn't this work perfectly fine with threads or MPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining response with bottom comment -- the test file we are using is basically the same one used in the signal handler tests. Part of the reason I used those commands on those tests was to prevent the tests from running for a long time.
I can locally check how much time it takes to run these with parallel and threads and see.
cli_args = "Outputs/cp/test_system=true" | ||
recover = false | ||
max_parallel = 1 | ||
max_threads = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, why the restrictions?
expect_out = "1 seconds have elapsed, autosaving checkpoint..." | ||
input = autosavetimer.i | ||
cli_args = "Outputs/cp/test_system=true" | ||
recover = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to even see this working with recover. That might mean you need to make the test longer since it'll be cut in half by the recover testing. Try it out and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is recover recovering from the checkpoint objects created in the test spec, or is this that thing that runs it for half the time on its own?
If its the second, I will admit I don't understand what the benefit of the recovery testing is -- after all, we don't actually care about the solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recover
in the test spec simply means that you don't want the TestHarness to manually split this test into two halves. It does make sense to skip it if you've explicitly setup a recover test. However, that's not what you've done here. A manual recover test typically has to not solve the whole problem on the first half (either by overriding the number of steps on the command line, or by passing the special --half-transient
flag. Then the second half would pass --recover
on the command line to read the checkpoint and make sure the test continues.
Now that I think about this a bit more, we probably don't want the TestHarness to help us with the split here as we might not pick up the right checkpoint file (if we are trying to verify that we can recover from the one written out by autosave). With your testing, you are indeed checking that the screen output is working, which is good. What we aren't testing though is if the checkpoint file actually works.
|
||
#include "Checkpoint.h" | ||
|
||
class TestAutosaveTimedCheckpoint : public Checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, do we need this object? I can see your comment about testing the autosave rather than a regular checkpoint, but I don't believe you need this. Checkpoint is doing both, we should be able to test both with that object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining the response to the bottom question with this comment -- the reason this test object exists is to test that the system checkpoint (the one created by the signal handler) is able to output the timed checkpoint we are doing now. Since we decided against the shortcut syntax, that means there's normally no way to modify the amount of time it takes for the system checkpoint to output the timed output we are performing here.
So, instead, we have this test object that we can make operate exactly like a system checkpoint -- we set the TestAutosaveTimedCheckpoint
object to have the same _is_autosave
flag as the system checkpoint, and then change the interval to 1 second so we can test the system checkpoint is outputting correctly. This is the easiest way to verify the automatically created checkpoint is working with the wall time output correctly.
I don't know if there's a good way to avoid the test object here to accomplish this. Obviously open to suggestions if you have any, but that's the justification for its existence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK fair, but then wouldn't I expect to see one of your tests use type = Checkpoint
(i.e. an explicit Checkpoint instance that does autosaving too)? Maybe I'm still missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TestAutosaveTimedCheckpoint
is just a derived Checkpoint object so there's no difference between it and a normal checkpoint if you don't pass the test_system=true
|
||
[Outputs] | ||
[cp] | ||
type = TestAutosaveTimedCheckpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the type of syntax I want to see, but it should just be Checkpoint
right? We can and should add a whole sub-block like this for testing, overriding the autosave time - I don't have any issues with that. I don't believe we need a separate test object to verify that the default works if the override value works though.
This pull request has been automatically marked as stale because it has not had recent activity in the last 100 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
Closes #12722