Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added realtime checkpoint autosaves after 10 minutes #23989

Closed

Conversation

socratesgorilla
Copy link
Contributor

@socratesgorilla socratesgorilla commented Apr 5, 2023

Closes #12722

@socratesgorilla socratesgorilla changed the title Added realtime checkpoint autosaves after 10 minutes #12722 Added realtime checkpoint autosaves after 10 minutes Apr 5, 2023
@socratesgorilla socratesgorilla marked this pull request as draft April 5, 2023 23:56
Copy link
Member

@permcody permcody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test object in moose/test/<src|include>/outputs

@moosebuild
Copy link
Contributor

moosebuild commented Apr 6, 2023

Job Documentation on b59816b wanted to post the following:

View the site here

This comment will be updated on new commits.

@socratesgorilla socratesgorilla force-pushed the walltimecheckpoint branch 3 times, most recently from 6d12d40 to 24ae5ab Compare April 6, 2023 20:37
@moosebuild
Copy link
Contributor

moosebuild commented Apr 6, 2023

Job Coverage on b59816b wanted to post the following:

Framework coverage

f0c82f #23989 b59816
Total Total +/- New
Rate 85.08% 85.08% +0.00% 100.00%
Hits 86219 86233 +14 15
Misses 15118 15118 - 0

Diff coverage report

Full coverage report

Modules coverage

Geochemistry

f0c82f #23989 b59816
Total Total +/- New
Rate 96.75% 96.38% -0.37% -
Hits 4999 4980 -19 0
Misses 168 187 +19 0

Diff coverage report

Full coverage report

Misc

f0c82f #23989 b59816
Total Total +/- New
Rate 84.80% 34.31% -50.49% -
Hits 173 70 -103 0
Misses 31 134 +103 0

Diff coverage report

Full coverage report

Phase field

f0c82f #23989 b59816
Total Total +/- New
Rate 89.57% 86.27% -3.31% -
Hits 12519 12057 -462 0
Misses 1457 1919 +462 0

Diff coverage report

Full coverage report

Porous flow

f0c82f #23989 b59816
Total Total +/- New
Rate 96.26% 96.25% -0.01% -
Hits 10687 10686 -1 0
Misses 415 416 +1 0

Diff coverage report

Full coverage report

Richards

f0c82f #23989 b59816
Total Total +/- New
Rate 97.49% 94.75% -2.74% -
Hits 3341 3247 -94 0
Misses 86 180 +94 0

Diff coverage report

Full coverage report

Tensor mechanics

f0c82f #23989 b59816
Total Total +/- New
Rate 89.18% 85.21% -3.97% -
Hits 26710 25521 -1189 0
Misses 3242 4431 +1189 0

Diff coverage report

Full coverage report

Full coverage reports

Reports

This comment will be updated on new commits.

@moosebuild
Copy link
Contributor

Job Precheck on e2fe421 wanted to post the following:

Your code requires style changes.

A patch was auto generated and copied here
You can directly apply the patch by running, in the top level of your repository:

curl -s https://mooseframework.inl.gov/docs/PRs/23989/clang_format/style.patch | git apply -v

Alternatively, with your repository up to date and in the top level of your repository:

git clang-format f0c82f7eda772c976247c8b501be080b8380ceb5

@socratesgorilla socratesgorilla marked this pull request as ready for review April 12, 2023 20:44
@moosebuild
Copy link
Contributor

Job Coverage on b59816b wanted to post the following:

The following coverage requirement(s) failed:

  • misc coverage rate 34.31% is less than the required 55.0%
  • phase_field coverage rate 86.27% is less than the required 89.0%
  • richards coverage rate 94.75% is less than the required 97.0%
  • tensor_mechanics coverage rate 85.21% is less than the required 86.0%

Copy link
Member

@permcody permcody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, but this is close!

@@ -47,13 +47,18 @@ Checkpoint::validParams()

// Advanced settings
params.addParam<bool>("binary", true, "Toggle the output of binary files");
params.addParam<int>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params.addParam<int>(
params.addParam<unsigned int>(

framework/src/outputs/Checkpoint.C Show resolved Hide resolved
@@ -124,6 +130,9 @@ class Checkpoint : public FileOutput
/// Vector of checkpoint filename structures
std::deque<CheckpointFileNames> _file_names;

/// Starting time compared against to see if we should automatically print out a checkpoint
std::chrono::time_point<std::chrono::steady_clock> start_time;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::chrono::time_point<std::chrono::steady_clock> start_time;
std::chrono::time_point<std::chrono::steady_clock> _start_time;

void
Checkpoint::initialSetup()
{
start_time = std::chrono::steady_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
start_time = std::chrono::steady_clock::now();
_start_time = std::chrono::steady_clock::now();

// Print only on timestep end to avoid weird issues
if (elapsed >= _autosave_time_interval and type == EXEC_TIMESTEP_END)
{
start_time = std::chrono::steady_clock::now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_start_time is a bit of a misnomer - should we call it _time_since_autosave or something similar?

Also, I would argue that we should reset this time each time we checkpoint regardless of how that occurred. If we are outputting every 5 time steps and the checkpoint is on. We may never end up checkpointing to autosave if those values are coming out frequently enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if its a misnomer per se, but it may cause confusion since Output objects contain like 8 variables already that sound relatively similar. _time_since_autosave works for me.

My only concern re: the reset every checkpoint is that we are sure that's the behavior we want before I commit the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes a lot of sense. For a single instance (honestly, you'd be hard pressed to really find a good reason for multiple instances IMO), you really only want to receive outputs frequently enough that you don't have to waste a bunch of CPU time if something goes wrong. For checkpointing, we don't typically care about what time step number these saves come out on. For output purposes, we care a lot!

input = autosavetimer.i
recover = false
max_parallel = 1
max_threads = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the restrictions? Shouldn't this work perfectly fine with threads or MPI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining response with bottom comment -- the test file we are using is basically the same one used in the signal handler tests. Part of the reason I used those commands on those tests was to prevent the tests from running for a long time.

I can locally check how much time it takes to run these with parallel and threads and see.

cli_args = "Outputs/cp/test_system=true"
recover = false
max_parallel = 1
max_threads = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, why the restrictions?

expect_out = "1 seconds have elapsed, autosaving checkpoint..."
input = autosavetimer.i
cli_args = "Outputs/cp/test_system=true"
recover = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to even see this working with recover. That might mean you need to make the test longer since it'll be cut in half by the recover testing. Try it out and see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is recover recovering from the checkpoint objects created in the test spec, or is this that thing that runs it for half the time on its own?

If its the second, I will admit I don't understand what the benefit of the recovery testing is -- after all, we don't actually care about the solve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recover in the test spec simply means that you don't want the TestHarness to manually split this test into two halves. It does make sense to skip it if you've explicitly setup a recover test. However, that's not what you've done here. A manual recover test typically has to not solve the whole problem on the first half (either by overriding the number of steps on the command line, or by passing the special --half-transient flag. Then the second half would pass --recover on the command line to read the checkpoint and make sure the test continues.

Now that I think about this a bit more, we probably don't want the TestHarness to help us with the split here as we might not pick up the right checkpoint file (if we are trying to verify that we can recover from the one written out by autosave). With your testing, you are indeed checking that the screen output is working, which is good. What we aren't testing though is if the checkpoint file actually works.


#include "Checkpoint.h"

class TestAutosaveTimedCheckpoint : public Checkpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, do we need this object? I can see your comment about testing the autosave rather than a regular checkpoint, but I don't believe you need this. Checkpoint is doing both, we should be able to test both with that object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining the response to the bottom question with this comment -- the reason this test object exists is to test that the system checkpoint (the one created by the signal handler) is able to output the timed checkpoint we are doing now. Since we decided against the shortcut syntax, that means there's normally no way to modify the amount of time it takes for the system checkpoint to output the timed output we are performing here.

So, instead, we have this test object that we can make operate exactly like a system checkpoint -- we set the TestAutosaveTimedCheckpoint object to have the same _is_autosave flag as the system checkpoint, and then change the interval to 1 second so we can test the system checkpoint is outputting correctly. This is the easiest way to verify the automatically created checkpoint is working with the wall time output correctly.

I don't know if there's a good way to avoid the test object here to accomplish this. Obviously open to suggestions if you have any, but that's the justification for its existence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK fair, but then wouldn't I expect to see one of your tests use type = Checkpoint (i.e. an explicit Checkpoint instance that does autosaving too)? Maybe I'm still missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TestAutosaveTimedCheckpoint is just a derived Checkpoint object so there's no difference between it and a normal checkpoint if you don't pass the test_system=true


[Outputs]
[cp]
type = TestAutosaveTimedCheckpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the type of syntax I want to see, but it should just be Checkpoint right? We can and should add a whole sub-block like this for testing, overriding the autosave time - I don't have any issues with that. I don't believe we need a separate test object to verify that the default works if the override value works though.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity in the last 100 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale PRs that have reached or exceeded 90 days with no activity label Jul 24, 2023
@github-actions github-actions bot closed this Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale PRs that have reached or exceeded 90 days with no activity
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Graceful termination of MOOSE with files or signals
5 participants