Add auto_advance parameter to PicardSolve #15614

lindsayad · 2020-07-15T22:30:59Z

This allows for uniform time-step cutting across multi-app levels even
when not performing Picard between two levels, e.g. to prevent a
sub-application from auto-advancing despite the state of it's solve, a
user can specify auto_advance = false to require the master to cut its
timestep.

Closes #15166

moosebuild · 2020-07-16T01:31:28Z

Job Documentation on db89cd3 wanted to post the following:

View the site here

This comment will be updated on new commits.

moosebuild · 2020-07-20T17:06:25Z

Job App tests on d325696 : invalidated by @lindsayad

moosebuild · 2020-07-20T17:06:51Z

Job Documentation on d325696 : invalidated by @lindsayad

YaqiWang · 2020-07-20T20:43:01Z

I have to say this auto_auto makes my eyes bleeding ;-) before reviewing this.

lindsayad · 2020-07-20T20:52:15Z

auto_advance ?

YaqiWang · 2020-07-20T21:02:58Z

Nah, I have not reviewed this. I may like it after looking into the code or come up other suggestions. Just auto-auto word seems complicated to me.

fdkong · 2020-07-20T21:06:56Z

I have to say this auto_auto makes my eyes bleeding ;-) before reviewing this.

auto_square might be better :-)

lindsayad · 2020-07-20T21:08:15Z

I do not see auto auto anywhere

lindsayad · 2020-07-20T21:10:25Z

Lol I see it was in the PR title. Changed the PR title. There is no auto auto in the code, nor auto_auto

fdkong

Looks good. A few minor comments

fdkong · 2020-07-20T22:39:09Z

framework/doc/content/source/postprocessors/TimePostprocessor.md

@@ -0,0 +1,9 @@
+# TimePostprocessor
+


Could we explain this object using one sentence?

I don't like the redundancy that that brings. See this postprocessor's doc page, vs. the one for TimestepSize which has an explicit description in the .md file. If I didn't have the !syntax description /Postprocessors/TimePostprocessor I would absolutely agree with you.

fdkong · 2020-07-20T22:40:01Z

framework/include/executioners/PicardSolve.h

+   * Whether sub-applications are automatically advanced no matter what happens during their solves
+   */
+  bool autoAdvance() const;
+


Sounds we should use forceAutoAdvance?

forceAutoAdvance sounds like it would be a setter-type method to me. This is just querying the sate of whether we are auto-advancing

fdkong · 2020-07-20T22:43:26Z

framework/src/executioners/PicardSolve.C

@@ -99,6 +99,9 @@ PicardSolve::validParams()
  params.addParam<bool>("update_xfem_at_timestep_begin",
                        false,
                        "Should XFEM update the mesh at the beginning of the timestep");
+  params.addParam<bool>("auto_advance",
+                        "Whether to automatically advance sub-applications regardless of whether "
+                        "their solve converges.");


It might be a good idea to document what use case requires us to advance the state even though sub-apps fail to solve?

I don't know what those use cases are. As you know I hate multiapps 😄

Do you, @YaqiWang, or @vincentlaboure know of some? I agree that I should add documentation about those cases. Otherwise yea even I don't understand why it's there!

I can't think of an example where auto_advance=true would be desired but it doesn't mean there isn't one

This does sound counter logic. If the solve in suapp is not successful, why not stop but rather advance?

why not name the parameter as force_advance and default it to false?

The current logic auto advances sub-applications as long as you are not doing Picard. If I do what you are describing, then I will be changing the default behavior. Presumably we have tests and/or applications that rely on this default behavior. I don't know how this dumpster fire advanced to the point where we are at now, but I am terrified of modifying default behavior, as I assume the original code-writer had some reason for it being that way. This seems like a classic Chesterton's fence.

I am shocked any time I make changes in the PicardSolve/Transient/TransientMultiApp code system, and I don't break something. I suppose I could try changing the default and seeing whether any tests fail... How much should we bet that tests fail? 😄

You could have more shock before I created PicardSolve ;-) This is exactly we want to have tests. Breaking tests will force us to think of the design.

The test I added applies auto_advance = false. All other test cases in the world test the other case.

OK, the conclusion is that we still do not know why we have that option when subapps fail

vincentlaboure

The test works as expected, thank you @lindsayad!

YaqiWang

I know current Transient and TransientMultiApp are messy even after I factored out Picard stuff into PicardSolve object. This looks like a bandage to me. I think the key is to document that extra parameter carefully and having a test. Thus in the future when we refactoring Transient and TransientMultiApp, we know what we are dealing with.

YaqiWang · 2020-07-21T17:31:31Z

framework/src/executioners/PicardSolve.C

@@ -99,6 +99,9 @@ PicardSolve::validParams()
  params.addParam<bool>("update_xfem_at_timestep_begin",
                        false,
                        "Should XFEM update the mesh at the beginning of the timestep");
+  params.addParam<bool>("auto_advance",
+                        "Whether to automatically advance sub-applications regardless of whether "
+                        "their solve converges.");


This does sound counter logic. If the solve in suapp is not successful, why not stop but rather advance?

YaqiWang · 2020-07-21T17:36:24Z

framework/include/multiapps/MultiApp.h

   */
-  virtual void finishStep() {}
+  virtual void finishStep(bool /*recurse_through_multiapp_levels*/ = false) {}


This guy is doing nothing?

oh it is the base, nvm.

Should we do this always recursively? Of cause I do not know the implications here ;-) just throwing wrenches.

No we should not. We also call this from incrementStepOrReject where we do not want to recurse through. But when the master solve is totally finished, then we also call this method, and it is then that we need to recurse through, otherwise we do not finish the steps of multiapp levels farther down than the first level.

YaqiWang · 2020-07-21T17:38:18Z

framework/src/executioners/PicardSolve.C

@@ -99,6 +99,9 @@ PicardSolve::validParams()
  params.addParam<bool>("update_xfem_at_timestep_begin",
                        false,
                        "Should XFEM update the mesh at the beginning of the timestep");
+  params.addParam<bool>("auto_advance",
+                        "Whether to automatically advance sub-applications regardless of whether "
+                        "their solve converges.");


why not name the parameter as force_advance and default it to false?

YaqiWang · 2020-07-21T21:10:18Z

I guess I only need you to update TransientMultiApp.md or maybe Transient.md, then I will approve ;-)

fdkong

I am OK with this PR, even though I really really want an example to demonstrate that we have to keep going when subapps fail and crash.

This allows for uniform time-step cutting across multi-app levels even when not performing Picard between two levels, e.g. to prevent a sub-application from auto-advancing despite the state of it's solve, a user can specify `auto_advance = false` to require the master to cut its timestep. Closes idaholab#15166

lindsayad · 2020-07-22T15:53:30Z

I just pushed up a commit to never auto-advance...let's see what the results are

lindsayad · 2020-07-22T19:53:37Z

I hate these systems with a fiery passion

YaqiWang · 2020-07-22T20:03:05Z

Which systems?

lindsayad · 2020-07-22T22:13:08Z

Ok I think my conclusion is this: within the current design we need to have auto_advance = true in order for TransientMultiApp to work with restart. This is because the incrementing of the sub-application state happens way after checkpoint output has occured. Checkpoint output happens in the master application's Transient::endStep; however, the incrementing of non-auto-advanced (Picard) sub-applications occur in the master application's call to Transient::incrementStepOrReject. So if an application is not auto-advanced, then restart data will show that sub-application as actually on the previous time step relative to the master application.

This conundrum could probably be fixed by having PICARD_END and TIMESTEP_END. However, that is a much bigger undertaking. For now, I think we should stick with our default of auto-advancing sub-applications when not doing Picard in order to ensure that those simulations can work with restart and recover. Then we will have this additional auto_advance parameter which the user can set if they want to.

How does that sound to people?

lindsayad · 2020-07-22T22:46:49Z

I am OK with this PR, even though I really really want an example to demonstrate that we have to keep going when subapps fail and crash.

@fdkong hopefully I answered this in the above comment. The purpose of auto-advance is not to keep going even when subapps fail; the purpose is to keep the states of sub-applications in sync with the states of master applications whenever we can.

fdkong · 2020-07-22T22:53:31Z

framework/src/executioners/PicardSolve.C

+bool
+PicardSolve::autoAdvance() const
+{
+  bool auto_advance = !(_has_picard_its && _problem.isTransient());


Could you add your findings right here? So we know why we have auto on when there is no picard?

Yes I'll add some good documentation to the .md files of what I've outlined in my comments here on github.

Note that I have this PR on draft, so I'm guessing the auto merge label isn't going to have an effect... Actually this could be an interesting test of CIVET 😄

lindsayad · 2020-07-22T22:56:59Z

With this PR, there are now multiple ways to approach the possibility of a failed sub-app solve.

You can set auto_advance = false in the Executioner block of the master application . This will cause the master application to immediately cut its time-step when the sub-application fails. However, setting this parameter also eliminates the possibility of doing restart/recover because the master and sub are out of sync when checkpoint output occurs.
Set catch_up = true in the TransientMultiApp block. This will cause the sub-application to try and catch up to the master application after a sub-app failed solve. If catch-up is unsuccessful, then we register this as a true failure of the solve, and the master dt will then get cut. This option has the advantage of keeping the master and sub transient states in sync, enabling accurate restart/recover data.

@vincentlaboure I assume that you're aware of the catch_up parameter. You seem like a multi-app expert, whereas I am a newcomer.

lindsayad · 2020-07-23T14:41:15Z

Ok, documentation added to TransientMultiApp.md

vincentlaboure · 2020-07-23T14:46:40Z

With this PR, there are now multiple ways to approach the possibility of a failed sub-app solve.

You can set auto_advance = false in the Executioner block of the master application . This will cause the master application to immediately cut its time-step when the sub-application fails. However, setting this parameter also eliminates the possibility of doing restart/recover because the master and sub are out of sync when checkpoint output occurs.

Set catch_up = true in the TransientMultiApp block. This will cause the sub-application to try and catch up to the master application after a sub-app failed solve. If catch-up is unsuccessful, then we register this as a true failure of the solve, and the master dt will then get cut. This option has the advantage of keeping the master and sub transient states in sync, enabling accurate restart/recover data.

@vincentlaboure I assume that you're aware of the catch_up parameter. You seem like a multi-app expert, whereas I am a newcomer.

I actually have never used catch-up so I'll give it a try. Thanks for the detailed explanation!

lindsayad · 2020-07-23T21:12:32Z

This is ready for review/merge

fdkong · 2020-07-23T23:44:55Z

Great, I would like to use catch up all the time.

lindsayad force-pushed the bug_multilevel branch 2 times, most recently from dc81ecc to d325696 Compare July 15, 2020 23:57

lindsayad requested review from fdkong, rwcarlsen, YaqiWang and vincentlaboure July 20, 2020 18:16

lindsayad changed the title ~~Add auto_auto advance parameter to PicardSolve~~ Add auto_advanced advance parameter to PicardSolve Jul 20, 2020

lindsayad changed the title ~~Add auto_advanced advance parameter to PicardSolve~~ Add auto_advance parameter to PicardSolve Jul 20, 2020

fdkong suggested changes Jul 20, 2020

View reviewed changes

vincentlaboure previously approved these changes Jul 21, 2020

View reviewed changes

YaqiWang suggested changes Jul 21, 2020

View reviewed changes

Add example to show the deficiency in multilevel time step rejection

1652398

fdkong previously approved these changes Jul 22, 2020

View reviewed changes

lindsayad dismissed stale reviews from fdkong and vincentlaboure via face32d July 22, 2020 15:52

lindsayad force-pushed the bug_multilevel branch from d325696 to face32d Compare July 22, 2020 15:52

lindsayad marked this pull request as draft July 22, 2020 15:52

lindsayad force-pushed the bug_multilevel branch from 1aa091b to e9ee2d3 Compare July 22, 2020 22:24

fdkong previously approved these changes Jul 22, 2020

View reviewed changes

fdkong added the PR: Auto Merge Add this label to have CIVET merge on success label Jul 22, 2020

aeslaughter assigned fdkong Jul 23, 2020

Document options for combating failed sub-app solves

db89cd3

lindsayad dismissed fdkong’s stale review via db89cd3 July 23, 2020 14:40

lindsayad marked this pull request as ready for review July 23, 2020 14:40

lindsayad added the PR: Ready for review/merge label Jul 23, 2020

YaqiWang approved these changes Jul 23, 2020

View reviewed changes

fdkong approved these changes Jul 23, 2020

View reviewed changes

fdkong merged commit 7511c47 into idaholab:next Jul 23, 2020

lindsayad deleted the bug_multilevel branch July 23, 2020 23:57

Add auto_advance parameter to PicardSolve #15614

Add auto_advance parameter to PicardSolve #15614

Conversation

lindsayad commented Jul 15, 2020

moosebuild commented Jul 16, 2020 • edited Loading

moosebuild commented Jul 20, 2020

moosebuild commented Jul 20, 2020

YaqiWang commented Jul 20, 2020

lindsayad commented Jul 20, 2020

YaqiWang commented Jul 20, 2020 • edited Loading

fdkong commented Jul 20, 2020

lindsayad commented Jul 20, 2020

lindsayad commented Jul 20, 2020

fdkong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindsayad Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincentlaboure left a comment

Choose a reason for hiding this comment

YaqiWang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YaqiWang commented Jul 21, 2020

fdkong left a comment

Choose a reason for hiding this comment

lindsayad commented Jul 22, 2020

lindsayad commented Jul 22, 2020

YaqiWang commented Jul 22, 2020

lindsayad commented Jul 22, 2020

lindsayad commented Jul 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lindsayad commented Jul 22, 2020 • edited Loading

lindsayad commented Jul 23, 2020

vincentlaboure commented Jul 23, 2020

lindsayad commented Jul 23, 2020

fdkong commented Jul 23, 2020

moosebuild commented Jul 16, 2020 •

edited

Loading

YaqiWang commented Jul 20, 2020 •

edited

Loading

lindsayad Jul 21, 2020 •

edited

Loading

lindsayad commented Jul 22, 2020 •

edited

Loading