Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiApp restart bug #5695

Closed
andrsd opened this issue Sep 15, 2015 · 12 comments
Closed

MultiApp restart bug #5695

andrsd opened this issue Sep 15, 2015 · 12 comments
Assignees
Labels
C: Framework T: defect An anomaly, which is anything that deviates from expectations.

Comments

@andrsd
Copy link
Contributor

andrsd commented Sep 15, 2015

Here is a branch that shows a bug in multi-app restart.

The problem being solved is very trivial. The sub-app is computing u = x * ton a 1D domain from t = 0..1 with dt = 0.1. This solution is transferred onto master app into a variable called v. The master app computes u = x * t on a 2D domain.

For the first part, I let the t = 0..0.5 (i.e. 5 time steps) with checkpoint = true. Then the second part restarts from t = 5s and goes to t = 10 (another 5 time steps). While all is good in the master app, we end up with 5 time steps in the output file and both solutions (u and v) are correct, we have a bug in the sub-app:

  1. sub app time steps are 0, 6, 7, 8, 9, 10 (master app times steps are 0, 1, 2, 3, 4, 5) - from the console.
  2. the output file of the sup-app has 10 time steps, the initial condition is wrong (is zero everywhere), then there is 5 empty (i.e. zero) time steps and finally 5 steps with the correct solution. I looked and the exodus file using ncdump, because paraview was showing all zeros and incorrect times (because the time_whole var in the file reads 5, 0, 0, 0, 0, 0, 0.6, 0.7, 0.8, 0.9, 1, while it should be 0.5, 0.6, 0.7, 0.8, 0.9, 1).
@andrsd andrsd added C: Framework P: critical T: defect An anomaly, which is anything that deviates from expectations. labels Sep 15, 2015
@friedmud
Copy link
Contributor

I'll take a look at this later tonight... unfortunately I'm tied up all afternoon.

Thanks for the simple example and the explanation

@friedmud friedmud self-assigned this Sep 15, 2015
@andrsd
Copy link
Contributor Author

andrsd commented Sep 15, 2015

Thanks. Also, I have not seen a test for multi-app restart (maybe I did not look close enough), so this could serve as a basis for it...

@friedmud
Copy link
Contributor

Ok - I see the issue. It's a restart vs. recover issue again.

The sub-app is being "recovered"... even though the master app is being "restarted".

The problem is that the sub-app gets its exact state pushed down to it from the master app. That's the right thing to do for "recover" but wrong for "restart" (the difference is in the declareRestartableData() vs declareRecoverableData()... data that is declared as recoverable should not be overwritten when doing a restart...

While I think about this a bit... you may need to just use "recover" for the sprint problem.

I think I have a fix... but I need to stew on it for a while...

@friedmud
Copy link
Contributor

Actually - I don't think this will be that bad... stay tuned

@permcody
Copy link
Member

As a developer I'm a little confused about when I should use one or the other of these methods. Is there guidelines for how to talk to users about this? I guess users have to ask the question do I want the state to start over for "restart" but not for "recover"? Is that your thinking?

@friedmud
Copy link
Contributor

@permcody For our users it's easy: they should only ever use declareRestartableData(). In fact.. they have no choice because declareRecoverableData() is actually private.

declareRecoverableData() should only be used internal to the library. It's meant to denote data that needs to keep it's state during a Recover... but should NOT keep it's state during a Restart (because it needs to start fresh/new because a Restart is a new simulation).

Easiest one to think about is _t_step. During a Recover _t_step obviously needs to continue where the simulation left off. So if the last _t_step was 7 the next one needs to be 8... etc.

However... during Restart _t_step has to start back over at 0... because this is a new simulation. So it needs to start over just like any other new simulation.

Therefore _t_step uses declareRecoverableData(). It's data that should be Recovered but NOT Restarted. (You can see this here: https://github.com/idaholab/moose/blob/devel/framework/src/base/FEProblem.C#L116 )

So to summarize:

Restartable means: state is kept for both Restart and Recover
Recoverable means: state is only kept during a Recover

Hope that helps.

@friedmud
Copy link
Contributor

This is incredibly nasty. It's 2AM here and I'm calling off the search until tomorrow.

I think I need to take a different approach...

@YaqiWang
Copy link
Contributor

Sorry for asking here. Actually restart/recover also confuse me. It is easy to understand why we have recover. But not very for restart, why do we just start the simulation from scratch, what is the difference?

@permcody
Copy link
Member

Restart is much like the two-part simulations you run. You might use data
from a previous simulation as the initial condition for another simulation.

On Mon, Sep 21, 2015 at 8:57 AM Yaqi notifications@github.com wrote:

Sorry for asking here. Actually restart/recover also confuse me. It is
easy to understand why we have recover. But not very for restart, why do we
just start the simulation from scratch, what is the difference?


Reply to this email directly or view it on GitHub
#5695 (comment).

@YaqiWang
Copy link
Contributor

Oh thanks. Now makes sense. But is restart a subset of recover? If so we should just let user to make the difference.

@permcody
Copy link
Member

Restart is a subset of recover and I really don't think we want the user
deciding in general. As Derek pointed out there are a few key pieces of
data that are recoverable but in general we really don't want users to pick
one or the other and get themselves messed up. I'm actually really happy
that only one method is exposed. If you have special needs for Eigen
Execution, we can deal with that at the framework level as needed but let's
try to keep it out of the apps.

Cody

On Mon, Sep 21, 2015 at 9:19 AM Yaqi notifications@github.com wrote:

Oh thanks. Now makes sense. But is restart a subset of recover? If so we
should just let user to make the difference.


Reply to this email directly or view it on GitHub
#5695 (comment).

@friedmud
Copy link
Contributor

Yep. declareRestartableData() is the only function users ever need to be aware of.

For now, I'm managing access to declareRecoveeableData() using friend classes in Restartable. I have a good idea though about how to uncover it for specific systems that might need access to it in derived classes (TimeSteppers, Output, Executioners are a couple). But, so far, there has been no need for that.

All the other systems in MOOSE should never even be aware of it.

friedmud added a commit to friedmud/moose that referenced this issue Sep 24, 2015
friedmud added a commit to friedmud/moose that referenced this issue Sep 24, 2015
friedmud added a commit to friedmud/moose that referenced this issue Sep 24, 2015
friedmud added a commit to friedmud/moose that referenced this issue Sep 24, 2015
friedmud added a commit to friedmud/moose that referenced this issue Sep 24, 2015
friedmud added a commit to friedmud/moose that referenced this issue Oct 9, 2015
friedmud added a commit to friedmud/moose that referenced this issue Oct 9, 2015
permcody pushed a commit to permcody/moose that referenced this issue Oct 13, 2015
permcody added a commit to permcody/moose that referenced this issue Oct 13, 2015
@friedmud, you may want to check this out. It appears
that we still have a few weird restart startup issues
I had to remove the 'initial' execute for the postprocessors
in this test to maintain the current gold file.

refs idaholab#5695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: Framework T: defect An anomaly, which is anything that deviates from expectations.
Projects
None yet
Development

No branches or pull requests

4 participants