Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This commit changes the semantics for lost movers, to avoid a segfault #121

Merged
merged 4 commits into from
Aug 27, 2020

Conversation

rfbird
Copy link
Contributor

@rfbird rfbird commented Aug 26, 2020

Previously, if movers were lost the following would happen:

  1. The user would get a warning,
  2. The particle->i value would be shifted by move_p to be out of bounds
  3. The particle never makes it to boundary_p in order to processes it

This either teleports the particle across the domain; or causes a
segfault in the next timestep.

This PR:

  1. Adds some notes to the user about how serious that warning is
  2. Undoes the shift, so instead of segfauliting the code now "holds" the
    particle at the cell boundary

This breaks physics, but to a lesser extent than the previous
teleporting of a particle. It's still crucial the user uses a correct
value for nm

To recreate these issues just use a deck where the intial nm is set to 1
during define_species in the input deck

Previously, if movers were lost the following would happen:

1) The user would get a warning,
2) The particle->i value would be shifted by move_p to be out of bounds
3) The particle never makes it to boundary_p in order to processes it

This either teleports the particle across the domain; or causes a
segfault in the next timestep.

This PR:

1) Adds some notes to the user about how serious that warning is
2) Undoes the shift, so instead of segfauliting the code now "holds" the
particle at the cell boundary

This breaks physics, but to a lesser extent than the previous
teleporting of a particle. It's still crucial the user uses a correct
value for nm

To recreate these issues just use a deck where the intial nm is set to 1
during define_species in the input deck
@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #121 into devel will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel     #121      +/-   ##
==========================================
- Coverage   83.14%   83.13%   -0.02%     
==========================================
  Files         118      118              
  Lines        7287     7287              
  Branches     1124     1124              
==========================================
- Hits         6059     6058       -1     
- Misses        752      753       +1     
  Partials      476      476              
Impacted Files Coverage Δ
...es_advance/standard/pipeline/advance_p_pipeline.cc 93.43% <0.00%> (-0.69%) ⬇️
src/sf_interface/pipeline/reduce_array_pipeline.cc 74.41% <0.00%> (-0.59%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 360186c...541574e. Read the comment docs.

@rfbird
Copy link
Contributor Author

rfbird commented Aug 26, 2020

Question to reviewers:

Fixing this segfault makes it more likely we'll burn CPU time on invalid simulations. Do we want to add some sort of optionally detection that causes an ERROR instead?

If so, do we want that exit to happen after diagnostics etc have happened?

@SVLuedtke-LANL
Copy link
Collaborator

I advocate making this an error rather than a warning even though that is not the traditional behavior of VPIC.

It is possible that if only a few particles are dropped, they won't significantly affect the physics, and you could "save" a simulation by continuing. But that is unlikely because it requires a "just right" mover size that is slightly less than the minimum needed. If you're that good at picking mover sizes, you should have no problem always having enough. More likely than saving a simulation, I think, is wasting compute time on a simulation that is already wrong.

Further, users probably don't search their output for warnings, and so might not even know there is a problem. I think this makes it way too easy for wrong results to be trusted.

More philosophically, I think that when VPIC errs in its calculation of the physics, it should present an error to the user.

Holding the particles back instead of dropping them is probably massively better for the physics, but it's still wrong. This could be useful for test simulations that won't resolve the physics anyway, but I don't think continuing should be allowed unless the user specifically enables a physics-breaking option.

@SVLuedtke-LANL
Copy link
Collaborator

Ideally the user wouldn't have to worry about such an arcane thing as how many particles might be passed among ranks in any given timestep, but resizing arrays inside advance_p might not be easy or performant.

@petschge
Copy link
Collaborator

The most important thing to me is that we do not segfault, because at that point a lot of debugging starts.

I am totally ok with making this an error. I am not sure if we want leave in the unshifting code for the adventures user that might want to downgrade the error to a warning if they are caught in a strange corner. Or we should keep the code simple and just die, telling to use to arrange for more movers.

Ideally the user what never have to worry about picking the right number of movers, but I have seen several cases where the automatic number of "-1" doesn't work well. Is there at least a way to suggest a better number of movers when we error out? So that the use gets an error message of "Pipeline %i (species = %s) ran out of storage for %i movers. Try increasing the movers in define_species to %i. If you don't have sufficient memory to do that increase the number of nodes."

If we do stick with the warning, we should probably mention that at the end of the simulation saying "stuck particles" instead of "normal exit" or something.

@rfbird
Copy link
Contributor Author

rfbird commented Aug 27, 2020

OK, it sounds like we all agree.

Plan:

  1. I'll extend this PR to allow a compile time flag to be "die plz"
  2. We'll do a merge review of this
  3. We can discuss in a future meeting if we want to swap the default behavior over to dying

@rfbird
Copy link
Contributor Author

rfbird commented Aug 27, 2020

@petschge @SVLuedtke-LANL can one of you give me a sanity check on this from a low-level code perspective please?

@SVLuedtke-LANL
Copy link
Collaborator

Other than those two small errors, it looks sane to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants