Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error classification #66

Closed
dgasmith opened this issue May 7, 2019 · 5 comments
Closed

Error classification #66

dgasmith opened this issue May 7, 2019 · 5 comments

Comments

@dgasmith
Copy link
Collaborator

dgasmith commented May 7, 2019

It would be good to add error classification to these models so that can downstream programs can make decisions on what should happen. Several examples:

  • InputError - (non-recoverable) error in the user input (e.g., incorrect keyword or method)
  • SetupError - (recoverable) example: scratch directory is not writeable
  • ConvergenceError - (recoverable) likely requires options tweaking to enhance iterative convergence.
  • RandomError - (recoverable) random seg fault or the like.

Recoverable/non-recoverable in a distributed computing sense where an upstream manager can make the decision to resubmit.

My initial thought here is that we build these as Exception classes so that the compute command can capture them and either properly process them into proper JSON error message or let them raise. I am usually not a fan of custom error classes, but here is a good case where there is a variety of different behaviors that you want to elicit depending on the type of error provided.

It would be good to kick around the different error types for a bit before implementing.

@loriab
Copy link
Collaborator

loriab commented May 7, 2019

I am usually a fan of custom error classes for flow management, so I like this idea.

Categories look good, though I suspect they're what the upstream manager wants to know more than than what the throwing program knows about its situation.

  • ResourceError - (recoverable or non-recoverable) req'd memory not allocatable (rc), available disk not sufficient (nrc if same resource; rc if alternate resource)
  • RandomError - (rc) PSIO error
  • InputError - (nrc) what were once psi4.ValidationErrors

I thought pymatgen was going to be more helpful than it turned out:

https://github.com/materialsproject/pymatgen/blob/76a66990113cf9a11f908e1aba643f19c6abca68/pymatgen/io/abinit/events.py#L767-L877

https://github.com/materialsproject/pymatgen/blob/76a66990113cf9a11f908e1aba643f19c6abca68/pymatgen/io/abinit/tasks.py#L1299-L1304

@dgasmith
Copy link
Collaborator Author

dgasmith commented May 8, 2019

Yea, I had poked through there as well without finding too much.

Is the ResourceError a SetupError or vice versa? Not sure if we need to delineate between the two, either way they are recoverable and rerun on another worker I think.

@loriab
Copy link
Collaborator

loriab commented May 8, 2019

I was thinking ResourceError was a more concrete name for SetupError.

@dgasmith
Copy link
Collaborator Author

So we have:

  • InputError - (non-recoverable) error in the user input (e.g., incorrect keyword or method)
  • ResourceError - (recoverable) example: scratch directory is not writeable, not enough memory
  • ConvergenceError - (recoverable) likely requires options tweaking to enhance iterative convergence.
  • RandomError - (recoverable) random seg fault or the like (SIGSEV, PSIO, etc).

I think that's a pretty good list to get started?

@dgasmith
Copy link
Collaborator Author

dgasmith commented Jun 1, 2019

Closed in #69.

@dgasmith dgasmith closed this as completed Jun 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants