Provide an error throw class #79

crtrott · 2015-09-04T22:40:33Z

Based on discussions with production teams, we have a use-case for a error catch class.
The problem is that the codes heavily use throw/catch mechanisms not just to aboard but also to do things like reruns. Also they want to catch more than a single error in a run. Imagine for example that a model runs into trouble because of faulty setup. In this case all problematic elements should be identified in a single run instead of just one or the first.

One idea is to have a class into which one can push_back errors. This could be based on a linked list which can be implemented thread-safe. A catch mechanism after the parallel_for can then go into that, and process all thrown errors.

mhoemmen · 2015-09-05T04:26:09Z

I would be interested in learning more about the use case. I usually handle stuff like this in Tpetra with custom code that collects all (or some subset) of the "bad" values inside of the loop, but reports them outside of the loop. If one of the goals is to stop processing early, then what people really want is early termination of a parallel kernel.

crtrott · 2015-09-05T16:30:33Z

We already have a function which comes as close as we can to aborting a kernel. I believe this class is meant as a way to collect multiple errors. What I would envision is something similar to this:

parallel_for(..) {
...
if(error_condition) {
Kokkos::collect_error("My Error Message");
do_something due to error
}
}

And internally in the parallel for we would check the state of the error collector, and if there were any in the kernel we can print and throw there the throw could use the kernel name.
For the error messages themselves we might even be able to figure out which thread it is internally.
Furthermore there should be a function which you can use after the kernel to iterate over all errors.

hcedwar · 2015-09-08T13:38:45Z

Using exceptions for expected (non-error / non-exceptional) conditions is a really bad design idiom. Exceptions should be restricted to "something better than abort" usage. We can discuss a feature with respect to requirements, but immediately jumping to exceptions is a bad path.

nmhamster · 2015-09-08T14:00:19Z

The use case here is calling into virtual functions (which may include wrapped calls to Fortran). The components being called may then need to report errors because to the caller but will do in non structured ways (i.e. no common return codes). The customer who presented this issue currently throws in receiving the error at the first case, this causes a fix up routine to run and then the whole computation is re-run again. What is desired is to collect all the "throwing" cases. Throws cannot be used in OpenMP across thread boundaries so this cannot be effectively parallelized in the prototype code the customer is working on. The proposal is to have a Kokkos class which allows them to collect up the errors as they occur and then inspect them at the end of the kernel. I think the inspection of the collected errors should allow the developer to decide what to do - i.e. do /they/ want to throw an exception, re-run code etc. That decision doesn't need to be in Kokkos.

crtrott · 2015-09-08T15:47:29Z

Jup. My usage of the word "throw" was a bit uncareful. I didn't want to say that Kokkos should actually trigger an exception. This is mainly about collecting errors in a parallel environment and let the users outside of that parallel region inspect them and decide what to do.

mrtupek · 2015-10-08T18:09:34Z

We would be OK with having a class that we can push-back errors to. In fact, we'd even be OK if that class was allocated/sized up front, and have error code/strings push-back'd until that space is used up. Any error after that can be ignored. Outside the loop we'd then process those error messages and do what we will with them.

hcedwar · 2016-11-08T18:30:52Z

ISO/C++ Standard committee decision on exception handling within C++17 parallel algorithms:
It is a trait/property of the execution policy for how exceptions thrown within the body of a parallel algorithm are handled.

crtrott · 2016-11-29T22:42:47Z

Ok I just merged in Patricks error reporter class which should get us reasonably far. This is based on pull request #518.

crtrott added the Feature Request Create new capability; will potentially require voting label Sep 4, 2015

crtrott added this to the Backlog milestone Nov 23, 2015

crtrott self-assigned this Nov 29, 2016

crtrott modified the milestones: END 2016, Backlog Nov 29, 2016

crtrott added Experimental Available labels Nov 29, 2016

crtrott closed this as completed Dec 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an error throw class #79

Provide an error throw class #79

crtrott commented Sep 4, 2015

mhoemmen commented Sep 5, 2015

crtrott commented Sep 5, 2015

hcedwar commented Sep 8, 2015

nmhamster commented Sep 8, 2015

crtrott commented Sep 8, 2015

mrtupek commented Oct 8, 2015

hcedwar commented Nov 8, 2016

crtrott commented Nov 29, 2016

Provide an error throw class #79

Provide an error throw class #79

Comments

crtrott commented Sep 4, 2015

mhoemmen commented Sep 5, 2015

crtrott commented Sep 5, 2015

hcedwar commented Sep 8, 2015

nmhamster commented Sep 8, 2015

crtrott commented Sep 8, 2015

mrtupek commented Oct 8, 2015

hcedwar commented Nov 8, 2016

crtrott commented Nov 29, 2016