MPI_ABORT called when accessing `h._ref_t` #1112

Helveg · 2021-03-22T16:03:35Z

The following error occurs when I comment/uncomment access to a property on my wrapper that records ._ref_t:

(bsb3) robin@TNG2019:~/ws3/neuron_adapter_test$ mpiexec -n 8 bsb -v 3 simulate test_adapter --hdf5 adapter_test.hdf5
numprocs=8
Load balancing on node 7 took 0.0 seconds
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

This is the code of the property:

    @property
    def time(self):
        if not hasattr(self, "_time"):
            t = self.Vector()
            # Fix for upstream NEURON bug. See https://github.com/neuronsimulator/nrn/issues/416
            try:
                with catch_hoc_error(CatchSectionAccess):
                    t.record(self._ref_t)
            except HocSectionAccessError as e:
                self.__dud_section = self.Section(name="this_is_here_to_record_time")
                # Recurse to try again.
                return self.time
            self._time = t
        return self._time

So probably because I catch the exception that occurs in #416 and continue with the simulation there is some fatal error later on.

The text was updated successfully, but these errors were encountered:

ramcdougal · 2021-03-22T20:41:24Z

You can check to see if it's related to the try-except by removing the try-except and just directly creating your dummy section if there are no existing sections:

@property
def time(self):
    if not hasattr(self, "_time"):
        self._time = self.Vector()
        if not any(h.allsec()):
            self.__dud_section = self.Section(name="this_is_here_to_record_time")
        self._time.record(self._ref_t)
    return self._time

Depending on what versions of Python you're targeting, consider using an @functools.cached_property instead of a @property as that transparently handles the memoization.

nrnhines · 2021-03-22T21:28:00Z

All hoc_execerror are fatal when running under MPI (calls MPI_Abort). I don't know how hoc_execerror itself can know if it is being executed from within a try other than requiring the author of the try to also set a flag to avoid the MPI_Abort.

Helveg · 2021-03-23T09:29:33Z

Depending on what versions of Python you're targeting, consider using an @functools.cached_property instead of a @Property as that transparently handles the memoization.

Thanks! I only recently discovered of @functools.cache, this prop predates that discovery :)

All hoc_execerror are fatal when running under MPI (calls MPI_Abort). I don't know how hoc_execerror itself can know if it is being executed from within a try other than requiring the author of the try to also set a flag to avoid the MPI_Abort.

Are there no catchable alternatives to MPI_Abort, any sort of MPI exceptions rather than MPI exits?

There's also some less preferable solutions that I can think off:

The global flag not to MPI_Abort you mention.
A new method on ParallelContext to register MPI error handlers: if they return a Truthy value, just continue, if they don't MPI_Abort. (no handlers = abort)
Another solution could be to add a string "HOC error occurred inside MPI context. Call MPI_Abort in case of deadlock." to the error message, if it's caught it doesn't show up, if it isn't caught people see it and know what they can do if they don't like the program stalling after a process erorring

Exception handling is essential to Python users as Python promotes an EAFP style.

Another small PS but: Even without my try/except/catch-hoc-error code, the error message isn't propagated, is that something to do with my MPI settings/stdout-flushing or on NEURON's part?

nrnhines · 2021-03-23T12:27:13Z

Are there no catchable alternatives to MPI_Abort, any sort of MPI exceptions rather than MPI exits?

I'm not aware of any. Does not mean that there are not any. One reason MPI_Abort is called (and also if during a psolve nothing advances for a settable interval) is to avoid using too much of a user's HPC account limit (the job timeout assumes a successful job). That got eaten up fairly quickly when there were failing runs with several hundred K ranks on a BlueGene.

I like the idea of an at_exit. But I'm a little shakey about how to get it to work. I suppose it would be restricted to an MPI run and when python is involved.

error message isn't propagated

When python is the launched program, it is printed to python's stderr. That is capturable. (see nrn/share/lib/python/neuron/expect_hocerr.py that I'm using to help with code coverage.)

iraikov · 2021-03-23T15:16:15Z

Is there any way to raise HOC errors as C++ exceptions, or to use a C library for exception handling? Then it would be relatively easy to catch them and translate to Python exceptions. I find that on occasion calling MPI_Abort causes some error output to not be written to the log file, which then makes debugging large MPI programs a difficult task.

nrnhines · 2021-03-23T15:51:44Z

@iraikov Perhaps that is possible now that all the code is compiled via c++. I don't have any familiarity with that style and so don't have any feeling about the difficulty of the transition.

pramodk · 2021-03-23T15:58:04Z

Is there any way to raise HOC errors as C++ exceptions, or to use a C library for exception handling?
@iraikov Perhaps that is possible now that all the code is compiled via c++.

I think these are good points. The error handling part has became bit messy - MPI executions want hard failures to avoid deadlock/resource wastage and Python based executions expect nice error handling. With above suggestions, it would be reasonable to achieve both!

iraikov · 2021-03-23T16:08:14Z

@nrnhines @pramodk Thanks, it is definitely possible to create a Python exception object from within a C++ try-catch block. There are several tutorials online that show how to do that I have used in the past. I am not familiar enough with HOC internals, but if hoc_execerror itself becomes a C++ routine and the entire neuron extension module is compiled via C++, then the exception should propagate to the invoking Python wrapper where it can be captured. But I imagine this would involve a try-catch each time the HOC interpreter is invoked. which may be a burdensome refactoring task.

Helveg · 2021-03-23T16:33:47Z

I find that on occasion calling MPI_Abort causes some error output to not be written to the log file, which then makes debugging large MPI programs a difficult task.

I cannot stress this enough; I was now VERY lucky that I had literally only changed 1 line in my code since the last sucessful simulation, but debugging large MPI simulations on a remote HPC server without error log? Horrible experience :s

nrnhines · 2021-05-05T00:29:40Z

I was wondering if the fact that, when python is launched, hoc_execerror sends its output to python for printing on python's stderr, is implicated in why the error output is not appearing prior to the execution of MPI_Abort. However I'm not seeing such an issue on my desktop so am wondering if the issue is specific to the HPC machine you are using.

On my desktop I'm experiencing with

$ cat test.py
from neuron import h
h.nrnmpi_init()

tvec = h.Vector()
tvec.record(h._ref_t)

The result

hines@hines-T7500:~/neuron/mpiabort$ mpiexec -n 4 python test.py
numprocs=4
0 NEURON: Section access unspecified
1 NEURON: Section access unspecified
1  near line 0
2 NEURON: Section access unspecified
2  near line 0
2  objref hoc_obj_[2]
3 NEURON: Section access unspecified
3  near line 0
3  objref hoc_obj_[2]
0  near line 0
0  objref hoc_obj_[2]
1  objref hoc_obj_[2]
                   ^
        3 Vector[0].record(...)
                   ^
        0 Vector[0].record(...)
                   ^
        2 Vector[0].record(...)
                   ^
        1 Vector[0].record(...)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[hines-T7500:08521] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[hines-T7500:08521] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

So I'm puzzled why you are not seeing the stderr output from hoc_execerror from at least the rank that initiated the MPI_Abort.

Helveg · 2021-05-05T14:07:25Z

However I'm not seeing such an issue on my desktop so am wondering if the issue is specific to the HPC machine you are using.

This is the output on the HPC machine, interesting is that with 1 MPI process the error survives, with multiple it does not. But perhaps that is because NEURON doesn't call MPI_Abort with only 1 MPI process?

bp000347@daint103:~> debug python -c "from neuron import h; h.nrnmpi_init(); tvec = h.Vector(); tvec.record(h._ref_t)"
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
numprocs=1
NEURON: Section access unspecified
 near line 0
 objref hoc_obj_[2]
                   ^
        Vector[0].record(...)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: hoc error
srun: error: nid00835: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=31062679.0
bp000347@daint103:~> debug -n 4 python -c "from neuron import h; h.nrnmpi_init(); tvec = h.Vector(); tvec.record(h._ref_t)"
srun: job 31062691 queued and waiting for resources
srun: job 31062691 has been allocated resources
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Rank 2 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
Rank 3 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Rank 0 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Rank 1 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
srun: error: nid00835: tasks 0-2: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=31062691.0
slurmstepd: error: *** STEP 31062691.0 ON nid00835 CANCELLED AT 2021-05-05T16:06:11 ***
srun: error: nid00835: task 3: Aborted (core dumped)

nrnhines · 2021-05-05T14:41:26Z

This is speculative, but I wonder if, when nrnmpi_nhost > 1, NEURON should not ask python to do the printing to stderr. If you think that is worth an experiment on your machine, it can be done with:

hines@hines-T7500:~/neuron/mpiabort/src/oc$ git diff
diff --git a/src/oc/fileio.cpp b/src/oc/fileio.cpp
index f2912cf42..5b51780dc 100755
--- a/src/oc/fileio.cpp
+++ b/src/oc/fileio.cpp
@@ -15,7 +15,7 @@
 #include       <errno.h>
 #include       "nrnfilewrap.h"
 #include    "nrnjava.h"
-
+#include    "nrnmpi.h"
 
 
 
@@ -894,6 +894,15 @@ static int vnrnpy_pr_stdoe(FILE* stream, const char *fmt, va_list ap) {
         return size;
     }
 
+    // On some machines, when running in parallel,
+    // having Python print the stderr message, and then
+    // an immediately subsequent MPI_Abort, 
+    // means the message is not printed. So...
+    if (nrnmpi_numprocs > 1 && stream == stderr) {
+        size = vfprintf(stream, fmt, ap);
+        return size;
+    }
+
     /* Determine required size */
     va_list apc;
 #ifndef va_copy

It that works around your issue, great. It does not prejudice the desirability of being able to recover from errors in the MPI context.

Helveg · 2021-05-05T14:53:48Z

In Python we can do print("error message", flush=True) ensuring that the buffer is flushed before continuing. Is there no such thing you can do when you ask Python to print the message from the API?

And I can for sure experiment with things but I most likely won't succeed in building NEURON from source on that machine ;p Can you open a PR to trigger a wheel?

ramcdougal · 2021-05-05T14:58:53Z

Note that flush=True is Python 3+ with no direct analogue in Python 2.

I'm assuming 8.0.x still support Python 2.7, and it's only 8.1+ that will drop it?

Helveg · 2021-05-05T15:06:24Z

I think the idea here is to just make a quick patch to secure error propagation until the MPI_Abort behavior can be substituted by any of the proposals earlier in this thread; so calling the flush kwarg if it is available seems reasonable; Also, large scale sims on HPC systems where this error occurs will most likely all ship recent Python versions in their Environment Modules.

nrnhines · 2021-05-05T15:09:57Z

do print("error message", flush=True)

That you can experiment with without compiling. Just reach into the installed wheel to neuron/__init__.py and consider

def nrnpy_pr(stdoe, s):
  if stdoe == 1:
    sys.stdout.write(s.decode())
  else:
    sys.stderr.write(s.decode())
  return 0

ramcdougal · 2021-05-05T15:12:00Z

So... would it suffice to just add a

sys.stderr.flush()

inside the else case?

nrnhines · 2021-05-05T15:13:04Z

The speculation is that the problem will go away if Helveg uses the fragment I sent to avoid delegating the stderr print to python. But better is if it can just be done in python with a flush. I can't experiment myself because I have no machine that exhibits the problem.

would it suffice

Here's hoping:)

Helveg · 2021-05-05T15:18:21Z

Yea, but I really don't have the mental bandwidth to go figure out how to get that piece of code there, and then get it built from source on the HPC, that could take me days to figure out 😞 Can we make a wheel, the easiest way I can think of is with a PR with that piece of code in it?

ramcdougal · 2021-05-05T15:19:35Z

Try my version then, which doesn't require a build from source, just a one-line addition in a Python file.

Helveg · 2021-05-05T15:22:29Z

Oh sorry, I missed nrnhines's post with the python code! Will try.

ramcdougal · 2021-05-05T15:32:11Z

You can change the default printing without modifying NEURON itself at all with a bit of messiness using nrnpy_pr_proto, nrnpy_pass_callback, and nrnpy_set_pr_etal as in the following:

import sys
from neuron import h
from neuron import nrnpy_pr_proto, nrnpy_pass_callback, nrnpy_set_pr_etal


def my_print(stdoe, s):
    stream = sys.stdout if stdoe else sys.stderr
    print("inside myprint!")
    stream.write(s.decode())
    stream.flush()
    # the following is required
    return 0


nrnpy_pr_callback = nrnpy_pr_proto(my_print)
nrnpy_set_pr_etal(nrnpy_pr_callback, nrnpy_pass_callback)

soma = h.Section("soma")
h.topology()

This outputs:

% python neuron-print-redirect.py
inside myprint!

inside myprint!
|inside myprint!
-inside myprint!
|       soma(0-1)
inside myprint!

(Obviously, in real use, you'd leave out the printing of "inside my_print!", but it's included here to demo that we're customizing printing.)

Helveg · 2021-05-05T16:29:10Z

Great success with:

import sys
from neuron import h
from neuron import nrnpy_pr_proto, nrnpy_pass_callback, nrnpy_set_pr_etal


def my_print(stdoe, s):
    stream = sys.stdout if stdoe else sys.stderr
    stream.write(s.decode())
    stream.flush()
    # the following is required
    return 0


nrnpy_pr_callback = nrnpy_pr_proto(my_print)
nrnpy_set_pr_etal(nrnpy_pr_callback, nrnpy_pass_callback)

from neuron import h
h.nrnmpi_init()

tvec = h.Vector()
tvec.record(h._ref_t)

bp000347@daint104:/scratch/snx3000/bp000347> debug -n 1 python ntest.py
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
NEURON: Section access unspecified
 near line 0
 objref hoc_obj_[2]
                   ^
        Vector[0].record(...)
numprocs=1
Traceback (most recent call last):
  File "ntest.py", line 21, in <module>
    tvec.record(h._ref_t)
RuntimeError: hoc error
srun: error: nid00009: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=31064762.0
bp000347@daint104:/scratch/snx3000/bp000347> debug -n 2 python ntest.py
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
0 NEURON: Section access unspecified
0  near line 0
0  objref hoc_obj_[2]
1 NEURON: Section access unspecified
1  near line 0
1  objref hoc_obj_[2]
                   ^
                   ^
        1 Vector[0].record(...)
        0 Vector[0].record(...)
Rank 1 [Wed May  5 18:26:56 2021] [c0-0c0s2n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Rank 0 [Wed May  5 18:26:56 2021] [c0-0c0s2n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
srun: error: nid00009: task 0: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=31064765.0
srun: error: nid00009: task 1: Aborted

Even in the second case were before the error was swallowed it is now flushed and shown before the MPI_Abort

nrnhines · 2021-05-05T16:37:47Z

Excellent. Seems worthwhile to make the change in __init__.py for simple PR. I can do that unless someone would rather.

ramcdougal · 2021-05-05T16:46:15Z

Agreed, except I think we should only force the flush on stderr, not on everything as was done here, because buffering does have its advantages.

Incidentally, this becomes a non-issue beginning with Python 3.9 which by default doesn't buffer stderr. https://docs.python.org/3.9/whatsnew/3.9.html#sys

nrnhines · 2021-05-05T17:00:58Z

becomes a non-issue beginning with Python 3.9

Darn. I was testing on my machine with Python 3.9. Don't know I would have see the issue with another version. But it works out because I prefer the change to __init__.py over my change to fileio.cpp.

Helveg · 2021-05-05T17:06:01Z

I'll make the PR :)

iraikov · 2021-05-05T17:54:05Z

Is it possible to use the logging framework for this purpose? Since I started using logging in my own MPI code, I have never had any issues with missing messages even with very large MPI jobs.

This ensures that in MPI simulations the error message is flushed before MPI_Abort is called, as discussed in #1112

Helveg added a commit to dbbs-lab/patch that referenced this issue Mar 23, 2021

changed time prop (neuronsimulator/nrn#1112)

7fa59ed

Helveg added a commit that referenced this issue May 6, 2021

Flush stderr in the python print callback.

dc28e55

This ensures that in MPI simulations the error message is flushed before MPI_Abort is called, as discussed in #1112

Helveg mentioned this issue May 6, 2021

Flush stderr in the python print callback. #1257

Merged

ramcdougal pushed a commit that referenced this issue May 6, 2021

Flush stderr in the python print callback. (#1257)

429eb6e

This ensures that in MPI simulations the error message is flushed before MPI_Abort is called, as discussed in #1112

nrnhines mentioned this issue Dec 24, 2021

ParallelContext.mpiabort_on_error(0) means no call to MPI_Abort. #1567

Merged

nrnhines closed this as completed Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_ABORT called when accessing `h._ref_t` #1112

MPI_ABORT called when accessing `h._ref_t` #1112

Helveg commented Mar 22, 2021

ramcdougal commented Mar 22, 2021 •

edited

Loading

nrnhines commented Mar 22, 2021

Helveg commented Mar 23, 2021 •

edited

Loading

nrnhines commented Mar 23, 2021

iraikov commented Mar 23, 2021

nrnhines commented Mar 23, 2021

pramodk commented Mar 23, 2021

iraikov commented Mar 23, 2021

Helveg commented Mar 23, 2021 •

edited

Loading

nrnhines commented May 5, 2021

Helveg commented May 5, 2021

nrnhines commented May 5, 2021

Helveg commented May 5, 2021 •

edited

Loading

ramcdougal commented May 5, 2021

Helveg commented May 5, 2021 •

edited

Loading

nrnhines commented May 5, 2021 •

edited

Loading

ramcdougal commented May 5, 2021

nrnhines commented May 5, 2021 •

edited

Loading

Helveg commented May 5, 2021 •

edited

Loading

ramcdougal commented May 5, 2021 •

edited

Loading

Helveg commented May 5, 2021 •

edited

Loading

ramcdougal commented May 5, 2021 •

edited

Loading

Helveg commented May 5, 2021

nrnhines commented May 5, 2021

ramcdougal commented May 5, 2021

nrnhines commented May 5, 2021

Helveg commented May 5, 2021

iraikov commented May 5, 2021

MPI_ABORT called when accessing h._ref_t #1112

MPI_ABORT called when accessing h._ref_t #1112

Comments

Helveg commented Mar 22, 2021

ramcdougal commented Mar 22, 2021 • edited Loading

nrnhines commented Mar 22, 2021

Helveg commented Mar 23, 2021 • edited Loading

nrnhines commented Mar 23, 2021

iraikov commented Mar 23, 2021

nrnhines commented Mar 23, 2021

pramodk commented Mar 23, 2021

iraikov commented Mar 23, 2021

Helveg commented Mar 23, 2021 • edited Loading

nrnhines commented May 5, 2021

Helveg commented May 5, 2021

nrnhines commented May 5, 2021

Helveg commented May 5, 2021 • edited Loading

ramcdougal commented May 5, 2021

Helveg commented May 5, 2021 • edited Loading

nrnhines commented May 5, 2021 • edited Loading

ramcdougal commented May 5, 2021

nrnhines commented May 5, 2021 • edited Loading

Helveg commented May 5, 2021 • edited Loading

ramcdougal commented May 5, 2021 • edited Loading

Helveg commented May 5, 2021 • edited Loading

ramcdougal commented May 5, 2021 • edited Loading

Helveg commented May 5, 2021

nrnhines commented May 5, 2021

ramcdougal commented May 5, 2021

nrnhines commented May 5, 2021

Helveg commented May 5, 2021

iraikov commented May 5, 2021

MPI_ABORT called when accessing `h._ref_t` #1112

MPI_ABORT called when accessing `h._ref_t` #1112

ramcdougal commented Mar 22, 2021 •

edited

Loading

Helveg commented Mar 23, 2021 •

edited

Loading

Helveg commented Mar 23, 2021 •

edited

Loading

Helveg commented May 5, 2021 •

edited

Loading

Helveg commented May 5, 2021 •

edited

Loading

nrnhines commented May 5, 2021 •

edited

Loading

nrnhines commented May 5, 2021 •

edited

Loading

Helveg commented May 5, 2021 •

edited

Loading

ramcdougal commented May 5, 2021 •

edited

Loading

Helveg commented May 5, 2021 •

edited

Loading

ramcdougal commented May 5, 2021 •

edited

Loading