Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_ABORT called when accessing h._ref_t #1112

Closed
Helveg opened this issue Mar 22, 2021 · 28 comments
Closed

MPI_ABORT called when accessing h._ref_t #1112

Helveg opened this issue Mar 22, 2021 · 28 comments

Comments

@Helveg
Copy link
Contributor

Helveg commented Mar 22, 2021

The following error occurs when I comment/uncomment access to a property on my wrapper that records ._ref_t:

(bsb3) robin@TNG2019:~/ws3/neuron_adapter_test$ mpiexec -n 8 bsb -v 3 simulate test_adapter --hdf5 adapter_test.hdf5
numprocs=8
Load balancing on node 7 took 0.0 seconds
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 7 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

This is the code of the property:

    @property
    def time(self):
        if not hasattr(self, "_time"):
            t = self.Vector()
            # Fix for upstream NEURON bug. See https://github.com/neuronsimulator/nrn/issues/416
            try:
                with catch_hoc_error(CatchSectionAccess):
                    t.record(self._ref_t)
            except HocSectionAccessError as e:
                self.__dud_section = self.Section(name="this_is_here_to_record_time")
                # Recurse to try again.
                return self.time
            self._time = t
        return self._time

So probably because I catch the exception that occurs in #416 and continue with the simulation there is some fatal error later on.

@ramcdougal
Copy link
Member

ramcdougal commented Mar 22, 2021

You can check to see if it's related to the try-except by removing the try-except and just directly creating your dummy section if there are no existing sections:

@property
def time(self):
    if not hasattr(self, "_time"):
        self._time = self.Vector()
        if not any(h.allsec()):
            self.__dud_section = self.Section(name="this_is_here_to_record_time")
        self._time.record(self._ref_t)
    return self._time

Depending on what versions of Python you're targeting, consider using an @functools.cached_property instead of a @property as that transparently handles the memoization.

@nrnhines
Copy link
Member

All hoc_execerror are fatal when running under MPI (calls MPI_Abort). I don't know how hoc_execerror itself can know if it is being executed from within a try other than requiring the author of the try to also set a flag to avoid the MPI_Abort.

@Helveg
Copy link
Contributor Author

Helveg commented Mar 23, 2021

Depending on what versions of Python you're targeting, consider using an @functools.cached_property instead of a @Property as that transparently handles the memoization.

Thanks! I only recently discovered of @functools.cache, this prop predates that discovery :)

All hoc_execerror are fatal when running under MPI (calls MPI_Abort). I don't know how hoc_execerror itself can know if it is being executed from within a try other than requiring the author of the try to also set a flag to avoid the MPI_Abort.

Are there no catchable alternatives to MPI_Abort, any sort of MPI exceptions rather than MPI exits?

There's also some less preferable solutions that I can think off:

  • The global flag not to MPI_Abort you mention.
  • A new method on ParallelContext to register MPI error handlers: if they return a Truthy value, just continue, if they don't MPI_Abort. (no handlers = abort)
  • Another solution could be to add a string "HOC error occurred inside MPI context. Call MPI_Abort in case of deadlock." to the error message, if it's caught it doesn't show up, if it isn't caught people see it and know what they can do if they don't like the program stalling after a process erorring

Exception handling is essential to Python users as Python promotes an EAFP style.

Another small PS but: Even without my try/except/catch-hoc-error code, the error message isn't propagated, is that something to do with my MPI settings/stdout-flushing or on NEURON's part?

Helveg added a commit to dbbs-lab/patch that referenced this issue Mar 23, 2021
@nrnhines
Copy link
Member

Are there no catchable alternatives to MPI_Abort, any sort of MPI exceptions rather than MPI exits?

I'm not aware of any. Does not mean that there are not any. One reason MPI_Abort is called (and also if during a psolve nothing advances for a settable interval) is to avoid using too much of a user's HPC account limit (the job timeout assumes a successful job). That got eaten up fairly quickly when there were failing runs with several hundred K ranks on a BlueGene.

I like the idea of an at_exit. But I'm a little shakey about how to get it to work. I suppose it would be restricted to an MPI run and when python is involved.

error message isn't propagated

When python is the launched program, it is printed to python's stderr. That is capturable. (see nrn/share/lib/python/neuron/expect_hocerr.py that I'm using to help with code coverage.)

@iraikov
Copy link
Contributor

iraikov commented Mar 23, 2021

Is there any way to raise HOC errors as C++ exceptions, or to use a C library for exception handling? Then it would be relatively easy to catch them and translate to Python exceptions. I find that on occasion calling MPI_Abort causes some error output to not be written to the log file, which then makes debugging large MPI programs a difficult task.

@nrnhines
Copy link
Member

@iraikov Perhaps that is possible now that all the code is compiled via c++. I don't have any familiarity with that style and so don't have any feeling about the difficulty of the transition.

@pramodk
Copy link
Member

pramodk commented Mar 23, 2021

Is there any way to raise HOC errors as C++ exceptions, or to use a C library for exception handling?
@iraikov Perhaps that is possible now that all the code is compiled via c++.

I think these are good points. The error handling part has became bit messy - MPI executions want hard failures to avoid deadlock/resource wastage and Python based executions expect nice error handling. With above suggestions, it would be reasonable to achieve both!

@iraikov
Copy link
Contributor

iraikov commented Mar 23, 2021

@nrnhines @pramodk Thanks, it is definitely possible to create a Python exception object from within a C++ try-catch block. There are several tutorials online that show how to do that I have used in the past. I am not familiar enough with HOC internals, but if hoc_execerror itself becomes a C++ routine and the entire neuron extension module is compiled via C++, then the exception should propagate to the invoking Python wrapper where it can be captured. But I imagine this would involve a try-catch each time the HOC interpreter is invoked. which may be a burdensome refactoring task.

@Helveg
Copy link
Contributor Author

Helveg commented Mar 23, 2021

I find that on occasion calling MPI_Abort causes some error output to not be written to the log file, which then makes debugging large MPI programs a difficult task.

I cannot stress this enough; I was now VERY lucky that I had literally only changed 1 line in my code since the last sucessful simulation, but debugging large MPI simulations on a remote HPC server without error log? Horrible experience :s

@nrnhines
Copy link
Member

nrnhines commented May 5, 2021

I was wondering if the fact that, when python is launched, hoc_execerror sends its output to python for printing on python's stderr, is implicated in why the error output is not appearing prior to the execution of MPI_Abort. However I'm not seeing such an issue on my desktop so am wondering if the issue is specific to the HPC machine you are using.

On my desktop I'm experiencing with

$ cat test.py
from neuron import h
h.nrnmpi_init()

tvec = h.Vector()
tvec.record(h._ref_t)

The result

hines@hines-T7500:~/neuron/mpiabort$ mpiexec -n 4 python test.py
numprocs=4
0 NEURON: Section access unspecified
1 NEURON: Section access unspecified
1  near line 0
2 NEURON: Section access unspecified
2  near line 0
2  objref hoc_obj_[2]
3 NEURON: Section access unspecified
3  near line 0
3  objref hoc_obj_[2]
0  near line 0
0  objref hoc_obj_[2]
1  objref hoc_obj_[2]
                   ^
        3 Vector[0].record(...)
                   ^
        0 Vector[0].record(...)
                   ^
        2 Vector[0].record(...)
                   ^
        1 Vector[0].record(...)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[hines-T7500:08521] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[hines-T7500:08521] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

So I'm puzzled why you are not seeing the stderr output from hoc_execerror from at least the rank that initiated the MPI_Abort.

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

However I'm not seeing such an issue on my desktop so am wondering if the issue is specific to the HPC machine you are using.

This is the output on the HPC machine, interesting is that with 1 MPI process the error survives, with multiple it does not. But perhaps that is because NEURON doesn't call MPI_Abort with only 1 MPI process?

bp000347@daint103:~> debug python -c "from neuron import h; h.nrnmpi_init(); tvec = h.Vector(); tvec.record(h._ref_t)"
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
numprocs=1
NEURON: Section access unspecified
 near line 0
 objref hoc_obj_[2]
                   ^
        Vector[0].record(...)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: hoc error
srun: error: nid00835: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=31062679.0
bp000347@daint103:~> debug -n 4 python -c "from neuron import h; h.nrnmpi_init(); tvec = h.Vector(); tvec.record(h._ref_t)"
srun: job 31062691 queued and waiting for resources
srun: job 31062691 has been allocated resources
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Rank 2 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 2
Rank 3 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3
Rank 0 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Rank 1 [Wed May  5 16:06:10 2021] [c4-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
srun: error: nid00835: tasks 0-2: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=31062691.0
slurmstepd: error: *** STEP 31062691.0 ON nid00835 CANCELLED AT 2021-05-05T16:06:11 ***
srun: error: nid00835: task 3: Aborted (core dumped)

@nrnhines
Copy link
Member

nrnhines commented May 5, 2021

This is speculative, but I wonder if, when nrnmpi_nhost > 1, NEURON should not ask python to do the printing to stderr. If you think that is worth an experiment on your machine, it can be done with:

hines@hines-T7500:~/neuron/mpiabort/src/oc$ git diff
diff --git a/src/oc/fileio.cpp b/src/oc/fileio.cpp
index f2912cf42..5b51780dc 100755
--- a/src/oc/fileio.cpp
+++ b/src/oc/fileio.cpp
@@ -15,7 +15,7 @@
 #include       <errno.h>
 #include       "nrnfilewrap.h"
 #include    "nrnjava.h"
-
+#include    "nrnmpi.h"
 
 
 
@@ -894,6 +894,15 @@ static int vnrnpy_pr_stdoe(FILE* stream, const char *fmt, va_list ap) {
         return size;
     }
 
+    // On some machines, when running in parallel,
+    // having Python print the stderr message, and then
+    // an immediately subsequent MPI_Abort, 
+    // means the message is not printed. So...
+    if (nrnmpi_numprocs > 1 && stream == stderr) {
+        size = vfprintf(stream, fmt, ap);
+        return size;
+    }
+
     /* Determine required size */
     va_list apc;
 #ifndef va_copy

It that works around your issue, great. It does not prejudice the desirability of being able to recover from errors in the MPI context.

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

In Python we can do print("error message", flush=True) ensuring that the buffer is flushed before continuing. Is there no such thing you can do when you ask Python to print the message from the API?

And I can for sure experiment with things but I most likely won't succeed in building NEURON from source on that machine ;p Can you open a PR to trigger a wheel?

@ramcdougal
Copy link
Member

Note that flush=True is Python 3+ with no direct analogue in Python 2.

I'm assuming 8.0.x still support Python 2.7, and it's only 8.1+ that will drop it?

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

I think the idea here is to just make a quick patch to secure error propagation until the MPI_Abort behavior can be substituted by any of the proposals earlier in this thread; so calling the flush kwarg if it is available seems reasonable; Also, large scale sims on HPC systems where this error occurs will most likely all ship recent Python versions in their Environment Modules.

@nrnhines
Copy link
Member

nrnhines commented May 5, 2021

do print("error message", flush=True)

That you can experiment with without compiling. Just reach into the installed wheel to neuron/__init__.py and consider

def nrnpy_pr(stdoe, s):
  if stdoe == 1:
    sys.stdout.write(s.decode())
  else:
    sys.stderr.write(s.decode())
  return 0

@ramcdougal
Copy link
Member

So... would it suffice to just add a

sys.stderr.flush()

inside the else case?

@nrnhines
Copy link
Member

nrnhines commented May 5, 2021

The speculation is that the problem will go away if Helveg uses the fragment I sent to avoid delegating the stderr print to python. But better is if it can just be done in python with a flush. I can't experiment myself because I have no machine that exhibits the problem.

would it suffice

Here's hoping:)

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

Yea, but I really don't have the mental bandwidth to go figure out how to get that piece of code there, and then get it built from source on the HPC, that could take me days to figure out 😞 Can we make a wheel, the easiest way I can think of is with a PR with that piece of code in it?

@ramcdougal
Copy link
Member

ramcdougal commented May 5, 2021

Try my version then, which doesn't require a build from source, just a one-line addition in a Python file.

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

Oh sorry, I missed nrnhines's post with the python code! Will try.

@ramcdougal
Copy link
Member

ramcdougal commented May 5, 2021

You can change the default printing without modifying NEURON itself at all with a bit of messiness using nrnpy_pr_proto, nrnpy_pass_callback, and nrnpy_set_pr_etal as in the following:

import sys
from neuron import h
from neuron import nrnpy_pr_proto, nrnpy_pass_callback, nrnpy_set_pr_etal


def my_print(stdoe, s):
    stream = sys.stdout if stdoe else sys.stderr
    print("inside myprint!")
    stream.write(s.decode())
    stream.flush()
    # the following is required
    return 0


nrnpy_pr_callback = nrnpy_pr_proto(my_print)
nrnpy_set_pr_etal(nrnpy_pr_callback, nrnpy_pass_callback)

soma = h.Section("soma")
h.topology()

This outputs:

% python neuron-print-redirect.py
inside myprint!

inside myprint!
|inside myprint!
-inside myprint!
|       soma(0-1)
inside myprint!

(Obviously, in real use, you'd leave out the printing of "inside my_print!", but it's included here to demo that we're customizing printing.)

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

Great success with:

import sys
from neuron import h
from neuron import nrnpy_pr_proto, nrnpy_pass_callback, nrnpy_set_pr_etal


def my_print(stdoe, s):
    stream = sys.stdout if stdoe else sys.stderr
    stream.write(s.decode())
    stream.flush()
    # the following is required
    return 0


nrnpy_pr_callback = nrnpy_pr_proto(my_print)
nrnpy_set_pr_etal(nrnpy_pr_callback, nrnpy_pass_callback)

from neuron import h
h.nrnmpi_init()

tvec = h.Vector()
tvec.record(h._ref_t)
bp000347@daint104:/scratch/snx3000/bp000347> debug -n 1 python ntest.py
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
NEURON: Section access unspecified
 near line 0
 objref hoc_obj_[2]
                   ^
        Vector[0].record(...)
numprocs=1
Traceback (most recent call last):
  File "ntest.py", line 21, in <module>
    tvec.record(h._ref_t)
RuntimeError: hoc error
srun: error: nid00009: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=31064762.0
bp000347@daint104:/scratch/snx3000/bp000347> debug -n 2 python ntest.py
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
Warning: no DISPLAY environment variable.
--No graphics will be displayed.
0 NEURON: Section access unspecified
0  near line 0
0  objref hoc_obj_[2]
1 NEURON: Section access unspecified
1  near line 0
1  objref hoc_obj_[2]
                   ^
                   ^
        1 Vector[0].record(...)
        0 Vector[0].record(...)
Rank 1 [Wed May  5 18:26:56 2021] [c0-0c0s2n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
Rank 0 [Wed May  5 18:26:56 2021] [c0-0c0s2n1] application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
srun: error: nid00009: task 0: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=31064765.0
srun: error: nid00009: task 1: Aborted

Even in the second case were before the error was swallowed it is now flushed and shown before the MPI_Abort

@nrnhines
Copy link
Member

nrnhines commented May 5, 2021

Excellent. Seems worthwhile to make the change in __init__.py for simple PR. I can do that unless someone would rather.

@ramcdougal
Copy link
Member

Agreed, except I think we should only force the flush on stderr, not on everything as was done here, because buffering does have its advantages.

Incidentally, this becomes a non-issue beginning with Python 3.9 which by default doesn't buffer stderr. https://docs.python.org/3.9/whatsnew/3.9.html#sys

@nrnhines
Copy link
Member

nrnhines commented May 5, 2021

becomes a non-issue beginning with Python 3.9

Darn. I was testing on my machine with Python 3.9. Don't know I would have see the issue with another version. But it works out because I prefer the change to __init__.py over my change to fileio.cpp.

@Helveg
Copy link
Contributor Author

Helveg commented May 5, 2021

I'll make the PR :)

@iraikov
Copy link
Contributor

iraikov commented May 5, 2021

Is it possible to use the logging framework for this purpose? Since I started using logging in my own MPI code, I have never had any issues with missing messages even with very large MPI jobs.

Helveg added a commit that referenced this issue May 6, 2021
This ensures that in MPI simulations the error message is flushed before 
MPI_Abort is called, as discussed in #1112
ramcdougal pushed a commit that referenced this issue May 6, 2021
This ensures that in MPI simulations the error message is flushed before 
MPI_Abort is called, as discussed in #1112
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants