-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_ABORT called when accessing h._ref_t
#1112
Comments
You can check to see if it's related to the try-except by removing the try-except and just directly creating your dummy section if there are no existing sections: @property
def time(self):
if not hasattr(self, "_time"):
self._time = self.Vector()
if not any(h.allsec()):
self.__dud_section = self.Section(name="this_is_here_to_record_time")
self._time.record(self._ref_t)
return self._time Depending on what versions of Python you're targeting, consider using an |
All hoc_execerror are fatal when running under MPI (calls MPI_Abort). I don't know how hoc_execerror itself can know if it is being executed from within a try other than requiring the author of the try to also set a flag to avoid the MPI_Abort. |
Thanks! I only recently discovered of
Are there no catchable alternatives to There's also some less preferable solutions that I can think off:
Exception handling is essential to Python users as Python promotes an EAFP style. Another small PS but: Even without my try/except/catch-hoc-error code, the error message isn't propagated, is that something to do with my MPI settings/stdout-flushing or on NEURON's part? |
I'm not aware of any. Does not mean that there are not any. One reason MPI_Abort is called (and also if during a psolve nothing advances for a settable interval) is to avoid using too much of a user's HPC account limit (the job timeout assumes a successful job). That got eaten up fairly quickly when there were failing runs with several hundred K ranks on a BlueGene. I like the idea of an at_exit. But I'm a little shakey about how to get it to work. I suppose it would be restricted to an MPI run and when python is involved.
When python is the launched program, it is printed to python's stderr. That is capturable. (see nrn/share/lib/python/neuron/expect_hocerr.py that I'm using to help with code coverage.) |
Is there any way to raise HOC errors as C++ exceptions, or to use a C library for exception handling? Then it would be relatively easy to catch them and translate to Python exceptions. I find that on occasion calling MPI_Abort causes some error output to not be written to the log file, which then makes debugging large MPI programs a difficult task. |
@iraikov Perhaps that is possible now that all the code is compiled via c++. I don't have any familiarity with that style and so don't have any feeling about the difficulty of the transition. |
I think these are good points. The error handling part has became bit messy - MPI executions want hard failures to avoid deadlock/resource wastage and Python based executions expect nice error handling. With above suggestions, it would be reasonable to achieve both! |
@nrnhines @pramodk Thanks, it is definitely possible to create a Python exception object from within a C++ try-catch block. There are several tutorials online that show how to do that I have used in the past. I am not familiar enough with HOC internals, but if hoc_execerror itself becomes a C++ routine and the entire neuron extension module is compiled via C++, then the exception should propagate to the invoking Python wrapper where it can be captured. But I imagine this would involve a try-catch each time the HOC interpreter is invoked. which may be a burdensome refactoring task. |
I cannot stress this enough; I was now VERY lucky that I had literally only changed 1 line in my code since the last sucessful simulation, but debugging large MPI simulations on a remote HPC server without error log? Horrible experience :s |
I was wondering if the fact that, when python is launched, hoc_execerror sends its output to python for printing on python's stderr, is implicated in why the error output is not appearing prior to the execution of MPI_Abort. However I'm not seeing such an issue on my desktop so am wondering if the issue is specific to the HPC machine you are using. On my desktop I'm experiencing with
The result
So I'm puzzled why you are not seeing the stderr output from hoc_execerror from at least the rank that initiated the MPI_Abort. |
This is the output on the HPC machine, interesting is that with 1 MPI process the error survives, with multiple it does not. But perhaps that is because NEURON doesn't call
|
This is speculative, but I wonder if, when nrnmpi_nhost > 1, NEURON should not ask python to do the printing to stderr. If you think that is worth an experiment on your machine, it can be done with:
It that works around your issue, great. It does not prejudice the desirability of being able to recover from errors in the MPI context. |
In Python we can do And I can for sure experiment with things but I most likely won't succeed in building NEURON from source on that machine ;p Can you open a PR to trigger a wheel? |
Note that I'm assuming 8.0.x still support Python 2.7, and it's only 8.1+ that will drop it? |
I think the idea here is to just make a quick patch to secure error propagation until the |
That you can experiment with without compiling. Just reach into the installed wheel to
|
So... would it suffice to just add a sys.stderr.flush() inside the |
The speculation is that the problem will go away if Helveg uses the fragment I sent to avoid delegating the stderr print to python. But better is if it can just be done in python with a flush. I can't experiment myself because I have no machine that exhibits the problem.
Here's hoping:) |
Yea, but I really don't have the mental bandwidth to go figure out how to get that piece of code there, and then get it built from source on the HPC, that could take me days to figure out 😞 Can we make a wheel, the easiest way I can think of is with a PR with that piece of code in it? |
Try my version then, which doesn't require a build from source, just a one-line addition in a Python file. |
Oh sorry, I missed nrnhines's post with the python code! Will try. |
You can change the default printing without modifying NEURON itself at all with a bit of messiness using import sys
from neuron import h
from neuron import nrnpy_pr_proto, nrnpy_pass_callback, nrnpy_set_pr_etal
def my_print(stdoe, s):
stream = sys.stdout if stdoe else sys.stderr
print("inside myprint!")
stream.write(s.decode())
stream.flush()
# the following is required
return 0
nrnpy_pr_callback = nrnpy_pr_proto(my_print)
nrnpy_set_pr_etal(nrnpy_pr_callback, nrnpy_pass_callback)
soma = h.Section("soma")
h.topology() This outputs:
(Obviously, in real use, you'd leave out the printing of |
Great success with: import sys
from neuron import h
from neuron import nrnpy_pr_proto, nrnpy_pass_callback, nrnpy_set_pr_etal
def my_print(stdoe, s):
stream = sys.stdout if stdoe else sys.stderr
stream.write(s.decode())
stream.flush()
# the following is required
return 0
nrnpy_pr_callback = nrnpy_pr_proto(my_print)
nrnpy_set_pr_etal(nrnpy_pr_callback, nrnpy_pass_callback)
from neuron import h
h.nrnmpi_init()
tvec = h.Vector()
tvec.record(h._ref_t)
Even in the second case were before the error was swallowed it is now flushed and shown before the |
Excellent. Seems worthwhile to make the change in |
Agreed, except I think we should only force the flush on stderr, not on everything as was done here, because buffering does have its advantages. Incidentally, this becomes a non-issue beginning with Python 3.9 which by default doesn't buffer stderr. https://docs.python.org/3.9/whatsnew/3.9.html#sys |
Darn. I was testing on my machine with Python 3.9. Don't know I would have see the issue with another version. But it works out because I prefer the change to |
I'll make the PR :) |
Is it possible to use the logging framework for this purpose? Since I started using logging in my own MPI code, I have never had any issues with missing messages even with very large MPI jobs. |
This ensures that in MPI simulations the error message is flushed before MPI_Abort is called, as discussed in #1112
The following error occurs when I comment/uncomment access to a property on my wrapper that records
._ref_t
:This is the code of the property:
So probably because I catch the exception that occurs in #416 and continue with the simulation there is some fatal error later on.
The text was updated successfully, but these errors were encountered: