faulthandler causes strange behaviour changes in my c++ extension #4
Comments
You should try to isolate which part of faulthandler does modify the behaviour of your application. Try for example to just import the module, but don't call faulthandler.enable(). Then try to modify faulthandler.c to disable some parts of the code (and then recompile and reinstall faulthandler: python setup.py install). Try for example to disable some signals of faulthandler_handlers (remove them from the list). Does your program crash without faulthandler? Do you have a memory issue (not enough memory)? Did you run your program in a debugger like gdb (without faulthandler)? The main effect of faulthandler.enable() is to replace the handler of SIGBUS, SIGILL, SIGFPE, SIGABRT, SIGSEGV signals. Your program may already handle one of these signals? Maybe SIGFPE to handler FPU errors? |
this is not related to a crash (I was using faulthandler earlier to catch segfaults, but they're not happening anymore - I just left the import in just in case.) It is changing the behaviour of what I thought was a deterministic, data-driven calculation - it's quite important (for both of us I guess) to work out where our code clashes so that I can fix my (apparently rather bugged) calculation and you can make faulthandler more robust (or provide guidance for extension developers about how to not repeat my mistake!) I'm gathering data on it for now - I'll post more info when I have some. header files For now, I can say that the only headers my code includes are:
with Cython automatically adding:
Although it looks like OpenMP was inside an ifdef, so may not actually be included at compile time, stddef just uses a macro called offsetof to help define struct shapes, I think. Anyway, all of this is standard Cython, so I imagine that it won't be a problem (if your tool didn't work with Cython you would have more bug reports by now!) Do any of these ring alarm bells? I'll take a look at your code in the morning (UK time) and see if I can see the clash. |
the following is enough to cause the behaviour change: import faulthandler I've put Taking a look at the method, the most controversial thing it does is to allocate its own stack in case of stack overflow. So, having added this to the top of the file (under the system includes): #ifdef HAVE_SIGALTSTACK
#undef HAVE_SIGALTSTACK
#endif and compiling without stackoverflow support does indeed seem to revert the behaviour to the original case. Thus, the offending code is the following chunk: #ifdef HAVE_SIGALTSTACK
/* Try to allocate an alternate stack for faulthandler() signal handler to
* be able to allocate memory on the stack, even on a stack overflow. If it
* fails, ignore the error. */
stack.ss_flags = 0;
stack.ss_size = SIGSTKSZ;
stack.ss_sp = PyMem_Malloc(stack.ss_size);
if (stack.ss_sp != NULL) {
err = sigaltstack(&stack, NULL);
if (err) {
PyMem_Free(stack.ss_sp);
stack.ss_sp = NULL;
}
}
#endif The only thing I can see that could possibly affect anything outside the scope of this file is the call to Since my code uses no signals anywhere (see the header files used above - I don't Do you have any ideas? |
When you read your first message, I already suspected sigaltstack(). What is the value of SIGSTKSZ on your platform? (add a printf) Try with a larger stack: stack.ss_size = SIGSTKSZ * 100;
A library may use signals internally. For example, D-Bus service uses a real time signal. Alarms or timeout may use also SIGALRM. Please try your program in a debugger like gdb. The debugger will notify you of signals. |
SIGSTKSZ = 131072 (printf'd with %d, assuming it was signed, which the compiler thinks is true.) I'll see if I can get system gdb (6) to play nice with python... |
GDB OK, gdb shows no difference in signals between the two versions; indeed, as I suspected, my numerical code triggers no signals at all while running (I checked briefly that C There are a bunch of "couldn't find debugging symbols" warnings for the fortran runtime and other There is one signal triggered during the entire runtime, which is SIGSTKSZ
However, I am still paranoid as to why this could cause my RAM-heavy but mostly heap allocated calculation to behave differently - I don't know which behaviour is correct, you see! |
There is one signal triggered during the entire runtime, which is SIGTRAP, which seems to be a default "oh, did you want to debug me?" point at the start of any script (it is also triggered for a file containing print "hello world!".) What? It's strange that you get a SIGTRAP. What is a script in your program? A Python script? What does send the SIGTRAP signal? Or did you press CTRL+c during the execution? faulthandler doesn't replace the handler of SIGTRAP. SIGTRAP amy be used by gdb (I don't know what is used on Mac OS X). SIGSTKSZ * 100 seems to have solved the discrepancy 131 KB should be enough, it's surprising that your program is crashing with a stack of 131 KB. A simple workaround is to not allocate a stack dedicated to signal handlers (ex: #undef HAVE_SIGALTSTACK). But it would be nice if you can investigate the issue, because other people may hit this bug. |
I'm not worried about the SIGTRAP - as I said, it's called for all python scripts, and may be something in apple's python 2.7 binary that I'm using (I can try with macports 2.7 later and see if I can repeat it.) I didn't press Ctrl+C at the speed of light, and wouldn't that be SIGINT/and or a KeyboardInterrupt pythonland error anyway? to clarify again, my program is not crashing - it's calculating a different numerical answer! to underline this, I'll copy and paste from the stackoverflow bug: with faulthandler:
however, without faulthandler (and now, with the patch):
Surely the Sig Alt Stack is just that - an alternative stack which is only used for handling signals. This stack will literally never be used (the SIGTRAP that's called is before faulthandler is loaded; before it runs a single line of python) and yet it seems to affect my program's data. I think the next step is for me to investigate where exactly my routines diverge with and without the small sigaltstack - as you say, I need to understand this behaviour both for the accuracy of any results I get and so I can communicate the differences to everyone else! oh well, verbose logging and |
Thanks for your help in debugging this - The problem was not a bug in faulthandler. I was using uninitialised RAM; clearly the |
see the stackoverflow question here for details, and let me know which parts of the code you'd like to see (e.g. a list of system functions that I call from C++?)
I'm using faulthandler 2.1 on OS X Lion.
The text was updated successfully, but these errors were encountered: