Skip to content
This repository has been archived by the owner on Apr 24, 2020. It is now read-only.

faulthandler causes strange behaviour changes in my c++ extension #4

Closed
joe-jordan opened this issue Dec 4, 2012 · 9 comments
Closed

Comments

@joe-jordan
Copy link

see the stackoverflow question here for details, and let me know which parts of the code you'd like to see (e.g. a list of system functions that I call from C++?)

I'm using faulthandler 2.1 on OS X Lion.

@vstinner
Copy link
Owner

vstinner commented Dec 4, 2012

You should try to isolate which part of faulthandler does modify the behaviour of your application. Try for example to just import the module, but don't call faulthandler.enable(). Then try to modify faulthandler.c to disable some parts of the code (and then recompile and reinstall faulthandler: python setup.py install). Try for example to disable some signals of faulthandler_handlers (remove them from the list).

Does your program crash without faulthandler? Do you have a memory issue (not enough memory)? Did you run your program in a debugger like gdb (without faulthandler)?

The main effect of faulthandler.enable() is to replace the handler of SIGBUS, SIGILL, SIGFPE, SIGABRT, SIGSEGV signals. Your program may already handle one of these signals? Maybe SIGFPE to handler FPU errors?

@joe-jordan
Copy link
Author

this is not related to a crash (I was using faulthandler earlier to catch segfaults, but they're not happening anymore - I just left the import in just in case.)

It is changing the behaviour of what I thought was a deterministic, data-driven calculation - it's quite important (for both of us I guess) to work out where our code clashes so that I can fix my (apparently rather bugged) calculation and you can make faulthandler more robust (or provide guidance for extension developers about how to not repeat my mistake!)

I'm gathering data on it for now - I'll post more info when I have some.

header files

For now, I can say that the only headers my code includes are:

assert.h
stdio.h
limits.h
math.h
vector
stdlib.h

with Cython automatically adding:

omp.h
stddef.h
structmember.h

Although it looks like OpenMP was inside an ifdef, so may not actually be included at compile time, stddef just uses a macro called offsetof to help define struct shapes, I think. Anyway, all of this is standard Cython, so I imagine that it won't be a problem (if your tool didn't work with Cython you would have more bug reports by now!)

Do any of these ring alarm bells? I'll take a look at your code in the morning (UK time) and see if I can see the clash.

@joe-jordan
Copy link
Author

the following is enough to cause the behaviour change:

import faulthandler

I've put printf() statements inside all the methods, and it seems that the import calls only initfaulthandler().

Taking a look at the method, the most controversial thing it does is to allocate its own stack in case of stack overflow. So, having added this to the top of the file (under the system includes):

#ifdef HAVE_SIGALTSTACK
#undef HAVE_SIGALTSTACK
#endif

and compiling without stackoverflow support does indeed seem to revert the behaviour to the original case.

Thus, the offending code is the following chunk:

#ifdef HAVE_SIGALTSTACK
    /* Try to allocate an alternate stack for faulthandler() signal handler to
     * be able to allocate memory on the stack, even on a stack overflow. If it
     * fails, ignore the error. */
    stack.ss_flags = 0;
    stack.ss_size = SIGSTKSZ;
    stack.ss_sp = PyMem_Malloc(stack.ss_size);
    if (stack.ss_sp != NULL) {
        err = sigaltstack(&stack, NULL);
        if (err) {
            PyMem_Free(stack.ss_sp);
            stack.ss_sp = NULL;
        }
    }
#endif

The only thing I can see that could possibly affect anything outside the scope of this file is the call to sigaltstack() (man page) which assigns the allocated stack pointer to be used for signal handlers.

Since my code uses no signals anywhere (see the header files used above - I don't #include <signal.h>!) I can't see how this could possibly affect anything.

Do you have any ideas?

@vstinner
Copy link
Owner

vstinner commented Dec 5, 2012

When you read your first message, I already suspected sigaltstack(). What is the value of SIGSTKSZ on your platform? (add a printf) Try with a larger stack: stack.ss_size = SIGSTKSZ * 100;

Since my code uses no signals anywhere (...)

A library may use signals internally. For example, D-Bus service uses a real time signal. Alarms or timeout may use also SIGALRM.

Please try your program in a debugger like gdb. The debugger will notify you of signals.

@joe-jordan
Copy link
Author

SIGSTKSZ = 131072 (printf'd with %d, assuming it was signed, which the compiler thinks is true.)

I'll see if I can get system gdb (6) to play nice with python...

@joe-jordan
Copy link
Author

GDB

OK, gdb shows no difference in signals between the two versions; indeed, as I suspected, my numerical code triggers no signals at all while running (I checked briefly that C assert() wasn't using signals, but even with all calls to it disabled I still see the behaviour change.)

There are a bunch of "couldn't find debugging symbols" warnings for the fortran runtime and other stdlib code, but these are loaded before faulthandler and occur whether or not it is loaded.

There is one signal triggered during the entire runtime, which is SIGTRAP, which seems to be a default "oh, did you want to debug me?" point at the start of any script (it is also triggered for a file containing print "hello world!".)

SIGSTKSZ

SIGSTKSZ * 100 seems to have solved the discrepancy (it also doesn't seem to use more than 100k more RAM, either, which means that SIGSTKSZ = 131072 makes even less sense than I thought (it is not bytes.))

However, I am still paranoid as to why this could cause my RAM-heavy but mostly heap allocated calculation to behave differently - I don't know which behaviour is correct, you see!

@vstinner
Copy link
Owner

vstinner commented Dec 5, 2012

There is one signal triggered during the entire runtime, which is SIGTRAP, which seems to be a default "oh, did you want to debug me?" point at the start of any script (it is also triggered for a file containing print "hello world!".)

What? It's strange that you get a SIGTRAP. What is a script in your program? A Python script? What does send the SIGTRAP signal? Or did you press CTRL+c during the execution?

faulthandler doesn't replace the handler of SIGTRAP. SIGTRAP amy be used by gdb (I don't know what is used on Mac OS X).

SIGSTKSZ * 100 seems to have solved the discrepancy

131 KB should be enough, it's surprising that your program is crashing with a stack of 131 KB.

A simple workaround is to not allocate a stack dedicated to signal handlers (ex: #undef HAVE_SIGALTSTACK). But it would be nice if you can investigate the issue, because other people may hit this bug.

@joe-jordan
Copy link
Author

I'm not worried about the SIGTRAP - as I said, it's called for all python scripts, and may be something in apple's python 2.7 binary that I'm using (I can try with macports 2.7 later and see if I can repeat it.) I didn't press Ctrl+C at the speed of light, and wouldn't that be SIGINT/and or a KeyboardInterrupt pythonland error anyway?

to clarify again, my program is not crashing - it's calculating a different numerical answer! to underline this, I'll copy and paste from the stackoverflow bug:

with faulthandler:

grain count: 1434
seemed to have 8000000 voxels, with average value 0.8398655
find cells:
running watershed algorithm...
found 1242 cells from 1434 original grains!
...

however, without faulthandler (and now, with the patch):

grain count: 1434
seemed to have 8000000 voxels, with average value 0.8398655
find cells:
running watershed algorithm...
found 927 cells from 1434 original grains!
...

Surely the Sig Alt Stack is just that - an alternative stack which is only used for handling signals. This stack will literally never be used (the SIGTRAP that's called is before faulthandler is loaded; before it runs a single line of python) and yet it seems to affect my program's data.

I think the next step is for me to investigate where exactly my routines diverge with and without the small sigaltstack - as you say, I need to understand this behaviour both for the accuracy of any results I get and so I can communicate the differences to everyone else!

oh well, verbose logging and diff it is...

@joe-jordan
Copy link
Author

Thanks for your help in debugging this - The problem was not a bug in faulthandler.

I was using uninitialised RAM; clearly the malloc() call in faulthandler's initialisation was affecting which random garbage I was handed on different runs, and thus which spurious result I obtained (or something - I still have no idea why the results were apparently repeatable on two different machines!)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants