Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
core: use synchronous signal handling on unix #7050
Issue #7044 describes multiple problems with the current signal
When using asynchronous signal handling via signal handlers, the code
This change hopes to solve the existing issues and avoid future
Move to Synchronous signal handling
The strategy used here is described in The Linux
On Linux, signals::init() will now:
Because we want to block the signals early in the process, we move
Move to a more explicit interface
This change also moves us from using a VecDequeue under a Mutex for
This has two advantages:
The old check_for_signal() API looks like it intended to provide the
Windows uses a library which appears to handle the CTRL_C_EVENT.
The set of changes should not substantively change the behavior on either platform,
This more explicit interface has the disadvantage of not being
 Kerrisk, Michael. The Linux Programming Interface, 685. San
Signed-off-by: Steven Danna email@example.com
stevendanna > Note that I kept the
I was wondering about that as well. Is there any case where we need to resolve multiple instances of a signal, or the order of their arrival? If not could we simply have a static array of flags, one per signal type, and test/process them in some priority order. While I think it's safe now, getting rid of the alloc implicit in push_back and replacing it with updating a static mask would simplify things and make it easier to think about the code.
I don't think so. Also, based on my understanding of how signals work, anything that is depending on this would be broken by design since the kernel will "coalesce" signals on you.
@markan Thanks for the review! I've moved away from the VecDequeue and just added a couple more AtomicBool's. I thought about doing some bitmask type thing, but for now, I wanted the signal behavior we actually handle to be much more explicit in the code. I don't think a bitmask+mutex and a new API would be hard to wire up in the future if we needed it. Ultimately, given the interactions between process, threads, signals, and the complicated 3rd party libraries we use which launch their own threads, I'd rather keep our use of signal-based behavior to a minimum.
On your question of libraries. I defer to y'all who work on this every day. Since a huge function of launcher and sup is subprocess management, I don't think we ever escape from having to own the details, but there is still a trade-off. In this PR, I've deferred to stdlib to do all necessary cleanup before calling execvp but have taken the signal masking operations in-house. To me the big question around whether to use nix is mostly about whether we want to deal with managing the FFI bits. In the very long run, there is also a good looking library
Issue #7044 describes multiple problems with the current signal handling approach. These problems were introduced in 8e13827 and render our current error handling unsafe. Specifically, taking a mutex and allocating memory from inside an signal handler can result in a deadlock or state corruption. When using asynchronous signal handling via signal handlers, the code in the signal handler are restricted to limited number of safe operations (i.e. flipping an atomic, calling async-signal-safe functions). This change hopes to solve the existing issues and avoid future problems caused by signal handlers by moving us to synchronous signal handling. The strategy used here is described in _The Linux Programming Interface_: > - All threads block all of the asynchronous signals that the process > might receive. The simplest way to do this is to block the signals > in the main thread before any other threads are created. Each > subsequently created thread will inherit a copy of the main > thread's signal mask. > > - Create a single dedicated thread that accepts incoming signals > using _sigwaitinfo()_, _sigtimedwait()_, or _sigwait()_... On Linux, signals::init() will now: - Block all signals that we plan on handling - Starts a thread that processes signals synchronously. Because we want to block the signals early in the process, we move signals::init() much earlier into main. Resources  Kerrisk, Michael. _The Linux Programming Interface_, 685. San Franscisco: No Starch Press, 2010.  http://man7.org/linux/man-pages/man2/sigprocmask.2.html  http://man7.org/linux/man-pages/man3/sigwait.3.html  http://man7.org/linux/man-pages/man3/pthread_sigmask.3.html  http://man7.org/linux/man-pages/man3/sigsetops.3.html  http://man7.org/linux/man-pages/man7/signal.7.html Signed-off-by: Steven Danna <firstname.lastname@example.org>
This change moves us from using a VecDequeue under a Mutex for tracking signals to signal-specific AtomicBools, replacing the generic `check_for_signal()` function with two more-specific `pending_sighup()` and `pending_sigchld()` functions. This has two advantages: 1) No mutexes or memory allocations in our signal handling code. This matters less since we have also moved to synchronous handling of signals, bug gives us some flexibility to change our approach more safely in the future. 2) Clearer alignment between this interface and the application behavior. The old check_for_signal() API looks like it intended to provide the groundwork for a variety of signal-based behavior; however, in reality, we are handling a small number of signals. The generic nature of that interface and the fact that the code is shared between two processes was obscuring the fact that we actually have a small amount of signal-related behavior across the two services: *Linux Behavior* | signal | hab-launch behavior | hab-sup behavior | |-----------+---------------------+----------------------| | SIGINT | Graceful shutdown | Handled but ignored | | SIGTERM | Graceful shutdown | Handled but ignored | | SIGHUP | Send to hab-sup | Shutdown for restart | | SIGCHLD | Reap zombies | Handled but ignored | | SIGQUIT | Handled but ignored | Handled but ignored | | SIGALRM | Handled but ignored | Handled but ignored | | SIGUSR1 | Handled but ignored | Handled but ignored | | SIGUSR2 | Handled but ignored | Handled but ignored | | All other | Default disposition | Default disposition | *Windows Behavior* Windows uses a library which appears to handle the CTRL_C_EVENT. The set of changes should not change the behavior here, although we are now installing the handler earlier in the startup of the relevant applications. This more explicit interface has the disadvantage of not being well-suited to a larger variety of signal-based behavior. However, it seems to me that we can always implement something smarter when that actually becomes an issue. Signed-off-by: Steven Danna <email@example.com>
We were calling libc::execvp directly here. This is problematic because ideally we would do various cleanups before calling exec, including resetting signal masks and signal dispositions. While we could have done this directly, the standard library already handles the nitty-gritty details and we already have an internal function that calls out to it. Signed-off-by: Steven Danna <firstname.lastname@example.org>
Something to discuss: I don't believe we need anything more than relaxed for our atomic memory ordering because the 2 threads are not sharing any other writes/reads. The assumption I'm making here is that the actions taken as a result of the atomic bool are not reordered to happen before the CAS.
This is directed to the more experienced habitat developers:
So when I look at the history of this code, it looks like circa hab 0.50 we handled signals in a similar fashion to this PR. Then in habitat-sh/core#75 we changed that to a queue, because of concerns that multiple signals might be lost or handled out of order.
But (as @stevendanna mentioned) multiple signals to a process can be freely reordered and coalesced, in particular any sent before the receiver is scheduled. It looks like under the covers it's a set of bit flags per process, indicating whether a signal has been sent to it, and when scheduled the signals are delivered in a canonical order https://github.com/torvalds/linux/blob/master/kernel/signal.c#L217. So we can't actually guarantee the behavior that #75 above is intended to provide, especially under load or other duress.
So any behavior that depends on in-order or multiple signal delivery is broken and needs to be modified. But from habitat-sh/core#11, I'm not clear on what depends on that behavior. How do we find out?
christophermaier left a comment
This looks like a great improvement @stevendanna; thanks very much for digging into this, and for providing such a well-documented solution.
@markan Doing a bit of code archaeology, it seems that before we had the
Multiple different signals arriving in succession could thus get condensed into whichever was the last one handled, resulting in the potential for missed signals, which is what habitat-sh/core#75 was trying to address (rather than ensuring that we directly handled every instance of every kind of signal). It looks like this current PR provides a more explicit way of dealing with multiple different signals, by dealing concretely with each one we care about, rather than trying to handle them all generically.
@stevendanna Was there anything else you wanted to do with this PR? If not, I'm happy to get it merged; I just didn't want to pull the rug out from under you if you had any other tweaks you wanted to do.
I'm going to file an issue to look into
I think what you've done here in this PR is a very definite improvement, though!