Fix SIGCHLD race in signal handler setup#5206
Fix SIGCHLD race in signal handler setup#5206bitoku wants to merge 1 commit intoopencontainers:mainfrom
Conversation
|
@bitoku Thanks for the report! You're absolutely right -- there’s clearly a race condition here. I should have caught this when I opened test PR #5197. That said, I believe the direction of your patch may be misguided. As proposed, it would undo the performance improvement introduced in #4654. Instead, I think we should check whether the container process is still alive -- something along these lines: +++ b/signals.go
@@ -70,6 +70,21 @@ func (h *signalHandler) forward(process *libcontainer.Process, tty *tty, detach
return -1, err
}
+ // We need to check if the process is still alive before we start forwarding signals,
+ // otherwise we might miss the exit of the process and wait indefinitely for a SIGCHLD
+ // that will never come. This can happen if the process exits before we start forwarding
+ // signals, which can happen in cases where the process is very short lived or if there
+ // is a race between the process exiting and us starting to forward signals. By sending
+ // a signal 0, we can check if the process is still alive without actually sending a
+ // signal to it. If the process is not alive, we can return immediately with its exit
+ // status.
+ if err := process.Signal(unix.Signal(0)); err != nil {
+ if state, err := process.Wait(); err == nil {
+ return state.ExitCode(), nil
+ }
+ return -1, err
+ }
+
if h.notifySocket != nil {
if detach {
_ = h.notifySocket.run(pid1) |
|
By the way, it would be great if you could add an integration test to cover this regression. |
a10eb07 to
7ebfd5b
Compare
|
@lifubang I don't think the suggested way works, because kill -0 to a zombie process should success.
I added it though it passes even without the fix. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Add an initial reap in forward() before entering the signal loop. Since signal.Notify runs in a goroutine, a fast-exiting process could trigger SIGCHLD before signal registration completes, causing the signal to be silently discarded. This left forward() blocking forever on a signal that would never arrive. By calling reap() once before the loop, we catch any process that already exited during the setup window. If the process is still running, the reap is a no-op and SIGCHLD arrives normally via the registered signal handler. Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
44fe5e2 to
e64a48e
Compare
|
Thanks for the update. Could you please verify if this patch resolves your issue? I believe it offers a more robust solution: diff --git a/signals.go b/signals.go
index 936d751f..649d3bcf 100644
--- a/signals.go
+++ b/signals.go
@@ -26,14 +26,18 @@ func newSignalHandler(enableSubreaper bool, notifySocket *notifySocket) chan *si
}
}
handler := make(chan *signalHandler)
+
+ // ensure that we have a large buffer size so that we do not miss any
+ // signals in case we are not processing them fast enough.
+ s := make(chan os.Signal, signalBufferSize)
+
// signal.Notify is actually quite expensive, as it has to configure the
// signal mask and add signal handlers for all signals (all ~65 of them).
// So, defer this to a background thread while doing the rest of the io/tty
// setup.
+ // But `SIGCHLD` is very important, so we need to handle it immediately.
+ signal.Notify(s, unix.SIGCHLD)
go func() {
- // ensure that we have a large buffer size so that we do not miss any
- // signals in case we are not processing them fast enough.
- s := make(chan os.Signal, signalBufferSize)
// handle all signals for the process.
signal.Notify(s)
handler <- &signalHandler{As I mentioned in the linked comment, I don't think adding a BATS test is necessary. Since I couldn't reproduce the issue even after 100,000 iterations, writing a test for it would be futile. It simply won't trigger in a realistic timeframe. |
|
Yes, I think this is a simplest possible fix. I tested it by adding @lifubang can you please open a PR (and acknowledge @bitoku contribution via |
|
Yet another way to fix this would be something like this: diff --git a/utils_linux.go b/utils_linux.go
index d7f51e20..a5b0a630 100644
--- a/utils_linux.go
+++ b/utils_linux.go
@@ -299,6 +299,7 @@ func (r *runner) run(config *specs.Process) (int, error) {
}
}
handler := <-handlerCh
+ handler.signals <- unix.SIGCHLD
status, err := handler.forward(process, tty, detach)
if err != nil {
r.terminate(process)This should have the same/similar effect as the fix in this PR but I like this approach better. PS this reminds me we should try to finish the review of #4661. |
|
closing in favor of #5210 |
Add an initial reap in forward() before entering the signal loop.
Since signal.Notify runs in a goroutine, a fast-exiting process
could trigger SIGCHLD before signal registration completes, causing
the signal to be silently discarded. This left forward() blocking
forever on a signal that would never arrive.
By calling reap() once before the loop, we catch any process that
already exited during the setup window. If the process is still
running, the reap is a no-op and SIGCHLD arrives normally via the
registered signal handler.
Fix: #5208