Fix SIGCHLD race in signal handler setup by bitoku · Pull Request #5206 · opencontainers/runc

bitoku · 2026-03-31T02:28:11Z

Add an initial reap in forward() before entering the signal loop.
Since signal.Notify runs in a goroutine, a fast-exiting process
could trigger SIGCHLD before signal registration completes, causing
the signal to be silently discarded. This left forward() blocking
forever on a signal that would never arrive.

By calling reap() once before the loop, we catch any process that
already exited during the setup window. If the process is still
running, the reap is a no-op and SIGCHLD arrives normally via the
registered signal handler.

Fix: #5208

lifubang · 2026-03-31T10:04:06Z

@bitoku Thanks for the report! You're absolutely right -- there’s clearly a race condition here. I should have caught this when I opened test PR #5197.

That said, I believe the direction of your patch may be misguided. As proposed, it would undo the performance improvement introduced in #4654.

Instead, I think we should check whether the container process is still alive -- something along these lines:

+++ b/signals.go
@@ -70,6 +70,21 @@ func (h *signalHandler) forward(process *libcontainer.Process, tty *tty, detach
                return -1, err
        }
 
+       // We need to check if the process is still alive before we start forwarding signals,
+       // otherwise we might miss the exit of the process and wait indefinitely for a SIGCHLD
+       // that will never come. This can happen if the process exits before we start forwarding
+       // signals, which can happen in cases where the process is very short lived or if there
+       // is a race between the process exiting and us starting to forward signals. By sending
+       // a signal 0, we can check if the process is still alive without actually sending a
+       // signal to it. If the process is not alive, we can return immediately with its exit
+       // status.
+       if err := process.Signal(unix.Signal(0)); err != nil {
+               if state, err := process.Wait(); err == nil {
+                       return state.ExitCode(), nil
+               }
+               return -1, err
+       }
+
        if h.notifySocket != nil {
                if detach {
                        _ = h.notifySocket.run(pid1)

lifubang · 2026-03-31T10:09:59Z

By the way, it would be great if you could add an integration test to cover this regression.

bitoku · 2026-03-31T11:42:20Z

@lifubang
Thank you!
Ok so it offsets runc init. I didn't check your PR, sorry.

I don't think the suggested way works, because kill -0 to a zombie process should success.
Instead I reused h.reap. It does unblocking wait, and does the same handling as when it gets SIGCHILD.
Let me know how it looks.

By the way, it would be great if you could add an integration test to cover this regression.

I added it though it passes even without the fix.
I encountered this issue in a really busy environment, but it's hard to reproduce it in CI, I guess.
Do we want to keep the integration test, or should I remove it?

Add an initial reap in forward() before entering the signal loop. Since signal.Notify runs in a goroutine, a fast-exiting process could trigger SIGCHLD before signal registration completes, causing the signal to be silently discarded. This left forward() blocking forever on a signal that would never arrive. By calling reap() once before the loop, we catch any process that already exited during the setup window. If the process is still running, the reap is a no-op and SIGCHLD arrives normally via the registered signal handler. Signed-off-by: Ayato Tokubi <atokubi@redhat.com>

lifubang · 2026-03-31T16:22:31Z

Thanks for the update. Could you please verify if this patch resolves your issue? I believe it offers a more robust solution:

diff --git a/signals.go b/signals.go
index 936d751f..649d3bcf 100644
--- a/signals.go
+++ b/signals.go
@@ -26,14 +26,18 @@ func newSignalHandler(enableSubreaper bool, notifySocket *notifySocket) chan *si
                }
        }
        handler := make(chan *signalHandler)
+
+       // ensure that we have a large buffer size so that we do not miss any
+       // signals in case we are not processing them fast enough.
+       s := make(chan os.Signal, signalBufferSize)
+
        // signal.Notify is actually quite expensive, as it has to configure the
        // signal mask and add signal handlers for all signals (all ~65 of them).
        // So, defer this to a background thread while doing the rest of the io/tty
        // setup.
+       // But `SIGCHLD` is very important, so we need to handle it immediately.
+       signal.Notify(s, unix.SIGCHLD)
        go func() {
-               // ensure that we have a large buffer size so that we do not miss any
-               // signals in case we are not processing them fast enough.
-               s := make(chan os.Signal, signalBufferSize)
                // handle all signals for the process.
                signal.Notify(s)
                handler <- &signalHandler{

As I mentioned in the linked comment, I don't think adding a BATS test is necessary. Since I couldn't reproduce the issue even after 100,000 iterations, writing a test for it would be futile. It simply won't trigger in a realistic timeframe.

kolyshkin · 2026-03-31T17:03:28Z

Yes, I think this is a simplest possible fix. I tested it by adding time.Sleep(10 * time.Millisecond) to the beginning of go func() in newSignalHandler which allows to reproduce the issue 100% on my machine.

@lifubang can you please open a PR (and acknowledge @bitoku contribution via Reported-by: Ayato Tokubi <atokubi@redhat.com>?)

kolyshkin · 2026-03-31T19:33:33Z

Yet another way to fix this would be something like this:

diff --git a/utils_linux.go b/utils_linux.go
index d7f51e20..a5b0a630 100644
--- a/utils_linux.go
+++ b/utils_linux.go
@@ -299,6 +299,7 @@ func (r *runner) run(config *specs.Process) (int, error) {
                }
        }
        handler := <-handlerCh
+       handler.signals <- unix.SIGCHLD
        status, err := handler.forward(process, tty, detach)
        if err != nil {
                r.terminate(process)

This should have the same/similar effect as the fix in this PR but I like this approach better.

PS this reminds me we should try to finish the review of #4661.

lifubang · 2026-04-01T01:30:19Z

@lifubang can you please open a PR (and acknowledge @bitoku contribution via Reported-by: Ayato Tokubi <atokubi@redhat.com>?)

Thanks @bitoku, opened #5210, PTAL.

bitoku · 2026-04-01T09:39:09Z

closing in favor of #5210

lifubang added kind/bug regression and removed kind/bug labels Mar 31, 2026

lifubang added backport/1.4-todo A PR in main branch which needs to backported to release-1.4 backport/1.5-todo A PR in main branch which needs to be backported to release-1.5 backport/1.3-todo A PR in main branch which needs to be backported to release-1.3 labels Mar 31, 2026

bitoku force-pushed the fix-sigchld-race branch 2 times, most recently from a10eb07 to 7ebfd5b Compare March 31, 2026 11:36

This comment was marked as outdated.

Sign in to view

bitoku force-pushed the fix-sigchld-race branch from 44fe5e2 to e64a48e Compare March 31, 2026 13:11

lifubang mentioned this pull request Apr 1, 2026

Fix SIGCHLD race in signal handler setup #5210

Merged

bitoku closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SIGCHLD race in signal handler setup#5206

Fix SIGCHLD race in signal handler setup#5206
bitoku wants to merge 1 commit intoopencontainers:mainfrom
bitoku:fix-sigchld-race

bitoku commented Mar 31, 2026 •

edited by lifubang

Loading

Uh oh!

lifubang commented Mar 31, 2026

Uh oh!

lifubang commented Mar 31, 2026

Uh oh!

bitoku commented Mar 31, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

lifubang commented Mar 31, 2026

Uh oh!

kolyshkin commented Mar 31, 2026

Uh oh!

kolyshkin commented Mar 31, 2026

Uh oh!

lifubang commented Apr 1, 2026

Uh oh!

bitoku commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bitoku commented Mar 31, 2026 • edited by lifubang Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lifubang commented Mar 31, 2026

Uh oh!

lifubang commented Mar 31, 2026

Uh oh!

bitoku commented Mar 31, 2026

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

lifubang commented Mar 31, 2026

Uh oh!

kolyshkin commented Mar 31, 2026

Uh oh!

kolyshkin commented Mar 31, 2026

Uh oh!

lifubang commented Apr 1, 2026

Uh oh!

bitoku commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bitoku commented Mar 31, 2026 •

edited by lifubang

Loading