Sentry systrap: exit the entire process when one stubprocess is dead #11590

xianzhe-databricks · 2025-03-27T11:04:02Z

Introduction

When the sandboxed application is killed externally (for example, when it requests more memory than what cgroup allows), the whole sandbox should exit as well. We find this commit 16dba7f#diff-06bbaf439d32bed64b39754ca405a6b8fe4227925bfa92c11428b6ddfc91b962 changed this behavior. This commit led to the fact that when the application is killed, the sandbox is still hanging around.

This PR fixes it.

Step to reproduce the hang

The hang can be reproduced in any release version including the commit above. Here we use release-20250319.0 as an example.

Application:

package main

import (
	"fmt"
	"time"
)

func main() {
	fmt.Println("Application started")
	var loop = 0
	for {
		time.Sleep(2 * time.Second)
		loop++
		fmt.Printf("Application is running %d\n", loop)
	}
	fmt.Println("Shutting down")
}

And run this application with gVisor, e.g. runsc --debug-log=/tmp/runsc/ --debug --strace run test.
Then while the application is running, we kill this process with sudo kill -9 <pid> to mimic OOM. Afterwards the sandbox is still hanging.

runsc command line output

Application started
Application is running 1
Application is running 2
Application is running 3
Application is running 4
Application is running 5
Application is running 6
Application is running 7
Application is running 8
Application is running 9

Then it gets stuck after the kill command (but does not exit).

Gvisor log

runsc.log.20250327-105656.496989.boot.txt (only the last relevant parts)

I0327 10:57:28.770718       1 strace.go:576] [   2:   6] client E futex(0xc000100148, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, 0x0, null, 0x0, 0x0)
I0327 10:57:28.770718       1 strace.go:576] [   2:   2] client E epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {}, 0x80, 0x7cf, null, 0x0)
I0327 10:57:28.770740       1 strace.go:614] [   2:   2] client X epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {{events=EPOLLOUT data=[0x65b00001, 0x7ed7610c]}{events=EPOLLOUT data=[0x66b80001, 0x7ed7610c]}}, 0x80, 0x7cf, null, 0x0) = 2 (0x2) (6.004µs)
I0327 10:57:28.770767       1 strace.go:576] [   2:   2] client E epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {}, 0x80, 0x7cf, null, 0x0)
I0327 10:57:28.770768       1 strace.go:602] [   2:   3] client X nanosleep(0xc00001ff18 {sec=0 nsec=20000}, null) = 0 (0x0) (62.461µs)
I0327 10:57:28.770791       1 strace.go:576] [   2:   3] client E futex(0x5571c0, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, 0x0, 0xc00001feb8 {sec=1 nsec=999872650}, 0x0, 0x0)
I0327 10:57:30.770005       1 strace.go:614] [   2:   2] client X epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {}, 0x80, 0x7cf, null, 0x0) = 0 (0x0) (1.999216401s)
I0327 10:57:30.771140       1 strace.go:614] [   2:   3] client X futex(0x5571c0, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, 0x0, 0xc00001feb8 {sec=1 nsec=999872650}, 0x0, 0x0) = 0 (0x0) errno=110 (connection timed out) (2.000334244s)
W0327 10:57:30.780225       1 subprocess_unsafe.go:130] the subprocess 38 exited (status: %!s(unix.WaitStatus=9), err %!s(<nil>))
W0327 10:57:30.780252       1 task_run.go:358] [   2:   2] Unexpected SwitchToApp error: systrap corrupted memory: subprocess died
D0327 10:57:30.780267       1 task_exit.go:215] [   2:   2] Transitioning from exit state TaskExitNone to TaskExitInitiated
D0327 10:57:30.780294       1 task_exit.go:215] [   2:   2] Transitioning from exit state TaskExitInitiated to TaskExitZombie
W0327 10:57:30.781250       1 task_run.go:358] [   2:   3] Unexpected SwitchToApp error: systrap corrupted memory: subprocess died
D0327 10:57:30.781266       1 task_exit.go:215] [   2:   3] Transitioning from exit state TaskExitNone to TaskExitInitiated
D0327 10:57:30.781292       1 task_exit.go:215] [   2:   3] Transitioning from exit state TaskExitInitiated to TaskExitZombie
D0327 10:57:30.781297       1 task_exit.go:215] [   2:   3] Transitioning from exit state TaskExitZombie to TaskExitDead
I0327 10:57:41.715145       1 watchdog.go:294] Watchdog starting loop, tasks: 5, discount: 0s
D0327 10:58:04.664761       1 sampler.go:162] Time: Adjusting syscall overhead up to 692
D0327 10:58:08.665378       1 sampler.go:191] Time: Adjusting syscall overhead down to 606
D0327 10:58:09.664941       1 sampler.go:191] Time: Adjusting syscall overhead down to 531
D0327 10:58:10.665540       1 sampler.go:191] Time: Adjusting syscall overhead down to 465
D0327 10:58:11.664765       1 sampler.go:191] Time: Adjusting syscall overhead down to 407
D0327 10:58:15.664958       1 sampler.go:191] Time: Adjusting syscall overhead down to 357
I0327 10:58:26.715574       1 watchdog.go:294] Watchdog starting loop, tasks: 5, discount: 0s
D0327 10:58:35.665620       1 sampler.go:191] Time: Adjusting syscall overhead down to 313

gofer.txt and run.txt do not reveal anything interesting.

xianzhe-databricks · 2025-03-27T11:04:53Z

@avagin could you take a look? Thanks

The changes in the commit 16dba7f#diff-06bbaf439d32bed64b39754ca405a6b8fe4227925bfa92c11428b6ddfc91b962 do not correspond to the comments

gvisor/pkg/sentry/platform/systrap/subprocess_linux.go

Lines 36 to 43 in 16dba7f

    
           // When creating the new child process, we specify SIGKILL as the 
        
           // signal to deliver when the child exits. We never expect a subprocess 
        
           // to exit; they are pooled and reused. This is done to ensure that if 
        
           // a subprocess is OOM-killed, this process (and all other stubs, 
        
           // transitively) will be killed as well. It's simply not possible to 
        
           // safely handle a single stub getting killed: the exact state of 
        
           // execution is unknown and not recoverable. 
        
           return attachedThread(unix.CLONE_FILES|uintptr(unix.SIGCHLD), linux.SECCOMP_RET_TRAP)

avagin · 2025-03-27T15:52:18Z

first commit

We do not squash commits, therefore all commits will be present in the upstream repository after merging. Please take this into account and write proper commit messages.

avagin · 2025-03-27T16:11:33Z

Then while the application is running, we kill this process with sudo kill -9 to mimic OOM. Afterwards the sandbox is still hanging.

What process did you kill? Do you kill a stub process? In the strace log, we see that a guest process associated with the killed stub process exited. The sandbox didn't exit, because there was the init process. Did it wait for the killed process? Why didn't it exit? Could you attach the full sentry log?

The changes in the commit 16dba7f#diff-06bbaf439d32bed64b39754ca405a6b8fe4227925bfa92c11428b6ddfc91b962 do not correspond to the comments

If we fork a stub process with SIGKILL, the entire sentry will be killed if a stub process crashes. The commit message says that we handle stub process crashes instead of just killing the sentry.

xianzhe-databricks · 2025-03-27T19:03:04Z

sentry.log
@avagin The full sentry is really long so I can only attach it as a file.

What process did you kill? Do you kill a stub process?

The process is a subprocess of runsc-sandbox: I think it is just go application with the infinity loop I showed in the PR description. I showed my steps here:

$ ps aux | grep "runsc"                   
root      222911  0.0  0.0  11352  5284 pts/5    S+   17:15   0:00 sudo /home/xianzhe.ma/gvisor/bin/runsc --debug-log=/home/xianzhe.ma/uds_test/logs/runsc/ --debug --strace run test
root      222912  0.1  0.0 1257208 24056 pts/5   Sl+  17:15   0:00 /home/xianzhe.ma/gvisor/bin/runsc --debug-log=/home/xianzhe.ma/uds_test/logs/runsc/ --debug --strace run test
root      222919  0.1  0.0 1256696 22972 ?       Ssl  17:15   0:00 runsc-gofer --strace=true --root=/var/run/runsc --debug=true --debug-log=/home/xianzhe.ma/uds_test/logs/runsc/ --debug-log-fd=3 gofer --bundle=/home/xianzhe.ma/uds_test/container --gofer-mount-confs=lisafs:none,lisafs:none --io-fds=7,8 --mounts-fd=5 --rpc-fd=6 --spec-fd=4 --sync-chroot-fd=-1 --sync-userns-fd=-1 --proc-mount-sync-fd=16 --apply-caps=false --setup-root=false
root      222924  3.2  0.0 3433556 36324 ?       Ssl  17:15   0:00 runsc-sandbox --root=/var/run/runsc --debug=true --debug-log=/home/xianzhe.ma/uds_test/logs/runsc/ --strace=true --debug-log-fd=3 boot --apply-caps=false --attached=true --bundle=/home/xianzhe.ma/uds_test/container --controller-fd=8 --cpu-num=32 --dev-io-fd=-1 --gofer-mount-confs=lisafs:none,lisafs:none --io-fds=4,5 --mounts-fd=6 --setup-root=false --spec-fd=9 --start-sync-fd=7 --stdio-fds=10,11,12 --total-host-memory=132941541376 --total-memory=132941541376 --product-name=m6i.8xlarge --host-thp-shmem-enabled=never --host-thp-defrag=madvise --proc-mount-sync-fd=20 test
xianzhe+  223375  0.0  0.0   7964  2680 pts/9    S+   17:15   0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox --exclude-dir=.venv --exclude-dir=venv runsc

$ pstree -p 222924
exe(222924)─┬─exe(222962)
            ├─exe(222972)───exe(222973)
            ├─exe(222979)───exe(222980)
            ├─exe(222981)─┬─exe(222982)
            │             ├─exe(222984)
            │             └─exe(222997)
            ├─{exe}(222947)
            ├─{exe}(222948)
            ├─{exe}(222949)
            ├─{exe}(222950)
            ├─{exe}(222956)
            ├─{exe}(222957)
            ├─{exe}(222958)
            ├─{exe}(222959)
            ├─{exe}(222960)
            ├─{exe}(222961)
            ├─{exe}(222963)
            ├─{exe}(222964)
            ├─{exe}(222965)
            ├─{exe}(222966)
            ├─{exe}(222967)
            ├─{exe}(222968)
            ├─{exe}(222969)
            ├─{exe}(222970)
            ├─{exe}(222971)
            └─{exe}(222983)

$ sudo kill -9 222981

xianzhe-databricks · 2025-03-27T19:10:14Z

@avagin regardless of the above, this is to mimic the case in production where the application running in gVisor requires more memory than what cgroup allows, and gets OOM-killed. We observed that the sandbox processes runsc-sandbox, runsc-gofer, etc. are still running and the sentry logs are similar as what I showed as a toy reproduction.

avagin · 2025-03-28T00:11:28Z

In the attached log, we killed a few threads that detected their stub process had died. We actually need to kill the entire process.
Could you try out this patch:

diff --git a/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go b/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
--- a/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
+++ b/google3/third_party/gvisor/pkg/sentry/kernel/task_run.go
@@ -356,7 +356,7 @@ func (app *runApp) execute(t *Task) task
        default:
                // What happened? Can't continue.
                t.Warningf("Unexpected SwitchToApp error: %v", err)
-               t.PrepareExit(linux.WaitStatusExit(int32(ExtractErrno(err, -1))))
+               t.PrepareGroupExit(linux.WaitStatusTerminationSignal(linux.SIGKILL))
                return (*runExit)(nil)
        }
 }

xianzhe-databricks · 2025-03-28T10:32:34Z

@avagin it works as well! Thanks. I revised the PR as well as the commit message. Could we merge this PR?

# Introduction When the sandboxed application is killed externally (for example, when it requests more memory than what cgroup allows), the whole sandbox should exit as well. We find this commit 16dba7f#diff-06bbaf439d32bed64b39754ca405a6b8fe4227925bfa92c11428b6ddfc91b962 changed this behavior. This commit led to the fact that when the application is killed, the sandbox is still hanging around. This PR fixes it. # Step to reproduce the hang The hang can be reproduced in any release version including the commit above. Here we use `release-20250319.0` as an example. ## Application: ```go package main import ( "fmt" "time" ) func main() { fmt.Println("Application started") var loop = 0 for { time.Sleep(2 * time.Second) loop++ fmt.Printf("Application is running %d\n", loop) } fmt.Println("Shutting down") } ``` And run this application with gVisor, e.g. `runsc --debug-log=/tmp/runsc/ --debug --strace run test`. Then while the application is running, we kill this process with `sudo kill -9 <pid>` to mimic OOM. Afterwards the sandbox is still hanging. ## runsc command line output ``` Application started Application is running 1 Application is running 2 Application is running 3 Application is running 4 Application is running 5 Application is running 6 Application is running 7 Application is running 8 Application is running 9 ``` Then it gets stuck after the kill command (but does not exit). ## Gvisor log runsc.log.20250327-105656.496989.boot.txt (only the last relevant parts) ``` I0327 10:57:28.770718 1 strace.go:576] [ 2: 6] client E futex(0xc000100148, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, 0x0, null, 0x0, 0x0) I0327 10:57:28.770718 1 strace.go:576] [ 2: 2] client E epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {}, 0x80, 0x7cf, null, 0x0) I0327 10:57:28.770740 1 strace.go:614] [ 2: 2] client X epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {{events=EPOLLOUT data=[0x65b00001, 0x7ed7610c]}{events=EPOLLOUT data=[0x66b80001, 0x7ed7610c]}}, 0x80, 0x7cf, null, 0x0) = 2 (0x2) (6.004µs) I0327 10:57:28.770767 1 strace.go:576] [ 2: 2] client E epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {}, 0x80, 0x7cf, null, 0x0) I0327 10:57:28.770768 1 strace.go:602] [ 2: 3] client X nanosleep(0xc00001ff18 {sec=0 nsec=20000}, null) = 0 (0x0) (62.461µs) I0327 10:57:28.770791 1 strace.go:576] [ 2: 3] client E futex(0x5571c0, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, 0x0, 0xc00001feb8 {sec=1 nsec=999872650}, 0x0, 0x0) I0327 10:57:30.770005 1 strace.go:614] [ 2: 2] client X epoll_pwait(0x3 anon_inode:[eventpoll], 0x7eb16a44c51c {}, 0x80, 0x7cf, null, 0x0) = 0 (0x0) (1.999216401s) I0327 10:57:30.771140 1 strace.go:614] [ 2: 3] client X futex(0x5571c0, FUTEX_WAIT|FUTEX_PRIVATE_FLAG, 0x0, 0xc00001feb8 {sec=1 nsec=999872650}, 0x0, 0x0) = 0 (0x0) errno=110 (connection timed out) (2.000334244s) W0327 10:57:30.780225 1 subprocess_unsafe.go:130] the subprocess 38 exited (status: %!s(unix.WaitStatus=9), err %!s(<nil>)) W0327 10:57:30.780252 1 task_run.go:358] [ 2: 2] Unexpected SwitchToApp error: systrap corrupted memory: subprocess died D0327 10:57:30.780267 1 task_exit.go:215] [ 2: 2] Transitioning from exit state TaskExitNone to TaskExitInitiated D0327 10:57:30.780294 1 task_exit.go:215] [ 2: 2] Transitioning from exit state TaskExitInitiated to TaskExitZombie W0327 10:57:30.781250 1 task_run.go:358] [ 2: 3] Unexpected SwitchToApp error: systrap corrupted memory: subprocess died D0327 10:57:30.781266 1 task_exit.go:215] [ 2: 3] Transitioning from exit state TaskExitNone to TaskExitInitiated D0327 10:57:30.781292 1 task_exit.go:215] [ 2: 3] Transitioning from exit state TaskExitInitiated to TaskExitZombie D0327 10:57:30.781297 1 task_exit.go:215] [ 2: 3] Transitioning from exit state TaskExitZombie to TaskExitDead I0327 10:57:41.715145 1 watchdog.go:294] Watchdog starting loop, tasks: 5, discount: 0s D0327 10:58:04.664761 1 sampler.go:162] Time: Adjusting syscall overhead up to 692 D0327 10:58:08.665378 1 sampler.go:191] Time: Adjusting syscall overhead down to 606 D0327 10:58:09.664941 1 sampler.go:191] Time: Adjusting syscall overhead down to 531 D0327 10:58:10.665540 1 sampler.go:191] Time: Adjusting syscall overhead down to 465 D0327 10:58:11.664765 1 sampler.go:191] Time: Adjusting syscall overhead down to 407 D0327 10:58:15.664958 1 sampler.go:191] Time: Adjusting syscall overhead down to 357 I0327 10:58:26.715574 1 watchdog.go:294] Watchdog starting loop, tasks: 5, discount: 0s D0327 10:58:35.665620 1 sampler.go:191] Time: Adjusting syscall overhead down to 313 ``` `gofer.txt` and `run.txt` do not reveal anything interesting. FUTURE_COPYBARA_INTEGRATE_REVIEW=#11590 from xianzhe-databricks:fix-hang-with-subprocess-death e7520af PiperOrigin-RevId: 741613636

xianzhe-databricks changed the title ~~Sentry: exit the parent process when the subprocess is dead~~ Sentry systrap: exit the parent process when the subprocess is dead Mar 27, 2025

avagin requested a review from konstantin-s-bogom March 27, 2025 15:47

avagin self-requested a review March 28, 2025 00:11

xianzhe-databricks force-pushed the fix-hang-with-subprocess-death branch from 5c6ea8c to 162de7b Compare March 28, 2025 10:29

xianzhe-databricks changed the title ~~Sentry systrap: exit the parent process when the subprocess is dead~~ Sentry systrap: exit the entire process when one stubprocess is dead Mar 28, 2025

Sentry systrap: exit the entire process when one stub process is dead

e7520af

xianzhe-databricks force-pushed the fix-hang-with-subprocess-death branch from 162de7b to e7520af Compare March 28, 2025 10:31

avagin added the ready to pull label Mar 28, 2025

copybara-service bot mentioned this pull request Mar 28, 2025

Sentry systrap: exit the entire process when one stubprocess is dead #11596

Merged

copybara-service bot merged commit 7a07e27 into google:master Mar 28, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentry systrap: exit the entire process when one stubprocess is dead #11590

Sentry systrap: exit the entire process when one stubprocess is dead #11590

Uh oh!

xianzhe-databricks commented Mar 27, 2025 •

edited

Loading

Uh oh!

xianzhe-databricks commented Mar 27, 2025 •

edited

Loading

Uh oh!

avagin commented Mar 27, 2025

Uh oh!

avagin commented Mar 27, 2025

Uh oh!

xianzhe-databricks commented Mar 27, 2025

Uh oh!

xianzhe-databricks commented Mar 27, 2025

Uh oh!

avagin commented Mar 28, 2025

Uh oh!

xianzhe-databricks commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sentry systrap: exit the entire process when one stubprocess is dead #11590

Sentry systrap: exit the entire process when one stubprocess is dead #11590

Uh oh!

Conversation

xianzhe-databricks commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Step to reproduce the hang

Application:

runsc command line output

Gvisor log

Uh oh!

xianzhe-databricks commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avagin commented Mar 27, 2025

Uh oh!

avagin commented Mar 27, 2025

Uh oh!

xianzhe-databricks commented Mar 27, 2025

Uh oh!

xianzhe-databricks commented Mar 27, 2025

Uh oh!

avagin commented Mar 28, 2025

Uh oh!

xianzhe-databricks commented Mar 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xianzhe-databricks commented Mar 27, 2025 •

edited

Loading

xianzhe-databricks commented Mar 27, 2025 •

edited

Loading