Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: Go can leak a forked process if main thread exits before spawn finished #33565

Open
jacobvosmaer opened this issue Aug 9, 2019 · 12 comments

Comments

@jacobvosmaer
Copy link

@jacobvosmaer jacobvosmaer commented Aug 9, 2019

Please answer these questions before submitting your issue. Thanks!

What did you do?

If I spawn a process in a goroutine, and exit the main thread before the spawn is finished, I am sometimes left with a fork of my original process that does not go away. The fork does not respond to SIGTERM, and sometimes consumes 100% CPU.

https://play.golang.org/p/1HPscvoAiwj

What did you expect to see?

If the main thread exits Go should not leave behind a forked process.

What did you see instead?

A forked process, sometimes using 100% CPU

Does this issue reproduce with the latest release (go1.12.7)?

Yes.

System details

go version go1.12.6 darwin/amd64
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/jacobvosmaer/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/jacobvosmaer/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
GOROOT/bin/go version: go version go1.12.6 darwin/amd64
GOROOT/bin/go tool compile -V: compile version go1.12.6
uname -v: Darwin Kernel Version 18.7.0: Thu Jun 20 18:42:21 PDT 2019; root:xnu-4903.270.47~4/RELEASE_X86_64
ProductName:	Mac OS X
ProductVersion:	10.14.6
BuildVersion:	18G87
lldb --version: lldb-1100.0.25.3
Apple Swift version 5.1 (swiftlang-1100.0.38.29 clang-1100.0.20.14)
@jacobvosmaer jacobvosmaer changed the title os/exec: Go can leak a fork process if main thread exits before spawn finished os/exec: Go can leak a forked process if main thread exits before spawn finished Aug 9, 2019
@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 9, 2019

My guess is that when doing a fork+exec, after the fork, the child wants to "communicate" with the parent; perhaps by reading or writing to a pipe. If the parent is gone the child should exit and do nothing. But it appears that with the "right" timing, the child can get stuck in a loop instead.

@ianlancetaylor ianlancetaylor changed the title os/exec: Go can leak a forked process if main thread exits before spawn finished syscall: Go can leak a forked process if main thread exits before spawn finished Aug 9, 2019
@ianlancetaylor ianlancetaylor added this to the Go1.14 milestone Aug 9, 2019
@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 19, 2019

There are some lldb stack traces that might help here: https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

@agnivade
Copy link
Contributor

@agnivade agnivade commented Aug 20, 2019

Can't reproduce on linux.

If the parent is gone the child should exit and do nothing.

I don't believe that is necessarily correct. If a parent exits, the child becomes a zombie and attaches to PID 1. I understand this issue is about cleaning up the child before the spawn has happened completely.

@randall77 for macOS.

@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 28, 2019

I also cannot reproduce on linux.

I understand this issue is about cleaning up the child before the spawn has happened completely.

Yes. In my toy example, I'm trying to spawn /bin/sh which will exit harmlessly on its own accord, even if its parent is gone. The problem is that when this bug strikes, we never get to the exec which creates the /bin/sh process.

I've learned some more about the real life bug that prompted this issue. It seems that there, even if the main goroutine does not exit, we get a stuck forked process. I was able to attach to some of these with lldb, which is what I linked to above. https://gitlab.com/gitlab-org/gitaly/issues/1850#note_205304016

I'm not sure if this is the exact same state as my reproducing example, but it might be related. To save some clicking I'm including stack traces below.

mystery fork A, 0% CPU
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7eaee86a libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff7ebad56e libsystem_pthread.dylib`_pthread_cond_wait + 722
    frame #2: 0x00007fff7d1e6467 libobjc.A.dylib`monitor_tt<false>::wait() + 19
    frame #3: 0x00007fff7d1eb462 libobjc.A.dylib`WAITING_FOR_ANOTHER_THREAD_TO_FINISH_CALLING_+initialize + 132
    frame #4: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79
    frame #5: 0x00007fff7d1eb79b libobjc.A.dylib`initializeNonMetaClass + 79
    frame #6: 0x00007fff7d1ec62f libobjc.A.dylib`initializeAndMaybeRelock(objc_class*, objc_object*, mutex_tt<false>&, bool) + 187
    frame #7: 0x00007fff7d1db690 libobjc.A.dylib`lookUpImpOrForward + 228
    frame #8: 0x00007fff7d1db114 libobjc.A.dylib`_objc_msgSend_uncached + 68
    frame #9: 0x00007fff7ebfafa1 libxpc.dylib`xpc_atfork_child + 125
    frame #10: 0x00007fff7b99eb18 libSystem.B.dylib`libSystem_atfork_child + 54
    frame #11: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40
    frame #12: 0x00000000010603ff gitaly`runtime.syscall + 31
    frame #13: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112
    frame #14: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624
    frame #15: 0x000000000104d40e gitaly`syscall.rawSyscall + 78
    frame #16: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226
    frame #17: 0x00000000010a31a0 gitaly`syscall.forkExec + 864
    frame #18: 0x00000000010c4a12 gitaly`os.startProcess + 610
    frame #19: 0x00000000010c449c gitaly`os.StartProcess + 124
    frame #20: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180
    frame #21: 0x00000000016805a3 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435
mystery fork B, 100% CPU
* thread #1: tid = 0x43f1b, 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
(lldb) bt
* thread #1, stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
  * frame #0: 0x00007fff7eba3317 libsystem_platform.dylib`_os_once_gate_corruption_abort + 23
    frame #1: 0x00007fff7eba114c libsystem_platform.dylib`_os_once_gate_wait + 196
    frame #2: 0x00007fff7eb9ea6f libsystem_platform.dylib`_os_alloc_once + 42
    frame #3: 0x00007fff7eb99d2a libsystem_notify.dylib`_notify_fork_child + 211
    frame #4: 0x00007fff7b99eb13 libSystem.B.dylib`libSystem_atfork_child + 49
    frame #5: 0x00007fff7ea0fd97 libsystem_c.dylib`fork + 40
    frame #6: 0x00000000010603ff gitaly`runtime.syscall + 31
    frame #7: 0x000000000105dde0 gitaly`runtime.asmcgocall + 112
    frame #8: 0x0000000001033930 gitaly`runtime.startTheWorldWithSema + 624
    frame #9: 0x000000000104d40e gitaly`syscall.rawSyscall + 78
    frame #10: 0x00000000010a21a2 gitaly`syscall.forkAndExecInChild + 226
    frame #11: 0x00000000010a31a0 gitaly`syscall.forkExec + 864
    frame #12: 0x00000000010c4a12 gitaly`os.startProcess + 610
    frame #13: 0x00000000010c449c gitaly`os.StartProcess + 124
    frame #14: 0x0000000001179f9c gitaly`os/exec.(*Cmd).Start + 1180
    frame #15: 0x0000000001680563 gitaly`gitlab.com/gitlab-org/gitaly/internal/supervisor.(*Process).start + 435

When I lldb the crashing example I don't get nice symbols. Is it possible I messed up a go build command?

@agnivade
Copy link
Contributor

@agnivade agnivade commented Aug 28, 2019

I would also give 1.13rc1 a shot, just to be sure that it is not something already fixed. Keith can probably comment more on what exactly is happening.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 28, 2019

The stack traces make it appear that the problem is in the Darwin libc. The forked child is trying to get some response from the parent, but the parent has exited. If that is true, it should be possible to write a similar test case in C. Anybody want to try that?

@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 28, 2019

it should be possible to write a similar test case in C. Anybody want to try that?

It's the sort of challenge I enjoy so I'll take a look. But a person familiar with Go internals would be much faster at it than me. I won't be offended if such a person swoops in and beats me to it.

@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 28, 2019

OK that was not hard. Just translated my Go example to C.

https://gist.github.com/jacobvosmaer/69fae756e88d2f5f4ef5091ceeba1d88

As before, leave it running for a while in a restart loop, and you get a weird forked process left behind.

@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 28, 2019

The C reproducer highlights how odd this is even more: all it tries to do is fork followed by exit, yet the leaking forked processes are not exiting.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 28, 2019

Thanks!

Does anybody want to try to report this to Apple?

To fix this in Go I think it would suffice to have exit in runtime/sys_darwin.go acquire a write lock on syscall.ForkLock. Unfortunately that will be a bit of a hassle since the runtime package doesn't depend on the syscall package, and since syscall.ForkLock is exported it can't move.

@jacobvosmaer
Copy link
Author

@jacobvosmaer jacobvosmaer commented Aug 29, 2019

It really does feel like a macOS bug. With my C reproducer, if I leave it running it leaks way more processes than the Go one. Most of the leaked processes are at 0% CPU instead of 100% CPU.

* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff6d67ef06 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff6d73bd52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff6d7394cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x00007fff6d5ed2b4 libsystem_c.dylib`__cxa_finalize_ranges + 36
    frame #4: 0x00007fff6d5ed6b3 libsystem_c.dylib`exit + 55
    frame #5: 0x0000000106b09e21 fork-leak`do_fork + 33
    frame #6: 0x00007fff6d73b2eb libsystem_pthread.dylib`_pthread_body + 126
    frame #7: 0x00007fff6d73e249 libsystem_pthread.dylib`_pthread_start + 66
    frame #8: 0x00007fff6d73a40d libsystem_pthread.dylib`thread_start + 13

What is particularly worrying here is that after I remove the leaked processes with kill -9, some part of the system (kernel? userspace kernel helper?) keeps spinning the CPU. It shows in the activity monitor as constant "system" load. I can only get rid of that by rebooting.

Screenshot 2019-08-29 at 09 31 49

I don't feel confident reporting this to Apple for a number of reasons:

  1. I don't have enough experience with C and Unix programming to stand behind my C reproducer
  2. I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.14, Unplanned Sep 11, 2019
@jqmichael
Copy link

@jqmichael jqmichael commented May 25, 2020

I don't feel confident reporting this to Apple for a number of reasons:

  1. I don't have enough experience with C and Unix programming to stand behind my C reproducer
  2. I don't know to what extent the C reproducer matches the behavior of the Go runtime: even if the C reproducer is proof of a macOS bug, do we know if it's the same bug I see in Go?

If they respond to my C reproducer by saying "your code is wrong" or "you should use posix_spawn" then I have nothing to say to that.

That's certainly true. But we are still making progress even if Apple proved we were doing wrong, isn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.