Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: use posix_spawn (or vfork) for ForkExec when possible #5838

Closed
owenthereal opened this issue Jul 5, 2013 · 56 comments
Closed

syscall: use posix_spawn (or vfork) for ForkExec when possible #5838

owenthereal opened this issue Jul 5, 2013 · 56 comments
Assignees
Milestone

Comments

@owenthereal
Copy link

@owenthereal owenthereal commented Jul 5, 2013

Why:

At a basic level posix_spawn(2) is a subset of fork(2). A new child process from
fork(2): 1) gets an exact copy of everything that the parent process had in memory, and
2) gets a copy of all the file descriptors that the parent process had open.
posix_spawn(2) preserves #2, but not #1. In some cases, say, shelling out a command,
it's unnecessary to get a copy of memory of the parent process. With copy-on-write, fork
will be less expensive but still, not necessary.

What's out there:

https://github.com/rtomayko/posix-spawn#benchmarks

I am wondering if it makes sense to have this API and let developers decide which one to
use (fork/exec vs. posix_spawn)
@alberts
Copy link
Contributor

@alberts alberts commented Jul 5, 2013

@davecheney
Copy link
Contributor

@davecheney davecheney commented Jul 5, 2013

Comment 2:

Sounds like MADV_DONTFORK is the way to go.
@owenthereal
Copy link
Author

@owenthereal owenthereal commented Jul 7, 2013

Comment 3:

Thanks for the info. I am pasting the implementation of the posix-spawn gem in case it's
a helpful reference. Looks like it forces to use vfork on linux:
https://github.com/rtomayko/posix-spawn/blob/master/ext/posix-spawn.c#L399-L404
https://github.com/rtomayko/posix-spawn/blob/master/ext/posix-spawn.c#L418
@minux
Copy link
Member

@minux minux commented Jul 8, 2013

Comment 4:

the problem of MADV_DONTFORK heap is that it's difficult for us to make sure
ForkExec code doesn't use heap at all.
i think posix_spawn is the way to go (vfork is removed in modern POSIX standard,
so it should be avoided when possible).

Labels changed: added priority-later, performance, removed priority-triage.

Status changed to Accepted.

@rsc
Copy link
Contributor

@rsc rsc commented Jul 8, 2013

Comment 5:

The standards don't matter. What matters is what is available on the target systems.
vfork is fine if it's there. So is posix_spawn if it's there (and works and supports all
the current usages, including syscall.ProcAttr).
@rsc
Copy link
Contributor

@rsc rsc commented Nov 27, 2013

Comment 6:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

@rsc rsc commented Dec 4, 2013

Comment 7:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

@rsc rsc commented Dec 4, 2013

Comment 8:

Labels changed: added repo-main.

@tarndt
Copy link

@tarndt tarndt commented Jan 23, 2015

I see this issue on a regular basis for machines that still have free physical memory (certainly enough for the process I would like to invoke), but are being called by a Go process with a very large virtual memory footprint using os.Exec. The invocation fails due to insufficient virtual memory to fork the parent Go process. I think calling this a "performance" issue isn't accurate.

Go daemons with large virtual memory footprints needing to invoke small command-line utilities is a common use case. It would be nice to get this assigned to the 1.5 release.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 23, 2015

I took a short look. On GNU/Linux posix_spawn is a C library function, not a system call. vfork is a special case of clone: you pass the CLONE_VFORK flag. This means that a program that cares can already use vfork on GNU/Linux, by setting the Cloneflags field in os/exec.Cmd.SysProcAttr or os.ProcAttr.Sys. So while this would be nice to fix I don't see a pressing need.

To fix we need to edit syscall/exec_linux.go to pass CLONE_VFORK when that is safe. It is pretty clearly safe if the only things the child needs to do after the fork is fiddle with file descriptors and call exec. If that is the case, as can be determined by looking at the sys fields, then we could add CLONE_VFORK to the clone flags. If somebody wants to try that out, that would be nice.

@tarndt
Copy link

@tarndt tarndt commented Feb 4, 2015

If somebody wants to try that out, that would be nice.

@ianlancetaylor I see this issue on a regular basis, I will give you suggestion a try- Thanks!

@napsy
Copy link

@napsy napsy commented Feb 13, 2015

I'm facing with a similar problem. I'm running a go app that allocates about 14GB of VM and can't spawn a simple 'ps' command despite having at leat 300 MB system RAM still available. It would be really great if this issue would be fixed in 1.5

@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Feb 13, 2015

Hmm, I gave this a quick try a few days ago, but gave up for now.

I failed to determine why the child hangs after the clone syscall. And if the child hangs, the parent won't continue either in the CLONE_VFORK case.

I only activated CLONE_VFORK, if everything in syscall.SysProcAttr was set to it's zero value. But even such simple cases are not so simple it seems. if someone want's to work on this with me, just ping me here.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 13, 2015

Did you pass CLONE_VM as well as CLONE_VFORK? I think that without CLONE_VM the parent won't be able to see when the child has exec'ed. Though I don't know why the child would hang.

@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Feb 14, 2015

@ianlancetaylor yes, I passed both. But I guess the systems needs to be in a single thread mode for this to work, which Go doesn't seem to do at the moment. http://ewontfix.com/7/ has more info on this, if someone wants to continue here (e.g. my future self).

@tarndt
Copy link

@tarndt tarndt commented Mar 6, 2015

@ianlancetaylor
I'm a bit confused, this appears to work: http://play.golang.org/p/Bop1efiPJ4
Is this test flawed? I hadn't gotten around to trying this with our "real" code yet.

Edit: I even tried adding runtime.GOMAXPROCS(runtime.NumCPU()) and it still works.

@nightlyone
Copy link
Contributor

@nightlyone nightlyone commented Mar 6, 2015

@tarndt CLONE_VM is not passed in your example. CLONE_VFORK without CLONE_VM will be If you add this, the go program calling execve hangs. Which is exactly what I have seen in my tests.

My current plan is to use madvise(...,MADV_DONTFORK) with the heap, but I haven't figured out yet how to do the file descriptor juggling in a safe manner without affecting the parent process and only using stack.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 7, 2015

@tarndt If you use CLONE_VFORK without CLONE_VM, is that really any faster?

If it is faster, and it works, then I suppose we could use it.

@minux
Copy link
Member

@minux minux commented Mar 20, 2015

There is one reason to not use vfork. It's when the child needs to dup a
FUSE-backed file descriptor, which could block, and in the case of vfork,
also block the parent for indefinite amount of time.

See
https://groups.google.com/forum/#!msg/golang-nuts/11rdExWP6ac/rauEcCB66FUJ
([go-nuts] syscall: dup2 blocks contrary to runtime's expectations for fuse file systems)
for the discussion.

This is an edge case, but still worth considering when switching to vfork.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@neelance
Copy link
Member

@neelance neelance commented Dec 13, 2016

Wow nice, so the only change it does is to make the parent wait until the child does its exec. That sounds to me like a very safe thing to do.

Could you do a test run with my full patch to see if there is any additional gain in CLONE_VM?

@arya
Copy link

@arya arya commented Dec 13, 2016

@neelance I feel super stupid. My first test was incorrect, but CLONE_VFORK alone is marginally faster but not anywhere near CLONE_VM|CLONE_VFORK. Here are all 3:

Unmodified Go 1.7
Running https://gist.github.com/arya/7e23e8654e87a6e80608ade43ee31041#file-without_vfork-go

$ forker
0GB: 752.606µs
1GB: 6.114471ms
2GB: 8.482259ms
3GB: 12.444931ms

Unmodified Go 1.7, application code adds CLONE_VFORK
Running https://gist.github.com/arya/7e23e8654e87a6e80608ade43ee31041#file-with_vfork-go

$ forker
0GB: 977.023µs
1GB: 2.883356ms
2GB: 6.791622ms
3GB: 10.0657ms

Go 1.7 with this patch applied: neelance@b7edfba
Running https://gist.github.com/arya/7e23e8654e87a6e80608ade43ee31041#file-without_vfork-go

$ forker
0GB: 634.004µs
1GB: 624.716µs
2GB: 531.193µs
3GB: 902.282µs

I'm surprised though that memory is still copied despite the the child_stack is set to zero.

@neelance
Copy link
Member

@neelance neelance commented Dec 14, 2016

As far as I understand the manpage of clone you can use it in 3 ways:

  1. Do not specify CLONE_VM, then COW makes sure that the two processes don't interfere.
  2. Specify CLONE_VM and provide a child_stack. Memory is shared, but the child uses the given stack, thus not interfering with the stack of the parent.
  3. Specify CLONE_VM and CLONE_VFORK. Memory is shared, but the parent does not run until the child detached from the memory space via exec. You have to consider that the child modifies the stack that the parent uses afterwards (my patch).
@arya
Copy link

@arya arya commented Dec 15, 2016

@neelance That makes sense to me. AFAICT the third option (your patch) is this most feasible and performant. The first option seems to be what's in master and suffers from a large amount of copying. CLONE_VFORK avoids some of the copies, but not much apparently. The second option seems to me (as a novice to the internals) much more difficult to get right given the nature of Go and its management of memory. Is that accurate?

@neelance
Copy link
Member

@neelance neelance commented Dec 15, 2016

Yes, I also think that the second option is harder to implement.

@fasaxc
Copy link

@fasaxc fasaxc commented Feb 24, 2017

I work on an app that makes heavy use of subprocesses to manipulate iptables and ipsets (since that's their only supported API). After observing poor performance when my process is using a lot of RAM, I found this issue.

I tried adding CLONE_VFORK to our use of exec.Command but it seemed to make the throughput of my app worse! Maybe it's a matter of the process getting paused until the execve happens, lowering the throughput of goroutines running on other threads too.

FWIW, the previous version of our app was written in Python and we observed a dramatic improvement when we switched from Python's default fork/exec strategy to using posix_spawn via FFI.

@fasaxc
Copy link

@fasaxc fasaxc commented Feb 24, 2017

Measuring in an app that's under load with work going on in other threads, I see Cmd.Start() take 0-1ms at 40MB VSS vs 50-60ms at 1.4GB VSS. That amounts to 50x difference in throughput for my app and given the presence of the fork lock, there doesn't seem to be a way around it, even when using multiple goroutines.

@neelance
Copy link
Member

@neelance neelance commented Feb 24, 2017

@fasaxc Yes, only using CLONE_VFORK without CLONE_VM will only add additional waiting for the subproces to exec without saving time anywhere else.

Would you mind applying the whole patch neelance@f207709 to your GOROOT, then do go install -a syscall just to be sure and then rebuild your app with that? You don't need to modify your exec.Command. I'd be interested if the patch also improves your use case.

@neelance
Copy link
Member

@neelance neelance commented Feb 24, 2017

With "improve" I specifically mean the latency on high ram usage. You are right that in a low-RAM situation it may lower the throughput. Please check if it is still 50x when using the full patch.

@fasaxc
Copy link

@fasaxc fasaxc commented Feb 24, 2017

That patch makes a dramatic improvement. I'm measuring 99%ile latency of 1ms vs 60ms before and a drop from 100% CPU to 20% CPU usage.

@neelance
Copy link
Member

@neelance neelance commented Feb 24, 2017

Yey, I'm happy to hear that. Any downsides that you see? What about the low-memory situation?

@fasaxc
Copy link

@fasaxc fasaxc commented Feb 24, 2017

@neelance It seems to improve latency at small VSS size too (~40MB): 800us 99th %ile vs 2600us

@neelance
Copy link
Member

@neelance neelance commented Feb 24, 2017

Cool. So there are no reasons for not bringing this upstream. I'll create a CL today or tomorrow.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 25, 2017

CL https://golang.org/cl/37439 mentions this issue.

fasaxc added a commit to fasaxc/go-build that referenced this issue Feb 27, 2017
@gopherbot gopherbot closed this in 9e6b79a Mar 22, 2017
@owenthereal
Copy link
Author

@owenthereal owenthereal commented Mar 23, 2017

@neelance nice work!

@keegancsmith
Copy link
Contributor

@keegancsmith keegancsmith commented Jan 23, 2018

gitlab blogged about this patch giving a 30x improvement to the p99 latency of the git service they developed :) https://about.gitlab.com/2018/01/23/how-a-fix-in-go-19-sped-up-our-gitaly-service-by-30x/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.