-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: cmd/cgo: unsafe FFI calls #42469
Comments
This will produce programs that usually work but sometimes hang for inexplicable reasons. I don't think we've come close to the theoretical limit on speeding up cgo calls without removing safety. Can you point to some documentation for GHC unsafe calls? Thanks. |
@ianlancetaylor Here is a section in the GHC User Guide about guaranteed call safety: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/ffi-chap.html#guaranteed-call-safety
|
What we'd need to decide to do this is very compelling evidence that Does anyone have any data about these? |
Leaving open for another week, but in the absence of evidence that the current cgo isn't fast enough, this is headed for a likely decline. |
@rsc There is plenty of evidence. The Tor Project decided that CGo was a very poor choice for their incremental rewrite of the Tor daemon. Filippo Valsorda wrote Rustigo to reduce call overhead when invoking Rust cryptographic routines. Rustigo was a disgusting hack, but it was over 15 times faster than CGo, which translated into a significant improvement in benchmarks. Yes, deadlocking if the invoked function blocks is not great. But there are cases where the invoked function is absolutely guaranteed not to block. Fast cryptographic routines are one example. Graphics APIs such as Vulkan are another, and I recall yet another involving database access. In these cases, if performance matters, the choice isn’t “use CGo or a disgusting assembler hack”. It’s “use a disgusting assembler hack or reimplement the hot code path in a different language”. |
For reference, the current CGO overhead:
|
@DemiMarie Earlier @rsc listed three things that we would want evidence for (#42469 (comment)). The performance of calls across the cgo boundary matters most when those calls themselves--not the other code on the caller side, not the other code on the callee side--are a significant part of the performance cost. That implies that the code is making a lot of calls. How often is that the case. And I'll note that I believe we can continue to make the cgo call process faster. @egonelbre Those numbers suggest that the overhead is due to pointer checking, but this proposal doesn't address pointer checking at all. |
@ianlancetaylor, yeah, kind-of. The more complicated the data-structure, the more time it'll take to check. For the benchmark I tried to find the most complex struct possible, https://github.com/golang/go/blob/master/misc/cgo/test/test.go#L120, however, I would suspect that such structs are the exception. Nevertheless, there probably is a way to speed up such structures as well. I would suspect that for most codebases the overhead will be in Both Filippos and Tor Projects seem to predate a few optimizations to cgo. So I'm not sure how applicable the examples are. For Filippos example the current cgo call overhead is ~50ns, which compared to 20us function cost seems negligible. |
PS: while reinvestigating cgo entersyscall, I noticed that there might be ways to save at least ~4ns (https://go-review.googlesource.com/c/go/+/278793). |
Tor Project needed to use callbacks from C into Go extensively, which are extremely slow. |
Can you point to some benchmarks, that would give folks something to aim at rather than talking across each other. |
@davecheney Sadly no, but I do remember reading that they ~1-2 milliseconds at one point. |
Related: #16051 If the issue is that these functions need to be preemptible, would another alternative be to give these "non-blocking" foreign functions some way to yield to the scheduler? |
Callbacks from C to Go have gotten much faster over the last few releases (including the upcoming 1.16 release). |
As an experiment, I created a trivial dummy function
A crude benchmark on my laptop (i7-8550U) shows call + loop overhead is roughly:
Linux Perf shows the CGO overhead breakdown is roughly (all percentages of total call time):
A couple of observations:
There are a number of cases where C libraries/drivers must be used. Eg, some OSes require using C libraries rather than directly calling the kernel (IIUC, Solaris, Darwin, recent OpenBSD). OpenGL is another example where faster calls would be useful. It's impractical to replace OpenGL drivers with Go/Assembly. Eg, instrumenting an older OpenGL game (Urban Terror/Quake3) shows ~10000 OpenGL calls / frame. Assuming a vsync of In practice, CGO performance limitations can encourage people to use another language for some use cases, so there is less demand (self-fullfulling prophecy). 2 things would go a long way to accelerating C calls:
With the planned support for multiple ABIs, it would be interesting to consider a limited interface to the C ABI for quick non-reentrant calls with limited stack usage. The extra call overhead could potentially be ~0. Perhaps this could be done via annotations in assembly files? Potential advantages:
MMU stack guard & Go pointer checks could be enabled via a separate compile mode (eg, -race or similar). |
Do you have rough breakdown on what was called? cgoCheckPointer is invoked only for data-types containing a pointer and I'm suspecting that it won't be triggered for most of the calls. Similarly, was it using gl2.1 or gl3+? There's a significant difference between those. Based on the stats, I would guess gl2.1. As an example, Oculus Quest performance target is 175 draw calls per frame. To me 10K draw calls sounds bad regardless of the language used. |
Just a note that the Go syscall and golang.org/x/sys/unix packages already handle these cases. It should never be necessary to use cgo only for this reason. If it is, that should be addressed in one of those packages, not in cgo. |
In the current runtime I think the |
@egonelbre, on reflection you're right - @ianlancetaylor, based on past experience I thought there might be some less common calls that might be missing/less desirable in the |
Another example for consideration.. I use Go/libpcap for network monitoring/packet inspection. CGO hasn't been a problem so far since I have enough CPU to spare for the traffic load, but I imagine this application would become problematic with more network traffic. Eg, a 10G interface at line rate could generate 830k - 19M packets/sec (depending on packet size). Each packet is retrieved via:
This is normally a very fast call, ~40ns on my laptop from C. With CGO, it's much higher (3 cgoCheckPointer calls):
There is still significant overhead from This call is safe,
|
With regards to
|
Another potential advantage for reusing the goroutine stack (the somewhat crazy idea using multiple ABI support above): It would make analysing performance profiles with CGO much easier. |
@egonelbre, that does seem like an improvement for It would be nice if there was an escape hatch to gain access to faster C calls when needed. That could make the difference between using Go or having to introduce/use another language. I don't see a significant amount of argument copying. In case it helps, here is the Assembly
|
@mpx sure, I understand. Disabling cgocheck isn't a great idea in the first place. I linked to add more context for previous discussion on Is the code public somewhere? I'm not quite understanding why it takes such a significant amount of time with |
I should mention that as far as overhead goes, I'm less worried by syscalls/OS-provided frameworks, and more concerned about external C/C++/Rust libraries that have put a large engineering effort into them. Rewriting them in Go and even batching calls to them is not always feasible. |
There does not seem to be a proposal here. Nor has there been evidence of a need. @mpx measured the current overhead (thanks!) and it is down to 60 ns per call. If the 1-2ms that Tor observed was the motivation for filing this proposal, it sounds like that's now fixed. Based on the discussion above, this seems like a likely decline. |
@mpx @smasher164 do you have a concrete need for faster cgo calls? @rsc would you be willing to reopen this if someone later provided evidence of such a need? |
I will need to gather more concrete evidence/measurements to make the case. I'm okay with reopening or filing a new issue when necessary. |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
The Glasgow Haskell Compiler (GHC) differentiates between “safe” and “unsafe” FFI calls. “safe” FFI calls are allowed to block and call back into Haskell, but have a substantial overhead. “unsafe” FFI calls are not allowed to block, but are as fast as a C function call.
While Go FFI will always be slower due to stack switching, this seems to account for only a small amount of the overhead that others have observed. If the C function is guaranteed to be short-running, a significant speedup can be obtained by making a direct call into C code, without involving the Go scheduler. Of course, if the C function blocks, this is bad, but in many cases, it can be guaranteed not to. Calling back into Go from an unsafe FFI call is undefined behavior, but in many cases such calls are known not to occur.
The text was updated successfully, but these errors were encountered: