Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: cmd/cgo: unsafe FFI calls #42469

Closed
DemiMarie opened this issue Nov 9, 2020 · 30 comments
Closed

proposal: cmd/cgo: unsafe FFI calls #42469

DemiMarie opened this issue Nov 9, 2020 · 30 comments

Comments

@DemiMarie
Copy link
Contributor

The Glasgow Haskell Compiler (GHC) differentiates between “safe” and “unsafe” FFI calls. “safe” FFI calls are allowed to block and call back into Haskell, but have a substantial overhead. “unsafe” FFI calls are not allowed to block, but are as fast as a C function call.

While Go FFI will always be slower due to stack switching, this seems to account for only a small amount of the overhead that others have observed. If the C function is guaranteed to be short-running, a significant speedup can be obtained by making a direct call into C code, without involving the Go scheduler. Of course, if the C function blocks, this is bad, but in many cases, it can be guaranteed not to. Calling back into Go from an unsafe FFI call is undefined behavior, but in many cases such calls are known not to occur.

@gopherbot gopherbot added this to the Proposal milestone Nov 9, 2020
@ianlancetaylor ianlancetaylor changed the title Proposal: unsafe FFI calls proposal: cmd/cgo: unsafe FFI calls Nov 9, 2020
@ianlancetaylor
Copy link
Contributor

This will produce programs that usually work but sometimes hang for inexplicable reasons. I don't think we've come close to the theoretical limit on speeding up cgo calls without removing safety.

Can you point to some documentation for GHC unsafe calls? Thanks.

@smasher164
Copy link
Member

@ianlancetaylor Here is a section in the GHC User Guide about guaranteed call safety: https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/ffi-chap.html#guaranteed-call-safety

The Haskell 2010 Report specifies that safe FFI calls must allow foreign calls to safely call into Haskell code. In practice, this means that the garbage collector must be able to run while these calls are in progress, moving heap-allocated Haskell values around arbitrarily.

This greatly constrains library authors since it implies that it is not safe to pass any heap object reference to a safe foreign function call. For instance, it is often desirable to pass an unpinned ByteArray#s directly to native code to avoid making an otherwise-unnecessary copy. However, this can only be done safely if the array is guaranteed not to be moved by the garbage collector in the middle of the call.

The Chapter does not require implementations to refrain from doing the same for unsafe calls, so strictly Haskell 2010-conforming programs cannot pass heap-allocated references to unsafe FFI calls either.

In previous releases, GHC would take advantage of the freedom afforded by the Chapter by performing safe foreign calls in place of unsafe calls in the bytecode interpreter. This meant that some packages which worked when compiled would fail under GHCi (e.g. #13730).

However, since version 8.4 this is no longer the case: GHC guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread.

@rsc
Copy link
Contributor

rsc commented Dec 2, 2020

What we'd need to decide to do this is very compelling evidence that
(1) the speed difference is significant,
(2) the speed difference cannot be reduced by other optimization work, and
(3) this happens often in situations where the difference is critical.

Does anyone have any data about these?

@rsc
Copy link
Contributor

rsc commented Dec 16, 2020

Leaving open for another week, but in the absence of evidence that the current cgo isn't fast enough, this is headed for a likely decline.

@DemiMarie
Copy link
Contributor Author

@rsc There is plenty of evidence. The Tor Project decided that CGo was a very poor choice for their incremental rewrite of the Tor daemon. Filippo Valsorda wrote Rustigo to reduce call overhead when invoking Rust cryptographic routines. Rustigo was a disgusting hack, but it was over 15 times faster than CGo, which translated into a significant improvement in benchmarks.

Yes, deadlocking if the invoked function blocks is not great. But there are cases where the invoked function is absolutely guaranteed not to block. Fast cryptographic routines are one example. Graphics APIs such as Vulkan are another, and I recall yet another involving database access. In these cases, if performance matters, the choice isn’t “use CGo or a disgusting assembler hack”. It’s “use a disgusting assembler hack or reimplement the hot code path in a different language”.

@egonelbre
Copy link
Contributor

For reference, the current CGO overhead:

goos: windows
goarch: amd64
pkg: misc/cgo/test
cpu: AMD Ryzen Threadripper 2950X 16-Core Processor

name                             time/op
CgoCall/add-int-32               49.3ns ± 2%
CgoCall/one-pointer-32           91.8ns ± 0%
CgoCall/eight-pointers-32         343ns ± 1%
CgoCall/eight-pointers-nil-32    89.4ns ± 2%
CgoCall/eight-pointers-array-32  3.90µs ± 1% // known bug
CgoCall/eight-pointers-slice-32  2.70µs ± 0%

GODEBUG=cgocheck=0
name                             time/op
CgoCall/add-int-32               51.1ns ± 0%
CgoCall/one-pointer-32           49.8ns ± 2%
CgoCall/eight-pointers-32        66.1ns ± 0%
CgoCall/eight-pointers-nil-32    66.2ns ± 0%
CgoCall/eight-pointers-array-32  1.62µs ± 1% // known bug
CgoCall/eight-pointers-slice-32   418ns ± 2%

@ianlancetaylor
Copy link
Contributor

@DemiMarie Earlier @rsc listed three things that we would want evidence for (#42469 (comment)).

The performance of calls across the cgo boundary matters most when those calls themselves--not the other code on the caller side, not the other code on the callee side--are a significant part of the performance cost. That implies that the code is making a lot of calls. How often is that the case. And I'll note that I believe we can continue to make the cgo call process faster.

@egonelbre Those numbers suggest that the overhead is due to pointer checking, but this proposal doesn't address pointer checking at all.

@egonelbre
Copy link
Contributor

@ianlancetaylor, yeah, kind-of. The more complicated the data-structure, the more time it'll take to check. For the benchmark I tried to find the most complex struct possible, https://github.com/golang/go/blob/master/misc/cgo/test/test.go#L120, however, I would suspect that such structs are the exception. Nevertheless, there probably is a way to speed up such structures as well. I would suspect that for most codebases the overhead will be in entersyscall/exitsyscall rather than cgocheck.

Both Filippos and Tor Projects seem to predate a few optimizations to cgo. So I'm not sure how applicable the examples are. For Filippos example the current cgo call overhead is ~50ns, which compared to 20us function cost seems negligible.

@egonelbre
Copy link
Contributor

PS: while reinvestigating cgo entersyscall, I noticed that there might be ways to save at least ~4ns (https://go-review.googlesource.com/c/go/+/278793).

@DemiMarie
Copy link
Contributor Author

Tor Project needed to use callbacks from C into Go extensively, which are extremely slow.

@davecheney
Copy link
Contributor

Tor Project needed to use callbacks from C into Go extensively, which are extremely slow.

Can you point to some benchmarks, that would give folks something to aim at rather than talking across each other.

@DemiMarie
Copy link
Contributor Author

@davecheney Sadly no, but I do remember reading that they ~1-2 milliseconds at one point.

@smasher164
Copy link
Member

Related: #16051

If the issue is that these functions need to be preemptible, would another alternative be to give these "non-blocking" foreign functions some way to yield to the scheduler?

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Dec 17, 2020

Callbacks from C to Go have gotten much faster over the last few releases (including the upcoming 1.16 release).

@mpx
Copy link
Contributor

mpx commented Jan 4, 2021

As an experiment, I created a trivial dummy function func Frob(int, int, int) int that returns an immediate value to highlight function call overhead. 2 implementations:

  • Go assembly (MOVQ + RET)
  • C function

A crude benchmark on my laptop (i7-8550U) shows call + loop overhead is roughly:

  • Go assembly implementation: ~2ns
  • Go assembly trampoline to C: ~3ns (just for comparison, not recommended)
  • CGO: ~60ns

Linux Perf shows the CGO overhead breakdown is roughly (all percentages of total call time):

  • 83%: runtime.entersyscall + runtime.exitsyscall
    • 30+%: runtime.casgstatus (almost all of it via lock cmpxchg)
    • 23%: runtime.exitsyscallfast
    • 9%: runtime.wirep
  • 6%: runtime.asmcgocall
  • remaining time is mostly the CGO shims/runtime.cgocall

A couple of observations:

  • The amount of time spent synchronising via lock cmpxchg might indicate the actual performance limit isn't too far away with this approach.
  • A "fast/raw" CGO call could remove a large amount of overhead (83%?)
  • As indicated above, if cgoCheckPointer is needed, it can easily become the largest cost dominating everything else (μs)

There are a number of cases where C libraries/drivers must be used. Eg, some OSes require using C libraries rather than directly calling the kernel (IIUC, Solaris, Darwin, recent OpenBSD).

OpenGL is another example where faster calls would be useful. It's impractical to replace OpenGL drivers with Go/Assembly. Eg, instrumenting an older OpenGL game (Urban Terror/Quake3) shows ~10000 OpenGL calls / frame. Assuming a vsync of
90Hz, this is ~900k calls/sec - likely completely impractical with cgoCheckPointer and still significant overhead without.

In practice, CGO performance limitations can encourage people to use another language for some use cases, so there is less demand (self-fullfulling prophecy).

2 things would go a long way to accelerating C calls:

  • Provide a way to avoid enter/exitsyscall overhead (83% overhead above)
  • Provide a way to programmatically choose to avoid cgoCheckPointer when the overhead is too great and it is known to be safe (GODEBUG/re-exec can't be used by library authors, enabling checks via a compile mode might be better)

With the planned support for multiple ABIs, it would be interesting to consider a limited interface to the C ABI for quick non-reentrant calls with limited stack usage. The extra call overhead could potentially be ~0. Perhaps this could be done via annotations in assembly files? Potential advantages:

  • Care clearly needs to be taken when writing assembly - not to be done lightly (higher barrier than existing CGO)
  • Calls could specify the amount of stack space required (although pragmas or some other mechanism could be added Go as well)
  • Radically reduces overhead and performance complaints :)

MMU stack guard & Go pointer checks could be enabled via a separate compile mode (eg, -race or similar).

@egonelbre
Copy link
Contributor

Eg, instrumenting an older OpenGL game (Urban Terror/Quake3) shows ~10000 OpenGL calls / frame.

Do you have rough breakdown on what was called? cgoCheckPointer is invoked only for data-types containing a pointer and I'm suspecting that it won't be triggered for most of the calls. Similarly, was it using gl2.1 or gl3+? There's a significant difference between those. Based on the stats, I would guess gl2.1. As an example, Oculus Quest performance target is 175 draw calls per frame. To me 10K draw calls sounds bad regardless of the language used.

@ianlancetaylor
Copy link
Contributor

There are a number of cases where C libraries/drivers must be used. Eg, some OSes require using C libraries rather than directly calling the kernel (IIUC, Solaris, Darwin, recent OpenBSD).

Just a note that the Go syscall and golang.org/x/sys/unix packages already handle these cases. It should never be necessary to use cgo only for this reason. If it is, that should be addressed in one of those packages, not in cgo.

@ianlancetaylor
Copy link
Contributor

In the current runtime I think the casgstatus is basically there to coordinate between entering a cgo call and being preempted. But once reentersyscall increments the M's locks field it can no longer be preempted anyhow. It may be possible to simplify that code.

@mpx
Copy link
Contributor

mpx commented Jan 5, 2021

@egonelbre, on reflection you're right - cgoCheckPointer shouldn't be generated in most cases since pointers to pointers are very rare in the GL API (shouldn't be on the hot path anyway). The game I instrumented used GL2.1. I've heard that "2k draw calls / frame" is normal for GL3+ on some titles due to the API improvements. I tried to find examples at the time, but struggled to find suitable GL3+ software to instrument on the platforms I use. Hence I only have concrete data for GL2.1. If anyone has access to software using more recent GL, you can use apitrace to record the GL calls. CGO overhead is probably still significant, but not as bad as I initially thought.

@ianlancetaylor, based on past experience I thought there might be some less common calls that might be missing/less desirable in the unix package. However, most of my past missing use cases (except seccomp) were added to the unix package in 2017 (Getsockopt* and Ioctl*) - I missed this, thanks.

@mpx
Copy link
Contributor

mpx commented Jan 5, 2021

Another example for consideration..

I use Go/libpcap for network monitoring/packet inspection. CGO hasn't been a problem so far since I have enough CPU to spare for the traffic load, but I imagine this application would become problematic with more network traffic.

Eg, a 10G interface at line rate could generate 830k - 19M packets/sec (depending on packet size). Each packet is retrieved via:

int pcap_next_ex(pcap_t *p, struct pcap_pkthdr **pkt_header, const u_char **pkt_data);

This is normally a very fast call, ~40ns on my laptop from C. With CGO, it's much higher (3 cgoCheckPointer calls):

  • cgocheck=1: ~330-580ns, median ~510ns
  • cgocheck=0: ~270-310ns, median ~290ns
  • Disabled cgoCheckPointer and shim function in compiler: ~110-230ns, median ~180ns

There is still significant overhead from cgocheck=0.

This call is safe, pcap_next_ex takes a C handle and returns C data via the pointers. Ideally, the author could ensure this is
compiled without cgoCheckPointer.

pcap_next_ex can be configured to block (or not). A "fast/raw" option could remove most of the remaining overhead, but it would need care from the developer to ensure this was handled correctly.

@egonelbre
Copy link
Contributor

With regards to cgocheck=0 see #28454.

110-230ns could suggest the code is hitting https://golang.org/cl/226517 bug, but it's hard to tell without the code. If any of the arguments is an array, then it currently makes a copy of it when calling cgoCheckPointer.

@mpx
Copy link
Contributor

mpx commented Jan 5, 2021

Another potential advantage for reusing the goroutine stack (the somewhat crazy idea using multiple ABI support above): It would make analysing performance profiles with CGO much easier.

@mpx
Copy link
Contributor

mpx commented Jan 5, 2021

@egonelbre, that does seem like an improvement for cgocheck=0. However, I don't think cgocheck=0 is a practical solution in general. Developers should be able to write code/modules that perform well without asking downstream users to set environment variables.

It would be nice if there was an escape hatch to gain access to faster C calls when needed. That could make the difference between using Go or having to introduce/use another language.

I don't see a significant amount of argument copying. In case it helps, here is the cgoCheckPointer shim for C.pcap_next_ex:

Assembly

TEXT monpath/pcap.pcap_next_ex.func1(SB) monpath/pcap/pcap.go
  0x4c43c0              64488b0c25f8ffffff      MOVQ FS:0xfffffff8, CX
  0x4c43c9              483b6110                CMPQ 0x10(CX), SP     
  0x4c43cd              0f86ad000000            JBE 0x4c4480                    
  0x4c43d3              4883ec28                SUBQ $0x28, SP      
  0x4c43d7              48896c2420              MOVQ BP, 0x20(SP)
  0x4c43dc              488d6c2420              LEAQ 0x20(SP), BP
  0x4c43e1              488b442430              MOVQ 0x30(SP), AX
  0x4c43e6              48890424                MOVQ AX, 0(SP)   
  0x4c43ea              e871a0f4ff              CALL runtime.convT64(SB)
  0x4c43ef              488d05ca650100          LEAQ 0x165ca(IP), AX            
  0x4c43f6              48890424                MOVQ AX, 0(SP)      
  0x4c43fa              0f57c0                  XORPS X0, X0  
  0x4c43fd              0f11442410              MOVUPS X0, 0x10(SP)
  0x4c4402              e8992df4ff              CALL runtime.cgoCheckPointer(SB)
  0x4c4407              488d05f2540100          LEAQ 0x154f2(IP), AX            
  0x4c440e              48890424                MOVQ AX, 0(SP)      
  0x4c4412              488b442438              MOVQ 0x38(SP), AX                                 
  0x4c4417              4889442408              MOVQ AX, 0x8(SP) 
  0x4c441c              0f57c0                  XORPS X0, X0     
  0x4c441f              0f11442410              MOVUPS X0, 0x10(SP)
  0x4c4424              e8772df4ff              CALL runtime.cgoCheckPointer(SB)
  0x4c4429              488d0510550100          LEAQ 0x15510(IP), AX            
  0x4c4430              48890424                MOVQ AX, 0(SP)      
  0x4c4434              488b442440              MOVQ 0x40(SP), AX                
  0x4c4439              4889442408              MOVQ AX, 0x8(SP)                                
  0x4c443e              0f57c0                  XORPS X0, X0    
  0x4c4441              0f11442410              MOVUPS X0, 0x10(SP)                                     
  0x4c4446              e8552df4ff              CALL runtime.cgoCheckPointer(SB)                        
  0x4c444b              488b442430              MOVQ 0x30(SP), AX                                       
  0x4c4450              48890424                MOVQ AX, 0(SP)                                          
  0x4c4454              488b442438              MOVQ 0x38(SP), AX                                       
  0x4c4459              4889442408              MOVQ AX, 0x8(SP)                                        
  0x4c445e              488b442440              MOVQ 0x40(SP), AX                                       
  0x4c4463              4889442410              MOVQ AX, 0x10(SP)                                       
  0x4c4468              e853e0ffff              CALL monpath/pcap._Cfunc_pcap_next_ex(SB)               
  0x4c446d              8b442418                MOVL 0x18(SP), AX                              
  0x4c4471              89442448                MOVL AX, 0x48(SP)                                       
  0x4c4475              488b6c2420              MOVQ 0x20(SP), BP                                       
  0x4c447a              4883c428                ADDQ $0x28, SP                                          
  0x4c447e              c3                      RET                                                     
  0x4c447f              90                      NOPL                                                    
  0x4c4480              e83b32faff              CALL runtime.morestack_noctxt(SB)                       
  0x4c4485              e936ffffff              JMP monpath/pcap.pcap_next_ex.func1(SB)                 

@egonelbre
Copy link
Contributor

@mpx sure, I understand. Disabling cgocheck isn't a great idea in the first place. I linked to add more context for previous discussion on cgocheck=0.

Is the code public somewhere? I'm not quite understanding why it takes such a significant amount of time with cgocheck=0. I could be just mis-guessing something.

@smasher164
Copy link
Member

I should mention that as far as overhead goes, I'm less worried by syscalls/OS-provided frameworks, and more concerned about external C/C++/Rust libraries that have put a large engineering effort into them. Rewriting them in Go and even batching calls to them is not always feasible.

@rsc
Copy link
Contributor

rsc commented Jan 6, 2021

There does not seem to be a proposal here. Nor has there been evidence of a need. @mpx measured the current overhead (thanks!) and it is down to 60 ns per call. If the 1-2ms that Tor observed was the motivation for filing this proposal, it sounds like that's now fixed.

Based on the discussion above, this seems like a likely decline.

@DemiMarie
Copy link
Contributor Author

@mpx @smasher164 do you have a concrete need for faster cgo calls? @rsc would you be willing to reopen this if someone later provided evidence of such a need?

@smasher164
Copy link
Member

I will need to gather more concrete evidence/measurements to make the case. I'm okay with reopening or filing a new issue when necessary.

@rsc
Copy link
Contributor

rsc commented Jan 6, 2021

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jan 13, 2021

No change in consensus, so declined.
— rsc for the proposal review group

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants