Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/cgo: avoid calls to cgoCheckPointer when debug.cgocheck=0 #28454

Open
egonelbre opened this issue Oct 29, 2018 · 10 comments
Open

cmd/cgo: avoid calls to cgoCheckPointer when debug.cgocheck=0 #28454

egonelbre opened this issue Oct 29, 2018 · 10 comments

Comments

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Oct 29, 2018

With DEBUG=cgocheck=0 Go still makes calls to cgoCheckPointer which will bail out early in

if debug.cgocheck == 0 {
. Every such call adds few ns, but funcs with many arguments can end up accumulating a lot of them.

https://golang.org/cl/142884 changes cgo generated code to:

defer func() func() {
    _cgo0 := x
    _cgo1 := y
    return func() {
        _cgoCheckPointer(_cgo0)
        _cgoCheckPointer(_cgo1)
        C.f(_cgo0, _cgo1)
    }
}()()

I propose, instead of checking debug.cgocheck=0 inside cgoCheckPointer it would check it before calling cgoCheckPointer, so cgo would generate:

defer func() func() {
    _cgo0 := x
    _cgo1 := y
    return func() {
        if debug.cgocheck != 0 {
            _cgoCheckPointer(_cgo0)
            _cgoCheckPointer(_cgo1)
        }
        C.f(_cgo0, _cgo1)
    }
}()()
@mvdan mvdan added the Performance label Oct 29, 2018
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 30, 2018

I see the advantage but I'm not excited about encouraging people to use GODEBUG=cgocheck=0.

@dominikh
Copy link
Member

@dominikh dominikh commented Oct 30, 2018

We could file a separate issue for making cgocheck=1 faster, but as it stands, it has a pretty substantial cost. I measured ~50ns per checked argument in the trivial case (struct { **int }), with ~68ns being an unchecked cgo call. In the context of APIs like Vulkan, where every function call contains a pointer to multiple pointers, and where performance is paramount, cgocheck=0 is hugely beneficial.

In these environments, it makes much more sense to run with cgocheck=2 during development, but to use cgocheck=0 in production.

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Oct 30, 2018

Just to clarify, I'm not excited about it either and would rather see a check that has close to zero cost.

The other things I thought that might be possible are:

  1. aggressive cgoCheckCall optimizer (something that elides all the type walking),
  2. single call to cgoCheckCall (still would have the overhead),
  3. inlinable cgoCheckCall.

However, all of these seem to need significantly more effort than this change, but also are complimentary to the external check.

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Oct 31, 2018

I did a bunch of experiments on https://github.com/egonelbre/exp/blob/master/bench/call/cgo.go#L50.

Experiments:

  1. Baseline is CL142884.
  2. Disabling cgoCheckPointer code generation completely, (best-case scenario).
  3. Using if before calling cgoCheckPointer (proposal).
  4. Using cgoCheckPointer1 without variadic args (additional idea).
  5. Combining outer if and cgoCheckPointer1.

PS: note, my machine is somewhat noisy, so take the results with a grain of salt.

Results

Raw results https://gist.github.com/egonelbre/fe11fc2a3bf1617e1e14dfc562bb3e2a

Baseline vs disabling code generation:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  66.5ns ± 4%  -25.45%  (p=0.000 n=19+20)
CArgs2-8   113ns ± 9%    70ns ± 7%  -38.18%  (p=0.000 n=20+20)
CArgs3-8   138ns ± 9%    73ns ± 3%  -47.01%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%    70ns ± 6%  -50.92%  (p=0.000 n=17+20)
CArgs8-8   224ns ± 2%    78ns ± 1%  -65.32%  (p=0.000 n=18+16)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  66.8ns ± 4%   -6.58%  (p=0.000 n=19+20)
CArgs2-8  76.9ns ± 1%  68.1ns ± 1%  -11.35%  (p=0.000 n=19+17)
CArgs3-8  76.7ns ± 2%  74.7ns ± 3%   -2.62%  (p=0.000 n=20+17)
CArgs4-8  81.3ns ± 6%  67.9ns ± 2%  -16.56%  (p=0.000 n=20+20)
CArgs8-8  95.7ns ± 2%  77.6ns ± 2%  -18.99%  (p=0.000 n=20+20)

Baseline vs outer if:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  88.8ns ± 1%    ~     (p=0.395 n=19+15)
CArgs2-8   113ns ± 9%   105ns ± 3%  -7.46%  (p=0.000 n=20+18)
CArgs3-8   138ns ± 9%   130ns ± 4%  -6.35%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   151ns ± 2%  +6.38%  (p=0.000 n=17+20)
CArgs8-8   224ns ± 2%   238ns ± 1%  +6.33%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  69.9ns ± 2%   -2.28%  (p=0.000 n=19+18)
CArgs2-8  76.9ns ± 1%  70.8ns ± 3%   -7.85%  (p=0.000 n=19+17)
CArgs3-8  76.7ns ± 2%  72.0ns ± 2%   -6.15%  (p=0.000 n=20+16)
CArgs4-8  81.3ns ± 6%  74.2ns ± 2%   -8.80%  (p=0.000 n=20+20)
CArgs8-8  95.7ns ± 2%  80.2ns ± 3%  -16.26%  (p=0.000 n=20+18)

Baseline vs cgoCheckPointer1(interface{}):

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  88.9ns ± 2%     ~     (p=0.641 n=19+20)
CArgs2-8   113ns ± 9%    99ns ± 2%  -12.38%  (p=0.000 n=20+18)
CArgs3-8   138ns ± 9%   120ns ± 5%  -13.65%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   140ns ± 2%   -1.46%  (p=0.000 n=17+19)
CArgs8-8   224ns ± 2%   219ns ± 2%   -1.98%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  74.7ns ± 3%  +4.44%  (p=0.000 n=19+19)
CArgs2-8  76.9ns ± 1%  74.5ns ± 3%  -3.07%  (p=0.000 n=19+20)
CArgs3-8  76.7ns ± 2%  75.7ns ± 2%  -1.34%  (p=0.000 n=20+20)
CArgs4-8  81.3ns ± 6%  80.7ns ± 1%    ~     (p=0.649 n=20+18)
CArgs8-8  95.7ns ± 2%  92.1ns ± 1%  -3.85%  (p=0.000 n=20+16)

Baseline vs combined:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  85.2ns ± 5%   -4.50%  (p=0.000 n=19+20)
CArgs2-8   113ns ± 9%   102ns ± 4%  -10.01%  (p=0.000 n=20+20)
CArgs3-8   138ns ± 9%   126ns ± 8%   -8.66%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   135ns ± 1%   -4.40%  (p=0.000 n=17+19)
CArgs8-8   224ns ± 2%   209ns ± 5%   -6.81%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  72.4ns ± 3%   +1.30%  (p=0.019 n=19+20)
CArgs2-8  76.9ns ± 1%  71.5ns ± 8%   -7.00%  (p=0.000 n=19+20)
CArgs3-8  76.7ns ± 2%  69.9ns ± 1%   -8.97%  (p=0.000 n=20+19)
CArgs4-8  81.3ns ± 6%  74.2ns ± 0%   -8.78%  (p=0.000 n=20+14)
CArgs8-8  95.7ns ± 2%  79.8ns ± 1%  -16.64%  (p=0.000 n=20+19)
@gopherbot
Copy link

@gopherbot gopherbot commented Oct 1, 2019

Change https://golang.org/cl/198081 mentions this issue: cmd/cgo: optimize cgoCheckPointer call

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 30, 2020

Change https://golang.org/cl/226342 mentions this issue: [WIP] cmd/cgo,runtime: inline cgocheck==0 check

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Mar 30, 2020

@ianlancetaylor I finally thought it would be nice to get this closed one way or another :).

I made a proof-of-concept change in https://go-review.googlesource.com/c/go/+/226342. I guess the main question is, whether it should be done at all.

The performance improvements are significant:

GODEBUG=cgocheck=0

name                             old time/op  new time/op  delta
CgoCall/add-int-32               48.2ns ± 7%  45.3ns ± 0%   -5.96%  (p=0.016 n=5+4)
CgoCall/one-pointer-32           48.8ns ± 1%  48.4ns ± 2%     ~     (p=0.127 n=5+5)
CgoCall/eight-pointers-32        69.3ns ± 1%  50.0ns ± 0%  -27.87%  (p=0.008 n=5+5)
CgoCall/eight-pointers-nil-32    67.8ns ± 1%  50.6ns ± 1%  -25.41%  (p=0.008 n=5+5)
CgoCall/eight-pointers-array-32  1.44µs ± 1%  0.05µs ± 2%  -96.52%  (p=0.008 n=5+5)
CgoCall/eight-pointers-slice-32   351ns ± 2%    49ns ± 1%  -86.00%  (p=0.008 n=5+5)

Should I try to finish the CL properly and add missing parts from gcc-go or is it better to drop this issue altogether?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 30, 2020

Sorry, but I'm still not at all persuaded that we should try to make GODEBUG=cgocheck=0 faster. We should absolutely try to make GODEBUG=cgocheck=1 faster, since that is the default. But cgocheck=0 was only intended to support old code that was written before the pointer checking rules were written down. (The argument of using cgocheck=0 only during production is always tempting, but it's similar to the argument that you should use a life jacket while practicing in a swimming pool but not while swimming in the ocean.)

That said I think the CL would be simpler if you introduce a new function

func cgoMuchCheckPointers() {
    return debug.cgocheck != 0
}

and call that. The compiler should inline that into the calling code. (If it doesn't, let's find out why not.)

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Mar 30, 2020

Fair enough, I won't bother with making cgocheck=0 faster. I understand the reasoning.

But, while doing this, I also realized why the array/slices cases are still so slow.

Here's a reproducer, where a single call takes 15ms:

package main

/*
typedef struct Example {
  int value;
  int *other;
} Example;

int getValue(Example *example) {
	return example->value;
}
*/
import "C"
import (
	"fmt"
	"time"
)

func main() {
	const N = 100

	var data [1 << 20]C.Example
	start := time.Now()
	for i := 0; i < N; i++ {
		_ = C.getValue(&data[0])
	}
	finish := time.Now()
	fmt.Println(finish.Sub(start) / N)
}

Cgo ends up generating this code:

	var data [1 << 20] /*line :22:20*/ _Ctype_struct_Example /*line :22:29*/
	start := time.Now()
	for i := 0; i < N; i++ {
		_ = func() _Ctype_int {
			_cgoIndex0 := & /*line :25:19*/ data
			_cgo0 := /*line :25:18*/ &(*_cgoIndex0)[0]
			_cgoCheckPointer(_cgo0, *_cgoIndex0) // <--- this makes a copy, of the array
			return _Cfunc_getValue(_cgo0)
		}()
	}
	finish := time.Now()

I'll look into fixing this and then use it as a resolution for this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 30, 2020

Change https://golang.org/cl/226517 mentions this issue: src/cmd/cgo,src/runtime: avoid array clone during cgo call [WIP]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.