New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/cgo: avoid calls to cgoCheckPointer when debug.cgocheck=0 #28454

Open
egonelbre opened this Issue Oct 29, 2018 · 4 comments

Comments

Projects
None yet
4 participants
@egonelbre
Contributor

egonelbre commented Oct 29, 2018

With DEBUG=cgocheck=0 Go still makes calls to cgoCheckPointer which will bail out early in

if debug.cgocheck == 0 {
. Every such call adds few ns, but funcs with many arguments can end up accumulating a lot of them.

https://golang.org/cl/142884 changes cgo generated code to:

defer func() func() {
    _cgo0 := x
    _cgo1 := y
    return func() {
        _cgoCheckPointer(_cgo0)
        _cgoCheckPointer(_cgo1)
        C.f(_cgo0, _cgo1)
    }
}()()

I propose, instead of checking debug.cgocheck=0 inside cgoCheckPointer it would check it before calling cgoCheckPointer, so cgo would generate:

defer func() func() {
    _cgo0 := x
    _cgo1 := y
    return func() {
        if debug.cgocheck != 0 {
            _cgoCheckPointer(_cgo0)
            _cgoCheckPointer(_cgo1)
        }
        C.f(_cgo0, _cgo1)
    }
}()()

@mvdan mvdan added the Performance label Oct 29, 2018

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Oct 30, 2018

I see the advantage but I'm not excited about encouraging people to use GODEBUG=cgocheck=0.

@dominikh

This comment has been minimized.

Member

dominikh commented Oct 30, 2018

We could file a separate issue for making cgocheck=1 faster, but as it stands, it has a pretty substantial cost. I measured ~50ns per checked argument in the trivial case (struct { **int }), with ~68ns being an unchecked cgo call. In the context of APIs like Vulkan, where every function call contains a pointer to multiple pointers, and where performance is paramount, cgocheck=0 is hugely beneficial.

In these environments, it makes much more sense to run with cgocheck=2 during development, but to use cgocheck=0 in production.

@egonelbre

This comment has been minimized.

Contributor

egonelbre commented Oct 30, 2018

Just to clarify, I'm not excited about it either and would rather see a check that has close to zero cost.

The other things I thought that might be possible are:

  1. aggressive cgoCheckCall optimizer (something that elides all the type walking),
  2. single call to cgoCheckCall (still would have the overhead),
  3. inlinable cgoCheckCall.

However, all of these seem to need significantly more effort than this change, but also are complimentary to the external check.

@egonelbre

This comment has been minimized.

Contributor

egonelbre commented Oct 31, 2018

I did a bunch of experiments on https://github.com/egonelbre/exp/blob/master/bench/call/cgo.go#L50.

Experiments:

  1. Baseline is CL142884.
  2. Disabling cgoCheckPointer code generation completely, (best-case scenario).
  3. Using if before calling cgoCheckPointer (proposal).
  4. Using cgoCheckPointer1 without variadic args (additional idea).
  5. Combining outer if and cgoCheckPointer1.

PS: note, my machine is somewhat noisy, so take the results with a grain of salt.

Results

Raw results https://gist.github.com/egonelbre/fe11fc2a3bf1617e1e14dfc562bb3e2a

Baseline vs disabling code generation:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  66.5ns ± 4%  -25.45%  (p=0.000 n=19+20)
CArgs2-8   113ns ± 9%    70ns ± 7%  -38.18%  (p=0.000 n=20+20)
CArgs3-8   138ns ± 9%    73ns ± 3%  -47.01%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%    70ns ± 6%  -50.92%  (p=0.000 n=17+20)
CArgs8-8   224ns ± 2%    78ns ± 1%  -65.32%  (p=0.000 n=18+16)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  66.8ns ± 4%   -6.58%  (p=0.000 n=19+20)
CArgs2-8  76.9ns ± 1%  68.1ns ± 1%  -11.35%  (p=0.000 n=19+17)
CArgs3-8  76.7ns ± 2%  74.7ns ± 3%   -2.62%  (p=0.000 n=20+17)
CArgs4-8  81.3ns ± 6%  67.9ns ± 2%  -16.56%  (p=0.000 n=20+20)
CArgs8-8  95.7ns ± 2%  77.6ns ± 2%  -18.99%  (p=0.000 n=20+20)

Baseline vs outer if:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  88.8ns ± 1%    ~     (p=0.395 n=19+15)
CArgs2-8   113ns ± 9%   105ns ± 3%  -7.46%  (p=0.000 n=20+18)
CArgs3-8   138ns ± 9%   130ns ± 4%  -6.35%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   151ns ± 2%  +6.38%  (p=0.000 n=17+20)
CArgs8-8   224ns ± 2%   238ns ± 1%  +6.33%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  69.9ns ± 2%   -2.28%  (p=0.000 n=19+18)
CArgs2-8  76.9ns ± 1%  70.8ns ± 3%   -7.85%  (p=0.000 n=19+17)
CArgs3-8  76.7ns ± 2%  72.0ns ± 2%   -6.15%  (p=0.000 n=20+16)
CArgs4-8  81.3ns ± 6%  74.2ns ± 2%   -8.80%  (p=0.000 n=20+20)
CArgs8-8  95.7ns ± 2%  80.2ns ± 3%  -16.26%  (p=0.000 n=20+18)

Baseline vs cgoCheckPointer1(interface{}):

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  88.9ns ± 2%     ~     (p=0.641 n=19+20)
CArgs2-8   113ns ± 9%    99ns ± 2%  -12.38%  (p=0.000 n=20+18)
CArgs3-8   138ns ± 9%   120ns ± 5%  -13.65%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   140ns ± 2%   -1.46%  (p=0.000 n=17+19)
CArgs8-8   224ns ± 2%   219ns ± 2%   -1.98%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  74.7ns ± 3%  +4.44%  (p=0.000 n=19+19)
CArgs2-8  76.9ns ± 1%  74.5ns ± 3%  -3.07%  (p=0.000 n=19+20)
CArgs3-8  76.7ns ± 2%  75.7ns ± 2%  -1.34%  (p=0.000 n=20+20)
CArgs4-8  81.3ns ± 6%  80.7ns ± 1%    ~     (p=0.649 n=20+18)
CArgs8-8  95.7ns ± 2%  92.1ns ± 1%  -3.85%  (p=0.000 n=20+16)

Baseline vs combined:

# cgocheck=1
name      old time/op  new time/op  delta
CArgs1-8  89.2ns ± 3%  85.2ns ± 5%   -4.50%  (p=0.000 n=19+20)
CArgs2-8   113ns ± 9%   102ns ± 4%  -10.01%  (p=0.000 n=20+20)
CArgs3-8   138ns ± 9%   126ns ± 8%   -8.66%  (p=0.000 n=18+20)
CArgs4-8   142ns ± 1%   135ns ± 1%   -4.40%  (p=0.000 n=17+19)
CArgs8-8   224ns ± 2%   209ns ± 5%   -6.81%  (p=0.000 n=18+20)

# cgocheck=0
name      old time/op  new time/op  delta
CArgs1-8  71.5ns ± 7%  72.4ns ± 3%   +1.30%  (p=0.019 n=19+20)
CArgs2-8  76.9ns ± 1%  71.5ns ± 8%   -7.00%  (p=0.000 n=19+20)
CArgs3-8  76.7ns ± 2%  69.9ns ± 1%   -8.97%  (p=0.000 n=20+19)
CArgs4-8  81.3ns ± 6%  74.2ns ± 0%   -8.78%  (p=0.000 n=20+14)
CArgs8-8  95.7ns ± 2%  79.8ns ± 1%  -16.64%  (p=0.000 n=20+19)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment