Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: freedefer performance oddity #18923

Closed
petermattis opened this issue Feb 3, 2017 · 9 comments

Comments

Projects
None yet
9 participants
@petermattis
Copy link

commented Feb 3, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8rc3 darwin/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build232974522=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

While profiling cockroachdb I noticed runtime.freedefer consuming a surprising amount of time (this is from a 30s profile and, yes, we make a lot of cgo calls):

(pprof) top10
127.59s of 159.49s total (80.00%)
Dropped 1044 nodes (cum <= 0.80s)
Showing top 10 nodes out of 202 (cum >= 1.74s)
      flat  flat%   sum%        cum   cum%
    54.35s 34.08% 34.08%     54.41s 34.11%  runtime.cgocall
    20.79s 13.04% 47.11%     20.81s 13.05%  syscall.Syscall
    14.99s  9.40% 56.51%     14.99s  9.40%  runtime.kevent
    13.04s  8.18% 64.69%     13.04s  8.18%  runtime.mach_semaphore_signal
     6.83s  4.28% 68.97%      6.83s  4.28%  runtime.mach_semaphore_wait
     6.80s  4.26% 73.23%      6.80s  4.26%  [cockroach]
     5.03s  3.15% 76.39%      5.03s  3.15%  runtime.usleep
     2.17s  1.36% 77.75%      4.79s  3.00%  runtime.scanobject
     1.85s  1.16% 78.91%      1.93s  1.21%  runtime.freedefer
     1.74s  1.09% 80.00%      1.74s  1.09%  runtime.duffcopy

Examining where the time is going within freedefer shows:

(pprof) list runtime.freedefer
Total: 2.66mins
...
         .          .    272:			})
         .          .    273:		}
     1.82s      1.89s    274:		*d = _defer{}
      20ms       30ms    275:		pp.deferpool[sc] = append(pp.deferpool[sc], d)

_defer is a simple structure of 7 fields. How is clearing the structure possibly taking that long? As an experiment, I tweaked this code to "manually" clear each field:

diff --git a/src/runtime/panic.go b/src/runtime/panic.go
index 876bca7..c7cbd3f 100644
--- a/src/runtime/panic.go
+++ b/src/runtime/panic.go
@@ -271,7 +271,14 @@ func freedefer(d *_defer) {
                                unlock(&sched.deferlock)
                        })
                }
-               *d = _defer{}
+               // *d = _defer{}
+               d.siz = 0
+               d.started = false
+               d.sp = 0
+               d.pc = 0
+               d.fn = nil
+               d._panic = nil
+               d.link = nil
                pp.deferpool[sc] = append(pp.deferpool[sc], d)
        }
 }

With this change freedefer consumes 110ms of time for the exact same workload.

Is this a real problem or is there some sort of profile oddity going on that is pointing blame at the *d = _defer{} line incorrectly? Seems like something real as the above change produces a small improvement on BenchmarkDefer:

name     old time/op  new time/op  delta
Defer-8  51.2ns ± 1%  50.1ns ± 1%  -2.13%  (p=0.000 n=19+20)

The above diff is still doing too much work as many of the fields are already clear or will be overwritten by the caller of newdefer:

diff --git a/src/runtime/panic.go b/src/runtime/panic.go
index 876bca7..39db94d 100644
--- a/src/runtime/panic.go
+++ b/src/runtime/panic.go
@@ -271,7 +271,8 @@ func freedefer(d *_defer) {
                                unlock(&sched.deferlock)
                        })
                }
-               *d = _defer{}
+               d.started = false
+               d.link = nil
                pp.deferpool[sc] = append(pp.deferpool[sc], d)
        }
 }

Which results in:

name     old time/op  new time/op  delta
Defer-8  51.2ns ± 1%  49.2ns ± 1%  -4.01%  (p=0.000 n=19+20)

Despite the repeatability of the above I'm still dubious about this change as I don't have any explanation for why it makes a difference.

@petermattis

This comment has been minimized.

Copy link
Author

commented Feb 3, 2017

My assembly is rusty, but this seems suspicious (from disasm runtime.freedefer of the unmodified go1.8rc3 code):

...
         .          .    402c8a4: MOVQ R10, 0x50(SP)
         .          .    402c8a9: CMPQ R9, R8
         .          .    402c8ac: JE 0x402ca1d
         .          .    402c8b2: LEAQ 0x78(SP), DI
     1.47s      1.47s    402c8b7: XORPS X0, X0
         .          .    402c8ba: ADDQ $-0x10, DI
         .          .    402c8be: MOVQ BP, -0x10(SP)
      10ms       10ms    402c8c3: LEAQ -0x10(SP), BP
         .       10ms    402c8c8: CALL 0x405be00
      10ms       10ms    402c8cd: MOVQ 0(BP), BP
...
@randall77

This comment has been minimized.

Copy link
Contributor

commented Feb 3, 2017

Looks like that's a call to duffzero. The xorps is just zeroing a source register for duffzero to use.
That instruction couldn't possibly take any time. Unless it is the first use of sse registers and they need to be initialized by the kernel somehow?
@TocarIP
It is certainly ok to replace the *d = defer{} with explicit zeroings. I'd like to understand what is going on first though.

@bradfitz bradfitz added this to the Go1.9 milestone Feb 3, 2017

@TocarIP

This comment has been minimized.

Copy link
Contributor

commented Feb 3, 2017

I don't think xorps is guilty. Slow case takes more stack space and has extra call to duffcopy. I suspect, that there is some extra copying going on.

@aclements

This comment has been minimized.

Copy link
Member

commented Jun 8, 2017

@randall77, ping.

/cc @josharian

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2017

I cannot reproduce this problem on either linux or darwin. On a simple tight defer benchmark I get about 10% of the time in freedefer, and 0 samples of that are on the XORPS. That all seems normal to me.

I think to make any more progress we're going to need some way to reproduce the issue.

Here's my repro attempt:

package issue18923

import "testing"

//go:noinline
func f() {
	defer func() {}()
}

func Benchmark1(b *testing.B) {
	for i := 0; i < b.N; i++ {
		f()
	}
}

I'm going to punt this to 1.10.

@randall77 randall77 modified the milestones: Go1.10, Go1.9 Jun 8, 2017

@petermattis

This comment has been minimized.

Copy link
Author

commented Jun 12, 2017

This still reproduces for me on both go1.8.3 and tip (b0d592c) on Darwin. This profile is from tip:

(pprof) list freedefer
...
         .          .    273:		}
     900ms      1.24s    274:		*d = _defer{}
      30ms       90ms    275:		pp.deferpool[sc] = append(pp.deferpool[sc], d)
         .          .    276:	}

Is there some other debug data I can provide? I can also provide the exact steps I'm performing.

This might be a clue, disasm freedefer now shows part of the time accounted to runtime.typedmemclr.

         .          .    402e332: MOVQ DI, 0x8(SP)
         .      340ms    402e337: CALL runtime.typedmemclr(SB)	           ;runtime.freedefer panic.go:274
         .          .    402e33c: MOVQ 0x48(SP), CX	                      ;panic.go:274
@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2017

Exact steps never hurt.

Sounds like the way forward (for 1.10) is probably just to replace with explicit zeroing. I also see that type _defer's layout could probably be improved. There may be other easy pickings as well.

I think that whether time is spent in typememclr vs duffzero depends on what's happening with the garbage collector at the time.

Perhaps the XORPS mystery is just due to event skid?

@mvdan

This comment has been minimized.

Copy link
Member

commented Sep 25, 2017

Perhaps related to the recent 332719f by @ianlancetaylor.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 25, 2017

The line has been replaced by explicit zeroing for other reasons in https://golang.org/cl/64070.

I'll close this but feel free to reopen if you want to continue to investigate the general issue.

@golang golang locked and limited conversation to collaborators Sep 25, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.