runtime: use frame pointers for callers #16638
Comments
That could work, and be much faster. |
There is no need to handle inlining at traceback time. Today, each PC corresponds to a single file/line/function. When we can inline non-leaf functions, each PC corresponds to a list of file/line/function tuples. At traceback time, we only need that PC. At symbolization time, we need to expand that PC into a list of file/line/function tuples. There is already code for handling this in The point is, traceback can be fast while still supporting non-leaf inlined functions when we interpret the traceback. |
CL https://golang.org/cl/33754 mentions this issue. |
func f() { g() } We mistakenly don't add a frame pointer for f. This means f isn't seen when walking the frame pointer linked list. That matters for kernel-gathered profiles, and is an impediment for issues like #16638. To fix, allocate a stack frame even for otherwise frameless functions like f. It is a bit tricky because we need to avoid some runtime internals that really, really don't want one. No test at the moment, as only kernel CPU profiles would catch it. Tests will come with the implementation of #16638. Fixes #18103 Change-Id: I411206cc9de4c8fdd265bee2e4fa61d161ad1847 Reviewed-on: https://go-review.googlesource.com/33754 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
CL https://golang.org/cl/33895 mentions this issue. |
When we copy the stack, we need to adjust all BPs. We correctly adjust the ones on the stack, but we also need to adjust the one that is in g.sched.bp. Like CL 33754, no test as only kernel-gathered profiles will notice. Tests will come (in 1.9) with the implementation of #16638. The invariant should hold that every frame pointer points to somewhere within its stack. After this CL, it is mostly true, but something about cgo breaks it. The runtime checks are disabled until I figure that out. Update #16638 Fixes #18174 Change-Id: I6023ee64adc80574ee3e76491d4f0fa5ede3dbdb Reviewed-on: https://go-review.googlesource.com/33895 Reviewed-by: Austin Clements <austin@google.com>
CL 43150 may also help speed up tracing; that list of hot functions looks familiar from when I was working on that CL. |
Change https://golang.org/cl/33809 mentions this issue: |
Change https://golang.org/cl/212301 mentions this issue: |
@dvyukov I took a look at this and it seems that if we want to get runtime.Callers() semantics fully correct (including a proper skip, based on any inlined frames due to mid-stack inlining), then we will lose a lot of the benefit of the optimization. See my message for my prototype code https://go-review.googlesource.com/c/go/+/212301 But then I noticed that actually the tracer and mprof are using a separate routine gcallers. We could possible only optimize gcallers(). In that case, would it be acceptible to not expand out the logical frames that are not physically there because of mid-stack inlining? Or do the inline expanding at CallerFrames()/Next() time only, but do the early skip processing based on physical frames? One really rough set of numbers that I did for your benchmark (take it with a grain of salt) is 40% overhead currently [like your 55% overhead], then 24% overhead if we do the optimization but have to deal with inlined frames, and 6% overhead if we don't have to deal with inlined frames (just follow FP pointers and done). |
Hi @danscales, The most profitable and critical is to optimize tracer, maybe mprof.
Either standard unwinding become faster, or everything else become slower :)
This sounds great. I would rely on tests that we correctly deal with inlining/skip. Thanks! |
The problem is that the semantics of Callers (and presumable gcallers) is that tne pcbuf slice must be filled up to its max (as possible) after skipping 'skip' frames, and the skipped frames must be counted including inlined frames. So, if we don't understand the inlining at the time we fill up pcbuf initially, we don't know exactly how many physical frames to skip, so we may not grab enough frames to fill up the max after we do the skip. (The easier thing would be to grab all physical frame pcs until we do the inlining later, but where do we store them all -- we only have pcbuf.) So, we need a slightly looser definition for skip and for filling in pcbuf for gcallers() if we are going to use the frame pointer optimization. But I will proceed with trying to optimizing gcallers, and doing the inlining interpretation later, and see how things go with your tests. |
We don't need these strict skip semantics here. We need to remove the non-interesting frames, how exactly does not matter. So instead of removing N inlined frames, we can remove M non-inlined frames. I would assume that the relevant runtime functions should not be inlined, so doing correct skip should be doable. Alternatively, we could consider changing what's being stripped a bit, if it makes things much simpler and faster on the unwinding side. |
Traceback is the main source of slowdown for tracer. On net/http.BenchmarkClientServerParallel4:
BenchmarkClientServerParallel4-6 200000 10627 ns/op 4482 B/op 57 allocs/op
with tracer:
BenchmarkClientServerParallel4-6 200000 16444 ns/op 4482 B/op 57 allocs/op
That's +55% slowdown. Top functions of profile are:
6.09% http.test http.test [.] runtime.pcvalue
5.88% http.test http.test [.] runtime.gentraceback
5.41% http.test http.test [.] runtime.readvarint
4.31% http.test http.test [.] runtime.findfunc
2.98% http.test http.test [.] runtime.step
2.12% http.test http.test [.] runtime.mallocgc
runtime.callers/gcallers/Callers are not interested in frame/func/sp/args/etc for each frame, they only need PC values. PC values can be obtained using frame pointers, which must be much faster. Note that there calls are always synchronous (can't happen during function prologue or in the middle of goroutine switch), so should be much simpler to handle.
We should use frame pointers in runtime.callers.
@aclements @ianlancetaylor @hyangah
The text was updated successfully, but these errors were encountered: