Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: use frame pointers for callers #16638

Open
dvyukov opened this issue Aug 8, 2016 · 12 comments
Open

runtime: use frame pointers for callers #16638

dvyukov opened this issue Aug 8, 2016 · 12 comments
Assignees
Milestone

Comments

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Aug 8, 2016

Traceback is the main source of slowdown for tracer. On net/http.BenchmarkClientServerParallel4:
BenchmarkClientServerParallel4-6 200000 10627 ns/op 4482 B/op 57 allocs/op
with tracer:
BenchmarkClientServerParallel4-6 200000 16444 ns/op 4482 B/op 57 allocs/op
That's +55% slowdown. Top functions of profile are:
6.09% http.test http.test [.] runtime.pcvalue
5.88% http.test http.test [.] runtime.gentraceback
5.41% http.test http.test [.] runtime.readvarint
4.31% http.test http.test [.] runtime.findfunc
2.98% http.test http.test [.] runtime.step
2.12% http.test http.test [.] runtime.mallocgc

runtime.callers/gcallers/Callers are not interested in frame/func/sp/args/etc for each frame, they only need PC values. PC values can be obtained using frame pointers, which must be much faster. Note that there calls are always synchronous (can't happen during function prologue or in the middle of goroutine switch), so should be much simpler to handle.

We should use frame pointers in runtime.callers.

@aclements @ianlancetaylor @hyangah

@quentinmit quentinmit added this to the Go1.8 milestone Aug 8, 2016
@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 11, 2016

That could work, and be much faster.
My only worry is that at some point we're going to tackle the "inline non-leaf functions" problem and then just the list of PCs from the frame pointer walk won't be enough. We'll need to somehow expand PCs that correspond to call sites that have been inlined into other functions. I'm not sure how that would work without doing everything the generic gentraceback is doing (findfunc, mainly).

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Aug 11, 2016

There is no need to handle inlining at traceback time.  Today, each PC corresponds to a single file/line/function.  When we can inline non-leaf functions, each PC corresponds to a list of file/line/function tuples.  At traceback time, we only need that PC. At symbolization time, we need to expand that PC into a list of file/line/function tuples. There is already code for handling this in runtime.Frames.Next, we just need to move from FuncForPC to something that can return multiple file/line/function tuples.

The point is, traceback can be fast while still supporting non-leaf inlined functions when we interpret the traceback.

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 1, 2016

CL https://golang.org/cl/33754 mentions this issue.

gopherbot pushed a commit that referenced this issue Dec 1, 2016
func f() {
    g()
}

We mistakenly don't add a frame pointer for f.  This means f
isn't seen when walking the frame pointer linked list.  That
matters for kernel-gathered profiles, and is an impediment for
issues like #16638.

To fix, allocate a stack frame even for otherwise frameless functions
like f.  It is a bit tricky because we need to avoid some runtime
internals that really, really don't want one.

No test at the moment, as only kernel CPU profiles would catch it.
Tests will come with the implementation of #16638.

Fixes #18103

Change-Id: I411206cc9de4c8fdd265bee2e4fa61d161ad1847
Reviewed-on: https://go-review.googlesource.com/33754
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@gopherbot
Copy link

@gopherbot gopherbot commented Dec 2, 2016

CL https://golang.org/cl/33895 mentions this issue.

gopherbot pushed a commit that referenced this issue Dec 7, 2016
When we copy the stack, we need to adjust all BPs.
We correctly adjust the ones on the stack, but we also
need to adjust the one that is in g.sched.bp.

Like CL 33754, no test as only kernel-gathered profiles will notice.
Tests will come (in 1.9) with the implementation of #16638.

The invariant should hold that every frame pointer points to
somewhere within its stack.  After this CL, it is mostly true, but
something about cgo breaks it.  The runtime checks are disabled
until I figure that out.

Update #16638
Fixes #18174

Change-Id: I6023ee64adc80574ee3e76491d4f0fa5ede3dbdb
Reviewed-on: https://go-review.googlesource.com/33895
Reviewed-by: Austin Clements <austin@google.com>
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.9Early May 3, 2017
@bradfitz bradfitz added the Performance label May 3, 2017
@josharian
Copy link
Contributor

@josharian josharian commented Jun 1, 2017

CL 43150 may also help speed up tracing; that list of hot functions looks familiar from when I was working on that CL.

@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@gopherbot
Copy link

@gopherbot gopherbot commented Sep 21, 2017

Change https://golang.org/cl/33809 mentions this issue: runtime: use frame pointers for callers

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018
@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@randall77 randall77 modified the milestones: Go1.13, Go1.14 May 6, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 16, 2019

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 20, 2019

Change https://golang.org/cl/212301 mentions this issue: runtime: use frame pointers to implement runtime.callers()

@danscales
Copy link

@danscales danscales commented Dec 20, 2019

@dvyukov I took a look at this and it seems that if we want to get runtime.Callers() semantics fully correct (including a proper skip, based on any inlined frames due to mid-stack inlining), then we will lose a lot of the benefit of the optimization. See my message for my prototype code https://go-review.googlesource.com/c/go/+/212301

But then I noticed that actually the tracer and mprof are using a separate routine gcallers. We could possible only optimize gcallers(). In that case, would it be acceptible to not expand out the logical frames that are not physically there because of mid-stack inlining? Or do the inline expanding at CallerFrames()/Next() time only, but do the early skip processing based on physical frames?

One really rough set of numbers that I did for your benchmark (take it with a grain of salt) is 40% overhead currently [like your 55% overhead], then 24% overhead if we do the optimization but have to deal with inlined frames, and 6% overhead if we don't have to deal with inlined frames (just follow FP pointers and done).

@dvyukov
Copy link
Member Author

@dvyukov dvyukov commented Dec 21, 2019

Hi @danscales,

The most profitable and critical is to optimize tracer, maybe mprof.
Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs.
I've added a bunch of tests for tracer that check that resulting stacks capture precisely what we want, including exact behavior of skip arguments. These should be useful, you could add more if you see other tricky cases.

40% overhead currently [like your 55% overhead]

Either standard unwinding become faster, or everything else become slower :)

and 6% overhead if we don't have to deal with inlined frames

This sounds great. I would rely on tests that we correctly deal with inlining/skip.

Thanks!

@danscales
Copy link

@danscales danscales commented Dec 24, 2019

The most profitable and critical is to optimize tracer, maybe mprof.
Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs.

The problem is that the semantics of Callers (and presumable gcallers) is that tne pcbuf slice must be filled up to its max (as possible) after skipping 'skip' frames, and the skipped frames must be counted including inlined frames. So, if we don't understand the inlining at the time we fill up pcbuf initially, we don't know exactly how many physical frames to skip, so we may not grab enough frames to fill up the max after we do the skip. (The easier thing would be to grab all physical frame pcs until we do the inlining later, but where do we store them all -- we only have pcbuf.) So, we need a slightly looser definition for skip and for filling in pcbuf for gcallers() if we are going to use the frame pointer optimization.

But I will proceed with trying to optimizing gcallers, and doing the inlining interpretation later, and see how things go with your tests.

@dvyukov
Copy link
Member Author

@dvyukov dvyukov commented Jan 7, 2020

We don't need these strict skip semantics here. We need to remove the non-interesting frames, how exactly does not matter. So instead of removing N inlined frames, we can remove M non-inlined frames. I would assume that the relevant runtime functions should not be inlined, so doing correct skip should be doable. Alternatively, we could consider changing what's being stripped a bit, if it makes things much simpler and faster on the unwinding side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.