-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile, runtime: reduce function prologue overhead #31193
Comments
I'd really want to see more data to show that this prologue is actually a performance problem. My intuition says two L1 loads and a predictable branch are ~free. |
I agree about wanting more data. :) This is all just a hunch. The other half of the hunch is that we got only 5% from @dr2chase's register-based calling convention experiments because we still had all of these memory operations being the long pole in the function call sequence. |
Thinking out loud about how to tell... How about hacking in a change to make all goroutines stack start giant and running with GOGC=off and inlining off? Then in theory there is never any stack growth and never any pre-emption and we make lots of function calls. We could then pick some cpu-bound, function-call-heavy benchmarks and run them as is and with function prologues removed. |
Here's an attempt at getting an upper bounds on the potential savings. package p
import (
"fmt"
"testing"
)
func count(x uint) {
if x != 0 {
count(x - 1)
}
}
func count2(x uint) {
if x != 0 {
count2(x - 1)
}
}
func BenchmarkCount(b *testing.B) {
for n := uint(1); n <= 1e6; n *= 100 {
b.Run(fmt.Sprint(n), func(b *testing.B) {
count(n) // grow stack
b.ResetTimer()
for i := 0; i < b.N; i++ {
count(n)
}
})
}
}
func BenchmarkCount2(b *testing.B) {
for n := uint(1); n <= 1e6; n *= 100 {
b.Run(fmt.Sprint(n), func(b *testing.B) {
count(n) // grow stack
b.ResetTimer()
for i := 0; i < b.N; i++ {
count2(n)
}
})
}
} Now hack the toolchain to omit the stack check prologue for functions named diff --git a/src/cmd/internal/obj/x86/obj6.go b/src/cmd/internal/obj/x86/obj6.go
index eb0e88b494..422fa6588f 100644
--- a/src/cmd/internal/obj/x86/obj6.go
+++ b/src/cmd/internal/obj/x86/obj6.go
@@ -675,12 +675,12 @@ func preprocess(ctxt *obj.Link, cursym *obj.LSym, newprog obj.ProgAlloc) {
}
}
- if !p.From.Sym.NoSplit() || p.From.Sym.Wrapper() {
+ if (!p.From.Sym.NoSplit() || p.From.Sym.Wrapper()) && cursym.Name != `"".count2` {
p = obj.Appendp(p, newprog)
p = load_g_cx(ctxt, p, newprog) // load g into CX
}
- if !cursym.Func.Text.From.Sym.NoSplit() {
+ if !cursym.Func.Text.From.Sym.NoSplit() && cursym.Name != `"".count2` {
p = stacksplit(ctxt, cursym, p, newprog, autoffset, int32(textarg)) // emit split check
} Check with
Results:
I'm not entirely sure how to interpret this: Is the The Count2/1000000 benchmark has very high variance. Looking at the raw data, it is bimodal, starting stable at around 4.2-4.3ms for a period and then dropping and re-stabilizing at around 3.7ms, so there is some other effect occurring. I'm thus inclined to disregard that data point entirely. |
I think the upper-bound for the compiler as a benchmark is around 3.5%. I used PEBS profiling to get a precisely-attributed CPU cycles profile for the compiler compiling cmd/compile/internal/ssa and wrote a tool to use the DWARF prologue markers to figure out which samples fell in the stack check prologue. Since it was easy to add, I also measured the bytes of text in the stack check prologue, which worked out to 2.0% of the text bytes. I collected the profile using: cat >/tmp/perfwrap <<EOF
#!/bin/sh
perf record -e cycles:ppp -o /tmp/perf.data.\$\$ -- "\$@"
EOF
chmod a+x /tmp/perfwrap
cd cmd/compile/internal/ssa
# Make sure dependencies are installed
go build -i
go build -gcflags "-o /tmp/_bench.o" -toolexec /tmp/perfwrap -x And then ran prologuer (which I just wrote): 1916 of 54450 samples (3.52%) in prologue |
It is promising that we got similar bounds (2-4%, 3.5%) via very different means. A few percent is nothing to sniff at, but neither is the loss of a register. I wonder whether we could combine this with #20005 and pack wb.enabled and stack information all into a single register.
I forgot earlier: I wanted to echo that sentiment, and add to it. When @navytux and I investigated #18977, we concluded (perhaps erroneously) that branch predictor interference was a non-trivial source of performance changes. Having a branch at the beginning of every function call could have non-local negative performance effects by disrupting branch predictions elsewhere. Of course, that's not relevant here, since there is no plan to eliminate the branch, just the loads. |
We've been talking about putting g back into a register on amd64 (like on most other architectures). I think this would supplant that.
Ooh, that's interesting. It would be easy enough to steal the bottom bit without affecting the stack bound check. And we have to interrupt everything to enable the write barrier, so it would be easy to set that bit when needed. |
#29067 that I opened a while back is related, although it's more about code layout |
I wonder if we could [ab]use the fact that stacks are at least 2KB-aligned and have something like:
That's 1 instruction more (assuming we need to allocate the stack frame anyway), but no memory loads and no registers taken. However, I am not sure we can actually arrange things for such check because the size we want to check is not necessary the stack frame size.
It also unties frame allocation from the check, so they can use different sizes. |
Rumors say that using flags from shift operations has higher latency and can cause stalls. So we might need to replace |
Do we know what's the performance/object size cost of stealing a register? We could restrict compiler from using 1 register and see what happens. |
How urgent is your question? I think I could run that experiment. |
Not urgent at all. But it would be useful to have that number before we do any decisions on this issue. I don't know if anybody have near future plans to attack this issue. |
As of Go 1.12, the stack bound check in every function prologue looks like
(or some variation thereof). This involves a chain of three dependent instructions, the first two of which are memory loads and the last of which is a conditional branch. I don't have hard data for this, but I suspect this is really bad for the CPU pipeline. The two loads are absolutely going to be in cache, but the CPU still has to wait for the first load before issuing the second, and has to wait for the second before resolving the branch. The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.
Function prologue overhead was also reported to be high in The benefits and costs of writing a POSIX kernel in a high-level language by Cutler et al.
One way we could address this is by putting the stack bound in a dedicated register (leveraging our new ability to change the internal ABI, #27539). This would make the prologue a single register/register compare and a branch. The branch would still probably have poor prediction cache locality, but the register/register comparison would happen so quickly that we would lose very little to a speculation failure. We're already moving toward implementing goroutine preemption using signals, which would make it easy to poison this stack bound register when requesting a preemption.
/cc @dr2chase @randall77 @josharian
The text was updated successfully, but these errors were encountered: