Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile, runtime: reduce function prologue overhead #31193

Open
aclements opened this issue Apr 1, 2019 · 13 comments

Comments

@aclements
Copy link
Member

commented Apr 1, 2019

As of Go 1.12, the stack bound check in every function prologue looks like

MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE 0x4018e4

(or some variation thereof). This involves a chain of three dependent instructions, the first two of which are memory loads and the last of which is a conditional branch. I don't have hard data for this, but I suspect this is really bad for the CPU pipeline. The two loads are absolutely going to be in cache, but the CPU still has to wait for the first load before issuing the second, and has to wait for the second before resolving the branch. The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.

Function prologue overhead was also reported to be high in The benefits and costs of writing a POSIX kernel in a high-level language by Cutler et al.

One way we could address this is by putting the stack bound in a dedicated register (leveraging our new ability to change the internal ABI, #27539). This would make the prologue a single register/register compare and a branch. The branch would still probably have poor prediction cache locality, but the register/register comparison would happen so quickly that we would lose very little to a speculation failure. We're already moving toward implementing goroutine preemption using signals, which would make it easy to poison this stack bound register when requesting a preemption.

/cc @dr2chase @randall77 @josharian

@randall77

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2019

I'd really want to see more data to show that this prologue is actually a performance problem. My intuition says two L1 loads and a predictable branch are ~free.

@aclements

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

I agree about wanting more data. :) This is all just a hunch. The other half of the hunch is that we got only 5% from @dr2chase's register-based calling convention experiments because we still had all of these memory operations being the long pole in the function call sequence.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Apr 1, 2019

I'd really want to see more data to show that this prologue is actually a performance problem.

Thinking out loud about how to tell...

How about hacking in a change to make all goroutines stack start giant and running with GOGC=off and inlining off? Then in theory there is never any stack growth and never any pre-emption and we make lots of function calls. We could then pick some cpu-bound, function-call-heavy benchmarks and run them as is and with function prologues removed.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

Here's an attempt at getting an upper bounds on the potential savings.

package p

import (
	"fmt"
	"testing"
)

func count(x uint) {
	if x != 0 {
		count(x - 1)
	}
}

func count2(x uint) {
	if x != 0 {
		count2(x - 1)
	}
}

func BenchmarkCount(b *testing.B) {
	for n := uint(1); n <= 1e6; n *= 100 {
		b.Run(fmt.Sprint(n), func(b *testing.B) {
			count(n) // grow stack
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				count(n)
			}
		})
	}
}

func BenchmarkCount2(b *testing.B) {
	for n := uint(1); n <= 1e6; n *= 100 {
		b.Run(fmt.Sprint(n), func(b *testing.B) {
			count(n) // grow stack
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				count2(n)
			}
		})
	}
}

Now hack the toolchain to omit the stack check prologue for functions named count2.

diff --git a/src/cmd/internal/obj/x86/obj6.go b/src/cmd/internal/obj/x86/obj6.go
index eb0e88b494..422fa6588f 100644
--- a/src/cmd/internal/obj/x86/obj6.go
+++ b/src/cmd/internal/obj/x86/obj6.go
@@ -675,12 +675,12 @@ func preprocess(ctxt *obj.Link, cursym *obj.LSym, newprog obj.ProgAlloc) {
                }
        }
 
-       if !p.From.Sym.NoSplit() || p.From.Sym.Wrapper() {
+       if (!p.From.Sym.NoSplit() || p.From.Sym.Wrapper()) && cursym.Name != `"".count2` {
                p = obj.Appendp(p, newprog)
                p = load_g_cx(ctxt, p, newprog) // load g into CX
        }
 
-       if !cursym.Func.Text.From.Sym.NoSplit() {
+       if !cursym.Func.Text.From.Sym.NoSplit() && cursym.Name != `"".count2` {
                p = stacksplit(ctxt, cursym, p, newprog, autoffset, int32(textarg)) // emit split check
        }

Check with go tool compile -S (sans FUNCDATA, PCDATA, NOP, etc):

"".count STEXT size=70 args=0x8 locals=0x10
	0x0000 00000 (count_test.go:8)	TEXT	"".count(SB), ABIInternal, $16-8
	0x0000 00000 (count_test.go:8)	MOVQ	(TLS), CX
	0x0009 00009 (count_test.go:8)	CMPQ	SP, 16(CX)
	0x000d 00013 (count_test.go:8)	JLS	63
	0x000f 00015 (count_test.go:8)	SUBQ	$16, SP
	0x0013 00019 (count_test.go:8)	MOVQ	BP, 8(SP)
	0x0018 00024 (count_test.go:8)	LEAQ	8(SP), BP
	0x001d 00029 (count_test.go:9)	MOVQ	"".x+24(SP), AX
	0x0022 00034 (count_test.go:9)	TESTQ	AX, AX
	0x0025 00037 (count_test.go:9)	JNE	49
	0x0027 00039 (<unknown line number>)	MOVQ	8(SP), BP
	0x002c 00044 (<unknown line number>)	ADDQ	$16, SP
	0x0030 00048 (<unknown line number>)	RET
	0x0031 00049 (count_test.go:10)	DECQ	AX
	0x0034 00052 (count_test.go:10)	MOVQ	AX, (SP)
	0x0038 00056 (count_test.go:10)	CALL	"".count(SB)
	0x003d 00061 (count_test.go:10)	JMP	39
	0x003f 00063 (count_test.go:8)	CALL	runtime.morestack_noctxt(SB)
	0x0044 00068 (count_test.go:8)	JMP	0

"".count2 STEXT size=48 args=0x8 locals=0x10
	0x0000 00000 (count_test.go:14)	TEXT	"".count2(SB), ABIInternal, $16-8
	0x0000 00000 (count_test.go:14)	SUBQ	$16, SP
	0x0004 00004 (count_test.go:14)	MOVQ	BP, 8(SP)
	0x0009 00009 (count_test.go:14)	LEAQ	8(SP), BP
	0x000e 00014 (count_test.go:15)	MOVQ	"".x+24(SP), AX
	0x0013 00019 (count_test.go:15)	TESTQ	AX, AX
	0x0016 00022 (count_test.go:15)	JNE	34
	0x0018 00024 (<unknown line number>)	MOVQ	8(SP), BP
	0x001d 00029 (<unknown line number>)	ADDQ	$16, SP
	0x0021 00033 (<unknown line number>)	RET
	0x0022 00034 (count_test.go:16)	DECQ	AX
	0x0025 00037 (count_test.go:16)	MOVQ	AX, (SP)
	0x0029 00041 (count_test.go:16)	CALL	"".count2(SB)
	0x002e 00046 (count_test.go:16)	JMP	24

Results:

name              time/op
Count/1-8         4.49ns ± 2%
Count/100-8        269ns ± 1%
Count/10000-8     25.3µs ± 1%
Count/1000000-8   4.23ms ± 2%
Count2/1-8        3.51ns ± 2%
Count2/100-8       266ns ± 2%
Count2/10000-8    24.9µs ± 1%
Count2/1000000-8  4.07ms ±11%

I'm not entirely sure how to interpret this: Is the /1 number the meaningful one or the mid-range steady state ones? The former is pretty significant, but the latter is just 2-4% despite the prologue being a substantive percentage of the instructions being executed.

The Count2/1000000 benchmark has very high variance. Looking at the raw data, it is bimodal, starting stable at around 4.2-4.3ms for a period and then dropping and re-stabilizing at around 3.7ms, so there is some other effect occurring. I'm thus inclined to disregard that data point entirely.

@aclements

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

I think the upper-bound for the compiler as a benchmark is around 3.5%. I used PEBS profiling to get a precisely-attributed CPU cycles profile for the compiler compiling cmd/compile/internal/ssa and wrote a tool to use the DWARF prologue markers to figure out which samples fell in the stack check prologue. Since it was easy to add, I also measured the bytes of text in the stack check prologue, which worked out to 2.0% of the text bytes.

I collected the profile using:

cat >/tmp/perfwrap <<EOF
#!/bin/sh
perf record -e cycles:ppp -o /tmp/perf.data.\$\$ -- "\$@"
EOF
chmod a+x /tmp/perfwrap

cd cmd/compile/internal/ssa
# Make sure dependencies are installed
go build -i
go build -gcflags "-o /tmp/_bench.o" -toolexec /tmp/perfwrap -x

And then ran prologuer (which I just wrote):

1916 of 54450 samples (3.52%) in prologue
202021 of 10097708 bytes (2.00%) in prologue

@josharian

This comment has been minimized.

Copy link
Contributor

commented Apr 2, 2019

It is promising that we got similar bounds (2-4%, 3.5%) via very different means. A few percent is nothing to sniff at, but neither is the loss of a register.

I wonder whether we could combine this with #20005 and pack wb.enabled and stack information all into a single register.

The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.

I forgot earlier: I wanted to echo that sentiment, and add to it. When @navytux and I investigated #18977, we concluded (perhaps erroneously) that branch predictor interference was a non-trivial source of performance changes. Having a branch at the beginning of every function call could have non-local negative performance effects by disrupting branch predictions elsewhere. Of course, that's not relevant here, since there is no plan to eliminate the branch, just the loads.

@aclements

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

A few percent is nothing to sniff at, but neither is the loss of a register.

We've been talking about putting g back into a register on amd64 (like on most other architectures). I think this would supplant that.

I wonder whether we could combine this with #20005 and pack wb.enabled and stack information all into a single register.

Ooh, that's interesting. It would be easy enough to steal the bottom bit without affecting the stack bound check. And we have to interrupt everything to enable the write barrier, so it would be easy to set that bit when needed.

@CAFxX

This comment has been minimized.

Copy link
Contributor

commented Apr 16, 2019

#29067 that I opened a while back is related, although it's more about code layout

@dvyukov

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

I wonder if we could [ab]use the fact that stacks are at least 2KB-aligned and have something like:

RCX = RSP
RSP -= 0x40
RCX ^= RSP
RCX >>= 11
JNZ slowpath

That's 1 instruction more (assuming we need to allocate the stack frame anyway), but no memory loads and no registers taken. However, I am not sure we can actually arrange things for such check because the size we want to check is not necessary the stack frame size.
Or it something like the following better?

RCX = RSP
RCX &= 0xfffffffffffff800
RCX -= 0x50
JB slowpath

It also unties frame allocation from the check, so they can use different sizes.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Rumors say that using flags from shift operations has higher latency and can cause stalls. So we might need to replace RCX >>= 11 with AND with mask or something else. Or maybe that's another argument in favor of the second snippet.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Jun 28, 2019

Do we know what's the performance/object size cost of stealing a register? We could restrict compiler from using 1 register and see what happens.

@dr2chase

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

How urgent is your question? I think I could run that experiment.

@dvyukov

This comment has been minimized.

Copy link
Member

commented Jun 29, 2019

Not urgent at all. But it would be useful to have that number before we do any decisions on this issue. I don't know if anybody have near future plans to attack this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.