Description
As of Go 1.12, the stack bound check in every function prologue looks like
MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE 0x4018e4
(or some variation thereof). This involves a chain of three dependent instructions, the first two of which are memory loads and the last of which is a conditional branch. I don't have hard data for this, but I suspect this is really bad for the CPU pipeline. The two loads are absolutely going to be in cache, but the CPU still has to wait for the first load before issuing the second, and has to wait for the second before resolving the branch. The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.
Function prologue overhead was also reported to be high in The benefits and costs of writing a POSIX kernel in a high-level language by Cutler et al.
One way we could address this is by putting the stack bound in a dedicated register (leveraging our new ability to change the internal ABI, #27539). This would make the prologue a single register/register compare and a branch. The branch would still probably have poor prediction cache locality, but the register/register comparison would happen so quickly that we would lose very little to a speculation failure. We're already moving toward implementing goroutine preemption using signals, which would make it easy to poison this stack bound register when requesting a preemption.