Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
cmd/compile: unnecessary padding in stack frames #42385
In cmd/compile/internal/gc/pgen.go:185, we do:
This code originated in commit 2ac375b (then called cmd/gc/pgen.c:474, committed by Luuk van Dijk).
There's no indication why this code was added. I suspect it is not necessary (maybe it was, but is no longer). This issue is to investigate why we do this and whether we can remove it.
Reported on go-nuts by Eric from arm.
For a simple repro, compile
The autotmps should be 2 bytes apart, not 8. They are 2 bytes apart on amd64, but not arm64.
Okay~, I will test all benchmarks on more linux/arm64 machines to see the current situation. But because there are many benchmarks that have great fluctuations, it is difficult to say. Intuitively I think a smaller stack might be better, and gcc does the same. I tested the performance of the compiler on multiple linux/arm64 machines, and there was basically no change.
Hi, use the above CL, I have got the test results on linux/arm64.
First of all go1, basically there is no change.
Then I ran all the benchmarks in the standard library and processed the results as follows:
First of all, we can see that there are not many cases with large performance changes relative to the total number. Then through the above results, we can see that there are both obvious improvements and also obvious declines. Seems the overall improvement is more obvious ? Then we analyzed several stable regression cases, one of which is utf8.ValidStringTenASCIIChars.
I agree. This seems in the absence of any data to be a good change. We want to see the data just to make sure that there aren't any unexpected regressions. The fact that it randomly makes some code worse and some code better is unfortunate but not a blocker. Only consistently worse would block this change.
Rant: Why is there so much variation here? This seems like a very minor change. I would expect the stack to be in L1 cache approximately always. Especially ValidStringTenASCIIChars, it is a leaf function and uses (on amd64) 32 bytes of stack. The only effect of packing stack frames better should be to occasionally need one less stack growth. It's not like you can get false sharing on a stack frame. Maybe L1 is only 1-way associative on arm64 and the stack and string it is operating on happen to conflict? I hate modern hardware.
Without this change, there's also so much variation because many benchmarks themselves have relatively large fluctuations, especially benchmarks in sync package.
The test machine has 4-way associative L1 cache. There are only about 5 cache misses related to BenchmarkValidStringTenASCIIChars, the increased cache-miss mainly occurs in runtime and kernel related functions, such as __wake_up_common_lock, arch_local_irq_restore, runtime.greyobject, runtime.retake etc. I haven't studied Go's testing framework in depth, and I am also confused about the impact of these functions on testing. But the fact is that we can indeed influence this benchmark by adding some irrelevant and useless code, and we can see that the cache miss and performance changes are consistent, because there is no change in the code, so the cache miss seems to be the only reasonable explanation. There are some hardware-related details involved here, and I don't really understand it either.