-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: iter implementations significantly slower than equivalent for loops #69015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
Could you try Go at tip of the master branch? There were some recent optimizations that didn't make into Go 1.23. Thanks! cc @golang/compiler @dr2chase |
Also, @dr2chase discovered that Apple Silicon chips may have some weird performance behaviors that we don't yet understand. Specifically, the inner loop runs faster if its address range crosses a 4K boundary, which intuitively we'd expect it to be slower... Could you try running on a different machine? And try building with |
I use go version devel go1.24-4f18477d Thu Aug 22 13:19:44 2024 +0000 windows/amd64 run benckmark, I got: goos: windows goarch: amd64 cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics BenchmarkSliceFunctions/AllForLoop-10-16 464714426 2.538 ns/op BenchmarkSliceFunctions/All-10-16 351567169 3.197 ns/op BenchmarkSliceFunctions/BackwardForLoop-10-16 466168953 2.540 ns/op BenchmarkSliceFunctions/Backward-10-16 66170752 16.75 ns/op BenchmarkSliceFunctions/ValuesForLoop-10-16 468401443 2.541 ns/op BenchmarkSliceFunctions/Values-10-16 349017703 3.192 ns/op BenchmarkSliceFunctions/AppendForLoop-10-16 10627237 104.5 ns/op BenchmarkSliceFunctions/AppendSeq-10-16 6952333 167.6 ns/op BenchmarkSliceFunctions/CollectForLoop-10-16 63369013 19.99 ns/op BenchmarkSliceFunctions/Collect-10-16 6831514 173.1 ns/op BenchmarkSliceFunctions/SortForLoop-10-16 35765378 30.09 ns/op BenchmarkSliceFunctions/Sorted-10-16 6520611 180.1 ns/op BenchmarkSliceFunctions/ChunkForLoop-10-16 1000000000 0.6359 ns/op BenchmarkSliceFunctions/Chunk-10-16 201705656 5.958 ns/op BenchmarkMapFunctions/AllForLoopMap-10-16 17116472 69.28 ns/op BenchmarkMapFunctions/AllMap-10-16 16828688 70.17 ns/op BenchmarkMapFunctions/KeysForLoopMap-10-16 17014784 69.13 ns/op BenchmarkMapFunctions/KeysMap-10-16 17204522 69.60 ns/op BenchmarkMapFunctions/ValuesForLoopMap-10-16 16612399 70.88 ns/op BenchmarkMapFunctions/ValuesMap-10-16 17106541 70.92 ns/op BenchmarkMapFunctions/InsertForLoopMap-10-16 1397740 856.9 ns/op BenchmarkMapFunctions/InsertMap-10-16 1279608 956.4 ns/op BenchmarkMapFunctions/CollectForLoopMap-10-16 3888702 298.7 ns/op BenchmarkMapFunctions/CollectMap-10-16 2691692 433.5 ns/op PASS |
The "collect"/"sort"/"insert" counterparts are not very equivalent. I removed them: https://go.dev/play/p/3BNJ1pgQNAE
It is some strange that |
Do not use |
@cherrymui I have used gotip, and the results are similar to 1.23.0. I have also tested the linux(x86) platform, the result is also similar |
The problem with backward is that there's an inlining that doesn't happen, I think the Backward iterator has cost 81. |
Change https://go.dev/cl/609095 mentions this issue: |
If a function f being considered for inlining calls one of its parameters, reduce the normal cost of that call (57) to 17 to increase the chance that f will be inlined and (with luck) that parameter will be revealed as a constant function (which unblocks escape analysis) or perhaps even be inlined. The least-change value for that was still effective for iter_test benchmarks was 32; however tests showed no particular harm even when reduced as low as 7, and there have been reports of other performance problems with rangefunc overheads and so I picked a middling number in hopes of warding off such reports. Updates #69015 Change-Id: I2a525c1beffb9f88daa14caa8a622864b023675c Reviewed-on: https://go-review.googlesource.com/c/go/+/609095 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Tim King <taking@google.com> Reviewed-by: Keith Randall <khr@golang.org>
With three recent CLs, there are some improvements:
I notice that all the benchmarks involve trivial bodies. Microbenchmarks are useful for analyzing behavior, but the benchmarks we wish we had more of are those that model the behavior of real programs that have real costs. |
It looks https://go-review.googlesource.com/c/go/+/629195 improves iterator performance much for the cases in #69015 (comment):
BTW, it looks that PR gives closures (like
|
The intent was to help with rangefunc performance. If you find it necessary to play these games for code that matters (production code, etc), please let us know, because otherwise it's just an implementation detail that might change in a future release without regard for how people decided to depend on it. That said, if a function is package-private and only called once, it might make sense boost its inlining-fu. (I just finished refactoring the inliner to help let us figure that out, but inlining in general remains unsatisfying.) |
PS - on a weekend lark, did a crude test of "if a function is package-private and only called once..." and the crude test is definitely not ready for prime time; the program text got a lot (10%-ish) bigger, no idea if it does anything for performance at all. The other changes, for existing benchmarks which are backward looking (contain no rangefunc and are light on generics) were close enough to flat to not care. By-the-way, if people are using generics or rangefunc in actually-performance-critical code, if there are benchmarks that model your use, those would be good to have, because lack of real-not-micro benchmarks for all the newer language features is a bother. |
Just a thought - if you want realistic benchmarks, perhaps ask through more public channels like the Go mailing list or twitter/bluesky, as I imagine only a handful of people subscribe to issues like this one. |
As of 1.24 freeze, the differences observed in this benchmark appear to be caused by whether the benchmarked function gets inlined into the BenchmarkXXX function. This is still a rangefunc-related problem, but it is because the rangefunc loop looks too expensive to the inliner at the time the inliner calculates the inlining cost.
|
Go version
go version go1.23.0 darwin/arm64(gotip too)
Output of
go env
in your module/workspace:What did you do?
Related Go files:
iter: https://go.dev/play/p/iRuU4kNXngq
iter_test: https://go.dev/play/p/4C_EbsSnlQH
Linux machines and x86 will also be a bit slower. Gotip was also used, with similar results.
Additionally, when examining the assembly output generated by
I noticed that certain functions contain additional instructions that appear to be unnecessary, which could be contributing to the observed performance differences.
What did you see happen?
Analysis of the generated assembly revealed that iterator-based implementations (e.g.,
slices.All
,slices.Backward
,slices.Chunk
) introduce additional overhead compared to traditional for-loops:Additional function calls:
Memory allocations:
runtime.newobject
)Additional control flow:
Indirect function calls:
CALL (R4)
observed in thechunk
function)Increased register usage and stack operations:
Additional safety checks:
slices.Chunk
Increased code size:
Specifically for
slices.Chunk
observed:runtime.newobject
calls for creating closure objectsslices.Chunk[go.shape.[]int,go.shape.int].func1
Similar issues were observed in other iterator-related function implementations.
What did you expect to see?
According to the Go Wiki's Rangefunc Experiment documentation, the optimized code structure in simple cases is almost identical to a manually written for loop.
However, assembly analysis suggests that the current implementations may introduce complexity and potential performance overhead. While these implementations are already quite effective, there is hope that further optimizations could align their performance with traditional for loops in most simple scenarios.
The text was updated successfully, but these errors were encountered: