Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: iter implementations significantly slower than equivalent for loops #69015

Open
kscooo opened this issue Aug 22, 2024 · 9 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@kscooo
Copy link

kscooo commented Aug 22, 2024

Go version

go version go1.23.0 darwin/arm64(gotip too)

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/admin/Library/Caches/go-build'
GOENV='/Users/admin/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/admin/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/admin/go'
GOPRIVATE=''
GOPROXY=''
GOROOT='/opt/homebrew/Cellar/go/1.23.0/libexec'
GOSUMDB='sum.golang.google.cn'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.0/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='on'
GOTELEMETRYDIR='/Users/admin/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/admin/Developer/test/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/7s/9gl4fgf97rlgr4gt47nythtw0000gn/T/go-build1880376490=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Related Go files:

iter: https://go.dev/play/p/iRuU4kNXngq
iter_test: https://go.dev/play/p/4C_EbsSnlQH

go test -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: ksco/test
cpu: Apple M3 Pro
BenchmarkSliceFunctions/AllForLoop-10-12         	321253351	         3.475 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/All-10-12                	250128255	         4.530 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-12    	344788078	         3.509 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Backward-10-12           	87433018	        13.84 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-12      	344200261	         3.476 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Values-10-12             	263804847	         4.544 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-12      	13531671	        87.18 ns/op	     248 B/op	       5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-12          	 8190546	       145.6 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/CollectForLoop-10-12     	92614030	        13.41 ns/op	      80 B/op	       1 allocs/op
BenchmarkSliceFunctions/Collect-10-12            	 8045440	       146.8 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/SortForLoop-10-12        	57416722	        20.59 ns/op	      80 B/op	       1 allocs/op
BenchmarkSliceFunctions/Sorted-10-12             	 7757234	       153.0 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-12       	1000000000	         0.7995 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Chunk-10-12              	231948748	         5.167 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-12        	15668906	        76.65 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/AllMap-10-12               	15576559	        76.58 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-12       	15780648	        75.70 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/KeysMap-10-12              	15699544	        76.53 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-12     	15928665	        75.93 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-12            	15532413	        76.55 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/InsertForLoopMap-10-12     	 1803122	       668.6 ns/op	    1401 B/op	       2 allocs/op
BenchmarkMapFunctions/InsertMap-10-12            	 1655206	       728.1 ns/op	    1489 B/op	       5 allocs/op
BenchmarkMapFunctions/CollectForLoopMap-10-12    	 5277978	       226.0 ns/op	     420 B/op	       1 allocs/op
BenchmarkMapFunctions/CollectMap-10-12           	 3718111	       321.8 ns/op	     716 B/op	       5 allocs/op
PASS
ok  	ksco/test	34.819s

Linux machines and x86 will also be a bit slower. Gotip was also used, with similar results.

Additionally, when examining the assembly output generated by

go build -gcflags="-S" iter.go

I noticed that certain functions contain additional instructions that appear to be unnecessary, which could be contributing to the observed performance differences.

What did you see happen?

Analysis of the generated assembly revealed that iterator-based implementations (e.g., slices.All, slices.Backward, slices.Chunk) introduce additional overhead compared to traditional for-loops:

  1. Additional function calls:

    • Iterator functions themselves
    • Closure function calls
    • Yield function calls
  2. Memory allocations:

    • Heap allocations for closures and iterator states (via runtime.newobject)
    • Larger stack frames
  3. Additional control flow:

    • Iterator state checks
    • Yield function return checks
  4. Indirect function calls:

    • Calls through function pointers (e.g., CALL (R4) observed in the chunk function)
  5. Increased register usage and stack operations:

    • More registers used for managing iterator state
    • More frequent stack operations for saving and restoring state
  6. Additional safety checks:

    • E.g., slice size validation in slices.Chunk
  7. Increased code size:

    • Iterator versions of functions are typically larger than their for-loop counterparts

Specifically for slices.Chunk observed:

  • runtime.newobject calls for creating closure objects
  • Closure setup, including function pointer and captured variable initialization
  • Creation and invocation of slices.Chunk[go.shape.[]int,go.shape.int].func1
  • Multiple closure calls during iteration
  • Checks on yield function return values

Similar issues were observed in other iterator-related function implementations.

What did you expect to see?

According to the Go Wiki's Rangefunc Experiment documentation, the optimized code structure in simple cases is almost identical to a manually written for loop.

However, assembly analysis suggests that the current implementations may introduce complexity and potential performance overhead. While these implementations are already quite effective, there is hope that further optimizations could align their performance with traditional for loops in most simple scenarios.

@gabyhelp
Copy link

@kscooo kscooo changed the title performance: iter implementations significantly slower than equivalent for loops cmd/compile: iter implementations significantly slower than equivalent for loops Aug 22, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 22, 2024
@cherrymui
Copy link
Member

Could you try Go at tip of the master branch? There were some recent optimizations that didn't make into Go 1.23. Thanks!

cc @golang/compiler @dr2chase

@cherrymui cherrymui added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Aug 22, 2024
@cherrymui cherrymui added this to the Go1.24 milestone Aug 22, 2024
@cherrymui
Copy link
Member

cherrymui commented Aug 22, 2024

cpu: Apple M3 Pro

Also, @dr2chase discovered that Apple Silicon chips may have some weird performance behaviors that we don't yet understand. Specifically, the inner loop runs faster if its address range crosses a 4K boundary, which intuitively we'd expect it to be slower... Could you try running on a different machine? And try building with -ldflags=-randlayout=N (where N is some nonzero integer) for some randomized address layout? Thanks.

@qiulaidongfeng
Copy link
Member

I use go version devel go1.24-4f18477d Thu Aug 22 13:19:44 2024 +0000 windows/amd64 run benckmark, I got:

goos: windows
goarch: amd64
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
BenchmarkSliceFunctions/AllForLoop-10-16                464714426                2.538 ns/op
BenchmarkSliceFunctions/All-10-16                       351567169                3.197 ns/op
BenchmarkSliceFunctions/BackwardForLoop-10-16           466168953                2.540 ns/op
BenchmarkSliceFunctions/Backward-10-16                  66170752                16.75 ns/op
BenchmarkSliceFunctions/ValuesForLoop-10-16             468401443                2.541 ns/op
BenchmarkSliceFunctions/Values-10-16                    349017703                3.192 ns/op
BenchmarkSliceFunctions/AppendForLoop-10-16             10627237               104.5 ns/op
BenchmarkSliceFunctions/AppendSeq-10-16                  6952333               167.6 ns/op
BenchmarkSliceFunctions/CollectForLoop-10-16            63369013                19.99 ns/op
BenchmarkSliceFunctions/Collect-10-16                    6831514               173.1 ns/op
BenchmarkSliceFunctions/SortForLoop-10-16               35765378                30.09 ns/op
BenchmarkSliceFunctions/Sorted-10-16                     6520611               180.1 ns/op
BenchmarkSliceFunctions/ChunkForLoop-10-16              1000000000               0.6359 ns/op
BenchmarkSliceFunctions/Chunk-10-16                     201705656                5.958 ns/op
BenchmarkMapFunctions/AllForLoopMap-10-16               17116472                69.28 ns/op
BenchmarkMapFunctions/AllMap-10-16                      16828688                70.17 ns/op
BenchmarkMapFunctions/KeysForLoopMap-10-16              17014784                69.13 ns/op
BenchmarkMapFunctions/KeysMap-10-16                     17204522                69.60 ns/op
BenchmarkMapFunctions/ValuesForLoopMap-10-16            16612399                70.88 ns/op
BenchmarkMapFunctions/ValuesMap-10-16                   17106541                70.92 ns/op
BenchmarkMapFunctions/InsertForLoopMap-10-16             1397740               856.9 ns/op
BenchmarkMapFunctions/InsertMap-10-16                    1279608               956.4 ns/op
BenchmarkMapFunctions/CollectForLoopMap-10-16            3888702               298.7 ns/op
BenchmarkMapFunctions/CollectMap-10-16                   2691692               433.5 ns/op
PASS

@zigo101
Copy link

zigo101 commented Aug 22, 2024

The "collect"/"sort"/"insert" counterparts are not very equivalent. I removed them: https://go.dev/play/p/3BNJ1pgQNAE

$ gotv :tip version
[Run]: $HOME/.cache/gotv/bra_master/bin/go version
go version devel go1.24-4f18477db6 Thu Aug 22 13:19:44 2024 +0000 linux/amd64

$ gotv :tip test -bench=.
[Run]: $HOME/.cache/gotv/bra_master/bin/go test -bench=.  -benchmem
goos: linux
goarch: amd64
pkg: example.com
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
BenchmarkSliceFunctions/AllForLoop-10-4         	125123698	         9.555 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/All-10-4                	120565822	        10.03 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-4    	125155231	         9.575 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Backward-10-4           	25645136	        45.58 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-4      	169942302	         7.087 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Values-10-4             	141663436	         8.305 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-4      	 4608756	       269.5 ns/op	     248 B/op	       5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-4          	 2763048	       432.1 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-4       	311692539	         3.812 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Chunk-10-4              	70037104	        15.69 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-4        	 6592899	       180.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/AllMap-10-4               	 6519124	       182.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-4       	 6884683	       173.1 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/KeysMap-10-4              	 6687552	       178.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-4     	 6934557	       171.2 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-4            	 6493563	       182.1 ns/op	       0 B/op	       0 allocs/op

It is some strange that appendSeq allocates more than appendForLoop. Their code looks equivalent.

@cherrymui
Copy link
Member

go build -gcflags="-N -l -S" iter.go

Do not use -N -l to look at optimizations. -N -l is meant to generate slower code, by definition.

@kscooo
Copy link
Author

kscooo commented Aug 22, 2024

@cherrymui I have used gotip, and the results are similar to 1.23.0. I have also tested the linux(x86) platform, the result is also similar

@dr2chase
Copy link
Contributor

The problem with backward is that there's an inlining that doesn't happen, I think the Backward iterator has cost 81.

@dr2chase dr2chase self-assigned this Aug 28, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/609095 mentions this issue: cmd/compile: tweak inlining to favor PPARAM-callers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
Status: In Progress
Development

No branches or pull requests

7 participants