Skip to content

cmd/compile: iter implementations significantly slower than equivalent for loops #69015

@kscooo

Description

@kscooo

Go version

go version go1.23.0 darwin/arm64(gotip too)

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/admin/Library/Caches/go-build'
GOENV='/Users/admin/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/admin/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/admin/go'
GOPRIVATE=''
GOPROXY=''
GOROOT='/opt/homebrew/Cellar/go/1.23.0/libexec'
GOSUMDB='sum.golang.google.cn'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.23.0/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='on'
GOTELEMETRYDIR='/Users/admin/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/admin/Developer/test/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/7s/9gl4fgf97rlgr4gt47nythtw0000gn/T/go-build1880376490=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Related Go files:

iter: https://go.dev/play/p/iRuU4kNXngq
iter_test: https://go.dev/play/p/4C_EbsSnlQH

go test -bench=. -benchmem
goos: darwin
goarch: arm64
pkg: ksco/test
cpu: Apple M3 Pro
BenchmarkSliceFunctions/AllForLoop-10-12         	321253351	         3.475 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/All-10-12                	250128255	         4.530 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/BackwardForLoop-10-12    	344788078	         3.509 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Backward-10-12           	87433018	        13.84 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/ValuesForLoop-10-12      	344200261	         3.476 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Values-10-12             	263804847	         4.544 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/AppendForLoop-10-12      	13531671	        87.18 ns/op	     248 B/op	       5 allocs/op
BenchmarkSliceFunctions/AppendSeq-10-12          	 8190546	       145.6 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/CollectForLoop-10-12     	92614030	        13.41 ns/op	      80 B/op	       1 allocs/op
BenchmarkSliceFunctions/Collect-10-12            	 8045440	       146.8 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/SortForLoop-10-12        	57416722	        20.59 ns/op	      80 B/op	       1 allocs/op
BenchmarkSliceFunctions/Sorted-10-12             	 7757234	       153.0 ns/op	     312 B/op	       8 allocs/op
BenchmarkSliceFunctions/ChunkForLoop-10-12       	1000000000	         0.7995 ns/op	       0 B/op	       0 allocs/op
BenchmarkSliceFunctions/Chunk-10-12              	231948748	         5.167 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/AllForLoopMap-10-12        	15668906	        76.65 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/AllMap-10-12               	15576559	        76.58 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/KeysForLoopMap-10-12       	15780648	        75.70 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/KeysMap-10-12              	15699544	        76.53 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/ValuesForLoopMap-10-12     	15928665	        75.93 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/ValuesMap-10-12            	15532413	        76.55 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapFunctions/InsertForLoopMap-10-12     	 1803122	       668.6 ns/op	    1401 B/op	       2 allocs/op
BenchmarkMapFunctions/InsertMap-10-12            	 1655206	       728.1 ns/op	    1489 B/op	       5 allocs/op
BenchmarkMapFunctions/CollectForLoopMap-10-12    	 5277978	       226.0 ns/op	     420 B/op	       1 allocs/op
BenchmarkMapFunctions/CollectMap-10-12           	 3718111	       321.8 ns/op	     716 B/op	       5 allocs/op
PASS
ok  	ksco/test	34.819s

Linux machines and x86 will also be a bit slower. Gotip was also used, with similar results.

Additionally, when examining the assembly output generated by

go build -gcflags="-S" iter.go

I noticed that certain functions contain additional instructions that appear to be unnecessary, which could be contributing to the observed performance differences.

What did you see happen?

Analysis of the generated assembly revealed that iterator-based implementations (e.g., slices.All, slices.Backward, slices.Chunk) introduce additional overhead compared to traditional for-loops:

  1. Additional function calls:

    • Iterator functions themselves
    • Closure function calls
    • Yield function calls
  2. Memory allocations:

    • Heap allocations for closures and iterator states (via runtime.newobject)
    • Larger stack frames
  3. Additional control flow:

    • Iterator state checks
    • Yield function return checks
  4. Indirect function calls:

    • Calls through function pointers (e.g., CALL (R4) observed in the chunk function)
  5. Increased register usage and stack operations:

    • More registers used for managing iterator state
    • More frequent stack operations for saving and restoring state
  6. Additional safety checks:

    • E.g., slice size validation in slices.Chunk
  7. Increased code size:

    • Iterator versions of functions are typically larger than their for-loop counterparts

Specifically for slices.Chunk observed:

  • runtime.newobject calls for creating closure objects
  • Closure setup, including function pointer and captured variable initialization
  • Creation and invocation of slices.Chunk[go.shape.[]int,go.shape.int].func1
  • Multiple closure calls during iteration
  • Checks on yield function return values

Similar issues were observed in other iterator-related function implementations.

What did you expect to see?

According to the Go Wiki's Rangefunc Experiment documentation, the optimized code structure in simple cases is almost identical to a manually written for loop.

However, assembly analysis suggests that the current implementations may introduce complexity and potential performance overhead. While these implementations are already quite effective, there is hope that further optimizations could align their performance with traditional for loops in most simple scenarios.

Metadata

Metadata

Assignees

Labels

NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions