Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: suboptimal cloning/optimization in slices.Clone #53643

Open
ainar-g opened this issue Jul 1, 2022 · 3 comments
Open

cmd/compile: suboptimal cloning/optimization in slices.Clone #53643

ainar-g opened this issue Jul 1, 2022 · 3 comments
Labels
compiler/runtime NeedsDecision Performance
Milestone

Comments

@ainar-g
Copy link
Contributor

ainar-g commented Jul 1, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18.3 linux/amd64

Does this issue reproduce with the latest release?

Yes with go version devel go1.19-d3ffff2790 Tue Jun 28 13:01:41 2022 +0000 linux/amd64.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ainar/.cache/go-build"
GOENV="/home/ainar/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ainar/go/pkg/mod"
GONOPROXY="REMOVED"
GONOSUMDB="REMOVED"
GOOS="linux"
GOPATH="/home/ainar/go"
GOPRIVATE="REMOVED"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/ainar/go/go1.18"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/ainar/go/go1.18/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1722739114=/tmp/go-build -gno-record-gcc-switches"

Discoveries / Proposal

slices.Clone is currently defined as:

// Clone returns a copy of the slice.
// The elements are copied using assignment, so this is a shallow clone.
func Clone[S ~[]E, E any](s S) S {
	// Preserve nil in case it matters.
	if s == nil {
		return nil
	}
	return append(S([]E{}), s...)
}

append here seems to be optimized to output what is essentially make + copy. But if we do that explicitly:

// Clone returns a copy of the slice.
// The elements are copied using assignment, so this is a shallow clone.
func Clone[S ~[]E, E any](s S) (clone S) {
	// Preserve nil in case it matters.
	if s == nil {
		return nil
	}
	clone = make(E, len(s))
	copy(clone, s)
	return clone
}

and benchmark it (see https://go.dev/play/p/WHvD0zJ33S1) then the new function generally performs better:

goos: linux
goarch: amd64
cpu: AMD Ryzen 7 PRO 4750U with Radeon Graphics
BenchmarkFoo/10/clone_std-16            11532111                95.56 ns/op           80 B/op        1 allocs/op
BenchmarkFoo/10/clone_our-16            16797295                70.03 ns/op           80 B/op        1 allocs/op
BenchmarkFoo/100/clone_std-16            2828976               425.5 ns/op           896 B/op        1 allocs/op
BenchmarkFoo/100/clone_our-16            3188673               370.0 ns/op           896 B/op        1 allocs/op
BenchmarkFoo/1000/clone_std-16            369156              3092 ns/op            8192 B/op        1 allocs/op
BenchmarkFoo/1000/clone_our-16            407413              2963 ns/op            8192 B/op        1 allocs/op
BenchmarkFoo/10000/clone_std-16            49088             23761 ns/op           81920 B/op        1 allocs/op
BenchmarkFoo/10000/clone_our-16            51192             23345 ns/op           81920 B/op        1 allocs/op
BenchmarkFoo/100000/clone_std-16            3909            267590 ns/op          802821 B/op        1 allocs/op
BenchmarkFoo/100000/clone_our-16            4609            255535 ns/op          802822 B/op        1 allocs/op
BenchmarkFoo/1000000/clone_std-16            879           1197480 ns/op         8003595 B/op        1 allocs/op
BenchmarkFoo/1000000/clone_our-16            918           1186598 ns/op         8003594 B/op        1 allocs/op
PASS
ok      command-line-arguments  15.665s

Looking at the assembly, it seems that the current version uses runtime.growslice while the new one, runtime.mallocgc. There are other changes in the assembly as well.

@gopherbot gopherbot added this to the Unreleased milestone Jul 1, 2022
@randall77
Copy link
Contributor

randall77 commented Jul 1, 2022

I believe we use the append version so that we let the runtime round up to the next size class. That will allow the caller to append some items to the result without immediately triggering a reallocation+copy.

The append version is also knows that it doesn't need to zero the allocation before overwriting it with the copy.

With both those said, it would be nice if the append version wasn't slower. It looks like it isn't for large arrays, but there's some overhead for small arrays.

@dr2chase dr2chase added the NeedsDecision label Jul 1, 2022
@ianlancetaylor ianlancetaylor changed the title x/exp/slices: suboptimal cloning/optimization opportunity missed? cmd/compile: suboptimal cloning/optimization in slices.Clone Jul 1, 2022
@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jul 1, 2022

If there is anything to be fixed here, it is in the compiler, so retitled.

@gopherbot gopherbot added the compiler/runtime label Jul 13, 2022
@martisch
Copy link
Contributor

martisch commented Aug 14, 2022

"The append version is also knows that it doesn't need to zero the allocation before overwriting it with the copy."
The same seems true for make+copy too: https://go-review.googlesource.com/c/go/+/146719

I think what we should do is detect cases of append(X, s...) where X is known to have len and cap 0 in the compiler and rewrite this into makeextended+copy where makeextended is a make variant that calculates the new slice size like append does but can assume more constraints then the general append case. This should then be minimaly slower then make+copy. Whether the optimization is alot faster to warrant the added complexity benchmarks need to show.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime NeedsDecision Performance
Projects
Status: Triage Backlog
Development

No branches or pull requests

6 participants