Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: detect and optimize slice insertion idiom append(sa, append(sb, sc...)...) #31592

Open
go101 opened this issue Apr 21, 2019 · 6 comments

Comments

@go101
Copy link

@go101 go101 commented Apr 21, 2019

What version of Go are you using (go version)?

go version go1.12.2 linux/amd64

Does this issue reproduce with the latest release?

yes

What did you do?

package main

import (
	"testing"
)

type T = int
var sx = []T{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}
var sy = []T{111, 222, 333, 444}
var index = 6

func SliceInsertion_OneLine(base, inserted []T, i int) []T {
	return append(base[:i], append(inserted, base[i:]...)...)
}

func SliceInsertion_Verbose(base, inserted []T, i int) []T {
	if cap(base)-len(base) >= len(inserted) {
		s := base[:len(base)+len(inserted)]
		copy(s[i+len(inserted):], s[i:])
		copy(s[i:], inserted)
		return s
	} else {
		s := make([]T, 0, len(inserted)+len(base))
		s = append(s, base[:i]...)
		s = append(s, inserted...)
		s = append(s, base[i:]...)
		return s
	}
}

var s1 []T
func Benchmark_SliceInsertion_OneLine(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s1 = SliceInsertion_OneLine(sx, sy, index)
	}
}

var s2 []T
func Benchmark_SliceInsertion_Verbose(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s2 = SliceInsertion_Verbose(sx, sy, index)
	}
}

What did you expect to see?

Small performance difference.

What did you see instead?

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: app/t
Benchmark_SliceInsertion_OneLine-4   	 5000000	       311 ns/op	     368 B/op	       2 allocs/op
Benchmark_SliceInsertion_Verbose-4   	10000000	       146 ns/op	     160 B/op	       1 allocs/op
PASS
ok  	app/t	3.535s
@go101
Copy link
Author

@go101 go101 commented Apr 21, 2019

I would say this optimization is more needed than this one, #21266, made in Go SDK 1.11, for it is used more popularly, and in fact, before Go 1.11, there is already a slice-extending method which is even more efficient than the optimization made in Go SDK 1.11. Evidence:

package main

import (
	"testing"
)

type T = int
var sx = []T{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}

func SliceGrow_OneLine(base []T, newCapacity int) []T {
	return append(base, make([]T, newCapacity-cap(base))...)
}

func SliceGrow_VerboseCopy(base []T, newCapacity int) []T {
	m := make([]T, newCapacity)
	copy(m, base)
	return m
}

var sa []T
func Benchmark_SliceGrow_OneLine(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sa = SliceGrow_OneLine(sx, 100)
	}
}

var sc []T
func Benchmark_SliceGrow_VerboseCopy(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sc = SliceGrow_VerboseCopy(sx, 100)
	}
}

Benchmark result:

Benchmark_SliceGrow_OneLine-4       	 3000000	       426 ns/op	     896 B/op	       1 allocs/op
Benchmark_SliceGrow_VerboseCopy-4   	 5000000	       350 ns/op	     896 B/op	       1 allocs/op

@julieqiu
Copy link
Contributor

@julieqiu julieqiu commented May 28, 2019

@josharian
Copy link
Contributor

@josharian josharian commented May 28, 2019

cc @martisch

There are a lot of variants of this. E.g. splice, introduced in a go-diff commit, offered very substantial performance improvements.

I'm genuinely not sure how many of these we want to recognize in the compiler vs, say, writing a blog post illustrating how to write these efficiently or wait for generics and implement them in the standard library.

@martisch
Copy link
Contributor

@martisch martisch commented May 28, 2019

I see at least 3 patterns here:

  • append(slice[:index],` append(elements, slice[index:]...)...)
  • append(slice[:index],` append(elements, slice[index+len(elements):]...)...)
  • append(slice[:index],` append(elements, slice[index+amount:]...)...)

I think the first 2 should be easiest to implement and subjectively also the most used.
If we have them recognized we could look at the how difficult and how much more complexity the third case adds.

If others dont think the first 2 are a no-go I can have go at it :) (but if someone else is eager to work on it dont hold off on it)

@ear7h
Copy link

@ear7h ear7h commented Sep 11, 2019

I'd like to try this, but I think it could be generalized, as the original comment mentioned.

For nested appends, one would need at most 1 call to growslice (but more calls to memmove). append(s1, append(s2, append( s3, append(. . .)...)...)...) could be transformed to the following by walkappend:

init {
	// continue walking the node tree and place the args
	// in temporaries s1, s2, s3, ...
	newcap := len(s1) + len(s2) + len(s3) + ...
	if newcap > cap(s1) {
	    s1 = growslice(s1, newcap)
	}

	s1 = s1[:s1]
	idx := 0

	copy(s1[idx:], s2)
	idx += len(s2)
	if cap(s2) >= len(s3) {
	    copy(s2[:cap(s2)], s3)
	}

	copy(s1[idx:], s3)
	idx += len(s3)
	if cap(s3) >= len(s4) {
	    copy(s3[:cap(s3)], s4)
	}

	...
}
s

This only allocates for the resulting slice and mimics the behavior of actually calling append where it writes between len and cap iff there's enough room. Since we do the writing conditionally, the number of copies is also reduced (along with allocations). This could also have logic for handling single values, and make efficiently by checking types and ops when generating the additions and copies. In the make case, the copy is replaced with memclr.

I should mention that I think I understand most of what's going on in the slice generation except for this bit:

// General case, with no function calls left as arguments.
// Leave for gen, except that instrumentation requires old form.
if !instrumenting || compiling_runtime {
return n
}

@martisch
Copy link
Contributor

@martisch martisch commented Sep 12, 2019

Careful with newcap := len(s1) + len(s2) + len(s3) + ... this can overflow and wrap around making the resulting capacity to small. The adds need to be done one by one with overflow checked each time. If any problem is encountered on the way the up to then good appends need to be applied so there is no difference in semantics. There can be logic that the calculation can not overflow e.g. for an existing length (or lower)+constant insert append(slice[:index], append(elements, slice[index:]...)...)`.

I would still suggest to solve this for the simple case to understand the complexities involved better and then we can see if the generalization is worth even more complexity while still being used in production code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants