Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: detect and optimize slice insertion idiom append(sa, append(sb, sc...)...) #31592

Open
go101 opened this issue Apr 21, 2019 · 10 comments
Labels
NeedsInvestigation Performance

Comments

@go101
Copy link

@go101 go101 commented Apr 21, 2019

What version of Go are you using (go version)?

go version go1.12.2 linux/amd64

Does this issue reproduce with the latest release?

yes

What did you do?

package main

import (
	"testing"
)

type T = int
var sx = []T{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}
var sy = []T{111, 222, 333, 444}
var index = 6

func SliceInsertion_OneLine(base, inserted []T, i int) []T {
	return append(base[:i], append(inserted, base[i:]...)...)
}

func SliceInsertion_Verbose(base, inserted []T, i int) []T {
	if cap(base)-len(base) >= len(inserted) {
		s := base[:len(base)+len(inserted)]
		copy(s[i+len(inserted):], s[i:])
		copy(s[i:], inserted)
		return s
	} else {
		s := make([]T, 0, len(inserted)+len(base))
		s = append(s, base[:i]...)
		s = append(s, inserted...)
		s = append(s, base[i:]...)
		return s
	}
}

var s1 []T
func Benchmark_SliceInsertion_OneLine(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s1 = SliceInsertion_OneLine(sx, sy, index)
	}
}

var s2 []T
func Benchmark_SliceInsertion_Verbose(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s2 = SliceInsertion_Verbose(sx, sy, index)
	}
}

What did you expect to see?

Small performance difference.

What did you see instead?

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: app/t
Benchmark_SliceInsertion_OneLine-4   	 5000000	       311 ns/op	     368 B/op	       2 allocs/op
Benchmark_SliceInsertion_Verbose-4   	10000000	       146 ns/op	     160 B/op	       1 allocs/op
PASS
ok  	app/t	3.535s
@go101
Copy link
Author

@go101 go101 commented Apr 21, 2019

I would say this optimization is more needed than this one, #21266, made in Go SDK 1.11, for it is used more popularly, and in fact, before Go 1.11, there is already a slice-extending method which is even more efficient than the optimization made in Go SDK 1.11. Evidence:

package main

import (
	"testing"
)

type T = int
var sx = []T{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}

func SliceGrow_OneLine(base []T, newCapacity int) []T {
	return append(base, make([]T, newCapacity-cap(base))...)
}

func SliceGrow_VerboseCopy(base []T, newCapacity int) []T {
	m := make([]T, newCapacity)
	copy(m, base)
	return m
}

var sa []T
func Benchmark_SliceGrow_OneLine(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sa = SliceGrow_OneLine(sx, 100)
	}
}

var sc []T
func Benchmark_SliceGrow_VerboseCopy(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sc = SliceGrow_VerboseCopy(sx, 100)
	}
}

Benchmark result:

Benchmark_SliceGrow_OneLine-4       	 3000000	       426 ns/op	     896 B/op	       1 allocs/op
Benchmark_SliceGrow_VerboseCopy-4   	 5000000	       350 ns/op	     896 B/op	       1 allocs/op

@julieqiu julieqiu added the NeedsInvestigation label May 28, 2019
@julieqiu
Copy link
Contributor

@julieqiu julieqiu commented May 28, 2019

/cc @randall77 @griesemer

@josharian
Copy link
Contributor

@josharian josharian commented May 28, 2019

cc @martisch

There are a lot of variants of this. E.g. splice, introduced in a go-diff commit, offered very substantial performance improvements.

I'm genuinely not sure how many of these we want to recognize in the compiler vs, say, writing a blog post illustrating how to write these efficiently or wait for generics and implement them in the standard library.

@martisch
Copy link
Contributor

@martisch martisch commented May 28, 2019

I see at least 3 patterns here:

  • append(slice[:index],` append(elements, slice[index:]...)...)
  • append(slice[:index],` append(elements, slice[index+len(elements):]...)...)
  • append(slice[:index],` append(elements, slice[index+amount:]...)...)

I think the first 2 should be easiest to implement and subjectively also the most used.
If we have them recognized we could look at the how difficult and how much more complexity the third case adds.

If others dont think the first 2 are a no-go I can have go at it :) (but if someone else is eager to work on it dont hold off on it)

@ear7h
Copy link

@ear7h ear7h commented Sep 11, 2019

I'd like to try this, but I think it could be generalized, as the original comment mentioned.

For nested appends, one would need at most 1 call to growslice (but more calls to memmove). append(s1, append(s2, append( s3, append(. . .)...)...)...) could be transformed to the following by walkappend:

init {
	// continue walking the node tree and place the args
	// in temporaries s1, s2, s3, ...
	newcap := len(s1) + len(s2) + len(s3) + ...
	if newcap > cap(s1) {
	    s1 = growslice(s1, newcap)
	}

	s1 = s1[:s1]
	idx := 0

	copy(s1[idx:], s2)
	idx += len(s2)
	if cap(s2) >= len(s3) {
	    copy(s2[:cap(s2)], s3)
	}

	copy(s1[idx:], s3)
	idx += len(s3)
	if cap(s3) >= len(s4) {
	    copy(s3[:cap(s3)], s4)
	}

	...
}
s

This only allocates for the resulting slice and mimics the behavior of actually calling append where it writes between len and cap iff there's enough room. Since we do the writing conditionally, the number of copies is also reduced (along with allocations). This could also have logic for handling single values, and make efficiently by checking types and ops when generating the additions and copies. In the make case, the copy is replaced with memclr.

I should mention that I think I understand most of what's going on in the slice generation except for this bit:

// General case, with no function calls left as arguments.
// Leave for gen, except that instrumentation requires old form.
if !instrumenting || compiling_runtime {
return n
}

@martisch
Copy link
Contributor

@martisch martisch commented Sep 12, 2019

Careful with newcap := len(s1) + len(s2) + len(s3) + ... this can overflow and wrap around making the resulting capacity to small. The adds need to be done one by one with overflow checked each time. If any problem is encountered on the way the up to then good appends need to be applied so there is no difference in semantics. There can be logic that the calculation can not overflow e.g. for an existing length (or lower)+constant insert append(slice[:index], append(elements, slice[index:]...)...)`.

I would still suggest to solve this for the simple case to understand the complexities involved better and then we can see if the generalization is worth even more complexity while still being used in production code.

@quasilyte
Copy link
Contributor

@quasilyte quasilyte commented Dec 6, 2021

To provide some statistics (source: gocorpus with all repositories checked):

  • append($b[:$i], append($_, $b[$i:]...)...) - 9 hits
  • append($b[:$i], append([]$_{$_}, $b[$i:]...)...) - 5 hits
  • $b = append($b, $_); copy($b[$i+1:], $b[$i:]); $b[$i] = $_ - 12 hits (1 match per ~866,636 SLOC)
1-2. output=append(output,0);copy(output[i+1:],output[i:]);output[i]=n (2 matches)
3-4. impDecl.Specs=append(impDecl.Specs,nil);copy(impDecl.Specs[insertAt+1:],impDecl.Specs[insertAt:]);impDecl.Specs[insertAt]=newImport (2 matches)
5. coords=append(coords,geom.Coord{});copy(coords[index+1:],coords[index:]);coords[index]=point.Coords()
6. l.data=append(l.data,nil);copy(l.data[i+1:],l.data[i:]);l.data[i]=s
7. t.inUseKeys=append(t.inUseKeys,writtenKeySpan{});copy(t.inUseKeys[start+1:],t.inUseKeys[start:]);t.inUseKeys[start]=writtenKeySpan{key:span,writer:w,txn:txn}
8. arr=append(arr,nil);copy(arr[idx+1:],arr[idx:]);arr[idx]=val
9. sb.endpoints=append(sb.endpoints,nil);copy(sb.endpoints[i+1:],sb.endpoints[i:]);sb.endpoints[i]=ep
10. s.Calls=append(s.Calls,Call{});copy(s.Calls[index+1:],s.Calls[index:]);s.Calls[index]=Call{Stack:gr.Stack}
11. archive.Files=append(archive.Files,txtar.File{});copy(archive.Files[gosumIdx+1:],archive.Files[gosumIdx:]);archive.Files[gosumIdx]=txtar.File{Name:"go.sum",Data:newGosumData}
12. es=append(es,muxEntry{});copy(es[i+1:],es[i:]);es[i]=e

For comparison, here are some results on the same corpus:

  • append($_, $_) - 37008 hits
  • append($_, $_, $_) - 723 hits

Which is understandable, 2-arguments append is far more common than any other form, but this gives us the answer by how much (~51 times).

(The search patterns are written in gogrep syntax, if you're curious.)


Just out of curiosity, does this slices.Insert perform the required operation in the comparable performance that we can achieve by teaching the compiler to make the suggested rewrite?

@dotaheor
Copy link

@dotaheor dotaheor commented Dec 7, 2021

Just out of curiosity, does this slices.Insert perform the required operation in the comparable performance that we can achieve by teaching the compiler to make the suggested rewrite?

No, it doesn't. Generics doesn't help here.
The make call still zero elements in the generic code.

func Insert[S ~[]E, E any](s S, i int, v ...E) S {
	tot := len(s) + len(v)
	if tot <= cap(s) {
		s2 := s[:tot]
		copy(s2[i+len(v):], s[i:])
		copy(s2[i:], v)
		return s2
	}
	s2 := make(S, tot)
	copy(s2, s[:i])
	copy(s2[i:], v)
	copy(s2[i+len(v):], s[i:])
	return s2
}

That's why I proposed to add merge builtin function (#23905).

On the other hand, the generic code does provide a chance to simplify the potential compiler optimization.

@quasilyte
Copy link
Contributor

@quasilyte quasilyte commented Dec 7, 2021

Excessive zeroing is a known issue. I think it's a separate fruit on its own.
I'm working on it by adding a memclrelim pass.
Right now it doesn't handle make/copy/etc, but there are plans to extend it (see the TODO notes, but there are actually more ideas).
https://go-review.googlesource.com/c/go/+/367496

@dotaheor
Copy link

@dotaheor dotaheor commented Dec 7, 2021

Right now it doesn't handle make/copy/etc

Mostly, except for the case handled by a Go 1.15 optimization:

var s = make([]T, n)
copy(s, x)
copy(s[len(x):], y)

In the above code, the elements within s[:len(x)] are not zeroed.

For the complexities in reality, the optimization has many limitations:

  • the make call must take exact two arguments,
  • x must be an identifier.
  • the elements within s[len(x):] are still zeroed even if the previous conditions are satisfied.

That means, currently, the above generic code could be a little faster:

func Insert[S ~[]E, E any](s S, i int, v ...E) S {
	tot := len(s) + len(v)
	if tot <= cap(s) {
		s2 := s[:tot]
		copy(s2[i+len(v):], s[i:])
		copy(s2[i:], v)
		return s2
	}
	x := s[:i]
	s2 := make(S, tot)
	copy(s2, x)
	copy(s2[i:], v)
	copy(s2[i+len(v):], s[i:])
	return s2
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Performance
Projects
None yet
Development

No branches or pull requests

8 participants