Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: explore optimizations for fixed-length array copies #46529

Open
mdempsky opened this issue Jun 2, 2021 · 5 comments
Open

cmd/compile: explore optimizations for fixed-length array copies #46529

mdempsky opened this issue Jun 2, 2021 · 5 comments

Comments

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Jun 2, 2021

#46505 (comment) reports these results for this benchmark:

BenchmarkNamed-8        201051339                6.263 ns/op
BenchmarkCopy-8         172203873                6.071 ns/op
BenchmarkPointer-8      1000000000               1.031 ns/op

We should see if cmd/compile can optimize the first two as efficiently as the last.

One observation though is they have different semantics: SumNamed and SumCopy need to support partial copies if len(b) < Size256, but SumPointer panics in this case.

We could probably efficiently handle copy(ret[:], b[:Size256]) though, by rewriting it into ret = *(*[Size256]byte)(b) in the frontend.

Marking for Go 1.18 to at least explore possibilities here.

Related #20859.

/cc @randall77 @bradfitz @katiehockman

@tmthrgd
Copy link
Contributor

@tmthrgd tmthrgd commented Jun 2, 2021

There’s a mistake in that benchmark, BenchmarkCopy measures SumNamed instead of SumCopy. That leaves it identical to BenchmarkNamed. I have no idea if that impacts the results though.

@go101
Copy link

@go101 go101 commented Aug 7, 2021

For large slices, the performance difference of SumNamed and SumPointer is very small if they are expanded (inline manually): https://play.golang.org/p/WovKO4U-xRk

SumCopy is much slower for large slices. It looks the stack frame size of SumCopy is larger than others. Maybe it copies data twice.

Other observations:

  • Named results sometimes cause positive performance impact, sometimes cause negative impact.
  • The effects of manual inline and auto inline might be different.
  • For small slices, manuallly expanded SumNamed is slower than manuallly expanded SumPointer.

@go101
Copy link

@go101 go101 commented Aug 23, 2021

It looks, the conversion way is more efficient for very small and large enough arrays:

package arrays

import "testing"

const N = 32
var r [128][N]byte
var s = make([]byte, N)

func init() {
	println("============= N =", N)
}

func Benchmark_CopyClone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		copy(r[i&127][:], s)
	}
}

func Benchmark_ConvertClone(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r[i&127] = *(*[N]byte)(s)
	}
}

Benchmark results:

============= N = 32
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
Benchmark_CopyClone-4      	258112560	         4.515 ns/op
Benchmark_ConvertClone-4   	783765790	         1.507 ns/op

============= N = 64
Benchmark_CopyClone-4      	229087599	         5.080 ns/op
Benchmark_ConvertClone-4   	589875326	         2.004 ns/op

============= N = 96
Benchmark_CopyClone-4      	181417629	         6.356 ns/op
Benchmark_ConvertClone-4   	281229399	         4.241 ns/op

============= N = 128
Benchmark_CopyClone-4      	192070240	         6.065 ns/op
Benchmark_ConvertClone-4   	224375317	         5.349 ns/op

============= N = 256
Benchmark_CopyClone-4      	137411401	         8.610 ns/op
Benchmark_ConvertClone-4   	135560024	         8.805 ns/op

============= N = 2048
Benchmark_CopyClone-4      	12372487	        94.05 ns/op
Benchmark_ConvertClone-4   	11931901	        97.88 ns/op

============= N = 2560
Benchmark_CopyClone-4      	 9082016	       127.2 ns/op
Benchmark_ConvertClone-4   	10472389	       110.6 ns/op

============= N = 4096
Benchmark_CopyClone-4      	 5103782	       226.3 ns/op
Benchmark_ConvertClone-4   	 6493408	       180.8 ns/op

============= N = 8192
Benchmark_CopyClone-4      	 2470774	       433.1 ns/op
Benchmark_ConvertClone-4   	 3447882	       344.9 ns/op

============= N = 65536
Benchmark_CopyClone-4      	  151039	      7240 ns/op
Benchmark_ConvertClone-4   	  242791	      4149 ns/op

@willbeason
Copy link
Contributor

@willbeason willbeason commented Aug 24, 2021

I actually see an 18% performance decrease when copying 2048 bytes. This seems super platform- and case-dependent.

I don't see significant improvements for my machine other than for <=64 bytes.

On further testing (I increased the number of arrays from 128 to 256), the results are also sensitive to the number of distinct arrays being copied. The main factors in whether this provides a significant speedup are:

1, Your machine's block size (64 bytes for my machine)
2. The L1 cache size for each processor (256kb / 4 processors = 64kb for me)

Once the arrays being copied exceed my machine's block size, I don't see a significant improvement.

Eventually, once the total set of memory I'm copying from/to exceeds the L1 cache size, I start to see slowdowns (number of arrays * size of array > 64kb).

The slowdowns become significant once the individual arrays being copied exceed the size of the L1 cache.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 10, 2021

Since there hasn't been any movement here since August and we're in the freeze, I'm going to put it into the backlog.

@mdempsky Feel free to move it out to 1.19 if you plan to do some more exploration of this for next release.

@mknyszek mknyszek removed this from the Go1.18 milestone Nov 10, 2021
@mknyszek mknyszek added this to the Backlog milestone Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants