Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: memmove sometimes faster than memclrNoHeapPointers #23306

Open
alandonovan opened this issue Jan 2, 2018 · 6 comments
Open

runtime: memmove sometimes faster than memclrNoHeapPointers #23306

alandonovan opened this issue Jan 2, 2018 · 6 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@alandonovan
Copy link
Contributor

alandonovan commented Jan 2, 2018

Memory allocation using make([]int, K) is surprisingly slow compared to append(nil, ...), even though append does strictly more work, such as copying.

$ cat a_test.go
package main

import "testing"

const K = 1e6
var escape []int

func BenchmarkMake(b *testing.B) {
	for i := 0; i < b.N; i++ {
		escape = make([]int, K)
	}
}

var empty [K]int

func BenchmarkAppend(b *testing.B) {
	for i := 0; i < b.N; i++ {
		escape = append([]int(nil), empty[:]...)
	}
}

$ go version
go version devel +6317adeed7 Tue Jan 2 13:39:20 2018 +0000 linux/amd64

$ go test -bench=. a_test.go
BenchmarkAppend-12    	    1000	   1208800 ns/op
BenchmarkMake-12      	    1000	   1473106 ns/op

While reporting this issue, I initially used an older runtime from December 18 in which the effect was much stronger: 10x-20x slowdown. But that seems to have been fixed.

Curiously, this issue is the exact opposite of the problem reported in #14718 (now closed).

@bcmills
Copy link
Member

bcmills commented Jan 2, 2018

append does strictly more work, such as copying.

append has to copy, but make has to zero, and either of those operations may be hardware-accelerated. It's not obvious that either is strictly more work than the other.

Are you sure that the escape analysis is working as you expect? Since the escape variable is package-local the compiler could reasonably see through it (and hoist the allocations out of either or both of those loops).

@mdempsky
Copy link
Member

mdempsky commented Jan 2, 2018

Here's a benchmark of the underlying memory copying/clearing primitives (you'll need to put this in its own package directory, along with an empty .s file to workaround #23311):

package main

import (
    "testing"
    "unsafe"
)

//go:linkname memclrNoHeapPointers runtime.memclrNoHeapPointers
func memclrNoHeapPointers(ptr unsafe.Pointer, n uintptr)

//go:linkname memmove runtime.memmove
func memmove(to, from unsafe.Pointer, n uintptr)

const K = 6e5

var a1, a2 [K]int

func BenchmarkMemclr(b *testing.B) {
    for i := 0; i < b.N; i++ {
            memclrNoHeapPointers(unsafe.Pointer(&a1), unsafe.Sizeof(a1))
    }
}

func BenchmarkMemmove(b *testing.B) {
    for i := 0; i < b.N; i++ {
            memmove(unsafe.Pointer(&a1), unsafe.Pointer(&a2), unsafe.Sizeof(a1))
    }
}

On my laptop, the relative performance seems very sensitive to the exact value of K. For example, at K=6e5, I get:

BenchmarkMemclr-4           5000            322261 ns/op
BenchmarkMemmove-4          5000            305383 ns/op

But at K=1e7, I get:

BenchmarkMemclr-4            300           4485500 ns/op
BenchmarkMemmove-4           300           5060492 ns/op

@mdempsky mdempsky changed the title runtime: allocation using make is 40% slower than append(nil, ...) runtime: memmove sometimes faster than memclrNoHeapPointers Jan 2, 2018
@josharian
Copy link
Contributor

Probably unrelated, but this reminds me of 4k aliasing: https://lemire.me/blog/2018/01/04/dont-make-it-appear-like-you-are-reading-your-own-recent-writes/

@TocarIP
Copy link
Contributor

TocarIP commented Feb 28, 2018

For original benchmark memmove and memclr use different strategies. Memmove switches to non-temporal movs, while memclr uses regular movs. Changing non-temporal mov threshould in memmove to match memclr makes append faster:

Make-6    1.58ms ± 1%  1.58ms ± 1%     ~     (p=0.912 n=10+10)
Append-6  1.36ms ± 1%  1.89ms ± 1%  +39.07%  (p=0.000 n=10+10)

However, for memmove tests from runtime switching to regular movs makes benchmark slower for larger sizes:

Memmove/65536-6                 14.9GB/s ± 0%  14.9GB/s ± 0%   +0.16%  (p=0.028 n=9+10)
Memmove/1048576-6               8.67GB/s ± 1%  8.26GB/s ± 2%   -4.80%  (p=0.000 n=10+10)
Memmove/4194304-6               8.51GB/s ± 2%  8.20GB/s ± 3%   -3.74%  (p=0.000 n=10+10)
Memmove/8388608-6               8.55GB/s ± 2%  6.31GB/s ± 4%  -26.28%  (p=0.000 n=10+10)
Memmove/16777216-6              7.92GB/s ± 1%  4.33GB/s ± 2%  -45.30%  (p=0.000 n=10+10)
Memmove/67108864-6              6.56GB/s ± 2%  6.59GB/s ± 1%     ~     (p=0.315 n=10+9)

MemmoveUnalignedDst/65536-6     14.5GB/s ± 1%  14.5GB/s ± 0%     ~     (p=1.000 n=10+7)
MemmoveUnalignedDst/1048576-6   8.70GB/s ± 2%  8.14GB/s ± 1%   -6.48%  (p=0.000 n=10+9)
MemmoveUnalignedDst/4194304-6   8.64GB/s ± 2%  8.13GB/s ± 2%   -5.92%  (p=0.000 n=10+10)
MemmoveUnalignedDst/8388608-6   8.55GB/s ± 3%  6.24GB/s ± 3%  -27.00%  (p=0.000 n=10+10)
MemmoveUnalignedDst/16777216-6  7.93GB/s ± 3%  4.36GB/s ± 1%  -45.08%  (p=0.000 n=10+9)
MemmoveUnalignedDst/67108864-6  6.66GB/s ± 1%  6.76GB/s ± 2%   +1.49%  (p=0.000 n=9+10)

MemmoveUnalignedSrc/65536-6     14.5GB/s ± 1%  14.5GB/s ± 1%     ~     (p=0.796 n=10+10)
MemmoveUnalignedSrc/1048576-6   8.57GB/s ± 1%  8.20GB/s ± 2%   -4.29%  (p=0.000 n=9+10)
MemmoveUnalignedSrc/4194304-6   8.54GB/s ± 2%  8.19GB/s ± 2%   -4.18%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/8388608-6   8.53GB/s ± 2%  6.25GB/s ± 4%  -26.66%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/16777216-6  8.02GB/s ± 2%  4.36GB/s ± 2%  -45.67%  (p=0.000 n=10+10)
MemmoveUnalignedSrc/67108864-6  6.73GB/s ± 2%  6.82GB/s ± 2%   +1.32%  (p=0.035 n=10+10)

@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Mar 28, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone Mar 28, 2018
@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018
@go101
Copy link

go101 commented Sep 6, 2020

It looks this problem has been solved in Go Toolchain 1.15.

@go101
Copy link

go101 commented Sep 6, 2020

Sorry, I mean make+copy is specially optimized in Go 1.15, so that it is more efficient than a single make (also more efficient than append in any case). Single make call is still not optimized.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

9 participants