Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes: relatively poor performance in bytes.Compare #62648

Open
pconstantinou opened this issue Sep 14, 2023 · 5 comments
Open

bytes: relatively poor performance in bytes.Compare #62648

pconstantinou opened this issue Sep 14, 2023 · 5 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@pconstantinou
Copy link

pconstantinou commented Sep 14, 2023

What version of Go are you using (go version)?

1.21

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/phil/Library/Caches/go-build'
GOENV='/Users/phil/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/phil/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/phil/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.21.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/2w/sj4gll256rqfs6r4sgwcn3lh0000gn/T/go-build2454689691=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

For fixed length arrays (such as github.com/google/uuid.UUID) bytes.Compare perform poorly relative to a handwritten:

func uuidCmpByte(a, b uuid.UUID) int {
	for i := 0; i < 16; i++ {
		if d := int(a[i]) - int(b[i]); d != 0 {
			return d
		}
	}
	return 0
}

Bench results where there's a

BenchmarkBytesCompare-8   	     262	   4459605 ns/op	       0 B/op	       0 allocs/op
BenchmarkArrayCompare-8   	     890	   1437192 ns/op	       0 B/op	       0 allocs/op

What did you expect to see?

I expected the internal bytes.Compare to perform optimally.

What did you see instead?

Probably due to the conversion a slice and the inability to unroll the loop, bytes.Compare performed about 3x worse.

It would be ideal to have a function that would compare fixed-length arrays optimally.


const sliceSize = 1000

func uuidCmpByte(a, b uuid.UUID) int {
	for i := 0; i < 16; i++ {
		if d := int(a[i]) - int(b[i]); d != 0 {
			return d
		}
	}
	return 0
}

func BenchmarkBytesCompare(b *testing.B) {
	uuids := make([]uuid.UUID, 0, sliceSize)
	for i := 0; i < sliceSize; i++ {
		uuids = append(uuids, uuid.New())
	}
	for _, uid := range uuids[:10] {
		uuids = append(uuids, uid)
	}
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		for _, a := range uuids {
			for _, b := range uuids {
				_ = bytes.Compare(a[:], b[:])
			}
		}

	}
}

func BenchmarkArrayCompare(b *testing.B) {
	uuids := make([]uuid.UUID, 0, sliceSize)
	for i := 0; i < sliceSize; i++ {
		uuids = append(uuids, uuid.New())
	}
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		for _, a := range uuids {
			for _, b := range uuids {
				_ = uuidCmpByte(a, b)
			}
		}
	}
}

@ALTree
Copy link
Member

ALTree commented Sep 14, 2023

Your compare function is not strictly equivalent to bytes.Compare because the latter guarantees a return value of -1, 0, or +1, while yours just returns the diff. If I change it to follow the bytes.Compare interface:

func uuidCmpByte2(a, b uuid.UUID) int {
	for i := 0; i < 16; i++ {
		d := int(a[i]) - int(b[i])
		if d < 0 {
			return -1
		} else if d > 0 {
			return 1
		}
	}
	return 0
}

I get:

name             time/op
BytesCompare-4   5.03ms ± 1%
ArrayCompare-4   1.59ms ± 1%
ArrayCompare2-4  5.08ms ± 2%

@ALTree ALTree changed the title affected/package: bytes - relatively poor performance in bytes. bytes: relatively poor performance in bytes.Compare Sep 14, 2023
@randall77
Copy link
Contributor

This largely has to do with how different the UUIDs are.
When they differ on the first byte, then your hand-written comparison will be really fast because it exits the loop immediately.
When they differ later in the UUID, or are in fact equal, then the bytes version will be faster. I measure more than a factor of 2 faster. (e.g. change from uuid.New to uuid.UUID{}.)

Of course, it would be nice to have the best of both worlds. Maybe there is some early-exit things we can do to improve the obviously-different case.

@pconstantinou
Copy link
Author

I was also hoping to find a way to cast (without a performance hit) the [16]byte into [2]uint64 thinking that would improve things. I figure the internal implementation of bytes.Compare did that. I'd assumed that some of the performance gains came from being able to unroll the loop.

I agree that the ratio of hits to misses is influential. Thanks for taking a look.

@seankhliao seankhliao added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 15, 2023
@klauspost
Copy link
Contributor

I was also hoping to find a way to cast (without a performance hit) the [16]byte into [2]uint64

There is no penalty with binary.LittleEndian.Uint64(x[0:8]) on amd64.

See https://godbolt.org/z/hxbeYE5E9 - both parts are loaded with a single MOVQ and no bounds checks are needed.

@shoham-b
Copy link

shoham-b commented Jan 2, 2024

This is because of the hand written function is inlined.

goos: darwin
goarch: arm64
pkg: go-playground/pkg
BenchmarkBytesCompare
BenchmarkBytesCompare-8                  	10990532	       106.1 ns/op
BenchmarkArrayCompare
BenchmarkArrayCompare-8                  	13332518	        89.93 ns/op
BenchmarkArrayCompareNoInline
BenchmarkArrayCompareNoInline-8          	 8318890	       138.7 ns/op
BenchmarkBytesCompareAllSame
BenchmarkBytesCompareAllSame-8           	 9992674	       119.0 ns/op
BenchmarkArrayCompareAllSame
BenchmarkArrayCompareAllSame-8           	 3212157	       371.1 ns/op
BenchmarkArrayCompareAllSameNoInline
BenchmarkArrayCompareAllSameNoInline-8   	 2865070	       487.5 ns/op

We use "compare_native" to compare the bytes using architeture specific optimizations, but ASM commands cannot be inlined.

If the UUIDs are different, the overhead of the call is more significant than when they are the same at which the loop is more prominent.

I don't think there is a good way to inline ASM instructions as of today.

@seankhliao seankhliao added this to the Unplanned milestone Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants