Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: a faster, better fnv32 implementation #19780

Closed
orcaman opened this issue Mar 30, 2017 · 16 comments
Closed

proposal: a faster, better fnv32 implementation #19780

orcaman opened this issue Mar 30, 2017 · 16 comments

Comments

@orcaman
Copy link

orcaman commented Mar 30, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.7.3 darwin/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/orcaman/github.com/orcaman/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lc/lr_hrpk55218jm8dn3s78ww00000gn/T/go-build334927550=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"

What did you do?

At the concurrent-map project we have written a faster, better implementation of function fnv32.

Please have a look at this issue to see the implementation and some benchmarks showing the performance improvement: orcaman/concurrent-map#48

This is the commit where we added our custom implementation:
orcaman/concurrent-map@e3b11eb

We clearly see on multiple machines of different kinds that our implementation is significantly faster. I would like to contribute this change to the go source tree.

Cheers,
Or

@ALTree
Copy link
Member

ALTree commented Mar 30, 2017

The core implementation in std looks essentially the same as yours:

func (s *sum32) Write(data []byte) (int, error) {
	hash := *s
	for _, c := range data {
		hash *= prime32
		hash ^= sum32(c)
	}
	*s = hash
	return len(data), nil
}

(sum32 is a cast).

I believe the reason your function is faster is that your implementation does not conform to the Hash interface and for this reason the runtime has less work to do when you use it.

EDIT: this is actually already noted in the issue you linked. The point is that I don't think we can change the Hash package now, moving to pure functions and whatnot. That would break the go1 compatibility promise.

@orcaman
Copy link
Author

orcaman commented Mar 30, 2017

Thanks for your response.

Indeed it seems that the reason for the faster implementation on our project is due to the fact the runtime has less work to do because we are not using the Hash interface.

The thing is: our benchmarks show a very big difference as you can see in the issue. Our version was in the range of 90.8 ns/op - 95.6 ns/op vs 165 ns/op - 170 ns/op if we re-use the hasher (or 214 ns/op vs 249 ns/op if we don't!).

I think that as this level it's worth giving people the option to use the faster version with native go.

@gopherbot gopherbot added this to the Proposal milestone Mar 30, 2017
@ALTree
Copy link
Member

ALTree commented Mar 30, 2017

I understand the motivation but I don't see a reasonable way to do this.

As I wrote we can't change the current API. So we would have to add a second package, with the same functionality, but implemented using plain functions, or just dump a bunch of new functions in the current package, duplicating everything there's now.

Both solutions are IMHO undesirable.

@orcaman
Copy link
Author

orcaman commented Mar 30, 2017

Well, I'm religious about beautiful and idiomatic code as much as the next gopher, but if it was my application, I would rather suffer something like an extra package for such a significant performance improvement. But I see your point, and most applications are probably not performance critical like ones that use concurrent-map.

@bradfitz
Copy link
Contributor

crc32 has separate functions for doing the operation directly and not via some interface: https://golang.org/pkg/hash/crc32/#Checksum

Same with sha1 and md5: https://golang.org/pkg/crypto/sha1/#Sum ... https://golang.org/pkg/crypto/md5/#Sum

I don't think it'd be weird to have a dedicated function in the fnv package for this.

@ALTree
Copy link
Member

ALTree commented Mar 30, 2017

The Sha and Md5 packages are very small though. They have New for the hasher and then Sum, and that's it. fnv will need either 4 new exported Sum functions (for 32, 32a, 64, 64a) or a more complicated single function that takes another parameter (after the data) and switches on implementations. It'll be certainly more confusing to document than md5.

Still, that's a good point: since other packages do this, there's a precedent.

@bradfitz
Copy link
Contributor

I'm not worried about 4 sum functions. It's a small and focused package. If somebody comes looking for fnv, they won't be overwhelmed by 4 fnv functions.

@orcaman
Copy link
Author

orcaman commented Mar 30, 2017

Thank you both for the feedback. Would it be reasonable at this stage if I will implement the new fnv32 function as a seperate function in the spirit of the other hashes mentioned here, and submit the contribution for code review as per the guidelines?

@bronze1man
Copy link
Contributor

bronze1man commented Mar 31, 2017

Is it possible for the compile to optimize out the Hash interface when the user just use one implement?

@bradfitz
Copy link
Contributor

Actually, https://go-review.googlesource.com/c/37751/ landed this cycle

@orcaman, could you profile Go 1.8 vs Go tip for the standard fnv package?

@ALTree
Copy link
Member

ALTree commented Mar 31, 2017

@orcaman Also it would be great if you could benchmark each function multiple times using -count, like in go test -bench=. -count 10 > old.txt , and then generate a report with benchstat (https://godoc.org/golang.org/x/perf/cmd/benchstat).

@orcaman
Copy link
Author

orcaman commented Apr 1, 2017

@bradfitz @ALTree hey guys, I just ran a benchmark of go1.8 vs go tip (BTW, for go tip I just built go from the latest master, is this OK?).

Here are the results:

for go 1.8:

$ go test -bench=. -count 10 > go1_8.txt
$ cat go1_8.txt 
BenchmarkFNV32Custom-8        	20000000	        72.1 ns/op
BenchmarkFNV32Custom-8        	20000000	        71.8 ns/op
BenchmarkFNV32Custom-8        	20000000	        75.2 ns/op
BenchmarkFNV32Custom-8        	20000000	        73.2 ns/op
BenchmarkFNV32Custom-8        	20000000	        69.6 ns/op
BenchmarkFNV32Custom-8        	20000000	        75.6 ns/op
BenchmarkFNV32Custom-8        	20000000	        73.2 ns/op
BenchmarkFNV32Custom-8        	20000000	        70.2 ns/op
BenchmarkFNV32Custom-8        	20000000	        72.2 ns/op
BenchmarkFNV32Custom-8        	20000000	        75.2 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       179 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       145 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       146 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       145 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       146 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       145 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       147 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       153 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       150 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       149 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       137 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       136 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       135 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       133 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       135 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       137 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       138 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       132 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       133 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       135 ns/op
PASS
ok  	github.com/orcaman/fnv-bench	73.664s

for go built from the latest master branch:

$ /usr/local/go_tip/go/bin/go test -bench=. -count 10 > go_tip.txt
$ cat go_tip.txt
goos: darwin
goarch: amd64
pkg: github.com/orcaman/fnv-bench
BenchmarkFNV32Custom-8        	20000000	        72.0 ns/op
BenchmarkFNV32Custom-8        	20000000	        70.2 ns/op
BenchmarkFNV32Custom-8        	20000000	        73.8 ns/op
BenchmarkFNV32Custom-8        	20000000	        72.1 ns/op
BenchmarkFNV32Custom-8        	20000000	        76.0 ns/op
BenchmarkFNV32Custom-8        	20000000	        77.7 ns/op
BenchmarkFNV32Custom-8        	20000000	        76.5 ns/op
BenchmarkFNV32Custom-8        	20000000	        74.3 ns/op
BenchmarkFNV32Custom-8        	20000000	        76.6 ns/op
BenchmarkFNV32Custom-8        	20000000	        75.3 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       185 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       154 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       151 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       154 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       153 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       150 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       152 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       149 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       152 ns/op
BenchmarkFNV32NativeAlloc-8   	10000000	       152 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       140 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       138 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       137 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       136 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       139 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       139 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       141 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       139 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       138 ns/op
BenchmarkFNV32NativeReuse-8   	10000000	       138 ns/op
PASS
ok  	github.com/orcaman/fnv-bench	80.214s

For reference, here is my benchmarking code:

package fnvbench

import (
	"hash/fnv"
	"math/rand"
	"testing"
)

const (
	maxElems  = 10000000
	maxStrLen = 128
	letters   = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
)

// generate random string of given length
func randString(n int) string {
	b := make([]byte, n)
	for i := range b {
		b[i] = letters[rand.Int63()%int64(len(letters))]
	}
	return string(b)
}

// generate slice with maxElems strings of variable length: 0 - maxStrLen
func prepareData() *[]string {
	var strLen int
	stringArray := make([]string, maxElems)
	for i := 0; i < maxElems; i++ {
		strLen = rand.Intn(maxStrLen)
		stringArray[i] = randString(strLen)
	}
	return &stringArray
}

// custom FNV32 from concurrent-map package
func fnv32(key string) uint32 {
	hash := uint32(2166136261)
	const prime32 = uint32(16777619)
	for i := 0; i < len(key); i++ {
		hash *= prime32
		hash ^= uint32(key[i])
	}
	return hash
}

var dataSet = prepareData()

func BenchmarkFNV32Custom(b *testing.B) {
	for i := 0; i < b.N; i++ {
		fnv32((*dataSet)[i%maxElems])
	}
}

func BenchmarkFNV32NativeAlloc(b *testing.B) {
	for i := 0; i < b.N; i++ {
		hasher := fnv.New32()
		hasher.Write([]byte((*dataSet)[i%maxElems]))
		hasher.Sum32()
	}
}

func BenchmarkFNV32NativeReuse(b *testing.B) {
	hasher := fnv.New32()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		hasher.Reset()
		hasher.Write([]byte((*dataSet)[i%maxElems]))
		hasher.Sum32()
	}
}

@josharian
Copy link
Contributor

And here are those same results presented using benchstat. Please use it in the future.

name                old time/op  new time/op  delta
FNV32Custom-8       72.8ns ± 4%  74.5ns ± 6%    ~     (p=0.118 n=10+10)
FNV32NativeAlloc-8   147ns ± 4%   152ns ± 2%  +3.09%  (p=0.003 n=9+9)
FNV32NativeReuse-8   135ns ± 2%   138ns ± 1%  +2.52%  (p=0.001 n=10+8)

@bradfitz
Copy link
Contributor

bradfitz commented Apr 1, 2017

Are you sure you're not comparing apples and oranges? Those string to byte slice conversions look like allocations in your inner loop to me.

@orcaman
Copy link
Author

orcaman commented Apr 1, 2017

@bradfitz seems you are right!

I changed the benchmarking code to prepare the data as slices of bytes instead of strings (see here) and the results are now quite comparable.

This is after running the new benchmark 10 times again:

$ benchstat go1_8.txt go_tip.txt 
name                old time/op  new time/op  delta
FNV32Custom-8       89.5ns ±10%  83.0ns ± 5%  -7.25%  (p=0.010 n=10+10)
FNV32NativeAlloc-8   105ns ± 8%   110ns ± 4%    ~     (p=0.108 n=10+10)
FNV32NativeReuse-8  91.0ns ±10%  89.0ns ± 6%    ~     (p=0.382 n=10+10)

@bradfitz
Copy link
Contributor

bradfitz commented Apr 1, 2017

Looks like there's nothing to do here, then.

You should run more iterations of the benchmark -count=... so the p value above is lower and says something. Right now they look comparable enough, though.

I'll close this but holler if you see room for improvement.

@bradfitz bradfitz closed this as completed Apr 1, 2017
@golang golang locked and limited conversation to collaborators Apr 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants