Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash: add Hasher, Seed for hashing user data #28322

Open
alandonovan opened this issue Oct 22, 2018 · 38 comments

Comments

Projects
None yet
@alandonovan
Copy link
Contributor

commented Oct 22, 2018

[See https://github.com//issues/28322#issuecomment-498831453 below for accepted proposal.]

Some applications need an efficient non-cryptographic hash function for strings. The Go runtime contains such a function, hashString, that is optimized to use AESENC instructions on amd64, but it is not exported. The implementation is non-trivial, so clients are loathe to copy it and will tend to use the //linkname hack instead.

In the same way that math/bits exports Go functions that use the best available implementation of a given bit-twiddling function, it would be useful for a package such as hash to export a decent string hash function. The implementation should include a per-process random salt to prevent users from assuming that hashes are stable across process restarts, which has been a problem for hash libraries in other languages.

package hash

// String computes a non-cryptographic hash of the specified string.
// The hash function is not specified, but callers may assume that,
// within a single process, x==y implies hash.String(x)==hash.String(y).
func String(string) uinptr

// ditto, bytes
func Bytes([]byte) uintptr

If we can agree where this belongs, I'll send the CL.

@dsnet

This comment has been minimized.

Copy link
Member

commented Oct 22, 2018

Related to #21195.

@dgryski

This comment has been minimized.

Copy link
Contributor

commented Oct 23, 2018

It seems to me this could easily be an external package. There are already many good string hashing functions available with slightly different tradeoffs: strength, speed for short strings, speed for long strings, etc. (I have a few benchmarks here: https://github.com/dgryski/trifles/blob/master/hashbench/bench_test.go#L10 ) A "one-size-fits-all" solution is nice, but I think this ends up being a sufficiently rare it shouldn't be in the standard library. That said, packaging a copy of the runtime's string hash function into /x/ somewhere might be an idea.

@agnivade agnivade changed the title hash: add string hash function proposal: hash: add string hash function Oct 23, 2018

@gopherbot gopherbot added this to the Proposal milestone Oct 23, 2018

@gopherbot gopherbot added the Proposal label Oct 23, 2018

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 24, 2018

This is not the right API. All hash functions must take an initial seed so that they can be chained and produce a different hash function for different seeds. The current proposed API, with no input seed, can only be used to build bad hash tables. (To avoid engineered collisions aka hash-flooding it is important that when you grow a hash table you also completely change the hash function, which you can do by having a different initial seed for the hash inputs.) On top of that we'd also want to vary the hash calculation, even for a given input seed value, for each process, to avoid people writing code dependent on the exact hash implementation.

More API design is probably needed here.

It might work to copy this to an external package and simply have two of the code. On the other hand this functionality is very much exporting what's in the runtime (unexported), so that would weigh toward being in the standard library. The right answer may depend on how much API we are talking about.

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Oct 24, 2018

Well, adding a seed parameter is trivial, and corresponds to the underlying implementation:

func String(s string, seed uintptr) uinptr
@andybons

This comment has been minimized.

Copy link
Member

commented Oct 31, 2018

Since uintptr is architecture-dependent, this seems like it could lead to an awkward API. Would it better to pass a uint64 and on 32-bit architectures we take the perf hit?

Per discussion with @golang/proposal-review

@rsc

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2018

Does anyone have time to work out a concrete API here? Or should we put this on hold?

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Dec 12, 2018

Does the API need to be more complex that this?

package hash

// String computes a non-cryptographic hash of s.
// The hash function is not specified but callers may assume that,
// within a single process, x==y implies hash.String(x, seed)==hash.String(y, seed)
// for any seed value.
func String(s string, seed uintptr) uinptr

// ditto, bytes
func Bytes([]byte, seed uintptr) uintptr
@dsnet

This comment has been minimized.

Copy link
Member

commented Dec 12, 2018

Is the compiler intelligent enough today to avoid allocating when doing:

var b []byte = ...
h := hash.String(string(b), seed)

such that we can avoid the bytes version?

@dgryski

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2018

Also probably want to make it clear that different processes even with the same seed may produce different hashes.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2018

Also should probably be uint64; uint32 is likely not good enough.

It's unclear to me if the API should be limited to a single function like this or if we want to define a Hasher object that you can feed strings and other basic types into.

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2018

A stateful hasher object that satisfies hash.Hash64 is a good idea: it makes the state parameter invisible in the common case, it presents an API consistent with the other hash/... packages, and it allows you to write non-strings to it using Fprintf.

@gopherbot

This comment has been minimized.

Copy link

commented Dec 19, 2018

Change https://golang.org/cl/155118 mentions this issue: hash: Runtime, an API for the hash function used by the Go runtime

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Jan 3, 2019

Somewhat related: during some recent work I had a need for a hash function for arbitrary reflect.Values that is consistent with the Go equivalence relation for those values, and I don't think it's possible to achieve in Go today. It's possible to test reflect.Values x and y for equivalence using x.Interface()==y.Interface(), but there's no way to obtain the values' hashes.

I wonder if the hash API proposed in this issue should instead allow arbitrary Go values to be hashed in a manner consistent with equals? As I said, it's only somewhat related; this issue is primarily about performance, not consistency, but perhaps there's an opportunity to kill two stones with one bird.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jan 3, 2019

Hashing arbitrary values is currently implemented with code generation in the compiler when it detects that a type requires a hash function. That obviously won’t work for an arbitrary value inside an interface or reflect.Value. We’d have to simulate instead. That suggests different API to me—something like reflect.Hash that accepts a reflect.Value (and a seed?).

@rsc

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2019

Ian, Robert, Andy, and I chatted through this and sketched an API, below.
The important concerns are:

  • Hasher has to be seeded with a different random seed by default.
    (Don't want to depend on these never changing.)
  • Need to be able to set up multiple hashers with same seed (for map lookup, for example).
  • Do not want to be able to force a particular hash value for a given input, even given a particular seed, because that locks us into one algorithm for all architectures for all time.
    (Implies seed must be opaque, not serializable.)

That would lead to an API something like:

type Hasher struct { ... opaque ..}
func (h *Hasher) Seed() Seed
func (h *Hasher) SetSeed(seed Seed) // random seed used if this is not called
func (h *Hasher) Add(x interface{}) // general method
func (h *Hasher) AddString(x string) // special cases just to avoid interface{} overhead
func (h *Hasher) AddBytes(x []byte)
func (h *Hasher) AddInt64(x int64)
func (h *Hasher) Hash() uint64
func (h *Hasher) Reset()
type Seed struct { ... opaque ... }
func NewSeed() Seed

Robert says he could use this for some of the generics experiments. Alan, does this make sense for you?

/cc @alandonovan

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Feb 6, 2019

Thanks, that all makes sense. I assume Add would permit any value that is a valid map key, including pointers, and panic otherwise?

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Feb 6, 2019

Hasher.SetSeed would need to disallow the zero Seed value to ensure that users can't choose a specific seed.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2019

Is the call pattern part of the hash? That is, is Add("foo");Add("bar") equivalent to Add("foobar")?
Are the types part of the hash? That is, is Add(int64(5)) equivalent to Add(uint64(5))?

Presumably Add("foo") and AddString("foo") are equivalent.

What does Reset do to the seed? I think it should restore to the last argument to SetSeed, or the same random seed as before if SetSeed was never called. (This means we'd have to save the seed somewhere in addition to the internal state, so that it would be available for a Reset. We'd probably have to anyway, for Seed.)

Does SetSeed do an implicit Reset? It probably should.

@CAFxX

This comment has been minimized.

Copy link
Contributor

commented Feb 6, 2019

Is the call pattern part of the hash? That is, is Add("foo");Add("bar") equivalent to Add("foobar")?

I see good reasons for both ways:

  • When working with streams/sequences, it would be useful if Add("foo");Add("bar") was the same as Add("f");Add("oobar") or Add("foobar").
  • But if I'm trying to hash a compound key, it would be ideal if Add("foo");Add("bar") was not the same as Add("f");Add("oobar") or Add("foobar").

Both behaviors seem equally desirable. (update: see #28322 (comment) as to why I now believe that the former is more desirable than the latter)

@dgryski

This comment has been minimized.

Copy link
Contributor

commented Feb 7, 2019

For the compound key instance, the traditional solution is to include a separator in the hash calculation: Add("foo"); Add(0); Add("bar")

@rsc

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

Is the call pattern part of the hash?

Yes?

That is, is Add("foo");Add("bar") equivalent to Add("foobar")?

No?

I was thinking the call sequence matters, so you can just shove all the fields from a struct into the Hasher one at a time without worrying about {Name: "abc", Value: "def"} and {Name: "abcde", Value: "f"} having the same hash.

Are the types part of the hash? That is, is Add(int64(5)) equivalent to Add(uint64(5))?

Hadn't thought through that. Not sure. What do you think?

Presumably Add("foo") and AddString("foo") are equivalent.

Yes.

What does Reset do to the seed? I think it should restore to the last argument to SetSeed, or the same random seed as before if SetSeed was never called. (This means we'd have to save the seed somewhere in addition to the internal state, so that it would be available for a Reset. We'd probably have to anyway, for Seed.)

Yes.

Does SetSeed do an implicit Reset? It probably should.

Yes.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

I think we should define that

var s string
AddString(s)
AddBytes(([]byte)(s))
Add(s)
Add(([]byte)(s))

behave identically. Similarly,

var i int64
AddInt64(i)
Add(i)

behave identically.

Given that a string or []byte argument to Add (with identical contents) behaves identically, I think it makes sense to ignore the type altogether in Add. Otherwise, the type matters sometimes and doesn't matter other times.

The more I think about the generic Add, the less I like it. Do we really want to specify how structs are hashed? Floats? The spec is much simpler if all you can hash are string, []byte, and int64. Add just isn't very useful if there are just 3 allowed types.

I like the ease of use provided by including the call sequence in the hash. All of AddString("foobar"), AddString("foo");AddString("bar"), and AddString("f");AddString("oobar") should have different hashes. The downsides are that users may have to pre-concatenate data before passing it to the hasher, and the hasher might need to do a bit more work (practically, just a bit of mixing between each call). Both don't seem to be showstoppers.

@andybons

This comment has been minimized.

Copy link
Member

commented May 14, 2019

@alandonovan thoughts on @randall77's comments?

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented May 14, 2019

Re: call pattern. It's straightforward to implement a hasher that consumes a stream of values if you have a hasher that consumes a sequence of bytes, but the reverse is not the case, so I think the call pattern should not be significant to the hash, even if that may be less convenient.

Re: generic add. We would not be specifying how structs are hashed, only that structs that compare equal hash the same. This leaves room for any specific implementation, of which the runtime alg is the obvious choice. We would of course be revealing the specific values, and that means some people will write programs that behave differently when the hash function changes.

Re: seed and reset. In a hash table, each table would need to create and save a new seed (ideally without an additional call to malloc), then each lookup operation would create its own local hasher (again ideally avoiding malloc), initialize it from the table's seed and hash the key. There's no call for Reset in this scenario. What's the value of Reset if NewHasher doesn't heap-allocate?

@rsc

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

@alandonovan I think the value of Reset is that you avoid having to manage the seed yourself. For a typical one-goroutine-at-a-time data structure, one hasher is fine, probably declared in your top-level data structure, and you just call Reset between uses.

I'm inclined to agree with Keith about leaving Add(interface{}) out at the start and have only AddBytes, AddString, AddFloat64, AddInt64, AddUint64. We can add the general interface{} case later once we understand why it is necessary.

@CAFxX

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

All of AddString("foobar"), AddString("foo");AddString("bar"), and AddString("f");AddString("oobar") should have different hashes. The downsides are that users may have to pre-concatenate data before passing it to the hasher

Thinking more about this, I'd say that providing a way to perform allocation-free/non-one-shot hashing of a (potentially non-fully-in-memory) stream of data is a good enough use case to put up with the inconvenience of having to manually delimit chunks.

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

@rsc I'd be happy for us to solve just the problem of efficient hashing of a stream of bytes here. The other issues of framing, structs, pointers, and so on hinted at by my Jan 3 note are a can of worms.

@martisch

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Will any off AddBytes([]byte{}), AddBytes(nil), AddString("") hash to the same hash given the same seed/will that be implementation dependent?

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

Will any off AddBytes([]byte{}), AddBytes(nil), AddString("") hash to the same hash given the same seed/will that be implementation dependent?

I think we are all in agreement that the effect of all three calls should be the same. The precise hash is not specified, and depends on the seed, which is also unspecified and unpredictable.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

OK, so it sounds like we have converged on:

type Hasher struct { ... opaque ..}
func (h *Hasher) Seed() Seed
func (h *Hasher) SetSeed(seed Seed) // random seed used if this is not called
func (h *Hasher) AddBytes(x []byte)
func (h *Hasher) AddInt64(x int64)
func (h *Hasher) AddUint64(x uint64)
func (h *Hasher) AddFloat64(x float64)
func (h *Hasher) AddString(x string)
func (h *Hasher) Hash() uint64
func (h *Hasher) Reset()
type Seed struct { ... opaque ... }
func NewSeed() Seed

Boundaries between AddBytes/AddString calls are not important. AddString("h");AddString("ello") and AddString("hello") are the same.

The Hash function has no effect on state. It is OK to use AddBytes, Hash, AddBytes, etc for a rolling hash.

Should we accept this proposal?

@alandonovan

This comment has been minimized.

Copy link
Contributor Author

commented Jun 4, 2019

The Int/Uint/Float cases are not strictly required as the caller can convert them to bytes with encoding/binary. Is there any compelling reason to include them?

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Sounds good to me.

As far as semantics, I think we converged on the fact that any sequence of AddBytes and AddString calls behaves identically to a single AddString of the concatenation of all the arguments of those calls.
I don't think there's a need for such a requirement on any of the other Add calls.

I presume Hash is a read-only operation. That means you could use a Hasher for a rolling hash. You can add some stuff, call Hash, add some more, call Hash again, and the second Hash behaves as if you had added all the same stuff but didn't call the intervening Hash?

It might be nice to be able to serialize/deserialize a Seed. Probably not necessary for v1.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

...or construct a Seed from an int64.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Int/Uint/Float are not necessary but having to encoding/binary them just to get bytes to feed to the hash seems like a lot of effort that is not going to yield a fast hash.

I agree about @randall77's clarifications and added them above.

It sounds like we can accept this.

@rsc rsc changed the title proposal: hash: add string hash function proposal: hash: add Hasher, Seed for hashing user data Jun 11, 2019

@rsc rsc modified the milestones: Proposal, Go1.14 Jun 11, 2019

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Does anyone want to pick this up for Go 1.14?

@andybons andybons changed the title proposal: hash: add Hasher, Seed for hashing user data hash: add Hasher, Seed for hashing user data Jun 13, 2019

@andybons andybons added the NeedsFix label Jun 13, 2019

@andybons

This comment has been minimized.

Copy link
Member

commented Jun 13, 2019

I'm sketching this out with @randall77 and @alandonovan.

@griesemer

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

Re: @randall77 's comment

As far as semantics, I think we converged on the fact that any sequence of AddBytes and AddString calls behaves identically to a single AddString of the concatenation of all the arguments of those calls.
I don't think there's a need for such a requirement on any of the other Add calls.

This implies that any []byte or string can be added by adding each individual byte separately. Perhaps there should be an AddByte(x byte) method as well.

@gopherbot

This comment has been minimized.

Copy link

commented Jul 19, 2019

Change https://golang.org/cl/186877 mentions this issue: bytes/hash: add hashing package for bytes and strings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.