Types should not use zero or other small integers as hash results #30

bos · 2012-09-22T16:49:06Z

There are some bad instances of Hashable that use small integers as hash results.

This is a bad idea, as it makes it trivially easy to exploit arithmetic properties to cause collisions.

Witness the definition of combine:

combine h1 h2 = (h1 * 16777619) `xor` h2

It's common for people combining hashes to write definitions like this. In this case, it is obvious that if you can fix the first hash to be zero, the result will be exactly the second hash.

Worse, if someone chains together a series of multiplications in an instance of their own, they can accidentally end up with hashes always being zero.

I propose two ways to address this problem:

Document it. In the Hashable class, tell people to use large random numbers, and why,
Fix the current vulnerable instances by replacing small random numbers with large.

The text was updated successfully, but these errors were encountered:

tibbe · 2012-09-24T06:12:55Z

We used to have a combine which didn't have this property:

combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2

It got changed when I merged the FNV-1 hash patches to match the combine function used in string hashing. I probably should have kept the old one.

I'm going to be on vacation for a few days. If you feel like creating a pull request with the changes you think ought to be made I can try to review it in the next few days.

There's another related item on my TODO list; integers hash to themselves. This behavior exists in Python, Java, and STL, but I'm not sure if it's a good one always. We might instead want to multiply integers by a large prime.

tibbe · 2012-12-12T03:16:43Z

Should be fixed with the switch to SipHash etc.

jberryman · 2015-03-03T01:33:01Z

Can this be re-opened and discussed since the behavior was reverted?

jberryman · 2015-03-03T15:00:49Z

Thanks, Johan. In my data structure I need b hash bits (depending on some parameters), to be divided and used in a few different phases of the algorithm. With a "good" hashing algorithm I should be able to take a few bits here and there from anywhere in the hash value to use for these various phases without bias. The numeric instances are clearly problematic for that kind of usage.

More broadly it makes me nervous that hashable makes no guarantees about "goodness" of these hashes, and that quality of hashes seems to have fluctuated wildly in the past. Either consistently using a documented algorithm (which you've said you don't want to do), or even just having a test suite that runs some avalanche/distribution tests (see below) that we promise to pass would be acceptable for me.

https://code.google.com/p/smhasher/
http://floodyberry.com/noncryptohashzoo/
http://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/

I can open a new ticket for the latter idea, if you want.

Thanks

tibbe · 2015-03-03T20:54:17Z

hashable is designed for a particular use case, supporting hashing-based data structures, rather than for generic hashing (e.g. don't use it to compute checksums for your files). As such it makes some engineering trade-offs, the same as are done in many other languages, that work well in practice for this use case. It turns out in practice that having ints hash to themselves provides good performance (due to simplicity and locality of hash functions).

If your data structures have different needs you can employ the common technique of post-mixing, here you mix the hash a bit more before using it. This is commonly used in e.g. base-2 hash tables for pointer values, as pointer values would have the low bits be always zero.

jberryman · 2015-03-03T21:38:44Z

My application is a bloom filter variant. There are probably many other hash-based data structures for which the current hashable makes too many assumptions to be usable (or is already silently introducing bias, or breaking the time complexity of peoples' data structures without them realizing). As you say I guess I can implement a better, more mixy hash function based on hashable and make sure it passes my own tests.

FWIW this combined with #73 (users want to serialize their bloomfilter) and #74 (instances for algebraic prelude types seem unprincipled, and make me nervous), makes it unlikely I'll be able to use this for anything more than to bootstrap a v0.0 for this particular use case.

jberryman · 2015-03-03T21:48:41Z

(on re-reading, I think my last comment might across as disparaging or argumentative which isn't my intention! I understand your trade-offs argument, I just wanted to voice that they don't seem to fit my needs very well even though I think my use-case is one hashable claims to be targeting. Thanks for your work on the library!)

tibbe · 2015-03-05T01:00:01Z

@jberryman we're basically at an uneasy stand still. We don't have an alternative that works better in practice. This seems to be the same conclusion that other communities have come to, so here we are.

jberryman · 2015-08-17T02:33:34Z

If anyone's interested I've released a rewrite of hashable called hashabler which concerns itself with this issue.

tibbe closed this as completed Dec 12, 2012

hrdinka mentioned this issue Jul 13, 2013

Miserable instances for integral types #71

Closed

tibbe reopened this Mar 3, 2015

phadej mentioned this issue Jun 24, 2018

Why can the result of hash be negative? #162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Types should not use zero or other small integers as hash results #30

Types should not use zero or other small integers as hash results #30

bos commented Sep 22, 2012

tibbe commented Sep 24, 2012

tibbe commented Dec 12, 2012

jberryman commented Mar 3, 2015

jberryman commented Mar 3, 2015

tibbe commented Mar 3, 2015

jberryman commented Mar 3, 2015

jberryman commented Mar 3, 2015

tibbe commented Mar 5, 2015

jberryman commented Aug 17, 2015

Types should not use zero or other small integers as hash results #30

Types should not use zero or other small integers as hash results #30

Comments

bos commented Sep 22, 2012

tibbe commented Sep 24, 2012

tibbe commented Dec 12, 2012

jberryman commented Mar 3, 2015

jberryman commented Mar 3, 2015

tibbe commented Mar 3, 2015

jberryman commented Mar 3, 2015

jberryman commented Mar 3, 2015

tibbe commented Mar 5, 2015

jberryman commented Aug 17, 2015