Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types should not use zero or other small integers as hash results #30

Open
bos opened this issue Sep 22, 2012 · 9 comments
Open

Types should not use zero or other small integers as hash results #30

bos opened this issue Sep 22, 2012 · 9 comments

Comments

@bos
Copy link
Collaborator

bos commented Sep 22, 2012

There are some bad instances of Hashable that use small integers as hash results.

This is a bad idea, as it makes it trivially easy to exploit arithmetic properties to cause collisions.

Witness the definition of combine:

combine h1 h2 = (h1 * 16777619) `xor` h2

It's common for people combining hashes to write definitions like this. In this case, it is obvious that if you can fix the first hash to be zero, the result will be exactly the second hash.

Worse, if someone chains together a series of multiplications in an instance of their own, they can accidentally end up with hashes always being zero.

I propose two ways to address this problem:

  1. Document it. In the Hashable class, tell people to use large random numbers, and why,
  2. Fix the current vulnerable instances by replacing small random numbers with large.
@tibbe
Copy link
Collaborator

tibbe commented Sep 24, 2012

We used to have a combine which didn't have this property:

combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2

It got changed when I merged the FNV-1 hash patches to match the combine function used in string hashing. I probably should have kept the old one.

I'm going to be on vacation for a few days. If you feel like creating a pull request with the changes you think ought to be made I can try to review it in the next few days.

There's another related item on my TODO list; integers hash to themselves. This behavior exists in Python, Java, and STL, but I'm not sure if it's a good one always. We might instead want to multiply integers by a large prime.

@tibbe
Copy link
Collaborator

tibbe commented Dec 12, 2012

Should be fixed with the switch to SipHash etc.

@jberryman
Copy link

Can this be re-opened and discussed since the behavior was reverted?

@tibbe tibbe reopened this Mar 3, 2015
@jberryman
Copy link

Thanks, Johan. In my data structure I need b hash bits (depending on some parameters), to be divided and used in a few different phases of the algorithm. With a "good" hashing algorithm I should be able to take a few bits here and there from anywhere in the hash value to use for these various phases without bias. The numeric instances are clearly problematic for that kind of usage.

More broadly it makes me nervous that hashable makes no guarantees about "goodness" of these hashes, and that quality of hashes seems to have fluctuated wildly in the past. Either consistently using a documented algorithm (which you've said you don't want to do), or even just having a test suite that runs some avalanche/distribution tests (see below) that we promise to pass would be acceptable for me.

https://code.google.com/p/smhasher/
http://floodyberry.com/noncryptohashzoo/
http://research.neustar.biz/2012/02/02/choosing-a-good-hash-function-part-3/

I can open a new ticket for the latter idea, if you want.

Thanks

@tibbe
Copy link
Collaborator

tibbe commented Mar 3, 2015

hashable is designed for a particular use case, supporting hashing-based data structures, rather than for generic hashing (e.g. don't use it to compute checksums for your files). As such it makes some engineering trade-offs, the same as are done in many other languages, that work well in practice for this use case. It turns out in practice that having ints hash to themselves provides good performance (due to simplicity and locality of hash functions).

If your data structures have different needs you can employ the common technique of post-mixing, here you mix the hash a bit more before using it. This is commonly used in e.g. base-2 hash tables for pointer values, as pointer values would have the low bits be always zero.

@jberryman
Copy link

My application is a bloom filter variant. There are probably many other hash-based data structures for which the current hashable makes too many assumptions to be usable (or is already silently introducing bias, or breaking the time complexity of peoples' data structures without them realizing). As you say I guess I can implement a better, more mixy hash function based on hashable and make sure it passes my own tests.

FWIW this combined with #73 (users want to serialize their bloomfilter) and #74 (instances for algebraic prelude types seem unprincipled, and make me nervous), makes it unlikely I'll be able to use this for anything more than to bootstrap a v0.0 for this particular use case.

@jberryman
Copy link

(on re-reading, I think my last comment might across as disparaging or argumentative which isn't my intention! I understand your trade-offs argument, I just wanted to voice that they don't seem to fit my needs very well even though I think my use-case is one hashable claims to be targeting. Thanks for your work on the library!)

@tibbe
Copy link
Collaborator

tibbe commented Mar 5, 2015

@jberryman we're basically at an uneasy stand still. We don't have an alternative that works better in practice. This seems to be the same conclusion that other communities have come to, so here we are.

@jberryman
Copy link

If anyone's interested I've released a rewrite of hashable called hashabler which concerns itself with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants