New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Types should not use zero or other small integers as hash results #30
Comments
We used to have a combine h1 h2 = (h1 + h1 `shiftL` 5) `xor` h2 It got changed when I merged the FNV-1 hash patches to match the combine function used in string hashing. I probably should have kept the old one. I'm going to be on vacation for a few days. If you feel like creating a pull request with the changes you think ought to be made I can try to review it in the next few days. There's another related item on my TODO list; integers hash to themselves. This behavior exists in Python, Java, and STL, but I'm not sure if it's a good one always. We might instead want to multiply integers by a large prime. |
Should be fixed with the switch to SipHash etc. |
Can this be re-opened and discussed since the behavior was reverted? |
Thanks, Johan. In my data structure I need More broadly it makes me nervous that https://code.google.com/p/smhasher/ I can open a new ticket for the latter idea, if you want. Thanks |
hashable is designed for a particular use case, supporting hashing-based data structures, rather than for generic hashing (e.g. don't use it to compute checksums for your files). As such it makes some engineering trade-offs, the same as are done in many other languages, that work well in practice for this use case. It turns out in practice that having ints hash to themselves provides good performance (due to simplicity and locality of hash functions). If your data structures have different needs you can employ the common technique of post-mixing, here you mix the hash a bit more before using it. This is commonly used in e.g. base-2 hash tables for pointer values, as pointer values would have the low bits be always zero. |
My application is a bloom filter variant. There are probably many other hash-based data structures for which the current FWIW this combined with #73 (users want to serialize their bloomfilter) and #74 (instances for algebraic prelude types seem unprincipled, and make me nervous), makes it unlikely I'll be able to use this for anything more than to bootstrap a v0.0 for this particular use case. |
(on re-reading, I think my last comment might across as disparaging or argumentative which isn't my intention! I understand your trade-offs argument, I just wanted to voice that they don't seem to fit my needs very well even though I think my use-case is one |
@jberryman we're basically at an uneasy stand still. We don't have an alternative that works better in practice. This seems to be the same conclusion that other communities have come to, so here we are. |
If anyone's interested I've released a rewrite of hashable called |
There are some bad instances of
Hashable
that use small integers as hash results.This is a bad idea, as it makes it trivially easy to exploit arithmetic properties to cause collisions.
Witness the definition of
combine
:It's common for people combining hashes to write definitions like this. In this case, it is obvious that if you can fix the first hash to be zero, the result will be exactly the second hash.
Worse, if someone chains together a series of multiplications in an instance of their own, they can accidentally end up with hashes always being zero.
I propose two ways to address this problem:
Hashable
class, tell people to use large random numbers, and why,The text was updated successfully, but these errors were encountered: