-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alternate, faster hash function? #2
Comments
I'm certainly open to it. The use case for what you describe sounds different than what I intended for the MD5/SHA1 cryptographic hashes in Petaca, but that's not to say it wouldn't be useful. The interface was designed to sequentially feed in arbitrary variables/arrays of intrinsic types (the components of a derived type, for example) in order to obtain a fingerprint of the data for comparison, in the same way one uses md5sum to get the checksum of a file. This is something I seldom use, but I have found it useful for verifying intermediate quantities between two versions of a code in debugging situations. It never occurred to me that one would use MD5/SHA1 as a hash function for hash tables. I agree that it is way too expensive for that and a different interface would be desired as well, I think. Regarding the unsigned integer issue, the md5/sha1 implementations rely on the same wrap-around behavior of 2's complement representation; requires avoiding certain run-time checking. I use a hash table in one situation (out of necessity; if I understood them better I expect I'd use them more often). I've attached the hash function for it -- fibonacci I pulled out of Knuth's book. The use case is specialized, handling short integer arrays (length 2 to 4, the node indices of a face of a mesh cell), with the hash invariant with respect to permutation of the array elements. |
Ah, my apologies for miss-understanding the use case; I skimmed the petaca code a few weeks ago and was under the impression The ability to checksum data is extraordinarily useful, and if you can take advantage of cache-locality to perform this when reading or writing the data, even better! You certainly want to use MD5 or better for this use case. In the past I have used HDF5 to take advantage of this functionality as well as some of their built in compression and MPI-IO capabilities. As an aside, it might be worthwhile considering using a hash table over a linked list for the I'll have to see if I can find more details about Knuth's fibonacci hash. His book is on my wish-list, but it's pretty expensive, so I haven't purchased it yet. Maybe I'll buy it once I'm at a new job. Any way, thanks for the useful software, and instructive implementations of various algorithms in modern Fortran! (Feel free to close this issue, since it seems that you use a SLL rather than a hash table for the associative array, and I was confused about the purpose of your hash functions) |
Yeah, for |
@nncarlson should we close this issue, since I misunderstood your original usage of the hash functions? For fingerprinting data, I think MD5 would be safe enough, and maybe a little bit faster, but SHA is a good choice here, and better guarantees prevention of collisions... I have not started using Petaca in my own work, so have not had time or need to help out with this, but I haven't ruled it out for the future 😄 |
The MD5 and SHA hash functions are cryptographic hash functions, which are great for ensuring a uniform distribution of entries in the hash table. However, these hash functions are relatively expensive, and the finite size of the hash table means that at some point (adding enough entities to a table of a given size) more than one KVP will start mapping into the same bin.
I would like to recommend that alternate hash functions be made available as part of Petaca. My favorite, cheap, quick and dirty hash function with a decent distribution is Bob Jenkin's "One at a time" hash: https://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time My tests indicate a relatively uniform distribution when hashing strings.
The only complication with implementing this hash function in Fortran is that there is no unsigned integer type. However, all Fortran integer implementations are done using twos compliment which has the same over-flow, wrap-around behavior as unsigned integers. As a result the absolute value of the hash function result may be used prior to modular division for indexing into the hash table array, without adversely effecting the uniformity of the hash function distribution. (At least until the table size becomes greater than
huge(hash_result_integer)
at which point, you can just use a larger signed Fortran integer.)Here is the implementation I usually use:
I'm pretty busy at the moment, but if you're amenable to including this as an option, I may try to find the time required to submit a PR to implement this.
The text was updated successfully, but these errors were encountered: