alternate, faster hash function? #2

zbeekman · 2015-11-16T15:05:16Z

The MD5 and SHA hash functions are cryptographic hash functions, which are great for ensuring a uniform distribution of entries in the hash table. However, these hash functions are relatively expensive, and the finite size of the hash table means that at some point (adding enough entities to a table of a given size) more than one KVP will start mapping into the same bin.

I would like to recommend that alternate hash functions be made available as part of Petaca. My favorite, cheap, quick and dirty hash function with a decent distribution is Bob Jenkin's "One at a time" hash: https://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time My tests indicate a relatively uniform distribution when hashing strings.

The only complication with implementing this hash function in Fortran is that there is no unsigned integer type. However, all Fortran integer implementations are done using twos compliment which has the same over-flow, wrap-around behavior as unsigned integers. As a result the absolute value of the hash function result may be used prior to modular division for indexing into the hash table array, without adversely effecting the uniformity of the hash function distribution. (At least until the table size becomes greater than huge(hash_result_integer) at which point, you can just use a larger signed Fortran integer.)

Here is the implementation I usually use:

  elemental function hash(str)
    !! Bob Jenkins' one-at-a-time hash function
    !! Map a given character string onto an integer. For best performance,
    !! this should be:
    !!     1. fast
    !!     2. deterministic
    !!     3. uniformly distributed over the integers
    !!     4. chaotic
    !! See <http://burtleburtle.net/bob/hash/doobs.html>, and
    !! <http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf>
    !! (slides from talk of google cityhash developers to Stanford)
    character(len=*), intent(in)   :: str
    !! string to hash
    integer(selected_int_kind(8))  :: hash
    !! Numeric hash
    character(len=len(str))        :: str2
    integer(selected_int_kind(8))  :: i, stlen
    str2 = adjustl(str)
    stlen = len(trim(str2))
    hash  = 0
    do i = 1,stlen
       hash = hash + ichar(str2(i:i))
       hash = hash + ishft(hash,10)
       hash = ieor(hash,ishft(hash,-6))
    end do
    hash = hash + ishft(hash,3)
    hash = ieor(hash,ishft(hash,-11))
    hash = abs(hash + ishft(hash,15))
  end function

I'm pretty busy at the moment, but if you're amenable to including this as an option, I may try to find the time required to submit a PR to implement this.

The text was updated successfully, but these errors were encountered:

nncarlson · 2015-11-18T15:05:17Z

I'm certainly open to it. The use case for what you describe sounds different than what I intended for the MD5/SHA1 cryptographic hashes in Petaca, but that's not to say it wouldn't be useful. The interface was designed to sequentially feed in arbitrary variables/arrays of intrinsic types (the components of a derived type, for example) in order to obtain a fingerprint of the data for comparison, in the same way one uses md5sum to get the checksum of a file. This is something I seldom use, but I have found it useful for verifying intermediate quantities between two versions of a code in debugging situations. It never occurred to me that one would use MD5/SHA1 as a hash function for hash tables. I agree that it is way too expensive for that and a different interface would be desired as well, I think. Regarding the unsigned integer issue, the md5/sha1 implementations rely on the same wrap-around behavior of 2's complement representation; requires avoiding certain run-time checking.

I use a hash table in one situation (out of necessity; if I understood them better I expect I'd use them more often). I've attached the hash function for it -- fibonacci I pulled out of Knuth's book. The use case is specialized, handling short integer arrays (length 2 to 4, the node indices of a face of a mesh cell), with the hash invariant with respect to permutation of the array elements.

facet_hash_type.txt

zbeekman · 2015-11-18T15:38:13Z

Ah, my apologies for miss-understanding the use case; I skimmed the petaca code a few weeks ago and was under the impression map_any_type used a hash table rather than a linked list for the associative array. I was working on a similar associative array implementation at the time, but have since then paused my efforts, at least until I can look at your implementation in greater detail; no point reinventing the wheel if I don't have to...

The ability to checksum data is extraordinarily useful, and if you can take advantage of cache-locality to perform this when reading or writing the data, even better! You certainly want to use MD5 or better for this use case. In the past I have used HDF5 to take advantage of this functionality as well as some of their built in compression and MPI-IO capabilities.

As an aside, it might be worthwhile considering using a hash table over a linked list for the map_any_type associative array, especially if the number of input parameters passed around with petaca is expected to be quite large, and if they are going to be retrieved randomly, rather than in the order they get put into the list.

I'll have to see if I can find more details about Knuth's fibonacci hash. His book is on my wish-list, but it's pretty expensive, so I haven't purchased it yet. Maybe I'll buy it once I'm at a new job.

Any way, thanks for the useful software, and instructive implementations of various algorithms in modern Fortran!

(Feel free to close this issue, since it seems that you use a SLL rather than a hash table for the associative array, and I was confused about the purpose of your hash functions)

nncarlson · 2015-11-18T16:47:13Z

Yeah, for map_any_type I've done just about the dumbest thing possible; I don't even think the linked-list is sorted. I've rationalized it by assuming the size will be small, accessed once, etc. But I think it would be great to have the internals redone to use a scalable algorithm, like the containers from Python or STL do -- it's just not one of my areas of expertise. I have other containers that I intend to move into Petaca that face the same issues. Now if someone could contribute in this area ...

zbeekman · 2016-06-01T15:29:59Z

@nncarlson should we close this issue, since I misunderstood your original usage of the hash functions? For fingerprinting data, I think MD5 would be safe enough, and maybe a little bit faster, but SHA is a good choice here, and better guarantees prevention of collisions... I have not started using Petaca in my own work, so have not had time or need to help out with this, but I haven't ruled it out for the future 😄

nncarlson closed this as completed Jun 2, 2016

zmiimz mentioned this issue Jun 13, 2019

gfortran 8.3.1 fails any_matrix_type test #14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alternate, faster hash function? #2

alternate, faster hash function? #2

zbeekman commented Nov 16, 2015

nncarlson commented Nov 18, 2015

zbeekman commented Nov 18, 2015

nncarlson commented Nov 18, 2015

zbeekman commented Jun 1, 2016

alternate, faster hash function? #2

alternate, faster hash function? #2

Comments

zbeekman commented Nov 16, 2015

nncarlson commented Nov 18, 2015

zbeekman commented Nov 18, 2015

nncarlson commented Nov 18, 2015

zbeekman commented Jun 1, 2016