Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternate, faster hash function? #2

Closed
zbeekman opened this issue Nov 16, 2015 · 4 comments
Closed

alternate, faster hash function? #2

zbeekman opened this issue Nov 16, 2015 · 4 comments

Comments

@zbeekman
Copy link

The MD5 and SHA hash functions are cryptographic hash functions, which are great for ensuring a uniform distribution of entries in the hash table. However, these hash functions are relatively expensive, and the finite size of the hash table means that at some point (adding enough entities to a table of a given size) more than one KVP will start mapping into the same bin.

I would like to recommend that alternate hash functions be made available as part of Petaca. My favorite, cheap, quick and dirty hash function with a decent distribution is Bob Jenkin's "One at a time" hash: https://en.wikipedia.org/wiki/Jenkins_hash_function#one-at-a-time My tests indicate a relatively uniform distribution when hashing strings.

The only complication with implementing this hash function in Fortran is that there is no unsigned integer type. However, all Fortran integer implementations are done using twos compliment which has the same over-flow, wrap-around behavior as unsigned integers. As a result the absolute value of the hash function result may be used prior to modular division for indexing into the hash table array, without adversely effecting the uniformity of the hash function distribution. (At least until the table size becomes greater than huge(hash_result_integer) at which point, you can just use a larger signed Fortran integer.)

Here is the implementation I usually use:

  elemental function hash(str)
    !! Bob Jenkins' one-at-a-time hash function
    !! Map a given character string onto an integer. For best performance,
    !! this should be:
    !!     1. fast
    !!     2. deterministic
    !!     3. uniformly distributed over the integers
    !!     4. chaotic
    !! See <http://burtleburtle.net/bob/hash/doobs.html>, and
    !! <http://www.stanford.edu/class/ee380/Abstracts/121017-slides.pdf>
    !! (slides from talk of google cityhash developers to Stanford)
    character(len=*), intent(in)   :: str
    !! string to hash
    integer(selected_int_kind(8))  :: hash
    !! Numeric hash
    character(len=len(str))        :: str2
    integer(selected_int_kind(8))  :: i, stlen
    str2 = adjustl(str)
    stlen = len(trim(str2))
    hash  = 0
    do i = 1,stlen
       hash = hash + ichar(str2(i:i))
       hash = hash + ishft(hash,10)
       hash = ieor(hash,ishft(hash,-6))
    end do
    hash = hash + ishft(hash,3)
    hash = ieor(hash,ishft(hash,-11))
    hash = abs(hash + ishft(hash,15))
  end function

I'm pretty busy at the moment, but if you're amenable to including this as an option, I may try to find the time required to submit a PR to implement this.

@nncarlson
Copy link
Owner

I'm certainly open to it. The use case for what you describe sounds different than what I intended for the MD5/SHA1 cryptographic hashes in Petaca, but that's not to say it wouldn't be useful. The interface was designed to sequentially feed in arbitrary variables/arrays of intrinsic types (the components of a derived type, for example) in order to obtain a fingerprint of the data for comparison, in the same way one uses md5sum to get the checksum of a file. This is something I seldom use, but I have found it useful for verifying intermediate quantities between two versions of a code in debugging situations. It never occurred to me that one would use MD5/SHA1 as a hash function for hash tables. I agree that it is way too expensive for that and a different interface would be desired as well, I think. Regarding the unsigned integer issue, the md5/sha1 implementations rely on the same wrap-around behavior of 2's complement representation; requires avoiding certain run-time checking.

I use a hash table in one situation (out of necessity; if I understood them better I expect I'd use them more often). I've attached the hash function for it -- fibonacci I pulled out of Knuth's book. The use case is specialized, handling short integer arrays (length 2 to 4, the node indices of a face of a mesh cell), with the hash invariant with respect to permutation of the array elements.

facet_hash_type.txt

@zbeekman
Copy link
Author

Ah, my apologies for miss-understanding the use case; I skimmed the petaca code a few weeks ago and was under the impression map_any_type used a hash table rather than a linked list for the associative array. I was working on a similar associative array implementation at the time, but have since then paused my efforts, at least until I can look at your implementation in greater detail; no point reinventing the wheel if I don't have to...

The ability to checksum data is extraordinarily useful, and if you can take advantage of cache-locality to perform this when reading or writing the data, even better! You certainly want to use MD5 or better for this use case. In the past I have used HDF5 to take advantage of this functionality as well as some of their built in compression and MPI-IO capabilities.

As an aside, it might be worthwhile considering using a hash table over a linked list for the map_any_type associative array, especially if the number of input parameters passed around with petaca is expected to be quite large, and if they are going to be retrieved randomly, rather than in the order they get put into the list.

I'll have to see if I can find more details about Knuth's fibonacci hash. His book is on my wish-list, but it's pretty expensive, so I haven't purchased it yet. Maybe I'll buy it once I'm at a new job.

Any way, thanks for the useful software, and instructive implementations of various algorithms in modern Fortran!

(Feel free to close this issue, since it seems that you use a SLL rather than a hash table for the associative array, and I was confused about the purpose of your hash functions)

@nncarlson
Copy link
Owner

Yeah, for map_any_type I've done just about the dumbest thing possible; I don't even think the linked-list is sorted. I've rationalized it by assuming the size will be small, accessed once, etc. But I think it would be great to have the internals redone to use a scalable algorithm, like the containers from Python or STL do -- it's just not one of my areas of expertise. I have other containers that I intend to move into Petaca that face the same issues. Now if someone could contribute in this area ...

@zbeekman
Copy link
Author

zbeekman commented Jun 1, 2016

@nncarlson should we close this issue, since I misunderstood your original usage of the hash functions? For fingerprinting data, I think MD5 would be safe enough, and maybe a little bit faster, but SHA is a good choice here, and better guarantees prevention of collisions... I have not started using Petaca in my own work, so have not had time or need to help out with this, but I haven't ruled it out for the future 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants