Hash Libraries

Benabik edited this page May 9, 2012 · 1 revision

(from #244)

From GCI 2010 student Atanas:

I checked some hashing libraries and I chose the following: LIBHASHISH

  • Build-in key support for char arrays (strings) and uint_{8,16,32} types and support for own (possible complex) key data types
  • Support rbtree's as collision strategy instead of bucked lists (avoid worst case O(n) scenario)
  • Dynamic or manual adjustable table size (reordering if proportion entries/table_size is unfavorable)
  • Build as an static or dynamic library (.a and .so)
  • Iterator support (equivalent to ruby's hash.each() method)
  • Thread clean - fine-grained lock mechanisms (mutex locks, reader writer locks, ...)
  • Bloom filter implementation
  • Many built-in hash algorithms (from trivial algorithms till cryptographic ones)
  • Architecture clean - runs on 32bit machines as well as 64bit machines
  • As lightweight as possible - no bloated code
  • Makefile test target plus benchmark applications for comparing the different hashing algorithm
  • Dual licensed under the GNU General Public and BSD License Website is http://libhashish.sourceforge.net/. I think this is the best hashing library for your project.

nwellnhof's commentary:

LSHKIT is for locality sensitive hashing, CMPH for minimal perfect hashing and Mhash for cryptographic hash functions. That's not what we need.

libghthash would be an example for a useful hash table library.

But I don't think we could gain much by using an external library. If we want to optimize our current implementation, I'd suggest a dead simple hash table with open addressing and linear probing. That would give us at most one dcache miss for practically all reads if we make sure that the fill factor stays low enough by rehashing.

Such a hash table would only require 2 words per entry vs. 4 words per entry for our current chained implementation. So even if we limit the fill factor to 50%, we wouldn't need more memory than now.

We'd need two additional bits to mark occupied and deleted entries. For pointer keys or values we could store them in the low bits of the pointers. For integer keys and values, we'd need an additional array for those flags.

benabik added:

I was poking around for information on hash libraries and came across a nice comparison page

Of particular interest to me is the "khash (C)" line in the table, which is his one include file hash library that allows for type specific hashes in C (via preprocessor fun). It seems to perform very well in both space and time. It's MIT licensed, so I don't know if that would be a problem to include in parrot.