Skip to content

jhunt/hash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jhunt/hash

Hashing. It's one of the fundamentals of Computer Science, but how do you compare the performance of a given hash function to another?

Here, you will find my attempts at solving that problem for multiple personal projects. I'm not (currently) in the business of defining new hashing algorithms, so it just focuses on hash functions built by smarter people than me, and attempts to portray their behavior and performance characteristics.

Above all, this test suite strives to be flexible, enabling new tests corpora to be integrated with minimal effort.

I hope you find this useful.

Fitness Tests

In order to gauge the fitness of a particular hash function, we can check simple uniformity, or how evenly the algorithm distributes output values. A hash function that always returns the value 42 is decidedly non-uniform.

Simple uniformity must be differentiated from random uniformity. For simple uniformity, we don't care how random the distribution appears, we just care that it is even. Random uniformity has important implications for cryptographically-secure hashing functions, but for simple hash tables (my original problem space) it may not be worth the effort.

Results

Hashing ~235k English words over 64 bins (calculated H(s) % 64), we can look at the scatter plot of bin clustering to get a gut feel for how uniform each hash is:

Scatter Plot

Yeesh. xor looks terrible, but that's not unexpected. The rest all seem to cluster around the middle (which is good!), so let's look at a consolidated boxplot to get a feel for quartile distribution:

Simple Uniformity Boxplot

As before, xor is off the charts, and makes it difficut to see the relative differences between the other, more serious hashing algorithms. If we drop the worst three (murmur3, spooky2, and xor), the picture gets a little clearer:

Box Plot (top performers)

Timing Benchmarks

We also need to evaluate each hash algorithm according to how long it take to perform hashing operations. To do that, we average the execution times of fifty runs of the hash function against each string in our training corpus. This set of per-operation durations is graphed below as a boxplot:

Execution Times

The y-axis is measured in nanoseconds.

Both SDBM and FNV1a have pretty dismal performance (75th percentile is no better than ~200ns/op and can get as high as ~300ns/op) compared to all the others. If we drop the worst four (fnv1a, sdbm, lookup3, and jenkins1), we can see more nuance between the top performers:

Execution Times (top performers)

Incidentally, xor has the best timing performance. Given that it's a single assembly instruction per byte, that's not too surprising.

Testing Qualified Names

Before I started playing with hash functions just for the hell of it, I found myself in search of a good hashing function for bolo-ng and TSDP qualified names.

Qualified Names look like this:

host=web01.example.com, metric=cpu, value=0.001, ts=1474055418

That is, a set of key=value pairs, separated by commas. This is definitely not English text, so let's re-run the fitness tests and timing benchmarks against a corpus of dummy qualified names.

(Side note: I wrote a small Perl script, madlibs, that let me easily generate hundreds of thousands of test names. :+1:)

Fitness Tests (Qualified Names Edition)

Hashing 100k Qualified Names over 64 bins (again calculated as H(s) % 64), we can look at the scatter plot of bin clustering to get a gut feel of uniformity given our new inputs:

Scatter Plot

Not much has changed; xor is still terrible. Let's look at the box plots:

Simple Uniformity Boxplot

We knew xor was bad, but now we can also see that murmur2 is worse than the others, so let's drop those two out and hopefully get a better picture of the top performers:

Box Plot (without xor)

Timing Benchmarks (Qualified Names Edition)

With the longer strings in the Qualified Names corpus, let's see what happens with the performance. Again, we're going to average the execution times of fifty runs of the hash function against each string in the corpus. This set of per-operation durations is graphed below as a boxplot:

Execution Time

The y-axis is meaured in nanoseconds.

It seems quite clear that we have three "bands" of performance; murmur3 and xor are the fastest, followed by djb2, k&r, sdbm and spooky2 in the mid-range, with fnv1a, jenkins1 and lookup3 bringing up the rear as the slowest.

Let's focus on the middle tier:

Execution Time

djb2 is the clear winner with the least variation and the quickest median per-hash execution speed at a little over 200ns.

But why settle for the middle-tier when we have two candidates for super-ultra-fast:

Execution Time

xor sure is fast. But so is murmur3, clocking in at a little under 40ns, 5x faster than djb2.

Conclusion

For my purposes, djb2 and murmur3 are the two best options. Two best options? Isn't that a cop-out? Not really. I'll probably be going with djb2 when I need uniformity, and murmur3 for raw speed of hashing.

About

Hashing. 'nuf said.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published