jhunt/hash

Hashing. It's one of the fundamentals of Computer Science, but how do you compare the performance of a given hash function to another?

Here, you will find my attempts at solving that problem for multiple personal projects. I'm not (currently) in the business of defining new hashing algorithms, so it just focuses on hash functions built by smarter people than me, and attempts to portray their behavior and performance characteristics.

Above all, this test suite strives to be flexible, enabling new tests corpora to be integrated with minimal effort.

I hope you find this useful.

Fitness Tests

In order to gauge the fitness of a particular hash function, we can check simple uniformity, or how evenly the algorithm distributes output values. A hash function that always returns the value 42 is decidedly non-uniform.

Simple uniformity must be differentiated from random uniformity. For simple uniformity, we don't care how random the distribution appears, we just care that it is even. Random uniformity has important implications for cryptographically-secure hashing functions, but for simple hash tables (my original problem space) it may not be worth the effort.

Results

Hashing ~235k English words over 64 bins (calculated H(s) % 64), we can look at the scatter plot of bin clustering to get a gut feel for how uniform each hash is:

Yeesh. xor looks terrible, but that's not unexpected. The rest all seem to cluster around the middle (which is good!), so let's look at a consolidated boxplot to get a feel for quartile distribution:

As before, xor is off the charts, and makes it difficut to see the relative differences between the other, more serious hashing algorithms. If we drop the worst three (murmur3, spooky2, and xor), the picture gets a little clearer:

Timing Benchmarks

We also need to evaluate each hash algorithm according to how long it take to perform hashing operations. To do that, we average the execution times of fifty runs of the hash function against each string in our training corpus. This set of per-operation durations is graphed below as a boxplot:

The y-axis is measured in nanoseconds.

Both SDBM and FNV1a have pretty dismal performance (75th percentile is no better than ~200ns/op and can get as high as ~300ns/op) compared to all the others. If we drop the worst four (fnv1a, sdbm, lookup3, and jenkins1), we can see more nuance between the top performers:

Incidentally, xor has the best timing performance. Given that it's a single assembly instruction per byte, that's not too surprising.

Testing Qualified Names

Before I started playing with hash functions just for the hell of it, I found myself in search of a good hashing function for bolo-ng and TSDP qualified names.

Qualified Names look like this:

host=web01.example.com, metric=cpu, value=0.001, ts=1474055418

That is, a set of key=value pairs, separated by commas. This is definitely not English text, so let's re-run the fitness tests and timing benchmarks against a corpus of dummy qualified names.

(Side note: I wrote a small Perl script, madlibs, that let me easily generate hundreds of thousands of test names. :+1:)

Fitness Tests (Qualified Names Edition)

Hashing 100k Qualified Names over 64 bins (again calculated as H(s) % 64), we can look at the scatter plot of bin clustering to get a gut feel of uniformity given our new inputs:

Not much has changed; xor is still terrible. Let's look at the box plots:

We knew xor was bad, but now we can also see that murmur2 is worse than the others, so let's drop those two out and hopefully get a better picture of the top performers:

Timing Benchmarks (Qualified Names Edition)

With the longer strings in the Qualified Names corpus, let's see what happens with the performance. Again, we're going to average the execution times of fifty runs of the hash function against each string in the corpus. This set of per-operation durations is graphed below as a boxplot:

The y-axis is meaured in nanoseconds.

It seems quite clear that we have three "bands" of performance; murmur3 and xor are the fastest, followed by djb2, k&r, sdbm and spooky2 in the mid-range, with fnv1a, jenkins1 and lookup3 bringing up the rear as the slowest.

Let's focus on the middle tier:

djb2 is the clear winner with the least variation and the quickest median per-hash execution speed at a little over 200ns.

But why settle for the middle-tier when we have two candidates for super-ultra-fast:

xor sure is fast. But so is murmur3, clocking in at a little under 40ns, 5x faster than djb2.

Conclusion

For my purposes, djb2 and murmur3 are the two best options. Two best options? Isn't that a cop-out? Not really. I'll probably be going with djb2 when I need uniformity, and murmur3 for raw speed of hashing.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
algo		algo
corpus		corpus
data		data
viz		viz
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
run.c		run.c
timer.h		timer.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

algo

algo

corpus

corpus

data

data

viz

viz

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

run.c

run.c

timer.h

timer.h

Repository files navigation

jhunt/hash

Fitness Tests

Results

Timing Benchmarks

Testing Qualified Names

Fitness Tests (Qualified Names Edition)

Timing Benchmarks (Qualified Names Edition)

Conclusion

About

Releases

Packages

Languages

License

jhunt/hash

Folders and files

Latest commit

History

Repository files navigation

jhunt/hash

Fitness Tests

Results

Timing Benchmarks

Testing Qualified Names

Fitness Tests (Qualified Names Edition)

Timing Benchmarks (Qualified Names Edition)

Conclusion

About

Resources

License

Stars

Watchers

Forks

Languages