{{ message }}

# Strengthen the built-in hash function #5225

Closed
opened this issue Feb 16, 2011 · 4 comments
Closed

# Strengthen the built-in hash function#5225

opened this issue Feb 16, 2011 · 4 comments
Assignees
Labels

### vicuna commented Feb 16, 2011

Original bug ID: 5225
Reporter: mgiovann
Assigned to: @xavierleroy
Status: closed (set by @xavierleroy on 2012-09-25T18:06:17Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 3.12.0
Fixed in version: 3.13.0+dev
Category: ~DO NOT USE (was: OCaml general)
Related to: #5222
Monitored by: @ygrek "Julien Signoles" "Christoph Bauer"

## Bug description

The built-in hash function in hash.c has extremely poor statistical properties. It is of the form (a*x + b) mod M for M = 2^word size, but the modulus should be prime for best results (and for correct results in the case of Cuckoo Hash, for instance). Tested on a word list it fails Knuth's criteria for chi-square and Kolmogorov-Smirnov tests.

Since the hash value is less than 2^31, suitable prime moduli (for Combine and Combine_small) are 2^31 - 19 and 2^31 - 1. In effect, the code for hash.c would be:

#define Alpha 65599
#define Beta 19
#define M0 ((1U << 31) - 19)
#define M1 ((1U << 31) - 1)
#define Combine(new) (hash_accu = (hash_accu * Alpha + (new)) % M0)
#define Combine_small(new) (hash_accu = (hash_accu * Beta + (new)) % M1)

## File attachments

### vicuna commented Feb 17, 2011

 Comment author: mgiovann I enclose the results of Knuth's standard battery of tests against random generators. Four hashes are tested against a large (> 200.000) English words. "Std hash" is a bit-correct reimplementation of Combine_small. "Mod hash" is the following hash: let combine_small' h c = Int64.rem (Int64.add (Int64.mul h 9973L) (Int64.of_int (int_of_char c))) 4294967291L where the modulus is (2^32 - 5), a prime. "MurmurHash2" is a C binding of the hash of the same name. "Murmur+Std" is Combine_small whitened by one round of 32-bit MurmurHash2. Chi-square tests are performed by binning the hashes, taking the 2^k least-significant-bits of the hash and accumulating. Taking the 2^(31-k) most significant bits gives markedly worse results (not shown) for the linear-congruential generators, implying that the multipliers are far too small. As can be seen from the test results: Combine_small's distribution is not uniform over 31 bits (K+ too large). Also, and more worrying, it has a very high number of collisions The proposed universal hash has smoother Chi-2 statistics, although it is a bit too high as shown in the K+ statistic. However, it has a very low number of collisions, an order of magnitude better than Combine_small. MurmurHash2 has a very smooth distribution of values Whitening Combine_small gives it better statistical properties but of course it has at least as many collisions as the underlying hash function has. In an unrelated test I've found that Hashtbl under the same data set (i.e., hashing strings) using MurmurHash2 shows around 6% better performance than using Combine_small, even though the hash is computationally more intensive, which shows that a better-behaved hash function can be beneficial.

### vicuna commented Mar 4, 2011

 Comment author: @xavierleroy Thanks for the suggestions. There are other infelicities in the current implementation of hash_univ, e.g. #5222 and also http://caml.inria.fr/pub/ml-archives/caml-list/2002/08/70fc38b62a61330808f6d14a9e00dd94.en.html The next major release might be a good time to address them all. One thing we must be careful about is that changing the hash function can break programs that store hashtables to persistent storage, so it's better to batch several changes in one major release.

### vicuna commented May 27, 2011

 Comment author: @xavierleroy I'm starting to look at better hash functions and it would save me time if could have your (mgiovanni's) test harness. At your leisure, could you please e-mail it to me? (xavier.leroy AT inria.fr) Thanks.

### vicuna commented May 29, 2011

 Comment author: @xavierleroy SVN trunk now contains a complete reimplementation of OCaml's generic hash function, using the Murmur 3 algorithm. It seems to address the poor distribution issue, as well as other issues of the old generic hash. Planned to go in release 3.13.