hashes: implement murmur3 #12022

narimiran · 2019-08-24T05:55:14Z

dom96

Cool. I assume the aim here is to offer a JS/C cross-compatible hashing function?

It would be nice to create a test that verifies that the hashes match on JS and C backends.

mratsim · 2019-08-24T19:17:05Z

AFAIK it's also to fix perf regression #11581 after #11203.

Can you rerun the same benchmarks as those in #11203 (comment).

I think we also need either a tables benchmark or a randomness analysis to make sure we don't reintroduce bucket imbalance in hash tables like with #11203

narimiran · 2019-08-26T09:45:53Z

Can you rerun the same benchmarks as those in #11203 (comment).

Here are the results of the benchmark proposed in #11581 (direct link to the benchmark), as it tests more things. Here are presented three values: time to run the benchmark, entropy value, maximal number of collisions.

Four different hashing algorithms were tested:

original ("Jenkins one at a time"): the one used originally, before PR faster hashing #11203
current: the one currently used, after PR faster hashing #11203
new ("murmur32"): the one proposed in this PR
murmur64_128: the one proposed in regression: hashes makes tables 100x slower on some inputs, eg oids #11581

test	measure	original	current	new (murmur32)	murmur64_128
string len 4	time	0.219	0.181	0.177	0.193
~	entropy	19.136	19.136	19.137	19.136
~	max coll.	8	8	8	10
string len 8	time	0.218	0.151	0.188	0.203
~	entropy	19.137	9.966	19.137	19.299
~	max coll.	8	1000	9	8
string len 12	time	0.204	0.188	0.179	0.239
~	entropy	19.137	19.137	19.137	19.136
~	max coll.	9	8	9	9
string len 16	time	0.241	0.150	0.194	0.190
~	entropy	19.135	9.966	19.298	19.135
~	max coll.	9	1000	9	9
string len 24	time	0.252	0.153	0.185	0.222
~	entropy	19.136	9.966	19.134	19.137
~	max coll.	9	1000	8	8
string len 32	time	0.280	0.159	0.191	0.185
~	entropy	19.136	9.958	19.135	19.137
~	max coll.	8	2000	9	9
string len 48	time	0.310	0.161	0.200	0.200
~	entropy	19.136	9.966	19.136	19.137
~	max coll.	8	1000	8	9
string len 64	time	0.347	0.169	0.210	0.201
~	entropy	19.137	9.966	19.136	19.297
~	max coll.	9	1000	9	8
genOid	time	0.221	0.142	0.153	0.187
~	entropy	19.136	7.935	19.138	19.138
~	max coll.	8	4096	8	9

Araq · 2019-08-26T18:27:11Z

lib/pure/hashes.nim

+proc rotl32(x: uint32, r: int): uint32 {.inline.} =
+  (x shl r) or (x shr (32 - r))
+
+proc murmurHash[T: char|int8|byte](x: openArray[T]): Hash =


Is it easy to make this non-generic? Only offer the char or byte variant. Should remove some code bloat.

timotheecour · 2019-08-26T20:07:40Z

@narimiran here are my conclusions:

rank 3: current is the worst in practice (because of regression: hashes makes tables 100x slower on some inputs, eg oids #11581 which has very bad modes, eg for oids)
rank 2: original is the next best
rank 1: murmur64_128 and murmur32 are clear winners.
So the only thing to compare is murmur64_128 vs murmur32
from your table, murmur64_128 performs a bit better than murmur32 for longer strings (>= 32), and a bit worse for shorter strings ( < 32).

I've made a more comprehensive comparison here to compare the 2 contenders:
timotheecour/vitanim@82b7356

# requires building nim on top of this PR #12022
# murmur64_128
nim c -r -d:danger -d:case_with_murmur $vitanim_D/testcases/tests/t0695.nim
# murmur32
nim c -r -d:danger $vitanim_D/testcases/tests/t0695.nim

it measures performance for different aspects:

tKey: time to compute the key (irrelevant)
tHash: time to compute the hash (repeated K1 times for scale).
tInsert: time to insert the keys in a table
tGet: time to retrieve the value from key in a table (repeated K2 times for scale)

tHash doesn't take into account entropy, but tInsert and tGet would be affected by it.

here are the results:

# murmur64_128
(key: "genRandOid           ", tKey: "0.0162", tHash: "0.1455", tInsert: "0.0846", tGet: "0.4825")
(key: "getRandomString:1    ", tKey: "0.1025", tHash: "0.0929", tInsert: "0.0707", tGet: "0.0664")
(key: "getRandomString:2    ", tKey: "0.0698", tHash: "0.0903", tInsert: "0.0968", tGet: "0.2243")
(key: "getRandomString:4    ", tKey: "0.0840", tHash: "0.1356", tInsert: "0.1205", tGet: "0.4458")
(key: "getRandomString:8    ", tKey: "0.1102", tHash: "0.1943", tInsert: "0.1669", tGet: "0.5790")
(key: "getRandomString:16   ", tKey: "0.1504", tHash: "0.1539", tInsert: "0.1648", tGet: "0.5852")
(key: "getRandomString:32   ", tKey: "0.1749", tHash: "0.2248", tInsert: "0.2088", tGet: "0.5541")
(key: "getRandomString:64   ", tKey: "0.2609", tHash: "0.2511", tInsert: "0.3105", tGet: "0.5512")
(key: "getRandomString:100  ", tKey: "0.3529", tHash: "0.3173", tInsert: "0.3847", tGet: "0.7094")
(key: "getRandomString:128  ", tKey: "0.4351", tHash: "0.4379", tInsert: "0.4778", tGet: "0.7872")
(key: "getRandomString:256  ", tKey: "0.7388", tHash: "0.6897", tInsert: "0.8344", tGet: "1.1721")
(key: "getRandomString:512  ", tKey: "1.3675", tHash: "1.2351", tInsert: "1.5222", tGet: "1.4350")
(key: "getRandomString:1024 ", tKey: "2.6427", tHash: "2.3164", tInsert: "2.9879", tGet: "2.0832")
(key: "getRandomString:2048 ", tKey: "5.2856", tHash: "4.5949", tInsert: "5.8851", tGet: "2.9954")

# murmur32
(key: "genRandOid           ", tKey: "0.0159", tHash: "0.1136", tInsert: "0.0857", tGet: "0.4915")
(key: "getRandomString:1    ", tKey: "0.1045", tHash: "0.0530", tInsert: "0.0737", tGet: "0.0640")
(key: "getRandomString:2    ", tKey: "0.0699", tHash: "0.0549", tInsert: "0.1045", tGet: "0.2534")
(key: "getRandomString:4    ", tKey: "0.1004", tHash: "0.0642", tInsert: "0.1246", tGet: "0.4488")
(key: "getRandomString:8    ", tKey: "0.1208", tHash: "0.0752", tInsert: "0.1396", tGet: "0.5041")
(key: "getRandomString:16   ", tKey: "0.1343", tHash: "0.1098", tInsert: "0.1534", tGet: "0.4902")
(key: "getRandomString:32   ", tKey: "0.1671", tHash: "0.1830", tInsert: "0.1969", tGet: "0.5264")
(key: "getRandomString:64   ", tKey: "0.2551", tHash: "0.3272", tInsert: "0.3077", tGet: "0.7017")
(key: "getRandomString:100  ", tKey: "0.3405", tHash: "0.4749", tInsert: "0.3918", tGet: "0.7964")
(key: "getRandomString:128  ", tKey: "0.4131", tHash: "0.5774", tInsert: "0.4585", tGet: "1.0032")
(key: "getRandomString:256  ", tKey: "0.7313", tHash: "1.1903", tInsert: "0.8415", tGet: "1.2563")
(key: "getRandomString:512  ", tKey: "1.3581", tHash: "2.4187", tInsert: "1.5216", tGet: "1.5864")
(key: "getRandomString:1024 ", tKey: "2.6210", tHash: "4.8380", tInsert: "3.0145", tGet: "2.7432")
(key: "getRandomString:2048 ", tKey: "5.2682", tHash: "9.6445", tInsert: "5.8773", tGet: "4.0281")

Conclusion: compared to murmur32, murmur64_128 is:

2X faster for hash computation (tHash) for len>= 256,
1.3X faster for 64 <= len <= 128
1.3X to 2X slower for len <= 32
roughly same speed for insertion (tGet)
for retrieval (tGet):
1.3X faster for len >= 512, most of the time a bit faster, sometimes a bit slower

Also note that murmur64_128 has, by definition, better randomness/entropy compared to murmur32, so would likely perform better on average in terms of collisions, but that would probably only trigger in certain applications / distribution of keys where murmur32 isn't good enough. I haven't run into such cases yet.

Based on this, this PR is a strict improvement over current situation so it's good enough for now in terms of performance; if needed, a future PR can achieve "best of both worlds" as follows:

apply murmur32 for len <= 32
apply murmur64 for len > 32

narimiran · 2019-08-26T20:15:21Z

a future PR can achieve "best of both worlds" as follows...

...while taking into an account that the solution must be available for:

nimvm (no casting at your will; it has to give the same hash as when not in VM)
JS backend

timotheecour · 2019-08-26T20:24:13Z

nimvm (no casting at your will; it has to give the same hash as when not in VM)

registerCallback can be used for nimvm (which achieves native code speed); that's what I did in #11767 and it works well and likely simplifies code ; will revisit this PR after yours is merged)

JS backend

depending on whether we want js backend to be same as c backend, we don't have to support this in js backend.

I just checked, after your PR, https://github.com/nim-lang/Nim/issues/11989 is still broken, ie nim js and nim c produce different hashes, including for strings eg echo hash("hi there")

~~also, is it possible to add a test for: hash at CT/RT with nim c/js and ensure they're the same?~~

one nasty detail is nim js uses 32bit Hash (on all platforms including 64bit ones, including when used in nimvm). We should make it clearer what guarantees we want to have:

do we enforce hash compatibility across nim releases? I really think the answer should be no, to keep open performance optimizations.
do we enforce 32 bit platforms give same hash as 64 bit platforms for nim c? for performance reasons, no would make sense; but there are arguments for yes
do we enforce nim c RT same as nim c CT (on a given platform): yes, this can be done easily with registerCallback (vm callbacks)
do we enforce nim js RT is same as nim js CT? or same as nim c ? that's the main question
I think the answer could be "no" until we implement 64bit integer support (via bigints) on js backend (see below).

relevant issues:

https://github.com/nim-lang/Nim/issues/11989 Hashes module produces different results in C and JS (does your PR impact this issue?)
nim js prevents valid compile time functionality #11988 see note: Hash uses 4 bytes instead of 8 bytes with nim js at CT, like nim js, and unlike with nim c

possible solution for `nim js` Hash size

use 64 bit Hash Size on js backend, using BigInt https://v8.dev/features/bigint instead of 32bit hash size
as mentioned in https://stackoverflow.com/questions/9643626/does-javascript-support-64-bit-integers

Chromium version 57 and later natively supports arbitrary-precision integers. This is called BigInt and is being worked on for other browsers as well. It is dramatically faster than JavaScript implementations.

dumblob · 2019-08-27T19:24:11Z

Just a note - currently overall "best" known tests are to be found in https://github.com/rurban/smhasher (and even better with rurban/smhasher#62 ). As can be seen, murmur3 is moderately slow though very portable. Feel free to reconsider the choice.

timotheecour · 2019-08-27T20:25:09Z

@dumblob => see reply #11581 (comment) where you posted the same question.

mratsim · 2019-08-28T08:54:14Z

What is the use-case for having the same hashes in the VM, in C and in JS?

Is that really necessary? We can just say in the hashes module that the "actual hash is implementation dependent and may differ depending on the execution backend (compile-time, C, C++, Javascript).".

…e via VM hooks

timotheecour · 2019-08-28T16:52:55Z

lib/pure/hashes.nim

-    murmurHash(toOpenArray(aBuf, sPos, ePos))
+  when A is byte:
+    when nimvm:
+      result = hashVmImplByte(aBuf, 0, aBuf.high)


hashVmImplByte(aBuf, sPos, ePos) ?
ditto below.

good point. :-)

timotheecour · 2019-08-28T16:56:00Z

lib/pure/hashes.nim

@@ -371,8 +388,16 @@ proc hash*[A](aBuf: openArray[A], sPos, ePos: int): Hash =
    let a = [1, 2, 5, 1, 2, 6]
    doAssert hash(a, 0, 1) == hash(a, 3, 4)

-  when A is char|int8|byte:
-    murmurHash(toOpenArray(aBuf, sPos, ePos))
+  when A is byte:


what about int8/uint8? at least int8 was handled before this change IIRC

ditto above
maybe: when sizeof(A)==1 and A isnot char: ...
vm supports casting integers of same size so everything could be cast to 1 type (eg byte) without having to add overloads. Ideally (but out of scope for this PR) there are more things that VM should allow to cast safely

I didn't consider it important enough but fair enough.

timotheecour · 2019-08-28T17:41:51Z

@mratsim

What is the use-case for having the same hashes in the VM, in C and in JS?

prevent gotchas like this:

import hashes
import tables

proc hash*(x: int): Hash {.inline.} =
  when nimvm: result = x*2
  else: result = x

proc fun2[T](t: T) =
  for k,v in t: echo (k,v)

proc fun(): auto =
  var t = {1:2}.toTable
  for i in 0..<3: t[i] = 3*i
  return t

let t1 = fun()
const t2 = fun()
echo t1 == t2 # false
echo "VM"
fun2(t1)
echo "RT"
fun2(t2)

false
VM
(1, 3)
(0, 0)
(2, 6)
RT
(0, 0)
(1, 3)
(2, 6)

likewise with nim js instead of nim c.
There are other more problematic cases.

I think we should just guarantee that nim c RT is same as nim c CT for now, and handle the tricky js part in future PR since it never worked anyway (#11989).
IMO nim js CT should be same as nim c CT, the difficulty for that is that Hashes is int32 for nim js (regardless of RT vs CT), which is exactly the issue raised in #11988.
Using type int = int64 (via BigInt) on nim js should solve this (and other) problem.

…rmur3

hashes: implement murmur3

b273de2

dom96 reviewed Aug 24, 2019

View reviewed changes

narimiran mentioned this pull request Aug 26, 2019

regression: hashes makes tables 100x slower on some inputs, eg oids #11581

Closed

Araq reviewed Aug 26, 2019

View reviewed changes

Araq added this to the v1 milestone Aug 27, 2019

Araq added 2 commits August 28, 2019 11:00

resolved merge conflict

e0b4a33

refactoring; there is only one murmurHash and it works at compile-tim…

765c6f5

…e via VM hooks

timotheecour reviewed Aug 28, 2019

View reviewed changes

Araq added 9 commits August 29, 2019 10:44

fixes JS tests

0ae181d

Merge branch 'devel' into miran-murmur3

1210edc

makes toOpenArrayByte work with C++

48ef7af

Merge branch 'devel' into miran-murmur3

95725ae

Merge branch 'miran-murmur3' of github.com:nim-lang/Nim into miran-mu…

8a7f181

…rmur3

bugfix

ca3774d

Merge branch 'devel' into miran-murmur3

431deca

Merge branch 'miran-murmur3' of github.com:nim-lang/Nim into miran-mu…

d6c4ec9

…rmur3

attempt to make it bootstrap in C++ mode for 0.20

683d027

Araq merged commit ab48d79 into devel Aug 31, 2019

Araq deleted the miran-murmur3 branch August 31, 2019 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hashes: implement murmur3 #12022

hashes: implement murmur3 #12022

narimiran commented Aug 24, 2019 •

edited

dom96 left a comment

mratsim commented Aug 24, 2019 •

edited

narimiran commented Aug 26, 2019 •

edited

Araq Aug 26, 2019

timotheecour commented Aug 26, 2019 •

edited

narimiran commented Aug 26, 2019

timotheecour commented Aug 26, 2019 •

edited

dumblob commented Aug 27, 2019

timotheecour commented Aug 27, 2019

mratsim commented Aug 28, 2019

timotheecour Aug 28, 2019

Araq Aug 28, 2019

timotheecour Aug 28, 2019 •

edited

Araq Aug 28, 2019

timotheecour commented Aug 28, 2019 •

edited

hashes: implement murmur3 #12022

hashes: implement murmur3 #12022

Conversation

narimiran commented Aug 24, 2019 • edited

dom96 left a comment

Choose a reason for hiding this comment

mratsim commented Aug 24, 2019 • edited

narimiran commented Aug 26, 2019 • edited

Araq Aug 26, 2019

Choose a reason for hiding this comment

timotheecour commented Aug 26, 2019 • edited

narimiran commented Aug 26, 2019

timotheecour commented Aug 26, 2019 • edited

possible solution for nim js Hash size

dumblob commented Aug 27, 2019

timotheecour commented Aug 27, 2019

mratsim commented Aug 28, 2019

timotheecour Aug 28, 2019

Choose a reason for hiding this comment

Araq Aug 28, 2019

Choose a reason for hiding this comment

timotheecour Aug 28, 2019 • edited

Choose a reason for hiding this comment

Araq Aug 28, 2019

Choose a reason for hiding this comment

timotheecour commented Aug 28, 2019 • edited

narimiran commented Aug 24, 2019 •

edited

mratsim commented Aug 24, 2019 •

edited

narimiran commented Aug 26, 2019 •

edited

timotheecour commented Aug 26, 2019 •

edited

timotheecour commented Aug 26, 2019 •

edited

possible solution for `nim js` Hash size

timotheecour Aug 28, 2019 •

edited

timotheecour commented Aug 28, 2019 •

edited