Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hashes: implement murmur3 #12022

Merged
merged 12 commits into from Aug 31, 2019
Merged

hashes: implement murmur3 #12022

merged 12 commits into from Aug 31, 2019

Conversation

narimiran
Copy link
Member

@narimiran narimiran commented Aug 24, 2019

fixes #11581

Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I assume the aim here is to offer a JS/C cross-compatible hashing function?

It would be nice to create a test that verifies that the hashes match on JS and C backends.

@mratsim
Copy link
Collaborator

mratsim commented Aug 24, 2019

AFAIK it's also to fix perf regression #11581 after #11203.

Can you rerun the same benchmarks as those in #11203 (comment).

I think we also need either a tables benchmark or a randomness analysis to make sure we don't reintroduce bucket imbalance in hash tables like with #11203

@narimiran
Copy link
Member Author

narimiran commented Aug 26, 2019

Can you rerun the same benchmarks as those in #11203 (comment).

Here are the results of the benchmark proposed in #11581 (direct link to the benchmark), as it tests more things. Here are presented three values: time to run the benchmark, entropy value, maximal number of collisions.

Four different hashing algorithms were tested:

  1. original ("Jenkins one at a time"): the one used originally, before PR faster hashing #11203
  2. current: the one currently used, after PR faster hashing #11203
  3. new ("murmur32"): the one proposed in this PR
  4. murmur64_128: the one proposed in regression: hashes makes tables 100x slower on some inputs, eg oids #11581
test measure original current new (murmur32) murmur64_128
string len 4 time 0.219 0.181 0.177 0.193
~ entropy 19.136 19.136 19.137 19.136
~ max coll. 8 8 8 10
string len 8 time 0.218 0.151 0.188 0.203
~ entropy 19.137 9.966 19.137 19.299
~ max coll. 8 1000 9 8
string len 12 time 0.204 0.188 0.179 0.239
~ entropy 19.137 19.137 19.137 19.136
~ max coll. 9 8 9 9
string len 16 time 0.241 0.150 0.194 0.190
~ entropy 19.135 9.966 19.298 19.135
~ max coll. 9 1000 9 9
string len 24 time 0.252 0.153 0.185 0.222
~ entropy 19.136 9.966 19.134 19.137
~ max coll. 9 1000 8 8
string len 32 time 0.280 0.159 0.191 0.185
~ entropy 19.136 9.958 19.135 19.137
~ max coll. 8 2000 9 9
string len 48 time 0.310 0.161 0.200 0.200
~ entropy 19.136 9.966 19.136 19.137
~ max coll. 8 1000 8 9
string len 64 time 0.347 0.169 0.210 0.201
~ entropy 19.137 9.966 19.136 19.297
~ max coll. 9 1000 9 8
genOid time 0.221 0.142 0.153 0.187
~ entropy 19.136 7.935 19.138 19.138
~ max coll. 8 4096 8 9

proc rotl32(x: uint32, r: int): uint32 {.inline.} =
(x shl r) or (x shr (32 - r))

proc murmurHash[T: char|int8|byte](x: openArray[T]): Hash =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to make this non-generic? Only offer the char or byte variant. Should remove some code bloat.

@timotheecour
Copy link
Member

timotheecour commented Aug 26, 2019

@narimiran here are my conclusions:

  • rank 3: current is the worst in practice (because of regression: hashes makes tables 100x slower on some inputs, eg oids #11581 which has very bad modes, eg for oids)
  • rank 2: original is the next best
  • rank 1: murmur64_128 and murmur32 are clear winners.
    So the only thing to compare is murmur64_128 vs murmur32
    from your table, murmur64_128 performs a bit better than murmur32 for longer strings (>= 32), and a bit worse for shorter strings ( < 32).

I've made a more comprehensive comparison here to compare the 2 contenders:
timotheecour/vitanim@82b7356

# requires building nim on top of this PR #12022
# murmur64_128
nim c -r -d:danger -d:case_with_murmur $vitanim_D/testcases/tests/t0695.nim
# murmur32
nim c -r -d:danger $vitanim_D/testcases/tests/t0695.nim

it measures performance for different aspects:

  • tKey: time to compute the key (irrelevant)
  • tHash: time to compute the hash (repeated K1 times for scale).
  • tInsert: time to insert the keys in a table
  • tGet: time to retrieve the value from key in a table (repeated K2 times for scale)

tHash doesn't take into account entropy, but tInsert and tGet would be affected by it.

here are the results:

# murmur64_128
(key: "genRandOid           ", tKey: "0.0162", tHash: "0.1455", tInsert: "0.0846", tGet: "0.4825")
(key: "getRandomString:1    ", tKey: "0.1025", tHash: "0.0929", tInsert: "0.0707", tGet: "0.0664")
(key: "getRandomString:2    ", tKey: "0.0698", tHash: "0.0903", tInsert: "0.0968", tGet: "0.2243")
(key: "getRandomString:4    ", tKey: "0.0840", tHash: "0.1356", tInsert: "0.1205", tGet: "0.4458")
(key: "getRandomString:8    ", tKey: "0.1102", tHash: "0.1943", tInsert: "0.1669", tGet: "0.5790")
(key: "getRandomString:16   ", tKey: "0.1504", tHash: "0.1539", tInsert: "0.1648", tGet: "0.5852")
(key: "getRandomString:32   ", tKey: "0.1749", tHash: "0.2248", tInsert: "0.2088", tGet: "0.5541")
(key: "getRandomString:64   ", tKey: "0.2609", tHash: "0.2511", tInsert: "0.3105", tGet: "0.5512")
(key: "getRandomString:100  ", tKey: "0.3529", tHash: "0.3173", tInsert: "0.3847", tGet: "0.7094")
(key: "getRandomString:128  ", tKey: "0.4351", tHash: "0.4379", tInsert: "0.4778", tGet: "0.7872")
(key: "getRandomString:256  ", tKey: "0.7388", tHash: "0.6897", tInsert: "0.8344", tGet: "1.1721")
(key: "getRandomString:512  ", tKey: "1.3675", tHash: "1.2351", tInsert: "1.5222", tGet: "1.4350")
(key: "getRandomString:1024 ", tKey: "2.6427", tHash: "2.3164", tInsert: "2.9879", tGet: "2.0832")
(key: "getRandomString:2048 ", tKey: "5.2856", tHash: "4.5949", tInsert: "5.8851", tGet: "2.9954")
# murmur32
(key: "genRandOid           ", tKey: "0.0159", tHash: "0.1136", tInsert: "0.0857", tGet: "0.4915")
(key: "getRandomString:1    ", tKey: "0.1045", tHash: "0.0530", tInsert: "0.0737", tGet: "0.0640")
(key: "getRandomString:2    ", tKey: "0.0699", tHash: "0.0549", tInsert: "0.1045", tGet: "0.2534")
(key: "getRandomString:4    ", tKey: "0.1004", tHash: "0.0642", tInsert: "0.1246", tGet: "0.4488")
(key: "getRandomString:8    ", tKey: "0.1208", tHash: "0.0752", tInsert: "0.1396", tGet: "0.5041")
(key: "getRandomString:16   ", tKey: "0.1343", tHash: "0.1098", tInsert: "0.1534", tGet: "0.4902")
(key: "getRandomString:32   ", tKey: "0.1671", tHash: "0.1830", tInsert: "0.1969", tGet: "0.5264")
(key: "getRandomString:64   ", tKey: "0.2551", tHash: "0.3272", tInsert: "0.3077", tGet: "0.7017")
(key: "getRandomString:100  ", tKey: "0.3405", tHash: "0.4749", tInsert: "0.3918", tGet: "0.7964")
(key: "getRandomString:128  ", tKey: "0.4131", tHash: "0.5774", tInsert: "0.4585", tGet: "1.0032")
(key: "getRandomString:256  ", tKey: "0.7313", tHash: "1.1903", tInsert: "0.8415", tGet: "1.2563")
(key: "getRandomString:512  ", tKey: "1.3581", tHash: "2.4187", tInsert: "1.5216", tGet: "1.5864")
(key: "getRandomString:1024 ", tKey: "2.6210", tHash: "4.8380", tInsert: "3.0145", tGet: "2.7432")
(key: "getRandomString:2048 ", tKey: "5.2682", tHash: "9.6445", tInsert: "5.8773", tGet: "4.0281")

Conclusion: compared to murmur32, murmur64_128 is:

  • 2X faster for hash computation (tHash) for len>= 256,

  • 1.3X faster for 64 <= len <= 128

  • 1.3X to 2X slower for len <= 32

  • roughly same speed for insertion (tGet)

  • for retrieval (tGet):

  • 1.3X faster for len >= 512, most of the time a bit faster, sometimes a bit slower

Also note that murmur64_128 has, by definition, better randomness/entropy compared to murmur32, so would likely perform better on average in terms of collisions, but that would probably only trigger in certain applications / distribution of keys where murmur32 isn't good enough. I haven't run into such cases yet.

Based on this, this PR is a strict improvement over current situation so it's good enough for now in terms of performance; if needed, a future PR can achieve "best of both worlds" as follows:

  • apply murmur32 for len <= 32
  • apply murmur64 for len > 32

@narimiran
Copy link
Member Author

a future PR can achieve "best of both worlds" as follows...

...while taking into an account that the solution must be available for:

  1. nimvm (no casting at your will; it has to give the same hash as when not in VM)
  2. JS backend

@timotheecour
Copy link
Member

timotheecour commented Aug 26, 2019

nimvm (no casting at your will; it has to give the same hash as when not in VM)

registerCallback can be used for nimvm (which achieves native code speed); that's what I did in #11767 and it works well and likely simplifies code ; will revisit this PR after yours is merged)

JS backend

depending on whether we want js backend to be same as c backend, we don't have to support this in js backend.

I just checked, after your PR, https://github.com/nim-lang/Nim/issues/11989 is still broken, ie nim js and nim c produce different hashes, including for strings eg echo hash("hi there")

also, is it possible to add a test for: hash at CT/RT with nim c/js and ensure they're the same?

one nasty detail is nim js uses 32bit Hash (on all platforms including 64bit ones, including when used in nimvm). We should make it clearer what guarantees we want to have:

  • do we enforce hash compatibility across nim releases? I really think the answer should be no, to keep open performance optimizations.
  • do we enforce 32 bit platforms give same hash as 64 bit platforms for nim c? for performance reasons, no would make sense; but there are arguments for yes
  • do we enforce nim c RT same as nim c CT (on a given platform): yes, this can be done easily with registerCallback (vm callbacks)
  • do we enforce nim js RT is same as nim js CT? or same as nim c ? that's the main question
    I think the answer could be "no" until we implement 64bit integer support (via bigints) on js backend (see below).

relevant issues:

possible solution for nim js Hash size

use 64 bit Hash Size on js backend, using BigInt https://v8.dev/features/bigint instead of 32bit hash size
as mentioned in https://stackoverflow.com/questions/9643626/does-javascript-support-64-bit-integers

Chromium version 57 and later natively supports arbitrary-precision integers. This is called BigInt and is being worked on for other browsers as well. It is dramatically faster than JavaScript implementations.

@Araq Araq added this to the v1 milestone Aug 27, 2019
@dumblob
Copy link

dumblob commented Aug 27, 2019

Just a note - currently overall "best" known tests are to be found in https://github.com/rurban/smhasher (and even better with rurban/smhasher#62 ). As can be seen, murmur3 is moderately slow though very portable. Feel free to reconsider the choice.

@timotheecour
Copy link
Member

@dumblob => see reply #11581 (comment) where you posted the same question.

@mratsim
Copy link
Collaborator

mratsim commented Aug 28, 2019

What is the use-case for having the same hashes in the VM, in C and in JS?

Is that really necessary? We can just say in the hashes module that the "actual hash is implementation dependent and may differ depending on the execution backend (compile-time, C, C++, Javascript).".

murmurHash(toOpenArray(aBuf, sPos, ePos))
when A is byte:
when nimvm:
result = hashVmImplByte(aBuf, 0, aBuf.high)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashVmImplByte(aBuf, sPos, ePos) ?
ditto below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. :-)

@@ -371,8 +388,16 @@ proc hash*[A](aBuf: openArray[A], sPos, ePos: int): Hash =
let a = [1, 2, 5, 1, 2, 6]
doAssert hash(a, 0, 1) == hash(a, 3, 4)

when A is char|int8|byte:
murmurHash(toOpenArray(aBuf, sPos, ePos))
when A is byte:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • what about int8/uint8? at least int8 was handled before this change IIRC
  • ditto above
    maybe: when sizeof(A)==1 and A isnot char: ...
    vm supports casting integers of same size so everything could be cast to 1 type (eg byte) without having to add overloads. Ideally (but out of scope for this PR) there are more things that VM should allow to cast safely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider it important enough but fair enough.

@timotheecour
Copy link
Member

timotheecour commented Aug 28, 2019

@mratsim

What is the use-case for having the same hashes in the VM, in C and in JS?

prevent gotchas like this:

import hashes
import tables

proc hash*(x: int): Hash {.inline.} =
  when nimvm: result = x*2
  else: result = x

proc fun2[T](t: T) =
  for k,v in t: echo (k,v)

proc fun(): auto =
  var t = {1:2}.toTable
  for i in 0..<3: t[i] = 3*i
  return t

let t1 = fun()
const t2 = fun()
echo t1 == t2 # false
echo "VM"
fun2(t1)
echo "RT"
fun2(t2)
false
VM
(1, 3)
(0, 0)
(2, 6)
RT
(0, 0)
(1, 3)
(2, 6)

likewise with nim js instead of nim c.
There are other more problematic cases.

I think we should just guarantee that nim c RT is same as nim c CT for now, and handle the tricky js part in future PR since it never worked anyway (#11989).
IMO nim js CT should be same as nim c CT, the difficulty for that is that Hashes is int32 for nim js (regardless of RT vs CT), which is exactly the issue raised in #11988.
Using type int = int64 (via BigInt) on nim js should solve this (and other) problem.

@Araq Araq merged commit ab48d79 into devel Aug 31, 2019
@Araq Araq deleted the miran-murmur3 branch August 31, 2019 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

regression: hashes makes tables 100x slower on some inputs, eg oids
6 participants