Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Murmurhash doesn't equal known murmurhashes. #10

Open
guloic opened this issue Feb 27, 2021 · 11 comments
Open

Murmurhash doesn't equal known murmurhashes. #10

guloic opened this issue Feb 27, 2021 · 11 comments

Comments

@guloic
Copy link

guloic commented Feb 27, 2021

Environment

System: Windows 10, 64 bit.
Nodejs Version: 12.13.0

Steps to Reproduce

Then run the following code -

var murmurhash = require('./murmurhash');

murmurhash.v3('foo');
// prints 4138058784

murmurhash.v2('foo');
// prints 2414502773

Expected

One of those outputs needs to be equal to -156908512 which is the known murmurhash for foo

Note
I observed this behaviour with Gary's original work too. Please refer to this issue.

@perezd
Copy link
Owner

perezd commented Feb 27, 2021

Yeah, makes sense, this is Gary's code unchanged. Do you have a patch?

@guloic
Copy link
Author

guloic commented Feb 27, 2021

I'm checking why it behaves differently. I tried this python library and it outputs correct hashes

import pymmh3 as mmh3

mmh3.hash('foo')
# prints -156908512

@guloic
Copy link
Author

guloic commented Feb 27, 2021

Okay. I have a patch.

Patch

diff --git a/murmurhash.js b/murmurhash.js
index 6a73210..a4bbc73 100644
--- a/murmurhash.js
+++ b/murmurhash.js
@@ -117,7 +117,14 @@
     h1 = ((((h1 & 0xffff) * 0xc2b2ae35) + ((((h1 >>> 16) * 0xc2b2ae35) & 0xffff) << 16))) & 0xffffffff;
     h1 ^= h1 >>> 16;

-    return h1 >>> 0;
+    h1 = h1 >>> 0;
+
+    if (h1 & 0x80000000 == 0) {
+      return h1;
+    }
+    else {
+      return -( (h1 ^ 0xFFFFFFFF) + 1 )
+    }
   }

   var murmur = MurmurHashV3;

New Output

var mmh3 = require('./murmurhash')

mmh3.v3('foo')
// prints -156908512

This output matches the output of python code.

I can raise a pull request.

guloic added a commit to guloic/node-murmurhash that referenced this issue Feb 27, 2021
guloic added a commit to guloic/node-murmurhash that referenced this issue Feb 27, 2021
@guloic
Copy link
Author

guloic commented Feb 27, 2021

Created PR.

@guloic
Copy link
Author

guloic commented Feb 27, 2021

Actually, it's not really an issue.

node-murmurhash is producing unsigned hashes by default, while the python library is producing signed hashes by default.

Instead of a permanent code change, maybe we can drive this behaviour through a flag passed in argument?

@perezd
Copy link
Owner

perezd commented Feb 27, 2021

Yeah I was thinking that too, this would likely break hashes otherwise, lets flag control it.

@polarathene
Copy link

polarathene commented Apr 30, 2021

I would argue that this lib using unsigned integers as output is more valid. If you compare to the outputs from other hash functions or implementations of murmur3 beyond the referenced python one, in my experience they tend to be unsigned. If converting the number to a string to represent the hash, a quick way is toString(radix), for negative values that adds a - which is unnecessary.

const bit_shifts = [24, 16, 8, 0]
function to_bytes(input) {
  let str = ""
  bit_shifts.forEach(bits => {
    str += `${((input >> bits) & 0xff)} `
  })

  return str
}

let a = -1621654532
let b = 2673312764
console.log(a.toString(36)); // "-qthon8"
console.log(b.toString(36)); // "187mdbw"

// Bitwise operations will convert to uint32 regardless
console.log(to_bytes(a)) // "159 87 131 252 "
console.log(to_bytes(b)) // "159 87 131 252 "

// or if you had converted the separate bytes into base36
console.log([
  (159).toString(36),
  (87).toString(36),
  (131).toString(36),
  (252).toString(36)
].join("")) // "4f2f3n70"

A will become B just by doing -1621654532 >>> 0. It is probably better to just be clear that the expected output is u32, while other implementations such as the python one are i32, but regardless of output they are equivalent, they just require outputs being normalized to one or the other if you're going to be running mixed implementations in a way that it causes problems.

@ricardomaia
Copy link

Hi @perezd... I would like to know when you intend to merge the @guloic PR and what would be the flag needed to produce signed hashes. Thanks!

@perezd
Copy link
Owner

perezd commented May 30, 2022

we'd probably want a new constructor for signed hashes, to ensure the existing behavior is preserved. Feel free to send me an updated PR (this one is pretty old) which a proposed signed constructor + tests, I'll merge + publish.

@ricardomaia
Copy link

Unfortunately I will have to think of another approach. The javascript codes are also incompatible with the encoding defined by RFC-2045, used to perform character encoding of the aforementioned Python library. I need to apply the hash algorithm to images and icons.

@Dnkfff
Copy link
Contributor

Dnkfff commented Jan 19, 2024

I honestly don't think it should be opened for time being right now, in case @guloic is reccomending to change everything to python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants