Switch to a different murmurhash implementation to handle Unicode characters #549

jtpio · 2020-10-08T20:51:02Z

Fixes #538.

This change switches to using the murmurhash2 implementation from: https://github.com/jtpio/murmurhash2

Before

After

The issue

Here is a report after digging into this and looking for different solutions. Posting a bit of information here so it's more convenient to refer to it in the future.

In #538, we noticed that having non-ascii characters such as € in a code cell could be reproduced on the stable Binder (using 0.3.3).

It was because the MurmurHash2 implementation in JS and the one used in xeus-python would not give the same result for the same input and seed.

@jupyterlab/debugger was using this implementation: https://github.com/mikolalysenko/murmurhash-js/blob/f19136e9f9c17f8cddc216ca3d44ec7c5c502f60/murmurhash2_gc.js#L14-L50, which corresponds to the version published on npm: https://www.npmjs.com/package/murmurhash-js

xeus-python is using this implementation: https://github.com/xtensor-stack/xtl/blob/7d41f768787ee6405e3f1b1056e04f6fbd43f8cd/include/xtl/xhash.hpp#L47
Which is based on the original in C++: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp

Using this code snippet:

print("€")

With 3339675911 as the seed.

Would give:

murmurhash-js: 3511777296
xtl: 1079235191

(more info in #538)

Summary:

From here two options:

WebAssemly

This sounds attractive because it would mean using the same code on both the backend and the frontend.
Reviving the initial work from xtensor-stack/xtl#171 led to being able to use the murmur2_x86 function in a node repl.

However it started to feel a bit complicated when:

instantiating WebAssembly on the web seems to be async only as of today. This would have to be taken into consideration when importing and loading the library, so the methods can be called when the instance is ready.
packaging .wasm files for the web as part of an npm package is a bit complicated. At least as of today, but it might improve over time
one option for packaging the wasm as js is to base64 encode it, and create a new Uint8Array to be passed to WebAssembly.initiate in a setup hook / function or equivalent
the bundle size from compiling xtl/hash to wasm with emscripten is in the order of ~100KB, which is quite a lot for just one function
the emscripten / wasm tooling requires extra setup

Fix the existing JS implementation

The implementation from murmurhash-js states it is for ascii strings only: https://github.com/garycourt/murmurhash-js/blob/0197ce38bedac0e05f40b9d7152095d06db8292c/murmurhash2_gc.js#L9

Instead, we can convert the string to a Uint8Array via a TextEncoder, and perform the operations on the Uint8Array just like the original algorithm and the one used in xeus-python.
This is implemented in https://github.com/jtpio/murmurhash2

Additional thoughts

Instead of having yet another murmurhash2 package on npm, one way forward could be to move the implementation to @jupyterlab/coreutils instead.

The 0.3.x releases of the debugger extension (this repo) could still depend on murmur2.

jtpio · 2020-10-09T16:31:47Z

Looks like some tests are timing out.

afshin · 2020-10-09T16:47:27Z

Hi @jtpio! How come the library is hosted on your account? Was it unavailable as a distribution by the original authors? Do you have a suggestion for how we should do this? We could bring the file into this repo, maybe?

jtpio · 2020-10-09T16:53:12Z

Was it unavailable as a distribution by the original authors?

It's a modified version of the original to handle unicode. Also the one published to npm is not from the original authors but from a fork.

We could bring the file into this repo, maybe

Yes I think so too. From the comment above:

Instead of having yet another murmurhash2 package on npm, one way forward could be to move the implementation to @jupyterlab/coreutils instead.

afshin · 2020-10-09T16:59:39Z

I missed that. That seems like a good idea to me. Or even including it as a local file we do not export right in the debugger package for now.

jtpio · 2020-10-09T17:00:58Z

Or even including it as a local file we do not export right in the debugger package for now.

I think we can vendor it in @jupyterlab/debugger@0.3.x yes, if we then add it to @jupyterlab/coreutils in core for 3.0.

afshin · 2020-10-09T17:10:08Z

Well basically if no one else uses it right now, I'd prefer to keep the API surface area of @jupyterlab/coreutils smaller rather than larger. What do you think?

jtpio · 2020-10-09T17:17:01Z

Yeah at first it felt like it should be "out there", since there doesn't seem to be a variation of murmurhash2 that handles this case on npm.

But I agree that we can lower the maintenance burden by not exposing it and keep it private to the debugger package.

jtpio · 2020-10-09T19:23:48Z

ok, so tests are also failing in #550

jtpio · 2020-10-12T11:37:06Z

Tests fixed in #553

afshin

Awesome, thank you @jtpio!

jtpio added 4 commits October 12, 2020 13:31

Switch to murmurhash2 to handle Unicode characters

25924f5

Add custom jest env

225d371

Update xeus-python to the latest

a85011d

Vendor murmurhash2

cdb8d6f

jtpio force-pushed the murmur-unicode branch from dd3852d to cdb8d6f Compare October 12, 2020 11:32

afshin approved these changes Oct 12, 2020

View reviewed changes

afshin merged commit c8686e7 into jupyterlab:master Oct 12, 2020

jtpio deleted the murmur-unicode branch October 12, 2020 12:57

jtpio mentioned this pull request Oct 13, 2020

Switch to a different murmurhash2 implementation to handle unicode characters jupyterlab/jupyterlab#9158

Merged

jtpio mentioned this pull request Nov 14, 2022

Add announcements jupyterlab/jupyterlab#13365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to a different murmurhash implementation to handle Unicode characters #549

Switch to a different murmurhash implementation to handle Unicode characters #549

jtpio commented Oct 8, 2020 •

edited

Loading

jtpio commented Oct 9, 2020 •

edited

Loading

afshin commented Oct 9, 2020

jtpio commented Oct 9, 2020

afshin commented Oct 9, 2020

jtpio commented Oct 9, 2020

afshin commented Oct 9, 2020

jtpio commented Oct 9, 2020 •

edited

Loading

jtpio commented Oct 9, 2020

jtpio commented Oct 12, 2020

afshin left a comment

Switch to a different murmurhash implementation to handle Unicode characters #549

Switch to a different murmurhash implementation to handle Unicode characters #549

Conversation

jtpio commented Oct 8, 2020 • edited Loading

Before

After

The issue

Additional thoughts

jtpio commented Oct 9, 2020 • edited Loading

afshin commented Oct 9, 2020

jtpio commented Oct 9, 2020

afshin commented Oct 9, 2020

jtpio commented Oct 9, 2020

afshin commented Oct 9, 2020

jtpio commented Oct 9, 2020 • edited Loading

jtpio commented Oct 9, 2020

jtpio commented Oct 12, 2020

afshin left a comment

Choose a reason for hiding this comment

jtpio commented Oct 8, 2020 •

edited

Loading

jtpio commented Oct 9, 2020 •

edited

Loading

jtpio commented Oct 9, 2020 •

edited

Loading