Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Switch to a different murmurhash implementation to handle Unicode characters #549

Merged
merged 4 commits into from
Oct 12, 2020

Conversation

jtpio
Copy link
Member

@jtpio jtpio commented Oct 8, 2020

Fixes #538.

This change switches to using the murmurhash2 implementation from: https://github.com/jtpio/murmurhash2

Before

unicode-before

After

unicode-after

The issue

Here is a report after digging into this and looking for different solutions. Posting a bit of information here so it's more convenient to refer to it in the future.

In #538, we noticed that having non-ascii characters such as in a code cell could be reproduced on the stable Binder (using 0.3.3).

It was because the MurmurHash2 implementation in JS and the one used in xeus-python would not give the same result for the same input and seed.

@jupyterlab/debugger was using this implementation: https://github.com/mikolalysenko/murmurhash-js/blob/f19136e9f9c17f8cddc216ca3d44ec7c5c502f60/murmurhash2_gc.js#L14-L50, which corresponds to the version published on npm: https://www.npmjs.com/package/murmurhash-js

xeus-python is using this implementation: https://github.com/xtensor-stack/xtl/blob/7d41f768787ee6405e3f1b1056e04f6fbd43f8cd/include/xtl/xhash.hpp#L47
Which is based on the original in C++: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp

Using this code snippet:

print("€")

With 3339675911 as the seed.

Would give:

  • murmurhash-js: 3511777296
  • xtl: 1079235191

(more info in #538)

Summary:

image

From here two options:

WebAssemly

This sounds attractive because it would mean using the same code on both the backend and the frontend.
Reviving the initial work from xtensor-stack/xtl#171 led to being able to use the murmur2_x86 function in a node repl.

However it started to feel a bit complicated when:

  • instantiating WebAssembly on the web seems to be async only as of today. This would have to be taken into consideration when importing and loading the library, so the methods can be called when the instance is ready.
  • packaging .wasm files for the web as part of an npm package is a bit complicated. At least as of today, but it might improve over time
  • one option for packaging the wasm as js is to base64 encode it, and create a new Uint8Array to be passed to WebAssembly.initiate in a setup hook / function or equivalent
  • the bundle size from compiling xtl/hash to wasm with emscripten is in the order of ~100KB, which is quite a lot for just one function
  • the emscripten / wasm tooling requires extra setup

Fix the existing JS implementation

The implementation from murmurhash-js states it is for ascii strings only: https://github.com/garycourt/murmurhash-js/blob/0197ce38bedac0e05f40b9d7152095d06db8292c/murmurhash2_gc.js#L9

Instead, we can convert the string to a Uint8Array via a TextEncoder, and perform the operations on the Uint8Array just like the original algorithm and the one used in xeus-python.
This is implemented in https://github.com/jtpio/murmurhash2

Additional thoughts

Instead of having yet another murmurhash2 package on npm, one way forward could be to move the implementation to @jupyterlab/coreutils instead.

The 0.3.x releases of the debugger extension (this repo) could still depend on murmur2.

@jtpio
Copy link
Member Author

jtpio commented Oct 9, 2020

Looks like some tests are timing out.

@afshin
Copy link
Member

afshin commented Oct 9, 2020

Hi @jtpio! How come the library is hosted on your account? Was it unavailable as a distribution by the original authors? Do you have a suggestion for how we should do this? We could bring the file into this repo, maybe?

@jtpio
Copy link
Member Author

jtpio commented Oct 9, 2020

Was it unavailable as a distribution by the original authors?

It's a modified version of the original to handle unicode. Also the one published to npm is not from the original authors but from a fork.

We could bring the file into this repo, maybe

Yes I think so too. From the comment above:

Instead of having yet another murmurhash2 package on npm, one way forward could be to move the implementation to @jupyterlab/coreutils instead.

@afshin
Copy link
Member

afshin commented Oct 9, 2020

I missed that. That seems like a good idea to me. Or even including it as a local file we do not export right in the debugger package for now.

@jtpio
Copy link
Member Author

jtpio commented Oct 9, 2020

Or even including it as a local file we do not export right in the debugger package for now.

I think we can vendor it in @jupyterlab/debugger@0.3.x yes, if we then add it to @jupyterlab/coreutils in core for 3.0.

@afshin
Copy link
Member

afshin commented Oct 9, 2020

Well basically if no one else uses it right now, I'd prefer to keep the API surface area of @jupyterlab/coreutils smaller rather than larger. What do you think?

@jtpio
Copy link
Member Author

jtpio commented Oct 9, 2020

Yeah at first it felt like it should be "out there", since there doesn't seem to be a variation of murmurhash2 that handles this case on npm.

But I agree that we can lower the maintenance burden by not exposing it and keep it private to the debugger package.

@jtpio
Copy link
Member Author

jtpio commented Oct 9, 2020

ok, so tests are also failing in #550

@jtpio
Copy link
Member Author

jtpio commented Oct 12, 2020

Tests fixed in #553

Copy link
Member

@afshin afshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you @jtpio!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gui error when using unicode symbols and setting a breakpoint
2 participants