-
-
Notifications
You must be signed in to change notification settings - Fork 43
Switch to a different murmurhash implementation to handle Unicode characters #549
Conversation
Looks like some tests are timing out. |
Hi @jtpio! How come the library is hosted on your account? Was it unavailable as a distribution by the original authors? Do you have a suggestion for how we should do this? We could bring the file into this repo, maybe? |
It's a modified version of the original to handle unicode. Also the one published to npm is not from the original authors but from a fork.
Yes I think so too. From the comment above:
|
I missed that. That seems like a good idea to me. Or even including it as a local file we do not export right in the debugger package for now. |
I think we can vendor it in |
Well basically if no one else uses it right now, I'd prefer to keep the API surface area of |
Yeah at first it felt like it should be "out there", since there doesn't seem to be a variation of murmurhash2 that handles this case on npm. But I agree that we can lower the maintenance burden by not exposing it and keep it private to the debugger package. |
ok, so tests are also failing in #550 |
Tests fixed in #553 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you @jtpio!
Fixes #538.
This change switches to using the murmurhash2 implementation from: https://github.com/jtpio/murmurhash2
Before
After
The issue
Here is a report after digging into this and looking for different solutions. Posting a bit of information here so it's more convenient to refer to it in the future.
In #538, we noticed that having non-ascii characters such as
€
in a code cell could be reproduced on the stable Binder (using0.3.3
).It was because the MurmurHash2 implementation in JS and the one used in
xeus-python
would not give the same result for the same input and seed.@jupyterlab/debugger
was using this implementation: https://github.com/mikolalysenko/murmurhash-js/blob/f19136e9f9c17f8cddc216ca3d44ec7c5c502f60/murmurhash2_gc.js#L14-L50, which corresponds to the version published on npm: https://www.npmjs.com/package/murmurhash-jsxeus-python
is using this implementation: https://github.com/xtensor-stack/xtl/blob/7d41f768787ee6405e3f1b1056e04f6fbd43f8cd/include/xtl/xhash.hpp#L47Which is based on the original in C++: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp
Using this code snippet:
With
3339675911
as the seed.Would give:
murmurhash-js
:3511777296
xtl
:1079235191
(more info in #538)
Summary:
From here two options:
WebAssemly
This sounds attractive because it would mean using the same code on both the backend and the frontend.
Reviving the initial work from xtensor-stack/xtl#171 led to being able to use the
murmur2_x86
function in a node repl.However it started to feel a bit complicated when:
.wasm
files for the web as part of an npm package is a bit complicated. At least as of today, but it might improve over timebase64
encode it, and create a newUint8Array
to be passed toWebAssembly.initiate
in asetup
hook / function or equivalentxtl/hash
to wasm withemscripten
is in the order of ~100KB, which is quite a lot for just one functionemscripten
/ wasm tooling requires extra setupFix the existing JS implementation
The implementation from
murmurhash-js
states it is for ascii strings only: https://github.com/garycourt/murmurhash-js/blob/0197ce38bedac0e05f40b9d7152095d06db8292c/murmurhash2_gc.js#L9Instead, we can convert the string to a
Uint8Array
via aTextEncoder
, and perform the operations on theUint8Array
just like the original algorithm and the one used inxeus-python
.This is implemented in https://github.com/jtpio/murmurhash2
Additional thoughts
Instead of having yet another
murmurhash2
package on npm, one way forward could be to move the implementation to@jupyterlab/coreutils
instead.The
0.3.x
releases of the debugger extension (this repo) could still depend onmurmur2
.