Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default murmurhash seed and/or expose as option? #12

Closed
boydgreenfield opened this issue Oct 7, 2015 · 5 comments
Closed

Change default murmurhash seed and/or expose as option? #12

boydgreenfield opened this issue Oct 7, 2015 · 5 comments

Comments

@boydgreenfield
Copy link

While I appreciate the Douglas Adams tribute, I wanted to raise the possibility of (1) changing the default murmurhash seed value to 0 (default in many bindings) and/or (2) exposing it as a command line flag.

The former would be useful for compatibility (both cross lang/lib, and with existing k-mer hashing code that uses murmurhash3). The latter may also be useful in facilitating easier simulation of many pseudorandom minmer subsets from the same file, and for interop with code that doesn't use a seed of 42.

Obviously (1) is a highly breaking change and therefor good to discuss and pin down prior to a preprint coming out. Thoughts on the marbl side?

@boydgreenfield
Copy link
Author

Hi – quickly following up here. @ondovb + @aphillippy any thoughts on this?

A 3rd, 100% backwards compatible option is to just store the seed in the Cap'n Proto file so at the very least it's explicit vs. an undocumented implementation detail.

And hope this isn't annoying to re-raise – just trying to ensure that interop with Mash and other sketches will be simple in the future.

@ondovb
Copy link
Member

ondovb commented Oct 22, 2015

Option 3 would be non-invasive but still pave the way for the other options in the future; I don't see any reason not add that in a future release, but the first release is pretty much locked down. Adam?

@aphillippy
Copy link
Member

The problem goes beyond the seed. Really you'd want to store a full description or identifier of the whole hash function (seed included). E.g. there's no guarantee Mash will forever use murmur3hash. So I'm in favor of exposing alternate hash functions as advanced options, but we need to think how to capture that in the sketch in a general way. Maybe pick a magic string that gets hashed and the resulting hash value is the identifier for the hash function?

@ctb
Copy link

ctb commented May 20, 2016

This is important for #27! I like the magic string approach as a verification, in addition to having something like a simple documented table of hash functions. And yes, I had to dig into the source code to figure out why my hash functions were returning different values :).

@ondovb
Copy link
Member

ondovb commented Sep 23, 2017

The seed is now store and parameterized in v2.0. As mentioned in the release notes, there is a slight caveat here in that we had to ensure old versions would hard fail with a non-legacy seed, as they were not equipped to query it. Also for this reason, we kept the default as the legacy seed (42) for now to ensure maximum cross-version compatibility for typical users. This may change in the future as adoption of seed-aware versions increases.

@ondovb ondovb closed this as completed Sep 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants