New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting mash hashes for interoperability #27

Open
ctb opened this Issue May 20, 2016 · 51 comments

Comments

Projects
None yet
8 participants
@ctb

ctb commented May 20, 2016

Hi all,

first some background:

it seems like sourmash is going to be a thing; I'm building it into a metagenomics data exploration tool, and it's already integrated into https://github.com/dib-lab/khmer/ in some interesting and useful ways. Before it becomes too much of a thing, I'm interested in harmonizing with what you've done with mash, both out of gratitude and because it'd be kind of stupid to have multiple different MinHash implementations out there - interoperability would be really handy!

So, on the topic of interop, I poked around under the hood of mash, and am happy to report that I can swizzle sourmash over to use your exact hash function and seed; I will do so forthwith.

It seems like it would be relatively simple for me to write a parser for your .msh files, but that would depend on capnproto, I think. It seems like it would be better to be part of mash. So, what do you think about a 'dump' command for sketches? This would be an explicit "data transfer" format that we could use to transition sketches between MinHash software implementations. I'd guess that something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

If this is not antithetical to the very principles on which mash was founded, then great! Let me know! And I'm happy to whip up a prototype and submit a pull request - I was thinking of adding a new command, 'mash dump'. Alternate ideas very welcome.

cc @luizirber

@ctb ctb changed the title from a 'mash dump' command? to Exporting mash hashes for interoperability May 20, 2016

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy May 20, 2016

Member

Hey Titus,
Glad you're finding Mash useful. Happy to help support interop however we can. Have you looked into capnproto tools (like pycapnp) for decoding the serialized format? E.g.

@aphillippy @CapnProto I successfully decoded a sketch using pycapnp. Only needed 3 lines of code, brilliant. Gonna try encode message too.

— Alex Jironkin (@biocomputerist) January 19, 2016
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Should be pretty easy to slurp in the current Mash serialized format, which is defined here:
https://github.com/marbl/Mash/blob/master/src/mash/capnp/MinHash.capnp

Are you opposed to the extra capnp dependency this requires? What other format would you suggest?

Member

aphillippy commented May 20, 2016

Hey Titus,
Glad you're finding Mash useful. Happy to help support interop however we can. Have you looked into capnproto tools (like pycapnp) for decoding the serialized format? E.g.

@aphillippy @CapnProto I successfully decoded a sketch using pycapnp. Only needed 3 lines of code, brilliant. Gonna try encode message too.

— Alex Jironkin (@biocomputerist) January 19, 2016
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

Should be pretty easy to slurp in the current Mash serialized format, which is defined here:
https://github.com/marbl/Mash/blob/master/src/mash/capnp/MinHash.capnp

Are you opposed to the extra capnp dependency this requires? What other format would you suggest?

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb May 20, 2016

ctb commented May 20, 2016

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy May 20, 2016

Member

Looking for something binary or text based? Propose anything reasonable and should be pretty easy to add.

Member

aphillippy commented May 20, 2016

Looking for something binary or text based? Propose anything reasonable and should be pretty easy to add.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb May 21, 2016

Something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

I'll put together a Python prototype for comments and get back to you.

ctb commented May 21, 2016

Something quite minimal (uniquely identified hash function + seed, k size, identifier, and hashes, all in a CSV file) would work. In our 'signature' files we also include an md5sum of the hashes.

I'll put together a Python prototype for comments and get back to you.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb May 21, 2016

Separately - do you have any contribution instructions, coding guidelines, or PR/code review policies I could use to guide me? thx!

ctb commented May 21, 2016

Separately - do you have any contribution instructions, coding guidelines, or PR/code review policies I could use to guide me? thx!

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy May 26, 2016

Member

Wasn't presuming you'd code it up. Just propose a format for the dump and we can implement.

Member

aphillippy commented May 26, 2016

Wasn't presuming you'd code it up. Just propose a format for the dump and we can implement.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jun 3, 2016

I've implemented a dump command here: ctb#1. Let me know what you think -- I am happy to turn it into a pull request against this repo. The dumpfile format can be changed to include more information; I don't know enough about your internals to do a comprehensive job here!

I was thinking it'd be nice to implement a 'load' or 'import' command for mash as well; if you do that, I can provide a dump command from sourmash as well.

ctb commented Jun 3, 2016

I've implemented a dump command here: ctb#1. Let me know what you think -- I am happy to turn it into a pull request against this repo. The dumpfile format can be changed to include more information; I don't know enough about your internals to do a comprehensive job here!

I was thinking it'd be nice to implement a 'load' or 'import' command for mash as well; if you do that, I can provide a dump command from sourmash as well.

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Jun 7, 2016

Member

I was able to pull and run this, and the code is pretty straightforward, so I'd be fine with merging a pull request. A few comments:

  • It might make sense for this to be a function of mash info to keep the command space clean?
  • Other info we may want to throw in for future-proofing would be the alphabet and whether it's canonical.
  • As far as the format itself, we could end up with some pretty long lines e.g. for s=10,000, which isn't ideal, but single-line parsing does have some convenience to it. I'm also not crazy about nesting whitespace-separation within comma-separation (but of course these files aren't really meant to be viewed anyway).
Member

ondovb commented Jun 7, 2016

I was able to pull and run this, and the code is pretty straightforward, so I'd be fine with merging a pull request. A few comments:

  • It might make sense for this to be a function of mash info to keep the command space clean?
  • Other info we may want to throw in for future-proofing would be the alphabet and whether it's canonical.
  • As far as the format itself, we could end up with some pretty long lines e.g. for s=10,000, which isn't ideal, but single-line parsing does have some convenience to it. I'm also not crazy about nesting whitespace-separation within comma-separation (but of course these files aren't really meant to be viewed anyway).
@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jun 8, 2016

On Tue, Jun 07, 2016 at 02:43:10PM -0700, ondovb wrote:

I was able to pull and run this, and the code is pretty straightforward, so I'd be fine with merging a pull request. A few comments:

  • It might make sense for this to be a function of mash info to keep the command space clean?

I was thinking that too. Alternatively, part of the "expert" commands
that aren't shown by default? (You don't have any just yet, but you could
have more.)

  • Other info we may want to throw in for future-proofing would be the alphabet and whether it's canonical.

Absolutely - but I haven't looked into this for mash, so don't know what's
important for your current & future functionality. Specific suggestions?

  • As far as the format itself, we could end up with some pretty long lines e.g. for s=10,000, which isn't ideal, but single-line parsing does have some convenience to it. I'm also not crazy about nesting whitespace-separation within comma-separation (but of course these files aren't really meant to be viewed anyway).

As you say - agree to all of it :)

thanks!
--titus

ctb commented Jun 8, 2016

On Tue, Jun 07, 2016 at 02:43:10PM -0700, ondovb wrote:

I was able to pull and run this, and the code is pretty straightforward, so I'd be fine with merging a pull request. A few comments:

  • It might make sense for this to be a function of mash info to keep the command space clean?

I was thinking that too. Alternatively, part of the "expert" commands
that aren't shown by default? (You don't have any just yet, but you could
have more.)

  • Other info we may want to throw in for future-proofing would be the alphabet and whether it's canonical.

Absolutely - but I haven't looked into this for mash, so don't know what's
important for your current & future functionality. Specific suggestions?

  • As far as the format itself, we could end up with some pretty long lines e.g. for s=10,000, which isn't ideal, but single-line parsing does have some convenience to it. I'm also not crazy about nesting whitespace-separation within comma-separation (but of course these files aren't really meant to be viewed anyway).

As you say - agree to all of it :)

thanks!
--titus

@edawson

This comment has been minimized.

Show comment
Hide comment
@edawson

edawson Jul 4, 2016

Hi all,

Has there been any more work on this? Both tools are absolutely fantastic and I'd be happy to follow suit in making yet another MinHash classifier compatible with Mash/sourmash (especially since I've contributed to the overgrowth of implementations).

I'm not opposed to incorporating capnproto but it'd be nice to have a text format as well - the sourmash sigs are great to have around!

edawson commented Jul 4, 2016

Hi all,

Has there been any more work on this? Both tools are absolutely fantastic and I'd be happy to follow suit in making yet another MinHash classifier compatible with Mash/sourmash (especially since I've contributed to the overgrowth of implementations).

I'm not opposed to incorporating capnproto but it'd be nice to have a text format as well - the sourmash sigs are great to have around!

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy Jul 10, 2016

Member

Hi Eric,
Yes, but summer holidays have put us behind. We should get back to this shortly.

Best,
-Adam

Member

aphillippy commented Jul 10, 2016

Hi Eric,
Yes, but summer holidays have put us behind. We should get back to this shortly.

Best,
-Adam

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Jul 26, 2016

Member

The master branch now has this implemented via mash info -d. I adhered to JSON syntax for more potential compatibility, but the whitespace is controlled, so a custom line-by-line parser should be pretty straightforward. I'll write up some schema docs once it's solidified; for now it should be mostly self-explanatory.

The one thing I didn't include was the MD5, since I wasn't sure of the scope of each signature (e.g. metadata or multiple sketches).

Update: forgot to print seed in original; just committed fix.

Member

ondovb commented Jul 26, 2016

The master branch now has this implemented via mash info -d. I adhered to JSON syntax for more potential compatibility, but the whitespace is controlled, so a custom line-by-line parser should be pretty straightforward. I'll write up some schema docs once it's solidified; for now it should be mostly self-explanatory.

The one thing I didn't include was the MD5, since I wasn't sure of the scope of each signature (e.g. metadata or multiple sketches).

Update: forgot to print seed in original; just committed fix.

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Aug 11, 2016

Member

I plan make to an official release tomorrow, which will include this feature. Any feedback on the format or what is included is certainly welcome...it will be fairly easy to add little things later, but any major changes will become more difficult once it is in use.

Member

ondovb commented Aug 11, 2016

I plan make to an official release tomorrow, which will include this feature. Any feedback on the format or what is included is certainly welcome...it will be fairly easy to add little things later, but any major changes will become more difficult once it is in use.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Aug 11, 2016

ctb commented Aug 11, 2016

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Aug 11, 2016

ctb commented Aug 11, 2016

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Aug 11, 2016

Member

Ok, take your time and I will hold for your comments, since you're the primary target of the feature.

Member

ondovb commented Aug 11, 2016

Ok, take your time and I will hold for your comments, since you're the primary target of the feature.

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Aug 22, 2016

Member

Did you get a chance to take a look at the JSON? I think I will go ahead with the release in the next couple days, but it should be easy to throw in any last minute tweaks to the format.

Member

ondovb commented Aug 22, 2016

Did you get a chance to take a look at the JSON? I think I will go ahead with the release in the next couple days, but it should be easy to throw in any last minute tweaks to the format.

@boydgreenfield

This comment has been minimized.

Show comment
Hide comment
@boydgreenfield

boydgreenfield Aug 24, 2016

@ondovb – One potential issue with JSON encoding here is that while serializing in C++ will work for 64-bit uints, the behavior decoding the JSON is going to be implementation specific (i.e., loading in Javascript will truncate the numbers to 64-bit floats).

For the purposes of having a maximally interchangeable format, thoughts on encoding these as strings in the JSON array?

boydgreenfield commented Aug 24, 2016

@ondovb – One potential issue with JSON encoding here is that while serializing in C++ will work for 64-bit uints, the behavior decoding the JSON is going to be implementation specific (i.e., loading in Javascript will truncate the numbers to 64-bit floats).

For the purposes of having a maximally interchangeable format, thoughts on encoding these as strings in the JSON array?

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy Aug 25, 2016

Member

Ooo, good catch, Nick. Thanks. I am not a JS person, so I don't know the best way to address this. Encoding as a string seems like a sensible solution to me...

Member

aphillippy commented Aug 25, 2016

Ooo, good catch, Nick. Thanks. I am not a JS person, so I don't know the best way to address this. Encoding as a string seems like a sensible solution to me...

@boydgreenfield

This comment has been minimized.

Show comment
Hide comment
@boydgreenfield

boydgreenfield Aug 25, 2016

Yeah, I think string encoding is the simplest approach here (esp. since many languages may will then load an array of ints interspersed with longs vs. all unsigned 64-bit ints). That would also avoid really cryptic errors where folks pass this JSON blob through, e.g., Javascript services, and then end up with mysteriously truncated hash values.

Does that work for your use case @ctb?

boydgreenfield commented Aug 25, 2016

Yeah, I think string encoding is the simplest approach here (esp. since many languages may will then load an array of ints interspersed with longs vs. all unsigned 64-bit ints). That would also avoid really cryptic errors where folks pass this JSON blob through, e.g., Javascript services, and then end up with mysteriously truncated hash values.

Does that work for your use case @ctb?

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Aug 31, 2016

Yep, encoding as strings works for me!

ctb commented Aug 31, 2016

Yep, encoding as strings works for me!

@michaelbarton

This comment has been minimized.

Show comment
Hide comment
@michaelbarton

michaelbarton Dec 9, 2016

Contributor

Would it be possible to add the functionality to deserialize the JSON dump into
a mash .msh file? This would allow interopability to work for both loading and
dumping.

I think having a standard format for mash hash interoperability is a good idea.
It may be useful to define an RFC-like document with a description of all the
fields. Nothing fancy - a simple document with the specifications of each key
and corresponding value. My small recommendation would be to add a version
field, this would help with maintenance of changes if additional fields are
added in future etc.

All data that might change in the future should be tagged with a version
number.
-- Joe Armstrong

Contributor

michaelbarton commented Dec 9, 2016

Would it be possible to add the functionality to deserialize the JSON dump into
a mash .msh file? This would allow interopability to work for both loading and
dumping.

I think having a standard format for mash hash interoperability is a good idea.
It may be useful to define an RFC-like document with a description of all the
fields. Nothing fancy - a simple document with the specifications of each key
and corresponding value. My small recommendation would be to add a version
field, this would help with maintenance of changes if additional fields are
added in future etc.

All data that might change in the future should be tagged with a version
number.
-- Joe Armstrong

@michaelbarton

This comment has been minimized.

Show comment
Hide comment
@michaelbarton

michaelbarton Dec 9, 2016

Contributor

I supoose it also follows that once you have an interchangable format for
mashes, a centralised database of precomputed hashes would be useful for
everyone. Titus mentions something similar in the last paragraph of his blog
post MinHash signatures as ways to find samples, and
collaborators?
.

I would suggest maintaining the individual hashes of each organism and then
allowing each collaborator to mash paste hashes together. This would allow
the creation of custom databases, and cross-validation of tools through leave-n
out benchmarking.

Contributor

michaelbarton commented Dec 9, 2016

I supoose it also follows that once you have an interchangable format for
mashes, a centralised database of precomputed hashes would be useful for
everyone. Titus mentions something similar in the last paragraph of his blog
post MinHash signatures as ways to find samples, and
collaborators?
.

I would suggest maintaining the individual hashes of each organism and then
allowing each collaborator to mash paste hashes together. This would allow
the creation of custom databases, and cross-validation of tools through leave-n
out benchmarking.

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Dec 9, 2016

Member

@michaelbarton this is certainly something we are open to, but the standardization you noted I think would be a prerequisite, especially with regard to the hash function, to keep things robust. We'll think about this more and keep you posted.

Member

ondovb commented Dec 9, 2016

@michaelbarton this is certainly something we are open to, but the standardization you noted I think would be a prerequisite, especially with regard to the hash function, to keep things robust. We'll think about this more and keep you posted.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Dec 11, 2016

ctb commented Dec 11, 2016

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Dec 30, 2016

@ondovb : the standardization of the hash function would indeed be necessary.

However, even with such a standardization sharing k-mers would seem preferable to sharing their hash values. This would allow added robustness through a rather simple check mechanism: check locally that a local hashing function produces the same minhashes as the one in the signature fetched from somewhere. This would also allow fancy handshakes between databases of signatures (pass the kmers and the corresponding hashes, and have an empirical assessment that the hash function is likely the same).

This would require a bit of added code upstream (tools creating signatures) with the need to store kmers in an signature-for-export (kmers and corresponding hash values) at creation time, but this does seem relatively easy to add.

PS: Sharing the k-mers, might also make a JSON-based format relatively free from the concern about JS and long integers.

lgautier commented Dec 30, 2016

@ondovb : the standardization of the hash function would indeed be necessary.

However, even with such a standardization sharing k-mers would seem preferable to sharing their hash values. This would allow added robustness through a rather simple check mechanism: check locally that a local hashing function produces the same minhashes as the one in the signature fetched from somewhere. This would also allow fancy handshakes between databases of signatures (pass the kmers and the corresponding hashes, and have an empirical assessment that the hash function is likely the same).

This would require a bit of added code upstream (tools creating signatures) with the need to store kmers in an signature-for-export (kmers and corresponding hash values) at creation time, but this does seem relatively easy to add.

PS: Sharing the k-mers, might also make a JSON-based format relatively free from the concern about JS and long integers.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Dec 30, 2016

ctb commented Dec 30, 2016

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Dec 30, 2016

140 chars version of my previous post: Kmer + hash value allow check. Kmer-only allows to overcome JS issue with integers.

lgautier commented Dec 30, 2016

140 chars version of my previous post: Kmer + hash value allow check. Kmer-only allows to overcome JS issue with integers.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Dec 30, 2016

ctb commented Dec 30, 2016

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Dec 31, 2016

(...)

If the goal is to verify the identity of the hash functions, that can be done by hashing a signal k-mer and sharing that hash specifically.

Empirical assessment of identical hashing functions by comparing output from known input is not going to be a guarantee no matter the numbers of inputs used but I'd say: use as much as possible while remaining within a practical data transfer and computational burden. This would likely mean more than one, and may point towards using the list of signature k-mers with their hash values.

Should the sharing of both the k-mers and the corresponding hash values for all entries in the signature represent an excessive burden (the signatures are relatively small, but let's assume it could happen), the fact that a signature is constituted of the N lowest hash values provides a rather straightforward way to dynamically adjust the amount of k-mer/hash value pairs in a signature-for-export. One can then only share M k-mer/hash value pairs for verification (with M <= N and and the and the pairs such as the hash values are the M lowest ones), as well as the remaining N-M entries in the signatures (k-mers, or hash values) obviously. It would cover up to the case of one signal k-mer with M=1(or no check at all withM=0`).

lgautier commented Dec 31, 2016

(...)

If the goal is to verify the identity of the hash functions, that can be done by hashing a signal k-mer and sharing that hash specifically.

Empirical assessment of identical hashing functions by comparing output from known input is not going to be a guarantee no matter the numbers of inputs used but I'd say: use as much as possible while remaining within a practical data transfer and computational burden. This would likely mean more than one, and may point towards using the list of signature k-mers with their hash values.

Should the sharing of both the k-mers and the corresponding hash values for all entries in the signature represent an excessive burden (the signatures are relatively small, but let's assume it could happen), the fact that a signature is constituted of the N lowest hash values provides a rather straightforward way to dynamically adjust the amount of k-mer/hash value pairs in a signature-for-export. One can then only share M k-mer/hash value pairs for verification (with M <= N and and the and the pairs such as the hash values are the M lowest ones), as well as the remaining N-M entries in the signatures (k-mers, or hash values) obviously. It would cover up to the case of one signal k-mer with M=1(or no check at all withM=0`).

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 1, 2017

Note that with @lgautier's sterling efforts in dib-lab/sourmash#71, sourmash is switching to JSON as a default signature format for 2.0. I plan to work with the code for a bit and then revisit this issue.

ctb commented Jan 1, 2017

Note that with @lgautier's sterling efforts in dib-lab/sourmash#71, sourmash is switching to JSON as a default signature format for 2.0. I plan to work with the code for a bit and then revisit this issue.

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Jan 1, 2017

Would a joint Python package (to be used as a command-line and/or library) to read/write/convert formats be of general and mutually beneficial interest ?

This would provide starting point and a rather concrete way to see whether a common format can be worked out (and even if it cannot, a common code base for conversion would already be really nice).

lgautier commented Jan 1, 2017

Would a joint Python package (to be used as a command-line and/or library) to read/write/convert formats be of general and mutually beneficial interest ?

This would provide starting point and a rather concrete way to see whether a common format can be worked out (and even if it cannot, a common code base for conversion would already be really nice).

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy Jan 2, 2017

Member

I don't see any barriers to a common format and would like to minimize the number of alternate formats. So it seems like we just need to define a spec for the JSON format? @ondovb and I had also arrived at the @lgautier solution in our informal discussions (of reporting some number of kmers to cross-check the hash function).

Any strong feelings on who should take first crack at defining a spec. @ondovb , are you up for it? I know of some other groups who would are interested, so after making a draft we could circulate it around for comment.

Member

aphillippy commented Jan 2, 2017

I don't see any barriers to a common format and would like to minimize the number of alternate formats. So it seems like we just need to define a spec for the JSON format? @ondovb and I had also arrived at the @lgautier solution in our informal discussions (of reporting some number of kmers to cross-check the hash function).

Any strong feelings on who should take first crack at defining a spec. @ondovb , are you up for it? I know of some other groups who would are interested, so after making a draft we could circulate it around for comment.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 2, 2017

ctb commented Jan 2, 2017

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Jan 2, 2017

Sounds awesome. Looking forward to seeing the draft, and if helpful I am happy to look at it early even if still plenty of open questions.

lgautier commented Jan 2, 2017

Sounds awesome. Looking forward to seeing the draft, and if helpful I am happy to look at it early even if still plenty of open questions.

@boydgreenfield

This comment has been minimized.

Show comment
Hide comment
@boydgreenfield

boydgreenfield Jan 3, 2017

@aphillippy @ondovb We've been thinking about this a bit too (we use a Mash implementation in our clustering functionality for One Codex), and would love to review a draft spec. We've also been moving towards storing the k-mers directly and then re-hashing at comparison time (which is really cheap) or storing k-mers and hashes.

It'd be nice to have a formal JSON schema or similar for the spec. Then validating that an implementation conforms to spec becomes trivial (this doesn't need to be a lot of work, just defining the fields and types).

Let me know how/if we can help here!

boydgreenfield commented Jan 3, 2017

@aphillippy @ondovb We've been thinking about this a bit too (we use a Mash implementation in our clustering functionality for One Codex), and would love to review a draft spec. We've also been moving towards storing the k-mers directly and then re-hashing at comparison time (which is really cheap) or storing k-mers and hashes.

It'd be nice to have a formal JSON schema or similar for the spec. Then validating that an implementation conforms to spec becomes trivial (this doesn't need to be a lot of work, just defining the fields and types).

Let me know how/if we can help here!

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 3, 2017

some good comments from @michaelbarton here.

ctb commented Jan 3, 2017

some good comments from @michaelbarton here.

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Jan 5, 2017

bump.

I'd be able to jolt down a skeleton. I happen to have a bit of time & inspiration about this, but I don't know how long both will remain true.

lgautier commented Jan 5, 2017

bump.

I'd be able to jolt down a skeleton. I happen to have a bit of time & inspiration about this, but I don't know how long both will remain true.

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Jan 5, 2017

Member

@lgautier I will probably get to this by tomorrow but have no JSON schema experience so feel free to contribute whatever you have in mind. Our current plan is to have kmer-hash pairs, with both values being optional, as long as each pair has one and at least some pairs have both for hash function verification.

Member

ondovb commented Jan 5, 2017

@lgautier I will probably get to this by tomorrow but have no JSON schema experience so feel free to contribute whatever you have in mind. Our current plan is to have kmer-hash pairs, with both values being optional, as long as each pair has one and at least some pairs have both for hash function verification.

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Jan 5, 2017

@ondovb . I have thought a little further and have a couple of things beyond kmer-hash value pairs. What would be a proper place to put it ?

Participants seem to be spanning across groups and institutions, so if it is looking sensible to all we could open a github organization to have both specs and reference implementation(s).

lgautier commented Jan 5, 2017

@ondovb . I have thought a little further and have a couple of things beyond kmer-hash value pairs. What would be a proper place to put it ?

Participants seem to be spanning across groups and institutions, so if it is looking sensible to all we could open a github organization to have both specs and reference implementation(s).

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy Jan 5, 2017

Member

@lgautier we will post a strawman spec and open an issue for comments. i think it's premature to go bigger for now

@ondovb we had concluded in the initial draft to include kmer+hash for all elements of the sketch, since it will only double the size and simplify things

Member

aphillippy commented Jan 5, 2017

@lgautier we will post a strawman spec and open an issue for comments. i think it's premature to go bigger for now

@ondovb we had concluded in the initial draft to include kmer+hash for all elements of the sketch, since it will only double the size and simplify things

@lgautier

This comment has been minimized.

Show comment
Hide comment
@lgautier

lgautier Jan 5, 2017

@aphillippy Sorry if I was too terse. Size was not in mind. More the notion that the specs are across/between tools. May be an unnecessary concern.

lgautier commented Jan 5, 2017

@aphillippy Sorry if I was too terse. Size was not in mind. More the notion that the specs are across/between tools. May be an unnecessary concern.

@aphillippy

This comment has been minimized.

Show comment
Hide comment
@aphillippy

aphillippy Jan 5, 2017

Member

@lgautier no worries, I like terse :) I meant that the number of interested groups is relatively small for now, so I don't think we need to worry about opening a new project until things get unwieldy. perhaps we can just put the spec in the separate repo for now, to keep it separate from mash

Member

aphillippy commented Jan 5, 2017

@lgautier no worries, I like terse :) I meant that the number of interested groups is relatively small for now, so I don't think we need to worry about opening a new project until things get unwieldy. perhaps we can just put the spec in the separate repo for now, to keep it separate from mash

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Jan 6, 2017

Member

Quick first pass schema is up; mostly formalizing what is output now but with hashes as strings and optional kmers. See #44 for further discussion.

Member

ondovb commented Jan 6, 2017

Quick first pass schema is up; mostly formalizing what is output now but with hashes as strings and optional kmers. See #44 for further discussion.

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 6, 2017

ctb commented Jan 6, 2017

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Jan 20, 2017

On first blush, your JSON export format works perfectly for me -- I have a script that converts from mash info -d format => sourmash format in dib-lab/sourmash#121. More testing soon.

ctb commented Jan 20, 2017

On first blush, your JSON export format works perfectly for me -- I have a script that converts from mash info -d format => sourmash format in dib-lab/sourmash#121. More testing soon.

@kescobo kescobo referenced this issue Mar 21, 2017

Merged

Genomic Distances using MASH #415

7 of 7 tasks complete
@kescobo

This comment has been minimized.

Show comment
Hide comment
@kescobo

kescobo Apr 6, 2017

I'm a bit late to this party, and please forgive me if this is a naive question, but where can I actually find the hash function + seed that you're using? I don't speak c++, so I might be looking in the wrong place, but I didn't find anything that looks like a function in the MurmurHash3 or Hash files in the src/ folder. Is this something that's being imported from somewhere else?

I'm asking because I made yet another Mash implementation in julia, and I'd like to ensure interoperability, but I'm not quite sure where to start making that happen. Julia has a built in hash() function that seems to work, but I presume it's not the same.

Also, I'm currently hashing the 4-bit DNA sequence format implemented in Bio.jl, which makes it blazing fast, but I'm assuming Mash and SourMash are hashing the string representations of the sequence?

kescobo commented Apr 6, 2017

I'm a bit late to this party, and please forgive me if this is a naive question, but where can I actually find the hash function + seed that you're using? I don't speak c++, so I might be looking in the wrong place, but I didn't find anything that looks like a function in the MurmurHash3 or Hash files in the src/ folder. Is this something that's being imported from somewhere else?

I'm asking because I made yet another Mash implementation in julia, and I'd like to ensure interoperability, but I'm not quite sure where to start making that happen. Julia has a built in hash() function that seems to work, but I presume it's not the same.

Also, I'm currently hashing the 4-bit DNA sequence format implemented in Bio.jl, which makes it blazing fast, but I'm assuming Mash and SourMash are hashing the string representations of the sequence?

@ctb

This comment has been minimized.

Show comment
Hide comment
@ctb

ctb Apr 6, 2017

ctb commented Apr 6, 2017

@edawson

This comment has been minimized.

Show comment
Hide comment
@edawson

edawson Apr 6, 2017

Hey Kevin,

I use the 128bit one from the MurmurHash library here and cut it down to the first 64 bits. If I remember correctly it's exactly the same as what Mash does.

The exact call I make is MurmurHash3_x64_128(start, seq.size(), 42, khash);.

Our hash seed is 42.

edawson commented Apr 6, 2017

Hey Kevin,

I use the 128bit one from the MurmurHash library here and cut it down to the first 64 bits. If I remember correctly it's exactly the same as what Mash does.

The exact call I make is MurmurHash3_x64_128(start, seq.size(), 42, khash);.

Our hash seed is 42.

@ondovb

This comment has been minimized.

Show comment
Hide comment
@ondovb

ondovb Apr 6, 2017

Member

@edawson Yep, that's exactly what Mash does, although in the case of k <= 16 (nucleotide) or k <= 7 (protein) it will optimize by using MurmurHash3_x86_32 (all 32 bits).

@kescobo Yes, the raw (ASCII) strings are hashed to allow varied alphabets, such as protein.

Member

ondovb commented Apr 6, 2017

@edawson Yep, that's exactly what Mash does, although in the case of k <= 16 (nucleotide) or k <= 7 (protein) it will optimize by using MurmurHash3_x86_32 (all 32 bits).

@kescobo Yes, the raw (ASCII) strings are hashed to allow varied alphabets, such as protein.

@kescobo

This comment has been minimized.

Show comment
Hide comment
@kescobo

kescobo Apr 6, 2017

Thanks for the rapid response everyone - happily there appears to be a julia implementation of murmur3, so I'll give that a shot and compare to @ctb 's python implementation.

@ondovb - thank you also for specifically mentioning the fact that it's an ASCII string - julia uses utf8 strings by default, and I likely would have spent a long time banging my head against that difference if you hadn't brought it up.

cc @edawson

kescobo commented Apr 6, 2017

Thanks for the rapid response everyone - happily there appears to be a julia implementation of murmur3, so I'll give that a shot and compare to @ctb 's python implementation.

@ondovb - thank you also for specifically mentioning the fact that it's an ASCII string - julia uses utf8 strings by default, and I likely would have spent a long time banging my head against that difference if you hadn't brought it up.

cc @edawson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment