Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON schema #44

Open
ondovb opened this issue Jan 6, 2017 · 23 comments
Open

JSON schema #44

ondovb opened this issue Jan 6, 2017 · 23 comments

Comments

@ondovb
Copy link
Member

ondovb commented Jan 6, 2017

A first pass of the JSON schema is in the Mash repo:
https://github.com/marbl/Mash/blob/master/src/mash/schema.json

For now, I put k-mers as a separate array parallel to hashes rather than an array of tuples, since the latter seemed unwieldy, especially if they are optional.

@boydgreenfield
Copy link

boydgreenfield commented Jan 6, 2017

@ondovb This looks like a great start! A few items we've been tracking here that it'd be great to include (we were literally just whiteboarding this):

  • A counts array, and extra parameters around count-based trimming of the min-hashes
  • An Object type metadata value for storing extra sample metadata (per sketch)
  • An optional metadata_schema value that could point to a schema for said metadata

A few more tactical items and questions:

  • Is name the filename for each sketch? In our implementation, we've previously also tracked the file size and an md5 checksum, which is useful for deduplicating
  • Is length the number of k-mers? Or file size? I think it could be useful to store both. One should probably only count valid k-mers per the alphabet in an implementation
  • alphabet should maybe be an enum?
  • hashSeed should probably be nullable since not all hash functions expose a seed (or one just makes it 0 in that case?)

@ondovb
Copy link
Member Author

ondovb commented Jan 6, 2017

@boydgreenfield Thanks for the feedback.

A counts array, and extra parameters around count-based trimming of the min-hashes

Counts make sense, and are in the Mash Cap'n Proto. What kind of trimming parameters did you have in mind?

An Object type metadata value for storing extra sample metadata (per sketch)
An optional metadata_schema value that could point to a schema for said metadata

Seems like as good a way as any.

Is name the filename for each sketch? In our implementation, we've previously also tracked the file size and an md5 checksum, which is useful for deduplicating

Usually, but for Mash it could also be a fasta tag if -i was active (in which case any description after whitespace goes in comment). To support these two modes (and format independence, e.g. fasta/fastq), I think an MD5 would have to be based on the sequence itself.

Is length the number of k-mers? Or file size? I think it could be useful to store both. One should probably only count valid k-mers per the alphabet in an implementation

It is the raw sequence length (or, for read sets, a genome size estimate based on k-mer content). We only use this for p-value calculation. The number of valid k-mers is currently not tracked in Mash, but I agree that allowing more information about this would be better.

alphabet should maybe be an enum?

I assume you mean only allowing nucleotide/protein options? For Mash we kept it as generic as possible in case someone would want to do text mining or something, but I could see the argument for that.

hashSeed should probably be nullable since not all hash functions expose a seed (or one just makes it 0 in that case?)

Makes sense. Regarding the hash function itself, I have it as an enum but am not sure if that's the best way. It would ensure specificity but would also require a new schema version to use any other hash function.

In general we could also allow additional fields ("additionalProperties" : true), though I feel like that would weaken the schema as a validation tool. Of course, if people end up ignoring schema compliance to get functionality, that's not very useful either.

@bbushnell
Copy link

The suggested JSON format appears to be extremely inefficient. BBMap's sketches look like this:

#SZ:30 CD:AD GS:1430 ID:393251 NM:Paenibacillus nanensis NM0:gi|343200804|ref|NR_041491.1| Paenibacillus nanensis strain MX2-3 16S ribosomal RNA gene, partial sequence
7XJemnRFJVG
HXETE>jil
18BHI?<JhP
1?ZKWJ=1CU
48anA5Vkc
1<7TlMGgOo
AT`\bZgKR
2]nnIcK;I_

...etc. They are coded in 2-bit format ASCII-48 with delta-compression so even when you gzip them the size is only reduced by ~30%. So, they are extremely efficient to store and load. I suggest, if the goal is to make an efficient interoperable standard, that you adopt something similar and abandon JSON.

@lgautier
Copy link

lgautier commented Jan 6, 2017

@bbushnell : binary formats are pretty much always more efficient than text formats, and I'd expect respective tools looking for performance to use their own representation. IIUIC this is trying to be an initial attempt at having a data exchange format between tools. I'd say that JSON is attractive because of the ubiquitous availability of tools around it (libraries exist for all languges) and as it defines basic structures like arrays and key-value maps could let focus on the content first (what should that format contain).

As use-cases emerge may the there might be a need to optimize, may be in incremental steps (2bit packing k-mers although this limited to {A,(T|U),C,G} sequences), encoding arrays with the minhash sketches as bytes-packed strings, etc...), but this would be for later ?

@bbushnell
Copy link

BBMap's sketch format is text, not binary, as you can see from my post - that is the exact, literal, first 9 lines of a BBMap sketch. Binary might be more efficient, but then you can't look at the sketches in a text editor, so I'm not really interested in that. They already are only 150% of their gzipped size, so I don't think it's much of a problem.

@bbushnell
Copy link

Oh - as for noncanonical bases... yes, you're right. My sketch format can only accommodate ACGT. I don't think this is a problem, though, because... well, what is the goal of sketches? It's to rapidly evaluate whether sequences are similar. Does anyone care whether you have a poly-N sequence that matches everything? ...no.

@ondovb
Copy link
Member Author

ondovb commented Jan 6, 2017

@bbushnell I agree with @lgautier that interoperability is the primary goal of this effort, above efficiency. I think the point is that if parsing requires any other custom code or less-than-mainstream libraries, one might as well maximize efficiency with a binary format. This was certainly the motivation behind our use of Cap'n Proto serialization (which actually does provide a schema and libraries for several languages). The ASCII encoding is an interesting middle ground if we want to compress the string within the JSON in the future, but any such solution would have to support the protein alphabet at the very least.

@aphillippy
Copy link
Member

Agree with all points raised by @lgautier and @ondovb

@lgautier
Copy link

lgautier commented Jan 6, 2017

@boydgreenfield 's suggestion to add count-based trimming is pointing out that in the case of DNA, RNA, or protein data the definition of a minhash sketch extends beyond the definition of an hash function (which the redundancy in sharing k-mers/n-tuples and associated hash values would empirically verify when sharing a sketch) and should cover a bit the nature of the data shared and associated pre-processing leading to the minhash sketch. In a way this is part of the "metadata" that was also suggested to be added. Fully defining it is a complex problem that should probably stay out), but at the same it the information might be important to make use of the sketch / signature (one of the reason they are exchanged in the first place).

For example, whether a DNA minhash sketch is build from a complete assembled genome or reads from shotgun sequencing for a given genome would have an influence on what a minhash sketch means or could be used. I am more specifically thinking of the use-case where the subset of kmers constituted by a sketch is used to query a database / service about whether they have a matching signature. With a convention the server might be able to answer the best way (e.g., prioritize / adjust threshold when using search).

I have the initial feeling that while this is looking like opening a Pandora's box, but I also think that major use-cases can be defined/covered well enough to have a practical exchange format.

Would the notion of hash value-level metadata and minhash sketch-level metaa seems like a interesting starting point ?

  • hash value-level would be:
    • hash values sorted (say, in increasing order)
    • required hash value-level metadata is the sequence (k-mer/n-gram) from which the hash is computed
    • optional hash value-level can comprise count (can other hash-value levels be included)
  • minhash sketch-level would be:
    • filter (metadata Pandora's box again here, but may be common filters can be agreed on ? e.g., count, complexity)
    • total number of k-mers/n-grams evaluated for inclusion in the minhash sketch

@lgautier
Copy link

lgautier commented Jan 6, 2017

alphabet should maybe be an enum?

I assume you mean only allowing nucleotide/protein options? For Mash we kept it as generic as possible in case someone would want to do text mining or something, but I could see the argument for that.

I think that agree with @ondovb : the alphabet is defining defining explicitly the space of k-mers / n-grams. Not space optimal (e.g. all amino acids repeated with each minhash sketch of polypeptides) but the mihash sketch is likely taking much more space anyway. It would also allow exotic bases, and all sort oddities synthetic biology can be coming up with.

@lgautier
Copy link

lgautier commented Jan 6, 2017

hashSeed should probably be nullable since not all hash functions expose a seed (or one just makes it 0 in that case?)

Makes sense. Regarding the hash function itself, I have it as an enum but am not sure if that's the best way. It would ensure specificity but would also require a new schema version to use any other hash function.

The definition of the hash function can be relaxed to being a string. There can be common-agreed-upon hashing function, but even so the redundancy of sharing hash values along with their originating k-mers/n-tuples is there to empirically double-check it.

@bbushnell
Copy link

That's fine. All I care about is efficiency, which is the point of min-hash-sketch. I'm surprised that you guys are willing to compromise efficiency for a basically intangible benefit of interoperability which may or may not happen. Good luck!

@bbushnell
Copy link

@ondovb
BBMap's sketch format uses 2-bit notation, but there's no problem with using 5-bit instead, to support proteins... BBMap does not currently support 5-bit format, but I could certainly add it if it would be useful. Currently, it's designed to match nucleotide sequences.

@boydgreenfield
Copy link

boydgreenfield commented Jan 6, 2017

@lgautier @ondovb I agree on both fronts re: the alphabet (to be clear, we're suggesting a list of valid characters as a string, correct?) and am fine with relaxing the hashFunction to a string. I also like your suggestions re: additional "how was this sketch constructed?" information @lgautier. Perhaps we can call all of these params so there's also a place to put metadata about the file being sketched, e.g., this is a sketch of short-read NGS data from a stool sample.

@bbushnell I think the point here is to get to something easy enough to use for interoperability, and so we should try to optimize for parse-ability and ease-of-correct-implementation over efficiency. E.g., we've actually been storing all of our min-hashes as binary data in Postgres.

@bbushnell
Copy link

@boydgreenfield

If you are interested in interoperability, doesn't it make more sense to store data as text? Personally, I consider binary formats to be inherently non-interoperable.

@bbushnell
Copy link

To emphasize this - I have written a lot of tools. All of them support text formats. I have zero interest in writing programs to read custom binary formats that are language-specific or format-specific, when they are less efficient than a text-based protocol.

@lgautier
Copy link

lgautier commented Jan 8, 2017

@bbshnell May be a slight misunderstanding here. While text vs binary was may be not the best way to describe it, it was (inaccurately) implied that your format was the binary one. In other words everyone has a text format, and this is not why JSON is considered.

@ondovb
Copy link
Member Author

ondovb commented Jan 9, 2017

@boydgreenfield Yes, I was suggesting the alphabet be a string, any characters in which would be considered valid. Characters that are different cases of the same letter would also be considered valid unless preserveCase were true. This is based on Mash's default case-insensitivity, but for this format it may make more sense to invert the parameter to ignoreCase.

@ondovb
Copy link
Member Author

ondovb commented Jan 12, 2017

I've updated schema-1.0.0.json in the repo to address some of these issues and I've added comments, which may not be strictly valid JSON but seem to be ignored by the validator I'm using.

  • additional properties - We've changed it to allow these, with the idea that this could be a minimal schema that can have additional information (e.g. metadata or novel filters) layered on top in derived schemas. This way tools could adapt the format for their own needs while being able to convey the most necessary information to other tools, with the extra data simply being ignored.
  • version - I addressed this with a string that is expected to be the URI of the schema used, whose name would include a version.
  • filters - There is a dedicated property for these, currently inhabited only by the minimum copy number filter.

We plan on updating Mash to read and write the format as proposed soon, but others are welcome to continue working on standards related to metadata or to create a shared repo. For a name, I would like to propose Jam (JSON MinHash), in keeping with the edibles theme :P

@lgautier
Copy link

"Jam" has a nice ring has it can also mean an informal and spontaneous musical performance. Visibility in search engines might be an other matter though.

I am about ready to write read/write code for that format but I have a question about the license for the JSON definition being discussed: what is it released under ? (CC-like would seem to make sense).

@aphillippy
Copy link
Member

@lgautier Public domain (I'm a govt employee). If someone else wants to open a repo and merge contributions from others, then I'd vote for CC0.

@lgautier
Copy link

Thanks. Public domain is good to start. We can see if need for anything else because of contributions or so later on...

@lgautier
Copy link

In case anyone is looking for the schema: the URL at the top of this thread appears no longer valid. It is here: https://github.com/marbl/Mash/blob/master/src/mash/schema-1.0.0.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants