-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More structured schema format? #12
Comments
Thanks, @jimmymathews. I think having a Frictionless Data representation would be extremely valuable, but my vote would be for "in addition" rather than "instead". In general, I completely agree that having a human-readable representation simplifies schema maintenance. This is actually what motivated us to use YAML instead of the more popular JSON format. (As you know, YAML is often used for configuration specifications, e.g., in Kubernetes deployment, because YAML files are intended to be both machine- and human-readable.) The challenge for a flat-table representation is that the specifications are ragged. For example, it is not trivial how to represent Lines 87 to 95 in 5311594
and Lines 100 to 105 in 5311594
in the same table, in part because the format and length of My personal vote would be to continue maintaining the "source" specification in YAML format, but begin accumulating scripts that automatically translate the "source" spec into other representations, including flat tables and Frictionless Data. In that sense, I fully welcome the proposed PR, which we can wrap into another GitHub Action to keep all representations automatically synchronized. I am curious to hear what @DenisSch, @jmuhlich and @adamjtaylor think. |
This makes sense... A synchronized multi-format schema would be very useful across multiple domains. Although, of course, one must be chosen as the canonical reference, and synchronization is non-trivial work. You're right that my approach in your example would be to separate out the "valid values" into a separate file. But the reason for this is not to deal with the raggedness in |
You write that: "To me, it's not immediately obvious that a collection of tables offers much of an advantage over a single self-contained YAML file, when it comes to by-hand editing." This sounds right, but this comment probably means I miscommunicated a bit. To clarify: I'm suggesting that the schema be located in (essentially) one file -- the fields table. This one file would be hand-edited by schema designers. The "collection of tables", multiple files, refers to the data bundles, not the schema/spec. As things stand, there are 8 YAML files comprising the spec, not "a single self-contained YAML file". |
@jimmymathews Sorry, I meant that replacing any one YAML file with multiple tables does not confer an advantage for by-hand editing (in my opinion). My interpretation of your original post was that the 8 YAML files (each of which is self-contained) would be replaced by a collection of tables, with each table adhering to a particular convention, such as Tidy Data for example. However, it sounds like I am misinterpreting your proposal? To assist with future discussion, can you maybe share a small example of what you envision as a "one file, multiple table" format for by-hand maintenance?
I'm probably not the best person to comment on Your point about certain scenarios not being covered by existing definitions is very valid. However, given that MITI is still in its infancy, I would advocate for iterative refinement of the schema (even with simple additions of Maybe @santas01, @arenasg, @acraquel, @clarenceyapp and others who helped define existing |
Sorry, I think I wrongly suggested general desireability of user choice of alternate "valid values". Of course, the standard should take a hard line on valid values. My point is that standardizing names only isn't enough.
As things stand a prospective data provider is still left wondering what state of affairs they are claiming holds if they list This seems to be a systematic issue with the dozens of fields with mere controlled string values in the spec. On the basis of the specification, data providers do not know what they are claiming scientifically by putting out a compliant dataset, and they do not know what is claimed in datasets they encounter. Here is the autogenerated fields table I have been referring to (not perfect yet!). And the tables table. |
Hi @jimmymathews . I was one of the members who selected the metadata fields for the microscopy/imaging tables such as immersion media. I need to clarify what is the issue you're finding with some of the fields. You mentioned that data providers do not know what they are claiming scientifically by putting out a compliant dataset. How does one design an experiment without knowing what immersion media (and other settings) they've been instructed to use or have used? This would have been done prospectively before data acquisition. I understand that it would be nice to have a detailed description of every single field, but as you know, there are alot of fields and it would take a long time to curate. In some cases, a concise description of what a field means is just simply not possible without turning it into a lecture. The original point of the miti standards was to hold metadata, not act as an instructional media. There are several microscopy websites that give thorough explanations which we can link out to if that is useful. |
Thanks @clarenceyapp , yes I think you have really honed in on the precise issue! Also thank you all for patience with my somewhat unclear posts. Definitions. You are correct that the main thing I am after is a "description of every single field" (and every hard-coded value in the specification). You are also correct that this is an onerous task to do comprehensively. However:
I would also distinguish between "definitions" and "descriptions", and suggest that definitions are needed rather than "detailed descriptions". You are of course correct that a metadata specification ought not to be instructional media. But the minimum of semantic content -- one sentence definitions -- should be considered the minimally sound basis for sharing data. Best case scenario. The absolute best case scenario, that handles all the problems raised in this thread, would be cases where a pre-existing formal ontology already covers the field. For example, the
(Link phenotypic sex.) This is the best case scenario because it wholly reuses others' work on the problem, answering your complaint that this onerous annotation task isn't the point of the metadata standard. Of course, in any scenario, one still has to decide what one means, and a term/name is not enough. It might seem that what is meant by "gender" is self-evident to experienced researchers in the microscopy domain, but a moment's reflection shows this not to be the case. Some species have a karyotypic sex concept. The karyotypic sex system is different for birds and for mammals. And so on. Choosing a definition, even an incomplete one, helps to immunize against getting mired in such tangential issues, raised by pedants like myself. Clarification. You write: "You mentioned that data providers do not know what they are claiming scientifically by putting out a compliant dataset. How does one design an experiment without knowing what immersion media (and other settings) they've been instructed to use or have used? This would have been done prospectively before data acquisition." I think you misunderstood my point. The investigator of course knows what they would like to claim scientifically. What is problematic is that the MITI specification currently prevents them from communicating the content of this claim via MITI-compliant datasets. My complaint is answered in this case if
|
There is always going to be some discrepancy in the the level of expert knowledge when it comes to specific fields and values. I don't necessarily think that introducing additional complexity to the schema is the best way to close that knowledge gap. What I propose is that we make better use of the MITI website (https://www.miti-consortium.org/) and create additional pages with links to existing microscopy resources and/or precise definitions of So, maybe to summarize the action items:
I think 3. will likely come down to the preference of MITI maintainers. To me, it makes sense to make their jobs as easy as possible and automate everything else. We will probably hear some more thoughts from maintainers and the governing board next week; I believe a number of folks are still on vacation. |
This plan makes sense. It seems this Issue is more or less settled. Vote to close? I plan to submit a PR with scripts to support part of item (1). Items (2) and (3) should perhaps get their own Issues. |
Thanks, @jimmymathews |
I am happy to organise a call with @jimmymathews after our next governance meeting (next week) to discuss this topic in more detail efficiently. |
That would be great, count me in. Thank you @DenisSch |
It seems that the YAML format for table data specification is informed by the choice of Cerberus as an underlying validator.
Is the consortium willing to consider Frictionless Data instead, as the underlying data schema specification framework? It has a lot of environment support, with Python, JS, and bash tools to make FD data packages natively available in various settings (local prototyping, headless/remote computation, web applications). As an example of a project that makes use of FD under the hood, see the C2M2 (Cross Cut Metadata Model).
If such a move (albeit quite significant) could be on the table, in a few days I can make a PR to share some scripts that I am currently using to:
(The above (1) is not a perfect translation yet.)
Moving to FD offers the advantage that the general purpose validation functionality is high quality and maintained by someone else. A specific advantage is the possibility of foreign-key integrity checks made possible by the schema's awareness of dependencies between tables.
Also, I think maintaining the schema as a flat-table-of-fields behind the scenes would simplify the schema designers' work, as it relieves them of the need to hand-edit schema specification files whose syntax is really designed for the use case of machine reading.
I don't have any kind of association with FD. I'm in the Nadeem lab at MSKCC, and we're making software that would stand to benefit from the MITI standards.
The MITI spec is really comprehensive, and many groups would benefit from standardization in this domain! Thank you for your work in this important effort.
The text was updated successfully, but these errors were encountered: