More structured schema format? #12

jimmymathews · 2022-01-03T07:40:47Z

It seems that the YAML format for table data specification is informed by the choice of Cerberus as an underlying validator.

Is the consortium willing to consider Frictionless Data instead, as the underlying data schema specification framework? It has a lot of environment support, with Python, JS, and bash tools to make FD data packages natively available in various settings (local prototyping, headless/remote computation, web applications). As an example of a project that makes use of FD under the hood, see the C2M2 (Cross Cut Metadata Model).

If such a move (albeit quite significant) could be on the table, in a few days I can make a PR to share some scripts that I am currently using to:

Automatically convert the MITI spec into (i) a flat table of fields, (ii) a table of tables, and (iii) a few separate files for the "valid values" currently spread across various YAML files.
Automatically convert (i), (ii), and (iii) into a FD data package ready to get populated with data.

(The above (1) is not a perfect translation yet.)

Moving to FD offers the advantage that the general purpose validation functionality is high quality and maintained by someone else. A specific advantage is the possibility of foreign-key integrity checks made possible by the schema's awareness of dependencies between tables.

Also, I think maintaining the schema as a flat-table-of-fields behind the scenes would simplify the schema designers' work, as it relieves them of the need to hand-edit schema specification files whose syntax is really designed for the use case of machine reading.

I don't have any kind of association with FD. I'm in the Nadeem lab at MSKCC, and we're making software that would stand to benefit from the MITI standards.

The MITI spec is really comprehensive, and many groups would benefit from standardization in this domain! Thank you for your work in this important effort.

ArtemSokolov · 2022-01-03T15:32:07Z

Thanks, @jimmymathews. I think having a Frictionless Data representation would be extremely valuable, but my vote would be for "in addition" rather than "instead".

In general, I completely agree that having a human-readable representation simplifies schema maintenance. This is actually what motivated us to use YAML instead of the more popular JSON format. (As you know, YAML is often used for configuration specifications, e.g., in Kubernetes deployment, because YAML files are intended to be both machine- and human-readable.) The challenge for a flat-table representation is that the specifications are ragged. For example, it is not trivial how to represent

MITI/03-file.yaml

Lines 87 to 95 in 5311594

    
           Immersion Medium: 
        
             description: the imaging medium affects the working NA of the objective 
        
             type: string 
        
             valid-values: 
        
             - Air 
        
             - Water 
        
             - Glycerin 
        
             - Oil 
        
             significance: recommended

and

MITI/03-file.yaml

Lines 100 to 105 in 5311594

    
           Frame_ Averaging: 
        
             description: Number of frames averaged together (if no averaging, set to 1) 
        
             type: integer 
        
             valid-values: 
        
               min: 1.0 
        
             significance: recommended

in the same table, in part because the format and length of valid-values: depends on type:. The standard approach (and the approach you're taking it sounds like) would be to utilize multiple tables that cross-reference each other. To me, it's not immediately obvious that a collection of tables offers much of an advantage over a single self-contained YAML file, when it comes to by-hand editing. But this could just come down to personal preference.

My personal vote would be to continue maintaining the "source" specification in YAML format, but begin accumulating scripts that automatically translate the "source" spec into other representations, including flat tables and Frictionless Data. In that sense, I fully welcome the proposed PR, which we can wrap into another GitHub Action to keep all representations automatically synchronized.

I am curious to hear what @DenisSch, @jmuhlich and @adamjtaylor think.

jimmymathews · 2022-01-03T15:56:59Z

This makes sense... A synchronized multi-format schema would be very useful across multiple domains. Although, of course, one must be chosen as the canonical reference, and synchronization is non-trivial work.

You're right that my approach in your example would be to separate out the "valid values" into a separate file. But the reason for this is not to deal with the raggedness in 03-file.yaml. It is rather because I would regard Air, Water, Glycerin, Oil as first-class data of its own, not schema information. Different datasets may involve different "Immersion medium" values, and some may even have additional information about specific media. I would think that hard coding data values in the schema specification would lead to habitual non-conformity to the spec.

jimmymathews · 2022-01-03T16:40:25Z

You write that: "To me, it's not immediately obvious that a collection of tables offers much of an advantage over a single self-contained YAML file, when it comes to by-hand editing."

This sounds right, but this comment probably means I miscommunicated a bit.

To clarify: I'm suggesting that the schema be located in (essentially) one file -- the fields table. This one file would be hand-edited by schema designers. The "collection of tables", multiple files, refers to the data bundles, not the schema/spec. As things stand, there are 8 YAML files comprising the spec, not "a single self-contained YAML file".

ArtemSokolov · 2022-01-03T19:42:15Z

As things stand, there are 8 YAML files comprising the spec, not "a single self-contained YAML file".

@jimmymathews Sorry, I meant that replacing any one YAML file with multiple tables does not confer an advantage for by-hand editing (in my opinion). My interpretation of your original post was that the 8 YAML files (each of which is self-contained) would be replaced by a collection of tables, with each table adhering to a particular convention, such as Tidy Data for example. However, it sounds like I am misinterpreting your proposal?

To assist with future discussion, can you maybe share a small example of what you envision as a "one file, multiple table" format for by-hand maintenance?

It is rather because I would regard Air, Water, Glycerin, Oil as first-class data of its own, not schema information.

I'm probably not the best person to comment on valid-values, which were defined by domain experts, but my understanding is that it is preferable to have these enforced by the schema instead of allowing data providers define their own. Centralizing valid values in the schema enables standardization across datasets and removes a whole class of wrangling issues associated with different sources using "Water" vs. "water", etc.

Your point about certain scenarios not being covered by existing definitions is very valid. However, given that MITI is still in its infancy, I would advocate for iterative refinement of the schema (even with simple additions of Other to valid-value fields where appropriate) to help with conformance.

Maybe @santas01, @arenasg, @acraquel, @clarenceyapp and others who helped define existing valid-values can comment further.

jimmymathews · 2022-01-03T21:18:49Z

Sorry, I think I wrongly suggested general desireability of user choice of alternate "valid values". Of course, the standard should take a hard line on valid values. My point is that standardizing names only isn't enough.

Air, Water, etc. are indeed very much needed, for the reasons you point out, as standardized names for real things, kinds of immersion media. But these real things have still not been explained or described.

As things stand a prospective data provider is still left wondering what state of affairs they are claiming holds if they list Water on a file record. (Should they do so if distilled water was used, for example?) At the very least, a 2-column "immersion media" table is needed, with "Name" (to link up with values in other tables) and "Description" (to explain what is meant when that value is chosen).

This seems to be a systematic issue with the dozens of fields with mere controlled string values in the spec. On the basis of the specification, data providers do not know what they are claiming scientifically by putting out a compliant dataset, and they do not know what is claimed in datasets they encounter.

Here is the autogenerated fields table I have been referring to (not perfect yet!). And the tables table.

clarenceyapp · 2022-01-04T16:32:01Z

Hi @jimmymathews . I was one of the members who selected the metadata fields for the microscopy/imaging tables such as immersion media. I need to clarify what is the issue you're finding with some of the fields.
For common objective lenses, there are only a limited number of possible immersion media one can physically use without damaging the lens, which we've listed as valid values. Lenses usually can only take one type of immersion media which is labelled on its side. If it's a water immersion lens, then the user should be using water. If it's a multi-immersion lens, this should still match with one of the valid values we've included. By 'Water', we do mean distilled water or any of the commercially sold aqueous solutions, which is essentially distilled water. Are you suggesting we need to include further granularity between water and distilled water? Any other types of water are not suitable for this sort of image capture.

You mentioned that data providers do not know what they are claiming scientifically by putting out a compliant dataset. How does one design an experiment without knowing what immersion media (and other settings) they've been instructed to use or have used? This would have been done prospectively before data acquisition.

I understand that it would be nice to have a detailed description of every single field, but as you know, there are alot of fields and it would take a long time to curate. In some cases, a concise description of what a field means is just simply not possible without turning it into a lecture. The original point of the miti standards was to hold metadata, not act as an instructional media. There are several microscopy websites that give thorough explanations which we can link out to if that is useful.

jimmymathews · 2022-01-04T18:29:28Z

Thanks @clarenceyapp , yes I think you have really honed in on the precise issue! Also thank you all for patience with my somewhat unclear posts.

Definitions. You are correct that the main thing I am after is a "description of every single field" (and every hard-coded value in the specification). You are also correct that this is an onerous task to do comprehensively. However:

The schema should not be designed to prevent the inclusion of descriptions where they are concise and known.
By making a public/community standard, as opposed to an internal/proprietary one, the standard gets to benefit from the input of many people. In the community context, the task of annotating 275 fields should not be regarded as too difficult to attempt. Even 50% coverage would be great. Frankly if they are as self-evident as you seem to hope they are, this should take an expert only a couple of minutes per field.

I would also distinguish between "definitions" and "descriptions", and suggest that definitions are needed rather than "detailed descriptions". You are of course correct that a metadata specification ought not to be instructional media. But the minimum of semantic content -- one sentence definitions -- should be considered the minimally sound basis for sharing data.

Best case scenario. The absolute best case scenario, that handles all the problems raised in this thread, would be cases where a pre-existing formal ontology already covers the field. For example, the clinical file has a 'Gender' field. The definition need not be any more complicated than:

"phenotypic sex" as defined by the Phenotype and Trait Ontology: http://purl.obolibrary.org/obo/PATO_0001894

(Link phenotypic sex.)

This is the best case scenario because it wholly reuses others' work on the problem, answering your complaint that this onerous annotation task isn't the point of the metadata standard.

Of course, in any scenario, one still has to decide what one means, and a term/name is not enough. It might seem that what is meant by "gender" is self-evident to experienced researchers in the microscopy domain, but a moment's reflection shows this not to be the case. Some species have a karyotypic sex concept. The karyotypic sex system is different for birds and for mammals. And so on. Choosing a definition, even an incomplete one, helps to immunize against getting mired in such tangential issues, raised by pedants like myself.

Clarification. You write:

"You mentioned that data providers do not know what they are claiming scientifically by putting out a compliant dataset. How does one design an experiment without knowing what immersion media (and other settings) they've been instructed to use or have used? This would have been done prospectively before data acquisition."

I think you misunderstood my point. The investigator of course knows what they would like to claim scientifically. What is problematic is that the MITI specification currently prevents them from communicating the content of this claim via MITI-compliant datasets. My complaint is answered in this case if Water comes with this definition text:

Distilled water or any of the commercially sold aqueous solutions, which are essentially distilled water.

ArtemSokolov · 2022-01-04T22:45:22Z

There is always going to be some discrepancy in the the level of expert knowledge when it comes to specific fields and values. I don't necessarily think that introducing additional complexity to the schema is the best way to close that knowledge gap. What I propose is that we make better use of the MITI website (https://www.miti-consortium.org/) and create additional pages with links to existing microscopy resources and/or precise definitions of Water, etc. This would avoid potentially unnecessary expansion of the schema, while providing a reference resource for non-experts.

So, maybe to summarize the action items:

Implement scripts that can convert between the various representations, including but not limited to YAML, flat tables and Frictionless Data.
Expand the MITI website with references to standard microscopy resources and/or field/value definitions.
Decide on a representation to be the canonical reference, and wrap all scripts that convert it to other representations into GitHub Actions that will trigger on new PRs and git merges.

I think 3. will likely come down to the preference of MITI maintainers. To me, it makes sense to make their jobs as easy as possible and automate everything else.

We will probably hear some more thoughts from maintainers and the governing board next week; I believe a number of folks are still on vacation.

jimmymathews · 2022-01-05T18:12:30Z

This plan makes sense. It seems this Issue is more or less settled. Vote to close?

I plan to submit a PR with scripts to support part of item (1). Items (2) and (3) should perhaps get their own Issues.

ArtemSokolov · 2022-01-05T18:15:51Z

Thanks, @jimmymathews
Maybe close via PR? But it's up to you if you want to close now.

DenisSch · 2022-01-06T18:10:32Z

I am happy to organise a call with @jimmymathews after our next governance meeting (next week) to discuss this topic in more detail efficiently.

jimmymathews · 2022-01-07T18:26:17Z

That would be great, count me in. Thank you @DenisSch

jimmymathews mentioned this issue Jan 6, 2022

Some schema converters #13

Closed

ArtemSokolov mentioned this issue Jun 7, 2022

Updated GHA to trigger on tag and generate a release #19

Merged

ArtemSokolov closed this as completed in #19 Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More structured schema format? #12

More structured schema format? #12

jimmymathews commented Jan 3, 2022 •

edited

Loading

ArtemSokolov commented Jan 3, 2022

jimmymathews commented Jan 3, 2022

jimmymathews commented Jan 3, 2022

ArtemSokolov commented Jan 3, 2022

jimmymathews commented Jan 3, 2022

clarenceyapp commented Jan 4, 2022

jimmymathews commented Jan 4, 2022

ArtemSokolov commented Jan 4, 2022 •

edited

Loading

jimmymathews commented Jan 5, 2022

ArtemSokolov commented Jan 5, 2022

DenisSch commented Jan 6, 2022

jimmymathews commented Jan 7, 2022

More structured schema format? #12

More structured schema format? #12

Comments

jimmymathews commented Jan 3, 2022 • edited Loading

ArtemSokolov commented Jan 3, 2022

jimmymathews commented Jan 3, 2022

jimmymathews commented Jan 3, 2022

ArtemSokolov commented Jan 3, 2022

jimmymathews commented Jan 3, 2022

clarenceyapp commented Jan 4, 2022

jimmymathews commented Jan 4, 2022

ArtemSokolov commented Jan 4, 2022 • edited Loading

jimmymathews commented Jan 5, 2022

ArtemSokolov commented Jan 5, 2022

DenisSch commented Jan 6, 2022

jimmymathews commented Jan 7, 2022

jimmymathews commented Jan 3, 2022 •

edited

Loading

ArtemSokolov commented Jan 4, 2022 •

edited

Loading