Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify plate and well specifications for sparse plates #24

Merged
merged 6 commits into from
Feb 2, 2022

Conversation

melissalinkert
Copy link
Member

Starting point for discussion. The main scenarios to clarify are plates that are missing entire row(s)/column(s), and wells with a field in some (but not all) of the defined acquisitions.

@will-moore
Copy link
Member

Looks good. 👍

@sbesson
Copy link
Member

sbesson commented Dec 17, 2020

The suggested changes also read fine to me and are inline with the decisions made for the first version of the HCS specification. From my side, this commit could also be ported directly to the 0.1/index.bts specification as well.

As discussed recently, as we start applying the OME-Zarr HCS specification to more real-world HCS use cases especially sparse plates, we might need to review and reconsider how we handle these the specification. This can be captured and discussed as a separate issue.

@melissalinkert
Copy link
Member Author

0c28690 expands on the sparse plate handling to explicitly identify the row and column for each well. glencoesoftware/bioformats2raw#91 is a corresponding proposed implementation.

Both are based on discussion with @kkoz and @chris-allan. In the sparse plate example where only C5 and D7 are acquired, a human reading the JSON can clearly see that C/5 means C5 and D/7 means D7, but the only way to automatically calculate that is to split the well path on / and match each token against rows and columns.

Happy to split 0c28690 into a separate issue if that's easier to discuss.

@will-moore
Copy link
Member

There is a proposal to simplify the specifications of "collections" #31.
I assume that this will replace the existing HCS spec with something more generic.
Currently it looks like we are nearing some consensus on the overall structure of the data.
But haven't yet decided on any specific keywords for adding e.g. HCS metadata.
I'll try and come up with a suggestion, although it may initially not include plate-acquisition info.

latest/index.bs Outdated
<dt><strong>version</strong></dt>
<dd>A string defining the version of the specification.</dd>
<dt><strong>wells</strong></dt>
<dd>A list of JSON objects defining the wells of the plate. Each well object
MUST contain a `path` key identifying the path to the well subgroup.</dd>
MUST contain a `path` key identifying the path to the well subgroup.
Each well object MUST contain both a `row_index` key identifying the index into
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding #24 (comment), are there cases where it is not possible to recompute these indexes based on the knowledge of the individual wells path as well as the rows and names dictionaries? If recomputing is always possible (but at the cost of the consumer), my primary consideration is whether the recommendation for these new fields should be SHOULD rather than MUST.

For real-world examples, I can definitely see how row_index/column_index makes sense in terms of optimizing some of the queries. In addition to testing this with sparse plates, it will be useful to also generate representative plate with many wells (384 at least) to check there is no performance impact with the extra JSON metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for these indexes to be forward or reverse computable, path would need to be much more explicitly defined than it is now:

A list of JSON objects defining the fields of views for a given well. Each object MUST contain a path key identifying the path to the field of view. If multiple acquisitions were performed in the plate, it SHOULD contain an acquisition key identifying the id of the acquisition which must match one of acquisition JSON objects defined in the plate metadata.

Furthermore, the wells array would need have be null or similar padding in order for those indexes to make sense.

Neither of these things are ideal obviously. I don't think there's a way to not have these things be MUST if we want to guarantee that lookups can happen based on physical plate characteristics.

@melissalinkert
Copy link
Member Author

5a1ddc7 is based on glencoesoftware/bioformats2raw#119 and discussion with @chris-allan earlier today, in preparation for discussion with @sbesson tomorrow. The proposed changes around well path in particular are still up for debate.

latest/index.bs Outdated
additional leading or trailing directories.
Each well object MUST contain both a `row_index` key identifying the index into
the `rows` list and a `column_index` key indentifying the index into
the `columns` list. `row_index` and `column_index` MUST be 0-based.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise that #70 has been added after this PR was opened, but the decision there means these new attributes should now be named rowIndex and columnIndex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in 7c2536a.

@melissalinkert
Copy link
Member Author

Following discussion with @sbesson and @chris-allan, 3c31c14 relaxes the "no empty groups" statement to address #24 (comment). There are also some clarifications to the row and column naming, intended to be consistent with https://www.openmicroscopy.org/Schemas/Documentation/Generated/OME-2016-06/ome_xsd.html#NamingConvention.

@sbesson
Copy link
Member

sbesson commented Dec 15, 2021

Overview

The specification changes proposed in this PR closely reflect several choices made in omero-cli-zarr when implementing the first version of the HCS metadata. Effectively, this migrates implementation details at the specification level clearing several ambiguities when dealing with sparse plates. The advantages are:

  1. reduce the divergence between writing implementations
  2. clarify the expectations for consumer when dealing with NGFF datasets implementing the HCS specification

RFC and Community call

This PR has now passed several rounds of internal review and is reaching the state where community feedback would be useful before integrating formally in an upcoming version of the specification. Given the latest announcement of the next NGFF call, I would propose to set the week of 2022-01-24 as the deadline for public comments. Ideally, we can review the state of this proposal, reach an agreement and decide on the timeline for getting these changes released as part of this community call.

Specific comments

Empty Zarr groups

A former version of the proposal forbade the existence of Zarr groups for wells and well rows containing no images. Following the feedback from #24 (comment), the latest version now reduces this as a recommendation. I can think of rationales backing both specification. Importantly, the biggest decision factor might be at the level of the consumer library:

  • assuming a strict proposal (MUST NOT), a library can either use the existence of Zarr group or the wells metadata to determine whether wells are populated with images
  • assuming the more lenient proposal (SHOULD NOT), a library cannot rely on the exisence of Zarr groups. Instead, the wells field acts as the single source of truth for whether wells are populated.

Rows/columns names

The new requirements regarding the content of the rows and columns arrays allow to communicate a representation of the physical plate layout independently of whether wells are populated or not.

Constraints have been added to the name definition of rows and columns. This should make broken scenarios like duplicate row/column names invalid as per the specification. Additionally, these constraints support the ubiquitous convention in the High-Content Screening domain of using letters/numbers for rows/columns e.g. row A, column 2 while still catering for some flexibility in the naming of rows/columns.

Wells indices

A major change in this proposal is that each well element now requires three keys: a path AND a rowIndex AND a columnIndex. The first element is unchanged compared to the previous specs and specifies the path to the Zarr group. The two indices allow to link this group to the associated row/column in the plate metadata.

The examples in the specification page as well as the bioformats2raw and omero-cli-zarr implementations use systematic naming conventions where the path to the well is derived from the names of the corresponding row and the column e.g. the well corresponding to row A(columnIndex: 0) and column 2 (columnIndex: 1) is located in path A/2. This representation has obviously readability advantageous but this behavior is not enforced by the current proposal i.e. libraries should assume that the path to individual wells is independent of the row/column names.

As discussed in #24 (comment), an alternate proposal would be to force a mapping between the path to the Zarr group of the well and the names of the row and column associated with this well. Under such a proposal, it would become superfluous to require both the path and rowIndex/columnIndex attributes in the wells array as one could be recomputed from the other.
Probably the biggest trade-offs up for discussion here are:

  • flexibility e.g. support for well paths of type 0/0
  • size of the plate metadata e.g. for 1536 wells plate
  • performance i.e. cost of the name <-> index lookup

Well metadata

At the well group level, for multi-acquisition plates, the acquisition key is now a mandatory key. An alternative would be to define some default behavior if this field is absent e.g. the first element of the top-level acquisitions array. Multi-acquisition plates are the exception rather the norm but even in these scenarios, this change sounds completely reasonable to me

Samples and implementations

The specification includes a few examples of metadata for sparse HCS data that complements the existing examples of dense plates. As for every release, representative real-world HCS examples should be generated covering as many features as possible.

In terms of implementation, glencoesoftware/bioformats2raw#119 contains the implementation of these changes for bioformats2raw, changes are expected to omero-cli-zarr to support the new attributes. Possibly consuming libraries like vizarr could be updated to benefit from the index lookup.

@sbesson
Copy link
Member

sbesson commented Dec 16, 2021

Another comment while looking at validation this morning is that the the current specification does not define the level of requirement for the keys under plate and well.

The initial JSON schema introduced in https://github.com/ome/ngff/pull/76/files#diff-2e387106f2f394aca19236f21c170f70b94c7931f60bfc2d8f6549941105e0cfR105 defines version, rows, columns and wells as required for plate. This would leave acquisitions, field_count and name as optional. This is largely in-line with the spirit of the changes here with possibly a discussion around version as this key is marked as recommended in the other specifications. That being said, I am personally in favor of enforcing version as a requirement everywhere in the mid-term.

For well, I assume images is required and version is either required or recommended, aligning with the decision regarding plates.

@sbesson sbesson mentioned this pull request Jan 28, 2022
13 tasks
@sbesson sbesson added this to the 0.4 milestone Feb 1, 2022
Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing the discussions from the 6th OME-NGFF community call, no objection/amendment was made for this proposal. The various HCS aware OME-Zarr implementations have been updated to support the new proposed layout. Merging in preparation of the upcoming 0.4 specification announcement.

@sbesson sbesson merged commit 416a377 into ome:main Feb 2, 2022
github-actions bot added a commit that referenced this pull request Feb 2, 2022
Clarify plate and well specifications for sparse plates

SHA: 416a377
Reason: push, by @sbesson

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@imagesc-bot
Copy link

This pull request has been mentioned on Image.sc Forum. There might be relevant details there:

https://forum.image.sc/t/next-call-on-next-gen-bioimaging-data-tools-2022-01-27/60885/11

@sbesson sbesson mentioned this pull request Mar 16, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants