Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: don't mix arrays and groups #107

Open
d-v-b opened this issue Mar 10, 2022 · 5 comments
Open

Proposal: don't mix arrays and groups #107

d-v-b opened this issue Mar 10, 2022 · 5 comments

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Mar 10, 2022

In the current spec, the zarr group for an image contains arrays (the different scale levels of the image data) as well as (optionally) a group (the labels group). E.g.,

foo.zarr                    # group, contains groups and arrays
foo.zarr/s0                 # array, scale level 0
foo.zarr/labels             # group, contains multiscale groups
foo.zarr/labels/original    # multiscale group, contains arrays
foo.zarr/labels/original/s0 # array, scale level 0

Storing arrays and groups in the same level of hierarchy can complicate semantics.
I would propose the following structure instead:

foo.zarr                    # group, contains groups
foo.zarr/image              # multiscale group, contains arrays
foo.zarr/image/s0           # array, scale level 0
foo.zarr/labels             # group, contains multiscale groups
foo.zarr/labels/original    # multiscale group, contains arrays
foo.zarr/labels/original/s0 # array, scale level 0

Under with this scheme, we can define a "multiscale group" as a collection of arrays with a specific group metadata scheme, and then re-use that definition.

I chose the name "image" arbitrarily, it could also be something like "raw" or "default", and maybe the name of this group could be reserved.

@joshmoore
Copy link
Member

Storing arrays and groups in the same level of hierarchy can complicate semantics.

In the case of labels, I can definitely see correcting the structure as it currently stands. There has been ongoing confusion as seen with the currently open "labels" issues that we could likely address at the same time.

I'm a bit more hesitant about saying generally that arrays and groups can't be at the same level. My preference would be to have metadata that clearly defines the semantics.

I chose the name "image" arbitrarily, it could also be something like "raw" or "default", and maybe the name of this group could be reserved.

Similarly, I would avoid trying to encode too much in the names because invariably there will be difficulties getting everything need in a single group or array name.

foo.zarr/labels/original/s0 # array, scale level 0

And maybe relatedly, inspecting a plate today, I did have the impulse to add col, row, field, ..., scale prefixes to the integer-named groups to make the layout clearer. This would be a convention though with the metadata still being definitive as in #48 (comment).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 11, 2022

I'm a bit more hesitant about saying generally that arrays and groups can't be at the same level. My preference would be to have metadata that clearly defines the semantics.

Agreed, but Zarr already provides a set of operational semantics for "list-like thing" (zarr group) and "value-like thing" (zarr array). Life would be easier if the OME data model had a clean mapping onto the zarr semantics, e.g. something like type definitions.

On one extreme (most permissive typing), you could define that all elements of an OME-Zarr have the recursive type leaf = Union[Zarr.Group[leaf], Zarr.Array], in which case the layout of a single level of hierarchy doesn't convey any information, and metadata / object names must be inspected to infer the OME-Zarr semantics. At another extreme, every OME-Zarr element could be strictly typed -- e.g., MultiscaleGroup = Zarr.Group[Zarr.Array], OMEZarrRoot = Zarr.Group[MultiscaleGroup, Zarr.Group[MultiscaleGroup]]. This was my suggestion in the above comment.

The current spec is somewhere in the middle: it looks something like MultiscaleGroup = Zarr.Group[Zarr.Array, Optional[Zarr.Group[MultiscaleGroup]], i.e. a multiscale group is a collection of zarr arrays and optionally contains another group that contains more multiscale groups. Subjectively, having MultiscaleGroup optionally function as a generic container feels like an unnecessary overload of the semantics of a MultiscaleGroup that will make parsing harder.

And re: naming groups with integers -- what's the motivation for this? zarr uses integers as the names of chunk directories in nested storage, and for chunks themselves, so it seems like integer names should probably be reserved for that purpose.

@joshmoore
Copy link
Member

Zarr already provides a set of operational semantics for "list-like thing" (zarr group) and "value-like thing" (zarr array).

Guess I think of zarr groups as being more dictionary like than list like which may explain some of our differing opinion.

`Union[Zarr.Group[leaf], Zarr.Array]

This is how I think of it in the general case.

every OME-Zarr element could be strictly typed -- e.g., MultiscaleGroup = Zarr.Group[Zarr.Array], OMEZarrRoot = Zarr.Group[MultiscaleGroup, Zarr.Group[MultiscaleGroup]]. This was my suggestion in the above comment.

Gotcha. If this issue is "don't mix the multiscale zgroup" then I could see adding that typing info to the spec, even if not necessarily doing it for all zgroups. (The top-level zgroup is of course a special case)

what's the motivation for this? zarr uses integers as the names of chunk directories in nested storage, and for chunks themselves, so it seems like integer names should probably be reserved for that purpose.

That's what I was getting at as well (though in Zarr v3 chunks themselves also get prefixes)

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 11, 2022

Guess I think of zarr groups as being more dictionary like than list like which may explain some of our differing opinion.

No, you're right, groups are definitely more dict-like than list-like, that's just an oversight on my part.

Gotcha. If this issue is "don't mix the multiscale zgroup" then I could see adding that typing info to the spec, even if not necessarily doing it for all zgroups. (The top-level zgroup is of course a special case)

Right, I support this tighter framing of the issue. I wasn't considering cases of zarrays outside of a multiscale group -- those situations should be considered separately.

@constantinpape
Copy link
Contributor

I didn't have time to fully read this issue; I will try to have a look later.
But I already wanted express my full support for this:

And re: naming groups with integers -- what's the motivation for this? zarr uses integers as the names of chunk directories in nested storage, and for chunks themselves, so it seems like integer names should probably be reserved for that purpose.

These integer named groups are super confusing. They should be avoided in the spec. (As a side-note: having the scale arrays with integer names is also not helping and calling them e.g. s0, s1, s2, ... is much better).
I also think that it's not a good choice to use these things as defaults, e.g. in bioformats2raw: glencoesoftware/bioformats2raw#109.

@sbesson sbesson mentioned this issue Mar 16, 2022
2 tasks
thewtex added a commit to thewtex/multiscale-spatial-image that referenced this issue Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants