-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal proposal #3
base: add-tables-points
Are you sure you want to change the base?
Minimal proposal #3
Conversation
Automated Review URLs |
I will make one small clarification to Michael's introduction above. When he says this proposal doesn't support AnnData out-of-the-box, the way I think of it is: this proposal provides a set of generic building blocks from which one could build an AnnData-Zarr layout. To be clear: Kevin et al.'s proposal is The most radical thing we are proposing is that an AnnData-Zarr layout belongs in the AnnData documentation. We believe OME-NGFF should provide data structures that AnnData and other communities can adapt to their needs. As you'll see in our proposal, we have made a very clear callout box directing OME-NGFF users to the AnnData documentation. There are two conversations here, (1) about representing tables in OME-NGFF, and (2) about representing AnnData objects in Zarr. EDIT:Okay, I see now that I misspoke. Current AnnData-Zarr datasets need two additional metadata properties in the table's .zattrs file to be compliant: annotated-data, and column-order. So while nothing would have to be moved or removed in existing AnnData-Zarr datasets, yes, you would need to add something. I'll talk more about these two properties in a comment below. |
latest/index.bs
Outdated
| | | ||
| ├── .zarray | ||
| ├── .zattrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's in this .zattrs
file? Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't plan to store any special information in this file. So I guess it can probably be deleted, or is it required by the general Zarr layout for arrays?
This would also apply for the .zattrs
file of the row_names
array. Thanks for pointing this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you have a .zarray
then there's not usually a sibling .zattrs
in zarr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I'll just delete these two files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you have a .zarray then there's not usually a sibling .zattrs in zarr.
@will-moore's "not usually" applies to the NGFF spec's usages of Zarr to date. Just to be clear: in Zarr itself, a .zattrs is always permitted beside a .zarray, so if there's a need, there wouldn't be objections.
latest/index.bs
Outdated
│ | ... | ||
│ └── n | ||
├── .zgroup | ||
├── .zattrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What goes in this tables/.zattrs
file?
Does this list the child tables in any way? E.g. {"tables": ["table1", "anotherTable"]}
in the same way that labels/.zattrs
does? https://ngff.openmicroscopy.org/0.4/index.html#labels-md
This is really essential if you can't browse the subgroups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it doesn't in order to minimize the chances of leaving the container in an inconsistent state (e.g., if a write/remove operation fails to modify the metadata as well). However, I don't have a lot of experience on systems where you can't browse subgroups, so I'm happy for any suggestions in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included this suggestion in a first round of including feedback. How does it look to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that looks good from a practical point of view (allows me to know what tables exist etc).
However, it's a bit different from what I was expecting, which was just to list the single table tables.attrs["tables"] = ["anndata",]
and for the tables spec to define what the sub-tables are called. So I guess that highlights what you're proposing in that the spec doesn't define what any of the sub-tables are called (in the way that AnnData does) so you're free to name them anything? E.g. var
or obs
could be col_info
or extra
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, I want to keep it as generic as possible until there is a very compelling argument to do otherwise.
latest/index.bs
Outdated
| # sparse arrays MAY be in the `uns` group or in a subgroup. | ||
| | ||
├── .zgroup | ||
└── row_names # The table group SHOULD contain a 1D array of strings of length n called `row_names`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that SHOULD
have row_names
is a strong requirement. In most cases I can think of, each row will be e.g. a Cell ID and mostly these won't have names for each row, so I can't satisfy this requirement.
Perhaps MAY
is better here?
Would tools that read this data be expected to handle this column differently from any another string columns e.g. called "cell_names" or "sample_names"? If I have a existing columns called "cell_names" and "sample_names" then it would only duplicate data if I have to add a "row_names" column too? So maybe this isn't really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, annotating an axis of an array attributes some categorical meaning to the hyperslices obtained by slicing orthogonal to that axis. Therefore, it seemed natural to me to be able to refer to such a hyperslice by a name (or an ID). So natural in fact, that I would suggest having default names Row 1, Row2, ...
in case this column is not present. In this regard, MAY
seems to weak for me.
As for the redundancy, would it help to rename this to names
, IDs
, identifier
, ... and allow Strings or integers as data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHOULD --> MAY is a schema-weakening change that is easy to roll out at any time. I think we should start with SHOULD and if people don't find row_names useful, we can change it to MAY later.
Allowing ints as well as strings seems reasonable to me.
I don't see why redundancy is an issue; row_names is not required. If there is a column that conceptually makes sense to think of as row names, you could just make that array row_names instead of a column. Presumably, the viewer will display the row names, so you wouldn't lose that column, it would just be moved to a more prominent place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay update, now that I am thinking about this more... I think we should require that IF row_names exists, it MUST be unique, i.e., cannot contain duplicates. Unique row names provide a way to query the table with a guarantee that you will not accidentally grab 2 rows when you were expecting 1. (SQL DBs use primary keys for this purpose, which usually correspond to row numbers.) In R, data frames always have row names, which must be unique. (By default, 'rownames' is just the row numbers, as a vector of strings. R's 'rownames' can be strings or integers.)
I actually would recommend changing this SHOULD to MAY, if we impose the uniqueness requirement.
I think it's probably not the file format's responsibility to provide default row names.
Considering that R is designed for tabular data and is extremely popular in the genomics community, I think copying R's design choices is not a bad idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more thought: again, we have the dilemma of strict directory name vs. a metadata object. We could either call the directory "row_names" or add a third MUST to the .zattrs file: "row_names" : ""
, which indicates which column contains the row names, if any. Personally, I like the latter solution better. It means that software that doesn't support row names can just ignore that metadata object and treat that column like any other column. Also, you could then name that directory whatever you want, e.g. 'indptr' (if I'm understanding the AD format correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize this discussion: I added an attribute "row-names" that MAY be present and refers to a 1d-column of strings or integers that should be used as row names. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent, but I would change how it's presented in the spec. I'll suggest some new text in a private conversation. As usual, I am concerned with story-telling. I think we should present the three new properties ("row-names", "column-order", and "annotated-data") together in the text, as well as in the tree diagram.
I would delete the row_names directory from the tree diagram. This visual design suggests that the directory should be named row_names, which is not the case. It also makes the spec look more complicated than it really is.
Generally 👍 for this proposal. A more minimal spec makes sense in order to improve cross-language support and the AnnData-zarr docs certainly belong in AnnData docs. However, I don't quite understand how data that conforms to the AnnData spec is also compliant with this proposal? I can see how AnnData tables could be mostly converted to match this spec, but that doesn't seem to be what's described above? |
Thanks again for this PR, @minnerbe ! At a spec level, I agree with the benefits of this more minimal spec easing interoperability. I am curious about the potential impact on the implementers. Previously, we had discussed with the AnnData folks and some of the Vitessce folks about python and javascript reader implementations, respectively. This conversation was centered around the fact that they had an existing need for AnnData zarr IO, so the proposed AnnData-based table spec wouldn't be a very heavy lift. I will forward this PR to those teams and ask if they can comment here if they think it is feasible for them. In the case that this is too big of a deviation for them, do we have alternative implementors? If it's possible to bring the AnnData on-disk representation to match this (or something in the middle), that might be a a nice result, but I could imagine that will take some time to get alignment from all stakeholders. What is the protocol for when implementations should be available following a spec being published?
I have the same questions as @will-moore . Would it be possible to see how an AnnData would be stored with this spec? |
Sidenote, @minnerbe: there are a few extraneous commits here that you might want to make disappear. (I can't push to Kevin's fork to make that happen) |
Thanks for the constructive feedback, folks! As a short answer to @will-moore's questions: I hope that I can provide a better answer within the next two days with an example of how I would use this spec to store an AnnData object. Unfortunately, I'm traveling right now so I'm still busy until Friday. @joshmoore which commits are you talking about? I can certainly rebase the whole branch make the history more clean. |
The new .DS_Store and .html file and the change to the existing .html file on this branch. I wouldn't suggest rebasing just yet since people are committing on the diff itself.
👍 because I think @will-moore's concerns from #3 (comment) match mine. It might in fact be that this proposal gives us the optimal that I mentioned in ome#64 (comment) but that we can define a rollout that gives both the NGFF spec and the AnnData on-disk format time to adjust incrementally. For those who would like to review: |
|
||
In OME-NGFF, a table is a Zarr group containing zero, one, or more Zarr arrays, where each | ||
array represents one column of the table. Columns are ordered, and each column in a | ||
table MUST have the same number of rows. While the table itself MUST be 2-dimensional, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since columns can be n-dimensional I would replace this by saying that each column should have the same shape, so to avoid the case of one column being 1d and one 2d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this case be prohibited? As Michael pointed out, this is permitted in Matlab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could make the in-memory representation of the data challenging (for instance AnnData
or Pandas
may not work out of the box). For instance, should a 2D column be represented as a rows that are np.array()
or as a list of columns? In the first case writing a Pandas
dataframe back to disk would give a list of 1d columns instead that a single 2d column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this objection. I understand that this structure cannot be stored in a Pandas dataframe, which is Python-only data structure. But AnnData has means to store 1D annotations (as Pandas dataframe, as far as I understand) as well as nD annotations (dictionary of type string -> [np.array|<other AnnData in-memory representations>]
?) in-memory. Could you please elaborate on the problems that you see with AnnData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the previous AnnData
table proposal the main dataframe was saved as a matrix (so each column had always dimension 1), and eventual ndim annotatoins were saved in obsm
, that was not the main matrix but present in a different zarr group (the obsm
group).
With the new proposal from what I understand there is just one main dataframe. So it's true that by default we could store any column that has dimension > 1 into a obsm
in AnnData
, but this type of conversion is not bidirectional and can induce fragmentation. For instance if the use has an AnnData
object with an obsm
column with dimension 1, when saving it to disk should this be saved in the separate obsm
group or should it be saved in the main matrix? In that case upon re-reading there is no way to know that before it was in obsm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@virginiascarlett In principle, I agree that multi-dimensional columns are not at all intuitive (and, in this regard, Matlab really is an outlier among all languages that support some kind of table). However, I don't know how AnnData could be represented by this table spec without multi-dimensional columns. I guess that most fields in the obsm
and varm
groups can be split into 1d components, since they seem to represent some kind of coordinate (points, umap, ...) most of the time. But for pairwise annotations in the obsp
and varp
groups, this would mean to have column numbers in the order of millions. How would you go about storing these with just 1d columns?
In particular, I think @LucaMarconato just suggests to represent obs
, obsm
, and obsp
as separate groups (i.e., tables in our definition) within an AnnData stored with our spec, is this correct? This would be entirely possible with our definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, I would store them as separate tables (using your definition of tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So do we agree to restrict the spec to 1D columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that restricting to 1D columns would severely break AnnData compatibility. Rather, I support Luca's suggestion of having a table (in our definition) for obs
, obsm
, and obsp
, each, where one can store 1D, nD with n > 1, and quadratic arrays, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer 1D columns of tuples over having 2D columns. For now this could be restricted to homogeneous tuples. This would allow these to be exactly mappable to a higher dimensional arrays. However, in the future, heterogenous tuples, structs with arbitrary fields, may be useful to have.
For example, instead of having a 5 row by 6 column matrix, we have a vector, a one-dimensional array, of 6-tuples.
In [26]: import numpy as np
In [27]: A = np.reshape(np.arange(30), (5,6))
In [28]: A # 2D matrix representation
Out[28]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
In [29]: B = [tuple(row) for row in A]
In [30]: B # 6-tuple representation
Out[30]:
[(0, 1, 2, 3, 4, 5),
(6, 7, 8, 9, 10, 11),
(12, 13, 14, 15, 16, 17),
(18, 19, 20, 21, 22, 23),
(24, 25, 26, 27, 28, 29)]
In [31]: np.array(B) # Easy conversion from 6-tuple representation back to a 2D matrix
Out[31]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29]])
table of `tables/` MUST contain the "annotated-data" property, which is a JSON array of | ||
Zarr array paths and dimensions (0-based indexing), as shown below: | ||
|
||
```json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understood this example. Could you please elaborate more on this? Why do you need to tell which dimension of an image is being annotated for each table?
Furthermore, does this allow to annotate labels? For a table annotating labels I would just put in the metadata that the table refers to the Zarr labels
group, which do you need to specify which one of the yx dimension is being annotated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By requiring the image and dimension that is annotated by a table, I see two advantages compared to having a fixed naming schema:
- We are free to name the tables whatever we like, e.g.,
obs
for the 0-th dimension of an arrayX
andvar
for the first dimension of the same array (this may also answer @will-moore's questions of how to store thevar
array), without imposing this naming scheme on others, who might want to choose a more descriptive name for a more specific use case. - We can annotate the same image/dimension pair by more than one table. Vice versa, one table can annotate multiple image/dimension pairs, in which case the 'orientation' of the table comes from the annotated dimension, i.e., the table can annotated the columns of one image and the rows of another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I now understood the purpose, but I want to point out some ambiguity in the terminology that you used; I think that this causes some confusion in the proposal and that addressing this can make the proposal clearer.
From your message above it seems like that the obs
is the table and X
is the image. In the AnnData
proposal X
is never an image. X
is always a matrix of values, i.e. the table is "X
+ obs
+ var
". obs
alone is not considered a table.
In the AnnData
proposal, if we have a segmentation mask (labels tensor) called lab
, the way to annotated it would be by a matrix X
containing the annotations and by a obs
dataframe containing the two columns that tell how to map the indices of the segmentation mask and the indices of the annotation matrix X
. So the table "X
+ obs
" annotates the labels lab
.
This is why I was confused by the new proposal, because what you call the table (obs
) should now tell which dimension of X
is being annotated, but X
is not the image, it's the annotation table itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this comment I don't want to say that obs
+ var
+ X
= table
is better or worse than having a more atomic concept of table (= obs
). It has it's advantages and it's disadvantages. I just want to point out the ambiguity in the proposal since it confused me and could confuse other readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I agree that the terminology is not 100% precise here and that removing all ambiguities in language is paramount to arriving at a useful spec. Part of the motivation for creating a more minimal spec was that I was confused by the terminology in the original proposal, so arriving at a common ground here would be very desirable.
The reasoning for my terminology is as follows:
- Table, for me, means a data structure that has one (heterogeneous) record schema and multiple records (the rows). In that sense,
X
(as well asX + obs + var
) is not a table. - I was also careful to let a table annotate "arrays" instead of "images", since this allows to consistently annotate
X
from the AnnData schema. An AnnData object, in turn, can annotate an image in a predefined way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second sentence of the Tables spec ("Tables are an intuitive way of storing...") contributes to this confusion. I would replace it with this: "Tables are an intuitive way of organizing data or metadata consisting of variables and records, which are often organized as columns and rows, respectively."
I agree with Michael that I am pretty happy with our definition of table as it is in the minimal spec, though I am also happy to make changes that will minimize confusion. In relational databases, if you have two tables that together tell a story, we treat them as two tables, and the schema describes how they relate to one another. I think if AnnData wants to conceptualize multiple 2D data structures as a single table, that's fine, but we want to create a spec that is generic and intuitive for people who've never encountered AnnData.
We decided to add two properties, Under their proposal, any new table spec will probably have to either:
While our two MUSTs may seem cumbersome now, they actually give AnnData more flexibility to grow and change. We are essentially adding two small pieces of metadata in exchange for more flexibility in the directory names and the layout. We believe the core spec ought to provide a place for some basic information about the structure and meaning of the table, rather than forcing people to decide where to put that information. Also, being too rigid with directory names means that no one else can reuse that name, e.g. 'obs', 'X', or 'layers', once it already has a special meaning. |
I finally managed to assemble an example showing how I would store an AnnData file using this spec. Since I need to rebase anyway (thanks, @joshmoore for pointing this out!), I pushed a python script to this branch:
In principle, Please let me know if there are any concerns about how I envision a potential way of AnnData using the present table spec proposal. As I include the suggestions already made into the spec, I will try to keep the AnnData example updated. |
@minnerbe Sorry, I can't work out where you pushed your |
You're right, I didn't push. 🤦 |
@minnerbe Thanks for taking the time to write that script - I think I understand the data structures a bit better now... That script generates an I'm not at-all familiar with the Java side of the argument, but is it correct that handling the Is it the hope that AnnData would migrate towards the "minimal proposal" (so as to use the same AnnData spec with OME-NGFF as without), or is it conceivable that there'd be 2 flavours of AnnData and ways to convert between them? I would be more supportive of dropping sparse CSR/CSC encoding from the tables spec than I would about categorical encoding, mostly because it's harder to load chunks of the sparse array that correspond to given chunks/rows of the X data if the sparse data is encoded. |
@minnerbe I made a couple of tweaks to your script, one to fix an error and the other to export anndata.zarr to help me compare on disk: git diff``` diff --git a/latest/generate_anndata_example.py b/latest/generate_anndata_example.py index 447c27c..893575d 100644 --- a/latest/generate_anndata_example.py +++ b/latest/generate_anndata_example.py @@ -107,11 +107,11 @@ def write_anndata_suggestion(adata, filename, chunks): row_names = np.array(["X", "log_transformed", "other_data"]) layers.create_dataset("row_names", data=row_names, dtype=object, object_codec=numcodecs.VLenUTF8()) layers.attrs["annotated-data"] = [{"array": "/tables/anndata/X", "dimension": "2"}] - obs.attrs["column-order"] = ["row_names"]
store example in an alternative way, exploiting the properties of the suggested minimal table spec a bit morewrite_anndata_suggestion(adata, "example_suggestion.zarr", chunks)
|
As suggested by @will-moore and @virginiascarlett: * Change SHOULD to MAY * Add metadata attribute to specify column containing row names
@will-moore: In principle, the idea was to have a minimal spec of structured metadata that can be used to store AnnData easily, but is also useful beyond AnnData. E.g., imagine you have a time-series of pictures in a 3d-image and you want to note the exact time and, say, temperature for every time-slice. With readers that can deal with tables as outlined in this proposal, there is a canonical way of associating this metadata with the image (create a table with columns "time" and "temperature" and let it annotate the time-dimension of the 3d-image). As far as I understand, there is already a fully developed AnnData standard (with multiple backends) independent of OME-NGFF. So, I don't see a reason further "standardize" AnnData by including it in OME-NGFF as-is. However, as AnnData provides undeniably more complex use cases than mine outlined above, it's very desirable to have a way of easily representing AnnData within OME-NGFF. If AnnData should shift more toward such a representation, or should simply provides the representation as an OME-NGFF "flavour" is not for me to decide or propose. Regarding Categoricals: I agree that they are a very-nice-to-have feature that wouldn't be too hard to implement from the Java side. My motivation for dropping this particular feature was
Of course, this similarity of the Would you suggest to re-introduce AnnData encodings in general, or just to keep Categoricals? In the latter case, I think it could be worthwhile to explore the redundancy mentioned above to have a uniform way of representing categorical data. To answer your other questions: |
fca7fda
to
997b930
Compare
I suggest we should add a sentence at the end of the first paragraph about the purpose of tables in OME-Zarr. Something like, "While Zarr is not designed for tabular data, the user may wish to store tables of reasonable size within a Zarr hierarchy, for clarity and convenience." |
Apparently a new release of bikeshed is now unhappy with the use of `<img/>` and `<img></img>` is required: ``` $ bikeshed spec "latest/index.bs" "latest/index.out.html" LINE 651:1: Tag <img> wasn't closed at end of file. ✘ Did not generate, due to fatal errors Failed ```
Fix img and JSON comments
This URL should generate a rendered preview: |
Minimal table spec
This is an attempt at a minimal version of the proposed table spec based on this fork. I am aware that there has been a conscious decision against a minimal spec and for full AnnData support. However, having worked on AnnData support from the Java/ImgLib2 side, I think that a more minimal spec that does not follow the AnnData in-memory representation too closely could improve interoperability with tools from outside the Python ecosystem.
This proposal is not meant as a “counter-proposal” but rather as an addendum of the original PR that tries to distill the essence of the AnnData format in a way that is still mostly compatible with it. There is no doubt that incorporating AnnData as the de-facto standard for spatial-omics analysis in Python into OME-NGFF is highly beneficial.
While working on this draft I tried to balance two goals:
Thanks to @bogovicj, @d-v-b, and especially @virginiascarlett for their help with creating and revising this proposal.
AnnData compatibility
In general, my approach was to dissect AnnData into (nearly) atomic building blocks:
Please note that my proposal only deals with the second aspect and thus doesn't offer AnnData support out-of-the-box. However, I have some opinions on how AnnData could be stored with this table spec:
My idea would be to store an AnnData dataset in its own group, where the central datasets are stored as Zarr arrays and the axis annotations as tables as described by this proposal. A minimal (metadata) schema is then needed to relate this group to an image. This way, it should be easy to specify and implement readers/writers for other table-based storage formats, such as the mentioned PointTable, RegionTable, and ImageTable, and potentially also the data structures used in spatialdata.
In the following, I walk through the details of the current AnnData on-disk format to discuss how they fit into this spec proposal. I am happy to discuss the details of these ideas and I'll try to share an example AnnData file stored using my table spec proposal in the course of the week.
AnnData structure
X / layers
To me, these are not tables in the sense that tables allow for heterogeneous data, but rather homogeneous arrays. Hence, they should be stored as simple Zarr arrays. To combine any array with axis-wise annotations, we propose an
annotated-data
map within a table's metadata. By not restricting the annotated array to 2D, one can harness Zarr’s ability to efficiently store multidimensional arrays (e.g, multi-channel data, time series) without having to store them as multiple 2D arrays (as is currently done in, e.g., AnnData and TIFF).obs[mp] and var[mp]
All these collections can be consolidated into one
tables
group. While I acknowledge that it is convenient to separate 1D, nD and quadratic annotations in the in-memory representation of AnnData, there seems to be no real advantage of doing this for the on-disk format. Dispatching the arrays to the correct AnnData fields in memory can be done easily when reading from disk based on the array metadata. Again, by making thetables
group generic, this generalizes easily to annotating datasets that are more than 2D.uns
In the original proposal, this is essentially a(n optional) group without any specific metadata, which can always be present in the current OME-NGFF spec. So I don’t see a need to specify this separately.
AnnData encoding types
This proposal is essentially a dataframe in the sense of AnnData which, however, allows for multi-dimensional “columns” as is the case, e.g., for tables in Matlab. This means that all columns are arrays and no
encoding-type
metadata is needed. For all non-array encoding types of AnnData I include my rationale for not including them in my proposal in the following.Sparse arrays
Representing sparse (2D) data as CSR/CSC in-memory is a very common and powerful optimization for downstream-analysis. However, considering Zarr’s compression, this is redundant for on-disk data. Conversion to and from CSR/CSC can be done while reading/writing if necessary and has linear complexity. To do this efficiently in terms of disk access, the data could be chunked in such a way that rows/columns are contiguous within the chunks.
Categorical arrays
The same rationale as for sparse arrays applies: this is a compression step which is redundant for on-disk storage and can be converted on the fly when reading/writing. Also, there is conceptual overlap with the
labels
metadata.Nullable Integers / Booleans
These can currently not be expressed in this minimal proposal. However, this seems to be an implementational detail of pandas dataframes, which I argue should probably not be exposed as public API. If there is a compelling argument for having nullable arrays, it might pay to have this as a standalone spec within OME-NGFF to facilitate using them also outside of tables. As for the previous point, masking arrays can probably be done by means of the
labels
metadata, so I think it would be good to sort out the redundancies first.