Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal proposal #3

Open
wants to merge 36 commits into
base: add-tables-points
Choose a base branch
from

Conversation

minnerbe
Copy link

Minimal table spec

This is an attempt at a minimal version of the proposed table spec based on this fork. I am aware that there has been a conscious decision against a minimal spec and for full AnnData support. However, having worked on AnnData support from the Java/ImgLib2 side, I think that a more minimal spec that does not follow the AnnData in-memory representation too closely could improve interoperability with tools from outside the Python ecosystem.
This proposal is not meant as a “counter-proposal” but rather as an addendum of the original PR that tries to distill the essence of the AnnData format in a way that is still mostly compatible with it. There is no doubt that incorporating AnnData as the de-facto standard for spatial-omics analysis in Python into OME-NGFF is highly beneficial.

While working on this draft I tried to balance two goals:

  • Making it very easy to express AnnData with this spec;
  • Making it as general as possible to allow for a maximum of possible applications.

Thanks to @bogovicj, @d-v-b, and especially @virginiascarlett for their help with creating and revising this proposal.

AnnData compatibility

In general, my approach was to dissect AnnData into (nearly) atomic building blocks:

  • The “central” datasets X / layers.
  • Annotations of one axis of these datasets (var and obs with their m and p variants).

Please note that my proposal only deals with the second aspect and thus doesn't offer AnnData support out-of-the-box. However, I have some opinions on how AnnData could be stored with this table spec:
My idea would be to store an AnnData dataset in its own group, where the central datasets are stored as Zarr arrays and the axis annotations as tables as described by this proposal. A minimal (metadata) schema is then needed to relate this group to an image. This way, it should be easy to specify and implement readers/writers for other table-based storage formats, such as the mentioned PointTable, RegionTable, and ImageTable, and potentially also the data structures used in spatialdata.

In the following, I walk through the details of the current AnnData on-disk format to discuss how they fit into this spec proposal. I am happy to discuss the details of these ideas and I'll try to share an example AnnData file stored using my table spec proposal in the course of the week.

AnnData structure

X / layers

To me, these are not tables in the sense that tables allow for heterogeneous data, but rather homogeneous arrays. Hence, they should be stored as simple Zarr arrays. To combine any array with axis-wise annotations, we propose an annotated-data map within a table's metadata. By not restricting the annotated array to 2D, one can harness Zarr’s ability to efficiently store multidimensional arrays (e.g, multi-channel data, time series) without having to store them as multiple 2D arrays (as is currently done in, e.g., AnnData and TIFF).

obs[mp] and var[mp]

All these collections can be consolidated into one tables group. While I acknowledge that it is convenient to separate 1D, nD and quadratic annotations in the in-memory representation of AnnData, there seems to be no real advantage of doing this for the on-disk format. Dispatching the arrays to the correct AnnData fields in memory can be done easily when reading from disk based on the array metadata. Again, by making the tables group generic, this generalizes easily to annotating datasets that are more than 2D.

uns

In the original proposal, this is essentially a(n optional) group without any specific metadata, which can always be present in the current OME-NGFF spec. So I don’t see a need to specify this separately.

AnnData encoding types

This proposal is essentially a dataframe in the sense of AnnData which, however, allows for multi-dimensional “columns” as is the case, e.g., for tables in Matlab. This means that all columns are arrays and no encoding-type metadata is needed. For all non-array encoding types of AnnData I include my rationale for not including them in my proposal in the following.

Sparse arrays

Representing sparse (2D) data as CSR/CSC in-memory is a very common and powerful optimization for downstream-analysis. However, considering Zarr’s compression, this is redundant for on-disk data. Conversion to and from CSR/CSC can be done while reading/writing if necessary and has linear complexity. To do this efficiently in terms of disk access, the data could be chunked in such a way that rows/columns are contiguous within the chunks.

Categorical arrays

The same rationale as for sparse arrays applies: this is a compression step which is redundant for on-disk storage and can be converted on the fly when reading/writing. Also, there is conceptual overlap with the labels metadata.

Nullable Integers / Booleans

These can currently not be expressed in this minimal proposal. However, this seems to be an implementational detail of pandas dataframes, which I argue should probably not be exposed as public API. If there is a compelling argument for having nullable arrays, it might pay to have this as a standalone spec within OME-NGFF to facilitate using them also outside of tables. As for the previous point, masking arrays can probably be done by means of the labels metadata, so I think it would be good to sort out the redundancies first.

@github-actions
Copy link

Automated Review URLs

@virginiascarlett
Copy link

virginiascarlett commented Jun 20, 2023

I will make one small clarification to Michael's introduction above. When he says this proposal doesn't support AnnData out-of-the-box, the way I think of it is: this proposal provides a set of generic building blocks from which one could build an AnnData-Zarr layout. To be clear: Kevin et al.'s proposal is compliant nearly compliant with Michael's. Existing AnnData-Zarr datasets are already can easily be made compliant with the proposal presented here. None of the work that Kevin et al. have done up to now needs to be discarded.

The most radical thing we are proposing is that an AnnData-Zarr layout belongs in the AnnData documentation. We believe OME-NGFF should provide data structures that AnnData and other communities can adapt to their needs. As you'll see in our proposal, we have made a very clear callout box directing OME-NGFF users to the AnnData documentation.

There are two conversations here, (1) about representing tables in OME-NGFF, and (2) about representing AnnData objects in Zarr.

EDIT:

Okay, I see now that I misspoke. Current AnnData-Zarr datasets need two additional metadata properties in the table's .zattrs file to be compliant: annotated-data, and column-order. So while nothing would have to be moved or removed in existing AnnData-Zarr datasets, yes, you would need to add something. I'll talk more about these two properties in a comment below.

latest/index.bs Outdated
| |
| ├── .zarray
| ├── .zattrs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's in this .zattrs file? Is it needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't plan to store any special information in this file. So I guess it can probably be deleted, or is it required by the general Zarr layout for arrays?
This would also apply for the .zattrs file of the row_names array. Thanks for pointing this out!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you have a .zarray then there's not usually a sibling .zattrs in zarr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'll just delete these two files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you have a .zarray then there's not usually a sibling .zattrs in zarr.

@will-moore's "not usually" applies to the NGFF spec's usages of Zarr to date. Just to be clear: in Zarr itself, a .zattrs is always permitted beside a .zarray, so if there's a need, there wouldn't be objections.

latest/index.bs Outdated
│ | ...
│ └── n
├── .zgroup
├── .zattrs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What goes in this tables/.zattrs file?
Does this list the child tables in any way? E.g. {"tables": ["table1", "anotherTable"]} in the same way that labels/.zattrs does? https://ngff.openmicroscopy.org/0.4/index.html#labels-md
This is really essential if you can't browse the subgroups.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it doesn't in order to minimize the chances of leaving the container in an inconsistent state (e.g., if a write/remove operation fails to modify the metadata as well). However, I don't have a lot of experience on systems where you can't browse subgroups, so I'm happy for any suggestions in that direction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this suggestion in a first round of including feedback. How does it look to you?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that looks good from a practical point of view (allows me to know what tables exist etc).
However, it's a bit different from what I was expecting, which was just to list the single table tables.attrs["tables"] = ["anndata",] and for the tables spec to define what the sub-tables are called. So I guess that highlights what you're proposing in that the spec doesn't define what any of the sub-tables are called (in the way that AnnData does) so you're free to name them anything? E.g. var or obs could be col_info or extra etc?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I want to keep it as generic as possible until there is a very compelling argument to do otherwise.

latest/index.bs Outdated
| # sparse arrays MAY be in the `uns` group or in a subgroup.
|
├── .zgroup
└── row_names # The table group SHOULD contain a 1D array of strings of length n called `row_names`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that SHOULD have row_names is a strong requirement. In most cases I can think of, each row will be e.g. a Cell ID and mostly these won't have names for each row, so I can't satisfy this requirement.
Perhaps MAY is better here?
Would tools that read this data be expected to handle this column differently from any another string columns e.g. called "cell_names" or "sample_names"? If I have a existing columns called "cell_names" and "sample_names" then it would only duplicate data if I have to add a "row_names" column too? So maybe this isn't really needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, annotating an axis of an array attributes some categorical meaning to the hyperslices obtained by slicing orthogonal to that axis. Therefore, it seemed natural to me to be able to refer to such a hyperslice by a name (or an ID). So natural in fact, that I would suggest having default names Row 1, Row2, ... in case this column is not present. In this regard, MAY seems to weak for me.

As for the redundancy, would it help to rename this to names, IDs, identifier, ... and allow Strings or integers as data type?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHOULD --> MAY is a schema-weakening change that is easy to roll out at any time. I think we should start with SHOULD and if people don't find row_names useful, we can change it to MAY later.

Allowing ints as well as strings seems reasonable to me.

I don't see why redundancy is an issue; row_names is not required. If there is a column that conceptually makes sense to think of as row names, you could just make that array row_names instead of a column. Presumably, the viewer will display the row names, so you wouldn't lose that column, it would just be moved to a more prominent place.

Copy link

@virginiascarlett virginiascarlett Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay update, now that I am thinking about this more... I think we should require that IF row_names exists, it MUST be unique, i.e., cannot contain duplicates. Unique row names provide a way to query the table with a guarantee that you will not accidentally grab 2 rows when you were expecting 1. (SQL DBs use primary keys for this purpose, which usually correspond to row numbers.) In R, data frames always have row names, which must be unique. (By default, 'rownames' is just the row numbers, as a vector of strings. R's 'rownames' can be strings or integers.)

I actually would recommend changing this SHOULD to MAY, if we impose the uniqueness requirement.

I think it's probably not the file format's responsibility to provide default row names.

Considering that R is designed for tabular data and is extremely popular in the genomics community, I think copying R's design choices is not a bad idea.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought: again, we have the dilemma of strict directory name vs. a metadata object. We could either call the directory "row_names" or add a third MUST to the .zattrs file: "row_names" : "", which indicates which column contains the row names, if any. Personally, I like the latter solution better. It means that software that doesn't support row names can just ignore that metadata object and treat that column like any other column. Also, you could then name that directory whatever you want, e.g. 'indptr' (if I'm understanding the AD format correctly).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize this discussion: I added an attribute "row-names" that MAY be present and refers to a 1d-column of strings or integers that should be used as row names. What do you think about this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent, but I would change how it's presented in the spec. I'll suggest some new text in a private conversation. As usual, I am concerned with story-telling. I think we should present the three new properties ("row-names", "column-order", and "annotated-data") together in the text, as well as in the tree diagram.

I would delete the row_names directory from the tree diagram. This visual design suggests that the directory should be named row_names, which is not the case. It also makes the spec look more complicated than it really is.

@will-moore
Copy link

Generally 👍 for this proposal. A more minimal spec makes sense in order to improve cross-language support and the AnnData-zarr docs certainly belong in AnnData docs.

However, I don't quite understand how data that conforms to the AnnData spec is also compliant with this proposal?
The AnnData layout has several differences and features that aren't supported here, and there are also MUST rules of this spec, such as "annotated-data" requirement that are not part of AnnData?

I can see how AnnData tables could be mostly converted to match this spec, but that doesn't seem to be what's described above?
I say "mostly" because I don't understand how var column annotations would be stored in this spec?
E.g. If I have AnnData where X is gene expression data for 100 Genes, would I split this up into 100 columns, each named with the Gene name? Is there a place to store metadata associated with each Gene, such as GO accession number?

@kevinyamauchi
Copy link
Owner

kevinyamauchi commented Jun 21, 2023

Thanks again for this PR, @minnerbe ! At a spec level, I agree with the benefits of this more minimal spec easing interoperability.

I am curious about the potential impact on the implementers. Previously, we had discussed with the AnnData folks and some of the Vitessce folks about python and javascript reader implementations, respectively. This conversation was centered around the fact that they had an existing need for AnnData zarr IO, so the proposed AnnData-based table spec wouldn't be a very heavy lift. I will forward this PR to those teams and ask if they can comment here if they think it is feasible for them. In the case that this is too big of a deviation for them, do we have alternative implementors?

If it's possible to bring the AnnData on-disk representation to match this (or something in the middle), that might be a a nice result, but I could imagine that will take some time to get alignment from all stakeholders. What is the protocol for when implementations should be available following a spec being published?

However, I don't quite understand how data that conforms to the AnnData spec is also compliant with this proposal?
The AnnData layout has several differences and features that aren't supported here, and there are also MUST rules of this spec, such as "annotated-data" requirement that are not part of AnnData?

I have the same questions as @will-moore . Would it be possible to see how an AnnData would be stored with this spec?

@joshmoore
Copy link

Sidenote, @minnerbe: there are a few extraneous commits here that you might want to make disappear. (I can't push to Kevin's fork to make that happen)

@minnerbe
Copy link
Author

Thanks for the constructive feedback, folks!

As a short answer to @will-moore's questions: var would also be a table, but with the columns having the same dimension as number of colums in X. This requires the MUST metadata to piece the two tables (one for obs and one for var) and the array (X) together.

I hope that I can provide a better answer within the next two days with an example of how I would use this spec to store an AnnData object. Unfortunately, I'm traveling right now so I'm still busy until Friday.

@joshmoore which commits are you talking about? I can certainly rebase the whole branch make the history more clean.

@joshmoore
Copy link

joshmoore commented Jun 21, 2023

@joshmoore which commits are you talking about? I can certainly rebase the whole branch make the history more clean.

The new .DS_Store and .html file and the change to the existing .html file on this branch. I wouldn't suggest rebasing just yet since people are committing on the diff itself.

I hope that I can provide a better answer within the next two days with an example of how I would use this spec to store an AnnData object. Unfortunately, I'm traveling right now so I'm still busy until Friday.

👍 because I think @will-moore's concerns from #3 (comment) match mine.

It might in fact be that this proposal gives us the optimal that I mentioned in ome#64 (comment) but that we can define a rollout that gives both the NGFF spec and the AnnData on-disk format time to adjust incrementally.

For those who would like to review:


In OME-NGFF, a table is a Zarr group containing zero, one, or more Zarr arrays, where each
array represents one column of the table. Columns are ordered, and each column in a
table MUST have the same number of rows. While the table itself MUST be 2-dimensional,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since columns can be n-dimensional I would replace this by saying that each column should have the same shape, so to avoid the case of one column being 1d and one 2d.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should this case be prohibited? As Michael pointed out, this is permitted in Matlab.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make the in-memory representation of the data challenging (for instance AnnData or Pandas may not work out of the box). For instance, should a 2D column be represented as a rows that are np.array() or as a list of columns? In the first case writing a Pandas dataframe back to disk would give a list of 1d columns instead that a single 2d column.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this objection. I understand that this structure cannot be stored in a Pandas dataframe, which is Python-only data structure. But AnnData has means to store 1D annotations (as Pandas dataframe, as far as I understand) as well as nD annotations (dictionary of type string -> [np.array|<other AnnData in-memory representations>]?) in-memory. Could you please elaborate on the problems that you see with AnnData?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the previous AnnData table proposal the main dataframe was saved as a matrix (so each column had always dimension 1), and eventual ndim annotatoins were saved in obsm, that was not the main matrix but present in a different zarr group (the obsm group).

With the new proposal from what I understand there is just one main dataframe. So it's true that by default we could store any column that has dimension > 1 into a obsm in AnnData, but this type of conversion is not bidirectional and can induce fragmentation. For instance if the use has an AnnData object with an obsm column with dimension 1, when saving it to disk should this be saved in the separate obsm group or should it be saved in the main matrix? In that case upon re-reading there is no way to know that before it was in obsm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@virginiascarlett In principle, I agree that multi-dimensional columns are not at all intuitive (and, in this regard, Matlab really is an outlier among all languages that support some kind of table). However, I don't know how AnnData could be represented by this table spec without multi-dimensional columns. I guess that most fields in the obsm and varm groups can be split into 1d components, since they seem to represent some kind of coordinate (points, umap, ...) most of the time. But for pairwise annotations in the obsp and varp groups, this would mean to have column numbers in the order of millions. How would you go about storing these with just 1d columns?

In particular, I think @LucaMarconato just suggests to represent obs, obsm, and obsp as separate groups (i.e., tables in our definition) within an AnnData stored with our spec, is this correct? This would be entirely possible with our definitions.

Copy link

@LucaMarconato LucaMarconato Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I would store them as separate tables (using your definition of tables).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we agree to restrict the spec to 1D columns?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that restricting to 1D columns would severely break AnnData compatibility. Rather, I support Luca's suggestion of having a table (in our definition) for obs, obsm, and obsp, each, where one can store 1D, nD with n > 1, and quadratic arrays, respectively.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer 1D columns of tuples over having 2D columns. For now this could be restricted to homogeneous tuples. This would allow these to be exactly mappable to a higher dimensional arrays. However, in the future, heterogenous tuples, structs with arbitrary fields, may be useful to have.

For example, instead of having a 5 row by 6 column matrix, we have a vector, a one-dimensional array, of 6-tuples.

In [26]: import numpy as np

In [27]: A = np.reshape(np.arange(30), (5,6))

In [28]: A # 2D matrix representation
Out[28]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

In [29]: B = [tuple(row) for row in A]

In [30]: B # 6-tuple representation
Out[30]:
[(0, 1, 2, 3, 4, 5),
 (6, 7, 8, 9, 10, 11),
 (12, 13, 14, 15, 16, 17),
 (18, 19, 20, 21, 22, 23),
 (24, 25, 26, 27, 28, 29)]

In [31]: np.array(B) # Easy conversion from 6-tuple representation back to a 2D matrix
Out[31]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29]])

table of `tables/` MUST contain the "annotated-data" property, which is a JSON array of
Zarr array paths and dimensions (0-based indexing), as shown below:

```json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understood this example. Could you please elaborate more on this? Why do you need to tell which dimension of an image is being annotated for each table?

Furthermore, does this allow to annotate labels? For a table annotating labels I would just put in the metadata that the table refers to the Zarr labels group, which do you need to specify which one of the yx dimension is being annotated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By requiring the image and dimension that is annotated by a table, I see two advantages compared to having a fixed naming schema:

  1. We are free to name the tables whatever we like, e.g., obs for the 0-th dimension of an array X and var for the first dimension of the same array (this may also answer @will-moore's questions of how to store the var array), without imposing this naming scheme on others, who might want to choose a more descriptive name for a more specific use case.
  2. We can annotate the same image/dimension pair by more than one table. Vice versa, one table can annotate multiple image/dimension pairs, in which case the 'orientation' of the table comes from the annotated dimension, i.e., the table can annotated the columns of one image and the rows of another one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I now understood the purpose, but I want to point out some ambiguity in the terminology that you used; I think that this causes some confusion in the proposal and that addressing this can make the proposal clearer.

From your message above it seems like that the obs is the table and X is the image. In the AnnData proposal X is never an image. X is always a matrix of values, i.e. the table is "X + obs + var". obs alone is not considered a table.

In the AnnData proposal, if we have a segmentation mask (labels tensor) called lab, the way to annotated it would be by a matrix X containing the annotations and by a obs dataframe containing the two columns that tell how to map the indices of the segmentation mask and the indices of the annotation matrix X. So the table "X + obs" annotates the labels lab.

This is why I was confused by the new proposal, because what you call the table (obs) should now tell which dimension of X is being annotated, but X is not the image, it's the annotation table itself.

Copy link

@LucaMarconato LucaMarconato Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this comment I don't want to say that obs + var + X = table is better or worse than having a more atomic concept of table (= obs). It has it's advantages and it's disadvantages. I just want to point out the ambiguity in the proposal since it confused me and could confuse other readers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I agree that the terminology is not 100% precise here and that removing all ambiguities in language is paramount to arriving at a useful spec. Part of the motivation for creating a more minimal spec was that I was confused by the terminology in the original proposal, so arriving at a common ground here would be very desirable.

The reasoning for my terminology is as follows:

  • Table, for me, means a data structure that has one (heterogeneous) record schema and multiple records (the rows). In that sense, X (as well as X + obs + var) is not a table.
  • I was also careful to let a table annotate "arrays" instead of "images", since this allows to consistently annotate X from the AnnData schema. An AnnData object, in turn, can annotate an image in a predefined way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second sentence of the Tables spec ("Tables are an intuitive way of storing...") contributes to this confusion. I would replace it with this: "Tables are an intuitive way of organizing data or metadata consisting of variables and records, which are often organized as columns and rows, respectively."

I agree with Michael that I am pretty happy with our definition of table as it is in the minimal spec, though I am also happy to make changes that will minimize confusion. In relational databases, if you have two tables that together tell a story, we treat them as two tables, and the schema describes how they relate to one another. I think if AnnData wants to conceptualize multiple 2D data structures as a single table, that's fine, but we want to create a spec that is generic and intuitive for people who've never encountered AnnData.

@virginiascarlett
Copy link

We decided to add two properties, annotated-data and column-order, as MUSTs rather than MAYs because we believe it is much easier for implementers to expect a potentially empty object than to not know where that structural information is, if it exists. Kevin et al. essentially put this information in the directory names.

Under their proposal, any new table spec will probably have to either:

  1. specify new directory names and meanings for those names, or
  2. specify new metadata objects as we have.

While our two MUSTs may seem cumbersome now, they actually give AnnData more flexibility to grow and change. We are essentially adding two small pieces of metadata in exchange for more flexibility in the directory names and the layout.

We believe the core spec ought to provide a place for some basic information about the structure and meaning of the table, rather than forcing people to decide where to put that information. Also, being too rigid with directory names means that no one else can reuse that name, e.g. 'obs', 'X', or 'layers', once it already has a special meaning.

@minnerbe
Copy link
Author

I finally managed to assemble an example showing how I would store an AnnData file using this spec. Since I need to rebase anyway (thanks, @joshmoore for pointing this out!), I pushed a python script to this branch: latest/generate_anndata_example.py. In this file:

  • an example AnnData object is created (slightly modified from the tutorial),
  • a way of storing this file on disk that is mostly compatible with the current AnnData layout (example.zarr),
  • a suggestions of how I would consolidate some fields of AnnData to exploit the features of the proposed table spec a bit more (example_suggestion.zarr).

In principle, example.zarr is the current AnnData, but without custom data types (for the reasons explained above) and plus the two pieces of metadata discussed above. The reason why these metadata exist were already explained by @virginiascarlett, but let me reiterate that I also think this helps tables in OME-NGFF play a role beyond the current AnnData specification, e.g., as a way of storing structured metadata for images.

Please let me know if there are any concerns about how I envision a potential way of AnnData using the present table spec proposal. As I include the suggestions already made into the spec, I will try to keep the AnnData example updated.

@will-moore
Copy link

@minnerbe Sorry, I can't work out where you pushed your generate_anndata_example.py script? I don't see a latest branch under https://github.com/minnerbe/ngff/branches (or anywhere else). Could you add a link? Thanks

@minnerbe
Copy link
Author

You're right, I didn't push. 🤦
It was kinda late yesterday, I'm sorry. The file should be there now.

@will-moore
Copy link

@minnerbe Thanks for taking the time to write that script - I think I understand the data structures a bit better now...

That script generates an example.zarr which looks similar to AnnData but without AnnData encoding of various types (categories and sparse data). But I'm not sure if this output is for illustration purposes only, or if it's one possible option for adoption?

I'm not at-all familiar with the Java side of the argument, but is it correct that handling the example.zarr generated by that script is much harder in Java than handling the example_suggestion.zarr data?
Or are the differences between them that you're proposing mainly based on data modelling arguments?

Is it the hope that AnnData would migrate towards the "minimal proposal" (so as to use the same AnnData spec with OME-NGFF as without), or is it conceivable that there'd be 2 flavours of AnnData and ways to convert between them?

I would be more supportive of dropping sparse CSR/CSC encoding from the tables spec than I would about categorical encoding, mostly because it's harder to load chunks of the sparse array that correspond to given chunks/rows of the X data if the sparse data is encoded.
I don't have a good grasp of how much bandwidth can be saved by using categorical encoding compared with regular string arrays with compression. I guess that depends on how long your strings are. Does the decoding of categorical arrays cause any issues in Java or is the proposal to drop them just in favour of "starting simple" with a really minimal spec?

@will-moore
Copy link

@minnerbe I made a couple of tweaks to your script, one to fix an error and the other to export anndata.zarr to help me compare on disk:

git diff ``` diff --git a/latest/generate_anndata_example.py b/latest/generate_anndata_example.py index 447c27c..893575d 100644 --- a/latest/generate_anndata_example.py +++ b/latest/generate_anndata_example.py @@ -107,11 +107,11 @@ def write_anndata_suggestion(adata, filename, chunks): row_names = np.array(["X", "log_transformed", "other_data"]) layers.create_dataset("row_names", data=row_names, dtype=object, object_codec=numcodecs.VLenUTF8()) layers.attrs["annotated-data"] = [{"array": "/tables/anndata/X", "dimension": "2"}] - obs.attrs["column-order"] = ["row_names"]
 # obs (combines obs, obsm, obsp)
 localChunks = (chunks[0],)
 obs = adgroup.create_group("obs")
  • obs.attrs["column-order"] = ["row_names"]
    obs.create_dataset("row_names", data=np.array(adata.obs_names), dtype=object, object_codec=numcodecs.VLenUTF8())
    obs.create_dataset("cell_type", data=np.array(adata.obs["cell_type"]), chunks=localChunks, object_codec=numcodecs.VLenUTF8())
    obs.create_dataset("X_umap", data=adata.obsm["X_umap"], chunks=(chunks[0], 2))
    @@ -141,3 +141,7 @@ write_anndata(adata, "example.zarr", chunks)

store example in an alternative way, exploiting the properties of the suggested minimal table spec a bit more

write_anndata_suggestion(adata, "example_suggestion.zarr", chunks)
+
+
+store = zarr.DirectoryStore('example_anndata.zarr')
+adata.write_zarr(store)

</details

As suggested by @will-moore and @virginiascarlett:
* Change SHOULD to MAY
* Add metadata attribute to specify column containing row names
@minnerbe
Copy link
Author

@will-moore: In principle, the idea was to have a minimal spec of structured metadata that can be used to store AnnData easily, but is also useful beyond AnnData. E.g., imagine you have a time-series of pictures in a 3d-image and you want to note the exact time and, say, temperature for every time-slice. With readers that can deal with tables as outlined in this proposal, there is a canonical way of associating this metadata with the image (create a table with columns "time" and "temperature" and let it annotate the time-dimension of the 3d-image).

As far as I understand, there is already a fully developed AnnData standard (with multiple backends) independent of OME-NGFF. So, I don't see a reason further "standardize" AnnData by including it in OME-NGFF as-is. However, as AnnData provides undeniably more complex use cases than mine outlined above, it's very desirable to have a way of easily representing AnnData within OME-NGFF. If AnnData should shift more toward such a representation, or should simply provides the representation as an OME-NGFF "flavour" is not for me to decide or propose.

Regarding Categoricals: I agree that they are a very-nice-to-have feature that wouldn't be too hard to implement from the Java side. My motivation for dropping this particular feature was

  1. starting simple, as you suggested,
  2. there is some redundancy with labels, as Categoricals can be represented by the image-label metadata annotating the array of category codes
{
  "properties": [
    {
      "label-value": 1,
      "category-name": "some cell type"
    },
    {
      "label-value": 4,
      "category-name": "another cell type"
    },
    ...
  ]
}

Of course, this similarity of the image-label metadata and Categoricals is mostly conceptual and further work would need to be done to represent Categoricals properly.

Would you suggest to re-introduce AnnData encodings in general, or just to keep Categoricals? In the latter case, I think it could be worthwhile to explore the redundancy mentioned above to have a uniform way of representing categorical data.

To answer your other questions: example.zarr and example_suggestion.zarr differ only in how they use this table spec proposal. The implementation of IO in Java would probably be of very similar difficulty. In fact, the main pain point were CSC/CSR matrices for exactly the reasons you gave (chunked loading).

@virginiascarlett
Copy link

I suggest we should add a sentence at the end of the first paragraph about the purpose of tables in OME-Zarr. Something like, "While Zarr is not designed for tabular data, the user may wish to store tables of reasonable size within a Zarr hierarchy, for clarity and convenience."

minnerbe and others added 4 commits August 28, 2023 16:46
Apparently a new release of bikeshed is now unhappy with the use of
`<img/>` and `<img></img>` is required:

```
  $ bikeshed spec "latest/index.bs" "latest/index.out.html"
  LINE 651:1: Tag <img> wasn't closed at end of file.
   ✘  Did not generate, due to fatal errors

  Failed
```
@mkitti
Copy link

mkitti commented Aug 28, 2023

This URL should generate a rendered preview:
http://api.csswg.org/bikeshed/?url=https://raw.githubusercontent.com/minnerbe/ngff/minimal-proposal/latest/index.bs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants