adata slots for table and point spec in ngff #99

giovp · 2022-02-08T17:28:04Z

related to #64 and https://github.com/kevinyamauchi/ome-ngff-tables-prototype , as discussed this morning with @kevinyamauchi this is a description of adata slots and how they are used in https://github.com/theislab/squidpy and other spatial analysis tools of the https://github.com/theislab/scanpy ecosystem.

adata.X and adata.layers["layer"] store molecular info (gene/protein expression etc.).
adata.obsm stores various "latent" representations of obs (e.g. PCA/UMAP coordinates) but also:
- adata.obsm["spatial"] stores obs coordinates in space, with shape (N,2) or (N,3).
- adata.obsm["molecule_spatial"] will store molecule location in FISH-based data (with awkward arrays).
adata.varm no real use in spatial data afaik
adata.obsp stores adjacency matrices of e.g. graphs in spatial coordinates, knn graphs in latent spaces etc.
adata.varp no real use in spatial data afaik
adata.uns stores a bunch of image-related data. It is structured as follow:
- adata.uns["spatial"] contains library_id keys that correspond to unique identifiers of images (e.g. tissue slides). These values are also stored in adata.obs["library_ids"] which can be used to subset anndata based on the tissue slide of interest. Furthermore, inside adata.uns["spatial"][<library_id>] there are 2 more dictionaries:
- images for small-size tissue images (order of Mbs)
- scalefactors metadata related to scaling original coordinates in adata.obs["spatial"] as well as other infos

adata.uns also stores intermediate analysis results by several analysis tools in the ecosystem. e.g. trajectory analysis, velocity, various plotting params etc. I would therefore consider to support it for a better integration in the ecosystem.
It would also be ok to store the same type of info in metadata in ngff, and then handle this on the API side (it'd be fine for us at Squidpy, not so sure for others).

@kevinyamauchi next week I'll try out https://github.com/kevinyamauchi/ome-ngff-tables-prototype and report back, thanks again for sharing.

Just want to mention one more time that this is super exciting and am really looking forward to see how it develops!

pinging various people @ivirshup @michalk8 @hspitzer @AnnaChristina @LucaMarconato

The text was updated successfully, but these errors were encountered:

kevinyamauchi · 2022-02-09T19:29:29Z

Thanks, @giovp ! This is very helpful. In principle, it seems like all of these attributes could be included in the table spec. I don't think they add any additional data types (potentially with the exception of awkward arrays), so I don't think it adds much extra work for the non-python implementations.

A couple of follow-up questions:

are there any constraints to what is allowed to be stored in adata.uns? Is it generally dictionaries, numbers, strings, arrays of numbers, and arrays of strings? There wouldn't be some scanpy or squidpy object stored in there, right?
Do you think the images in adata.uns could be stored as an image in the OME-NGFF file? If so, maybe there could be a reference in uns that says where the image is stored. I suppose the additional bonus is this could potentially open the door to linking to images in different files. From our conversation, it seemed like this would be okay, but I just wanted to double check.
Are you already using the awkward arrays in anndata.obsm? If so, do you already have a way to write them to zarr? I was looking around and only found this issue. I think we briefly chatted about this in our call, but I can't remember what the conclusion was. Maybe @joshmoore remembers?
I was playing around with scanpy today and learned there is also AnnData.raw. Does that also need to be saved to disk?

joshmoore · 2022-02-10T07:10:22Z

Maybe @joshmoore remembers?

There hasn't been a zarr proposal yet. Interestingly, @eriknw joined the zarr call last night and I mentioned to him that @ivirshup might be getting in touch. I know @martindurant is interested as well. @MSanKeys963 and I can spend some time getting the existing issues cleaned up.

giovp · 2022-02-10T08:52:42Z

thanks @kevinyamauchi for prompt reply!

are there any constraints to what is allowed to be stored in adata.uns? Is it generally dictionaries, numbers, strings, arrays of numbers, and arrays of strings? There wouldn't be some scanpy or squidpy object stored in there, right?

exactly, I think there could pandas dataframes and there is interest for tuples and named tuples afaik but I think what you listed should be enough, maybe @ivirshup can comment more on this?

Do you think the images in adata.uns could be stored as an image in the OME-NGFF file? If so, maybe there could be a reference in uns that says where the image is stored. I suppose the additional bonus is this could potentially open the door to linking to images in different files. From our conversation, it seemed like this would be okay, but I just wanted to double check.

yes exactly, this is something that would be very useful and we'd be happy to change the current API in squidpy to accomodate that eventually. Current solution is not sustainable and doesn't really scale.

Are you already using the awkward arrays in anndata.obsm? If so, do you already have a way to write them to zarr? I was looking around and only found zarr-developers/zarr-specs#62. I think we briefly chatted about this in our call, but I can't remember what the conclusion was. Maybe @joshmoore remembers?

I'm working on adding awkward array support in both varm and obsm here scverse/anndata#647 . IO is what it is currently missing, I am not sure how easy it'd be to write them to zarr, maybe @ivirshup can chip in here?

I was playing around with scanpy today and learned there is also AnnData.raw. Does that also need to be saved to disk?

I completely forgot about raw, sorry for that. Yes I guess that would also need to be saved to disk. Afaik it's not used much anymore, it was more useful for when anndata didn't have layer, again I'd like @ivirshup to comment on that.

kevinyamauchi · 2022-02-10T20:05:18Z

Thank you for all of the feedback, @giovp ! I will have a look at your awkward array PR.

Would the best way for me to play with an AnnData object with typical spatial data to make one using the instructions from one of your nice tutorials? Perhaps this one?

ivirshup · 2022-02-14T13:05:29Z

Values in `.uns`

are there any constraints to what is allowed to be stored in adata.uns? Is it generally dictionaries, numbers, strings, arrays of numbers, and arrays of strings? There wouldn't be some scanpy or squidpy object stored in there, right?

In the new release candidate you can put an AnnData in .uns. Basically anything that we're able to write can be put in uns.

ragged/ awkward array storage

I'm working on adding awkward array support in both varm and obsm here scverse/anndata#647. IO is what it is currently missing, I am not sure how easy it'd be to write them to zarr

My hope is this should be fairly straightforward with ak.to_buffers, but maybe @joshmoore or other zarr developers would know more here. Would also be happy to take a different approach, like directly using the zarr ragged array encoding.

raw

So, I would like to deprecate .raw. Still need to figure out just how feasible that is, whether it needs to be directly replaced.

One option here is using mudata for shared observations with non-shared variables. I've written a bit more on this here: scverse/mudata#13. Ambrose has also proposed this kind of approach attached around the singlecelldata/matrix-api, but I'm not sure there's much of that conversation on github.

joshmoore · 2022-02-18T20:56:41Z

My hope is this should be fairly straightforward with ak.to_buffers, but maybe @joshmoore or other zarr developers would know more here. Would also be happy to take a different approach, like directly using the zarr ragged array encoding.

👍 Work here will likely start to ramp up soon (perhaps along with the soon to be listed https://github.com/zarr-developers/gsoc/tree/main/2022). Seems like it's a good time to get all of us working in the same direction.

ivirshup · 2022-02-23T14:43:26Z

Is there a good place to discuss the awkward array proposal? I'm wondering whether the goal here is more like ak.to_buffers or zarr ragged array. In particular, where do current needs sit on interoperability. Many languages do ragged arrays, but I think awkward adds some features on top of that.

ivirshup · 2022-02-23T14:54:14Z

@kevinyamauchi, one more point I forgot to add about potential future changes in AnnData: X may be just another layer. It's also optional at the moment.

One place this might play into the design on OME is for point data, if X is still being used to store the coordinates. Have you considered putting the coordinates in obsm instead, and having X be a sparse matrix? The points table could then have shape n_points x n_var_types. I think there could be a couple advantages here:

There's probably more annotation per kind of point than per coordinate dimension (and the dimension's metadata is captured elsewhere)
Less repetition of metadata per kind of point in the .obs table.
Numeric values per point, e.g. intensities, probabilistic assignments

giovp · 2022-03-07T16:21:34Z

@joshmoore an update regarding reading/writing awkward arrays in zarr, we ended up doing it with ak.to_buffers

This is the relevant code from scverse/anndata#647

@_REGISTRY.register_write(H5Group, AwkArray, IOSpec("awkward-array", "0.1.0"))
@_REGISTRY.register_write(ZarrGroup, AwkArray, IOSpec("awkward-array", "0.1.0"))
def write_awkward(f, k, v, dataset_kwargs=MappingProxyType({})):
    import awkward as ak

    group = f.create_group(k)
    form, length, container = ak.to_buffers(v)
    group.attrs["length"] = length
    group.attrs["form"] = form.tojson()
    write_elem(group, "container", container, dataset_kwargs=dataset_kwargs)


@_REGISTRY.register_read(H5Group, IOSpec("awkward-array", "0.1.0"))
@_REGISTRY.register_read(ZarrGroup, IOSpec("awkward-array", "0.1.0"))
def read_awkward(elem):
    import awkward as ak

    form = _read_attr(elem.attrs, "form")
    length = _read_attr(elem.attrs, "length")
    container = read_elem(elem["container"])

    return ak.from_buffers(form, length, container)

where:

form is a json file string formatted
length is an int with array length
container is a dict with the actual data

as per API and tutorial

kevinyamauchi mentioned this issue Jun 19, 2022

Table spec proposal #64

Closed

joshmoore mentioned this issue Aug 29, 2024

Sparse Array Support #257

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adata slots for table and point spec in ngff #99

adata slots for table and point spec in ngff #99

giovp commented Feb 8, 2022

kevinyamauchi commented Feb 9, 2022 •

edited

Loading

joshmoore commented Feb 10, 2022

giovp commented Feb 10, 2022

kevinyamauchi commented Feb 10, 2022

ivirshup commented Feb 14, 2022

joshmoore commented Feb 18, 2022 •

edited

Loading

ivirshup commented Feb 23, 2022

ivirshup commented Feb 23, 2022

giovp commented Mar 7, 2022 •

edited

Loading

adata slots for table and point spec in ngff #99

adata slots for table and point spec in ngff #99

Comments

giovp commented Feb 8, 2022

kevinyamauchi commented Feb 9, 2022 • edited Loading

joshmoore commented Feb 10, 2022

giovp commented Feb 10, 2022

kevinyamauchi commented Feb 10, 2022

ivirshup commented Feb 14, 2022

Values in .uns

ragged/ awkward array storage

raw

joshmoore commented Feb 18, 2022 • edited Loading

ivirshup commented Feb 23, 2022

ivirshup commented Feb 23, 2022

giovp commented Mar 7, 2022 • edited Loading

kevinyamauchi commented Feb 9, 2022 •

edited

Loading

Values in `.uns`

joshmoore commented Feb 18, 2022 •

edited

Loading

giovp commented Mar 7, 2022 •

edited

Loading