[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/cellxgene-census-lamin/blob/main/docs/03-cellxgene-census.ipynb)
[![census](https://img.shields.io/badge/laminlabs/cellxgene--census-mediumseagreen)](https://lamin.ai/laminlabs/cellxgene-census)

# CELLxGENE: scRNA-seq datasets

[CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census) is a versioned data release from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/) and a [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA) API to query it.

LaminDB makes it easy to integrate the Census data with in-house data of any kind, from omics & phenotypic data, to pdfs, notebooks & ML models.

You can use Census in three ways:

1. In the current guide, you'll see how to query the data in `.h5ad` format by validated metadata.
2. In the [transfer guide](docs:transfer), you'll see how to transfer data & metadata into your LaminDB instance.
3. In the [SOMA guide](query-census), you'll see how to use LaminDB's registries to write SOMA queries with auto-complete.

If you are interested in building on to Census or building similar data assets:

1. See the [scRNA guide](docs:scrna) for how to create a growing versioned queryable scRNA-seq dataset.
2. See the [validation](docs:validate) & [validator](docs:faq/validator) guides for how to validate & write validators based on ontologies.
3. [Reach out](https://lamin.ai/contact) if you are interested in a full zero-copy clone of `laminlabs/cellxgene-census` to kick-start your in-house LaminDB instances. 
4. See the [registration guide](census-registries) for how the `laminlabs/cellxgene-census` instance was created.


## Setup

Load the public LaminDB instance that mirrors cellxgene-census on the CLI:

In [None]:
!lamin load laminlabs/cellxgene-census

In [None]:
import lamindb as ln
import lnschema_bionty as lb

## Search & look up metadata

Let us search for a cell type:

In [None]:
lb.CellType.search("effector T cell").head()

Let's use the persistent universal `uid` to access the metadata record:

In [None]:
t_eff = lb.CellType.filter(uid="yvHkIrVI").one()

In [None]:
t_eff

Alternatively, we can use auto-complete based on a look-up object:


<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

In [None]:
cell_types = lb.CellType.lookup()
cell_types.effector_t_cell

You can create look-up objects for any registry in LaminDB, including [basic biological entities](docs:lnschema-bionty) and things like users or storage locations.

You can also arbitrarily combine queries & search results and convert them into lookups:

In [None]:
organisms = lb.Organism.lookup()  # species
genes = lb.Gene.filter(organism=organisms.human).lookup()  # ~60k human genes
features = ln.Feature.lookup()  # non-gene features, like `cell_type`, `assay`, etc.
experimental_factors = lb.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = lb.Tissue.lookup()  # tissue labels
ulabels = ln.ULabel.lookup()  # universal labels, e.g. dataset collections
suspension_types = (
    ulabels.is_suspension_type.children.all().lookup()
)  # suspension types

## Understand ontologies

Understand the surrounding ontology terms: 

In [None]:
t_eff.view_parents(distance=2, with_children=True)

Or access them programmatically:

In [None]:
t_eff.children.df()

## Query data

Unlike in the [SOMA guide](query-census), here, we'll query sets of `h5ad` files, which correspond to `AnnData` objects.

To access them, we query the {class}`~lamindb.Dataset` record that links the latest LTS set of h5ad files:

In [None]:
census_version = "2023-07-25"
dataset = ln.Dataset.filter(name="cellxgene-census", version=census_version).one()

dataset

(Dataset is an abstraction over different ways of storing datasets from array stores to file or path collections.)

You can get all linked files as a dataframe - there are 850 files in version `2023-07-25`.

In [None]:
dataset.files.df().head()

You can also query all files by arbitrary metadata combinations, for instance:

In [None]:
query = dataset.files.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)

Display query result as a `DataFrame`:

In [None]:
query = query.order_by("size").distinct()  # drop duplicates
query.df().head()

## Load an entire array

Each file stores an array in form of an annotated data matrix, an `AnnData` object.

Let's look at the first array and retrieve all metadata using `.describe()`:

In [None]:
file = query.first()
file.describe()

:::{dropdown} More ways of accessing metadata

Access just features:

```
file.features
```

Or get labels given a feature:

```
file.labels.get(features.tissue).df()
```

```
file.labels.get(features.collection).one()
```

:::



If you're sure that you want to load the array, you have three options:
1. Load it directly into memory via `file.load() -> AnnData`, this automatically caches the h5ad on disk, so that you only download once
2. Stage it locally on disk `file.stage() -> Path` in a cache
3. Stream data from the cloud through a backed object `file.backed() -> AnnDataAccessor`

All 3 options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).

## Load an array slice

Let us first work with an object in memory:

In [None]:
adata = file.load()
adata

Now we have an `AnnData` object, which stores observation annotations matching our file-level query in the `.obs` slot:

If we'd like to subset a slice, we can use the same query we used to retrieve the file:

:::{dropdown} See the file-level query for comparison

```
query = dataset.files.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
```

`AnnData` uses pandas to manage metadata and the syntax differs slightly, while the same metadata reference records are used.

:::

In [None]:
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice

If we want to aggregate these individual file-level slices, we can loop over all files and concatenate the results.

## Stream an array slice

Depending on the use case, we might prefer to directly stream the slice without downloading the entire file.

Here's how to do it:

In [None]:
adata_backed = file.backed()
adata_backed

The `AnnDataAccessor` behaves largely in the same way as `AnnData`, and hence, the query looks the same:

In [None]:
adata_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_slice

## Exploring data by collection

Alternatively, 

- [you can search a file on the LaminHub UI](https://lamin.ai/laminlabs/cellxgene-census/records/core/File) and fetch it through: 
`ln.File.filter(uid="...").one()`
- or query for a collection you found on [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/collections)

Let's search the collections from CELLxGENE:

In [None]:
ulabels.is_collection.search("immune zonation of the human kidney", limit=10)

Let's get the full metadata record of the top hit collection:

In [None]:
collection_13BWB722 = ln.ULabel.filter(uid="13BWB722").one()

collection_13BWB722

We see it's a Science paper and we could find more information using the [DOI](https://doi.org/10.1126/science.aat5031) or CELLxGENE [collection id](https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec).

Each collection has at least one {class}`~lamindb.File` file associated to it. Let's query them for this collection:

In [None]:
ln.File.filter(ulabels=collection_13BWB722).df()

```{toctree}
:maxdepth: 1
:hidden:

census-registries
query-census
```