[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/cellxgene-census-lamin/blob/main/docs/03-cellxgene-census.ipynb)
[![census](https://img.shields.io/badge/laminlabs/cellxgene--census-mediumseagreen)](https://lamin.ai/laminlabs/cellxgene-census)

# CELLxGENE: scRNA-seq datasets

[CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census) is a versioned data release from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/) and a [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA) API to query it.

LaminDB makes it easy to integrate the Census data with in-house data of any kind, from omics & phenotypic data, to pdfs, notebooks & ML models.

You can use Census in three ways:

1. In the current guide, you'll see how to query the data in `.h5ad` format by validated metadata.
2. In the [transfer guide](docs:transfer), you'll see how to transfer data & metadata into your LaminDB instance.
3. In the [SOMA guide](query-census), you'll see how to use LaminDB's registries to write SOMA queries with auto-complete.

If you are interested in building on to Census or building similar data assets:

1. See the [scRNA guide](docs:scrna) for how to create a growing versioned queryable scRNA-seq dataset.
2. See the [validation](docs:validate) & [validator](docs:faq/validator) guides for how to validate & write validators based on ontologies.
3. [Reach out](https://lamin.ai/contact) if you are interested in a full zero-copy clone of `laminlabs/cellxgene-census` to kick-start your in-house LaminDB instances. 
4. See the [registration guide](census-registries) for how the `laminlabs/cellxgene-census` instance was created.


## Setup

Load the public LaminDB instance that mirrors cellxgene-census on the CLI:

In [None]:
!lamin load laminlabs/cellxgene-census

In [None]:
import lamindb as ln
import lnschema_bionty as lb

## Search & look up metadata

Let us search for a cell type:

In [None]:
lb.CellType.search("effector T cell").head()

Let's use the persistent universal `uid` to access the metadata record:

In [None]:
t_eff = lb.CellType.filter(uid="yvHkIrVI").one()

In [None]:
t_eff

Alternatively, we can use auto-complete based on a look-up object:


<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

In [None]:
cell_types = lb.CellType.lookup()
cell_types.effector_t_cell

You can create look-up objects for any registry in LaminDB, including [basic biological entities](docs:lnschema-bionty) and things like users or storage locations.

You can also arbitrarily combine queries & search results and convert them into lookups:

In [None]:
organisms = lb.Organism.lookup()
genes = lb.Gene.filter(organism=organisms.human).lookup()  # just human genes
features = ln.Feature.lookup()
assays = lb.ExperimentalFactor.lookup()
tissues = lb.Tissue.lookup()
ulabels = ln.ULabel.lookup()
suspension_types = ulabels.is_suspension_type.children.all().lookup()

## Understand ontologies

Understand the surrounding ontology terms: 

In [None]:
t_eff.view_parents(distance=2, with_children=True)

Or access them programmatically:

In [None]:
t_eff.children.df()

## Query data

Unlike in the [SOMA guide](query-census), here, we'll query sets of `h5ad` files, which correspond to `AnnData` objects.

To access them, we query the {class}`~lamindb.Dataset` record that links the latest versioned set of h5ad files:

In [None]:
census_version = "2023-07-25"
dataset = ln.Dataset.filter(name="cellxgene-census", version=census_version).one()

dataset

(Dataset is an abstraction over different ways of storing datasets from array stores to file or path collections.)

You can get all linked files as a dataframe - there are 850 files in version `2023-07-25`.

In [None]:
dataset.files.df().head()

You can also query all files by arbitrary metadata combinations, for instance:

In [None]:
query = (
    dataset.files.filter(
        organism=organisms.human,
        cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
        tissues=tissues.kidney,
        ulabels=suspension_types.cell,
        experimental_factors=assays.ln_10x_3_v2,
    )
    .order_by("size")  # order by size
    .distinct()  # drop duplicated query results
)

Display query result as a `DataFrame`:

In [None]:
query.df()

Alternatively, 

- [you can search a file on the LaminHub UI](https://lamin.ai/laminlabs/cellxgene-census/records/core/File) and fetch it through: 
`ln.File.filter(uid="...").one()`
- or query for a collection you found on [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/collections)

Each collection is stored as a {class}`~lamindb.File` record with an underlying `h5ad` file:

In [None]:
ln.File.filter(ulabels__name="Spatiotemporal immune zonation of the human kidney").df()

## Load data

Access and describe an individual file:

In [None]:
file = query.first()  # get the first file in the query result
file.describe()

You can also access its features alone:

In [None]:
file.features

Or get labels given a feature:

In [None]:
file.labels.get(features.tissue).df()

In [None]:
file.labels.get(features.collection).one()

If you're sure that you want to load these data, you have three options:
1. Load them directly into memory via `file.load() -> AnnData`
2. Stage them locally on disk `file.stage() -> Path`
3. Access a backed object and stream data from the cloud `file.backed() -> AnnDataAccessor`

All 3 options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute in that region).

In [None]:
file.backed()

In [None]:
file.load()

```{toctree}
:maxdepth: 1
:hidden:

census-registries
query-census
```