[![hub](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/cellxgene/record/core/Transform?uid=5FUyJ6RkVk0Dz8)

# CELLxGENE: scRNA-seq

[CZ CELLxGENE](https://cellxgene.cziscience.com/) hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, ...).

You can use the CELLxGENE data in three ways:

1. In the current guide, you'll see how to query metadata and data based on `AnnData` objects.
2. If you want to use these in your own LaminDB instance, see the [transfer guide](docs:transfer).
3. If you'd like to leverage the TileDB-SOMA API for the data subset of [CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census), see the [Census guide](query-census).

If you are interested in building similar data assets in-house:

1. See the [scRNA guide](docs:scrna) for how to create a growing versioned queryable scRNA-seq dataset.
2. See the [Annotate](./cellxgene-annotate) for validating, curating and registering your own AnnData objects.
3. [Reach out](https://lamin.ai/contact) if you are interested in a full zero-copy clone of `laminlabs/cellxgene` to accelerate building your in-house LaminDB instances.


## Setup

Load the public LaminDB instance that mirrors cellxgene on the CLI:

In [None]:
!lamin load laminlabs/cellxgene

In [None]:
import lamindb as ln
import bionty as bt

## Query & understand metadata

### Auto-complete metadata

You can create look-up objects for any registry in LaminDB, including [basic biological entities](docs:bionty) and things like users or storage locations.

Let's use auto-complete to look up cell types:

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

In [None]:
cell_types = bt.CellType.lookup()
cell_types.effector_t_cell

You can also arbitrarily chain filters and create lookups from them:

In [None]:
organisms = bt.Organism.lookup()  # species
genes = bt.Gene.filter(organism=organisms.human).lookup()  # ~60k human genes
features = ln.Feature.lookup()  # non-gene features, like `cell_type`, `assay`, etc.
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
ulabels = ln.ULabel.lookup()  # universal labels, e.g. dataset collections
suspension_types = (
    ulabels.is_suspension_type.children.all().lookup()
)  # suspension types

### Search & filter metadata

We can use search & filters for metadata:

In [None]:
bt.CellType.search("effector T cell")

In [None]:
bt.CellType.search("CD8-positive cytokine effector T cell")

And use a `uid` to filter exactly one metadata record:

In [None]:
effector_t_cell = bt.CellType.filter(uid="3nfZTVV4").one()
effector_t_cell

### Understand ontologies

View the related ontology terms: 

In [None]:
effector_t_cell.view_parents(distance=2, with_children=True)

Or access them programmatically:

In [None]:
effector_t_cell.children.df()

## Query artifacts

Unlike in the [SOMA guide](query-census), here, we'll query sets of `h5ad` files, which correspond to `AnnData` objects.

To access them, we query the {class}`~lamindb.Collection` record that links the latest LTS set of h5ad files:

In [None]:
collection = ln.Collection.filter(name="cellxgene-census", version="2023-07-25").one()
collection

You can get all linked files as a dataframe - there are 850 files in `cellxgene-census` version `2023-07-25`.

In [None]:
collection.artifacts.df().head()  # not tracking run & transform because read-only instance

You can query across files by arbitrary metadata combinations, for instance:

In [None]:
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame

## Query arrays

Each file stores an array in form of an annotated data matrix, an `AnnData` object.

Let's look at the first array in the file query and show metadata using `.describe()`:

In [None]:
artifact = query.first()
artifact.describe()

:::{dropdown} More ways of accessing metadata

Access just features:

```
artifact.features
```

Or get labels given a feature:

```
artifact.labels.get(features.tissue).df()
```

```
artifact.labels.get(features.collection).one()
```

:::



If you want to query a slice of the array data, you have two options:
1. Cache & load the entire array into memory via `artifact.load() -> AnnData` (caches the h5ad on disk, so that you only download once)
2. Stream the array from the cloud using a cloud-backed accessor `artifact.backed() -> AnnDataAccessor`

Both options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).

### 1. Cache & load

Let us first consider option 1:

In [None]:
adata = artifact.load()
adata

Now we have an `AnnData` object, which stores observation annotations matching our file-level query in the `.obs` slot, and we can re-use almost the same query on the array-level:

:::{dropdown} See the file-level query for comparison

```
query = collection.files.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
```

`AnnData` uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

:::

In [None]:
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice

### 2. Stream

Let us now consider option 2:

In [None]:
adata_backed = artifact.backed()
adata_backed

We now have an `AnnDataAccessor` object, which behaves much like an `AnnData`, and the query looks the same:

In [None]:
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()

### 3. Concatenate slices 

If we want to concatenate these individual file-level slices, loop over all files in `query` and concatenate the results.

:::{dropdown} How would this look like?

```
adata_slices = []
for file in query:
    adata_backed = artifact.backed()
    adata_slice = adata_backed[
        adata_backed.obs.cell_type.isin(
            [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
        )
        & (adata_backed.obs.tissue == tissues.kidney.name)
        & (adata_backed.obs.suspension_type == suspension_types.cell.name)
        & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
    ]
    adata_slices.append(adata_slice.to_memory())

import anndata as ad

adata_query = ad.concat(adata_slices)
```

:::

## Train an ML model

See {doc}`docs:scrna5`.

## Exploring data by collection

Alternatively, 

- [you can search a file on the LaminHub UI](https://lamin.ai/laminlabs/cellxgene/records/core/Artifact?offset=0&limit=50) and fetch it through: 
`ln.Artifact.filter(uid="...").one()`
- or query for a collection you found on [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/collections)

Let's search the collections from CELLxGENE:

In [None]:
ln.Collection.search("immune human kidney", limit=10)

Let's get the record of the top hit collection:

In [None]:
collection = ln.Collection.filter(uid="kqiPjpzpK9H9rdtnHWas").one()

collection

We see it's a Science paper and we could find more information using the [DOI](https://doi.org/10.1126/science.aat5031) or CELLxGENE [collection id](https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec).

Each collection has at least one {class}`~lamindb.Artifact` file associated to it. Let's get the associated artifacts:

In [None]:
collection.artifacts.df()

```{toctree}
:maxdepth: 1
:hidden:

query-census
cellxgene-annotate
```