![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)
[![lamindata](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=Nv48yAceNSh8z8)

# scRNA-seq

You'll learn how to manage a growing number of scRNA-seq data shards as a single queryable collection.

Along the way, you'll see how to create reports, leverage data lineage, and query individual data shards stored as files.

If you're only interested in _using_ a large curated scRNA-seq collection, see the [CELLxGENE Census guide](docs:cellxgene).

Here, you will:

1. create an {class}`~lamindb.Artifact` from an `AnnData` object and seed a growing {class}`~lamindb.Collection` with it (![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey), current page)
2. append a new data batch (a new `.h5ad` file) and create a new version of this collection ([![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)](/scrna2))
3. query & inspect artifacts by metadata individually ([![scrna3/6](https://img.shields.io/badge/scrna3/6-lightgrey)](/scrna3))
4. load the joint collection into memory and save analytical results ([![scrna4/6](https://img.shields.io/badge/scrna4/6-lightgrey)](/scrna4))
5. iterate over the collection, train a model, store a derived representation ([![scrna5/6](https://img.shields.io/badge/scrna5/6-lightgrey)](/scrna5))
6. discuss converting a number of artifacts to a single TileDB SOMA store of the same data ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna6))

```{toctree}
:maxdepth: 1
:hidden:

scrna2
scrna3
scrna4
scrna5
scrna6
```

## Setup

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = "hint"
lb.settings.organism = "human"
ln.track()

## Ingest a artifact

Let us look at the standardized data of [Conde _et al._, Science (2022)](https://doi.org/10.1126/science.abl5197), available from [CZ CELLxGENE](https://cellxgene.cziscience.com/).

By calling {func}`~lamindb.dev.collections.anndata_human_immune_cells`, we load a subsampled version of the [collection from CZ CELLxGENE](https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3) and pre-populate the corresponding LaminDB registries: {class}`~lamindb.Feature`, {class}`~lamindb.ULabel`, {class}`~lnschema_bionty.Gene`, {class}`~lnschema_bionty.CellType`, {class}`~lnschema_bionty.CellLine`, {class}`~lnschema_bionty.ExperimentalFactor`.

In [None]:
adata = ln.dev.collections.anndata_human_immune_cells(populate_registries=True)
adata

This `AnnData` object is standardized using the [CZI single-cell-curation validator](https://github.com/chanzuckerberg/single-cell-curation) with the same public ontologies that underlie {mod}`lnschema_bionty`. Because registries are pre-populated, validation passes.

```{note}

In the [next guide](/scrna2), we'll curate a non-standardized collection.

```

The gene registry provides metadata for each of the 36k genes measured in the `AnnData`:

In [None]:
lb.Gene.filter().df()

When we create a {class}`~lamindb.Artifact` object from an `AnnData`, we automatically link its features:

In [None]:
artifact = ln.Artifact.from_anndata(
    adata,
    field=lb.Gene.ensembl_gene_id,  # field to validate and link features
    key="scrna/conde22.h5ad",  # optional: a relative path in your default storage
    description="Human immune cells from Conde22",  # optional: a description
)
artifact

In [None]:
artifact.save()

The artifact has 2 linked feature sets, one for measured genes and one for measured metadata:

In [None]:
artifact.features

Let's now annotate the artifact with labels:

In [None]:
experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()

artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
    experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.labels.add(adata.obs.cell_type, feature=features.cell_type)
artifact.labels.add(adata.obs.assay, feature=features.assay)
artifact.labels.add(adata.obs.tissue, feature=features.tissue)
artifact.labels.add(adata.obs.donor, feature=features.donor)

The artifact is now validated & queryable by everything we linked:

In [None]:
artifact.describe()

## Seed a collection

Let's create a first version of a collection that will encompass many `h5ad` files when more data is ingested.

```{note}

To see the result of the incremental growth, take a look at the [CELLxGENE Census guide](docs:cellxgene) for an instance with ~1k h5ads and ~50 million cells.

```

In [None]:
collection = ln.Collection(
    artifact, name="My versioned scRNA-seq collection", version="1"
)
collection.save()
collection.labels.add_from(artifact)  # seed the initial labels of the collection

For this version 1 of the collection, collection and artifact match each other. But they're independently tracked and queryable through their registries:

In [None]:
collection.describe()

Access the underlying artifact like so:

In [None]:
collection.artifact

See data flow:

In [None]:
collection.view_flow()