![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)
[![lamindata](https://img.shields.io/badge/laminlabs/lamindata-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=Nv48yAceNSh8z8)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/laminlabs/lamin-usecases/main?labpath=lamin-usecases%2Fdocs%2Fscrna.ipynb)

# scRNA-seq

You'll learn how to manage a growing number of scRNA-seq data batches as a single queryable dataset.

Along the way, you'll see how to create reports, leverage data lineage, and query individual data batches stored as files.

If you're interested in directly using the type of large curated atlas of scRNA-seq datasets that arises from the present guide, see the [CELLxGENE Census guide](/cellxgene-census).

Here, you will:

1. read a single `.h5ad` file as an `AnnData` object and seed a growing dataset with it (![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey), current page)
2. append a new data batch (a new `.h5ad` file) and create a new version of this dataset ([![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)](/scrna2))
3. query & inspect files by metadata individually ([![scrna3/6](https://img.shields.io/badge/scrna3/6-lightgrey)](/scrna3))
4. load the dataset into memory and save analytical results ([![scrna4/6](https://img.shields.io/badge/scrna4/6-lightgrey)](/scrna4))
5. iterate over the dataset, train a model, store a derived representation ([![scrna5/6](https://img.shields.io/badge/scrna5/6-lightgrey)](/scrna5))
6. discuss converting a number of files to a single TileDB SOMA store of the same data ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna6))

```{toctree}
:maxdepth: 1
:hidden:

scrna2
scrna3
scrna4
scrna5
scrna6
```

## Setup

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.track()

## Access ![](https://img.shields.io/badge/Access-10b981)

Let us look at the data of [Conde _et al._, Science (2022)](https://doi.org/10.1126/science.abl5197).

These data are available in standardized form from the [CellxGene data portal](https://cellxgene.cziscience.com/).

Here, we'll use it to seed a growing in-house store of scRNA-seq data managed with the corresponding metadata in LaminDB registries.

```{note}

If you're not interested in managing large collections of in-house data and you'd just like to query public data, please take a look at [CellxGene census](docs:cellxgene-census), which exposes all datasets hosted in the data portal as a concatenated TileDB SOMA store.

```

In [None]:
lb.settings.organism = "human"

By calling {func}`~lamindb.dev.datasets.anndata_human_immune_cells`, we load a subsampled version of a [dataset from CZ CELLxGENE](https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3) and pre-populate the corresponding LaminDB registries: {class}`lnschema_bionty.Feature`, {class}`lnschema_bionty.ULabel`, {class}`lnschema_bionty.Gene`, {class}`lnschema_bionty.CellType`, {class}`lnschema_bionty.CellLine`, {class}`lnschema_bionty.ExperimentalFactor`.

In [None]:
adata = ln.dev.datasets.anndata_human_immune_cells(populate_registries=True)

In [None]:
adata

This `AnnData` is standardized using the [CZI single-cell-curation validator](https://github.com/chanzuckerberg/single-cell-curation) the same public ontologies underlying [lnschema-bionty](docs:lnschema-bionty). Because registries are populated, validation passes.

```{note}

In the [this guide](/scrna2), we'll curate a non-standardized dataset as you might get from an external partner.

```

The gene registry provides metadata for each of the 36k gene measured in the `AnnData`:

In [None]:
lb.Gene.filter().df().head()

When we create a `File` object from an `AnnData`, we automatically link its features:

In [None]:
modalities = ln.Modality.lookup()  # optional: label by data modality
file = ln.File.from_anndata(
    adata, description="Conde22", field=lb.Gene.ensembl_gene_id, modality=modalities.rna
)

In [None]:
file.save()

The file has 2 linked feature sets, one for measured genes and one for measured metadata:

In [None]:
file.features

Let's now validate the corresponding label slots in the `AnnData` and annotate the file with labels:

In [None]:
experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()

file.labels.add(organism.human, feature=features.organism)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)
file.labels.add(adata.obs.cell_type, feature=features.cell_type)
file.labels.add(adata.obs.assay, feature=features.assay)
file.labels.add(adata.obs.tissue, feature=features.tissue)
file.labels.add(adata.obs.donor, feature=features.donor)

The file is now queryable by everything we linked:

In [None]:
file.describe()

## Seed a dataset

Let's create a first version of a dataset that will encompass many `h5ad` files when more data is ingested.

```{note}

To see the result of the incremental growth, take a look at the [CELLxGENE Census guide](/cellxgene-census) for an instance with ~1k h5ads and ~50 million cells.

```

In [None]:
dataset = ln.Dataset(file, name="My versioned scRNA-seq dataset", version="1")
dataset.save()
dataset.labels.add_from(file)  # seed the initial labels of the dataset

For this version 1 of the dataset, dataset and file match each other. But they're independently tracked and queryable through their registries:

In [None]:
dataset.describe()

Access the file like so:

In [None]:
dataset.file

See data flow:

In [None]:
dataset.view_flow()