![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)
[![lamindata](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/transform/Nv48yAceNSh8)

# scRNA-seq

Here, you'll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

1. create a dataset (an {class}`~lamindb.Artifact`) and seed a {class}`~lamindb.Collection` (![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey))
2. append a new dataset to the collection ([![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)](/scrna2))
3. query & analyze individual datasets ([![scrna3/6](https://img.shields.io/badge/scrna3/6-lightgrey)](/scrna3))
4. load the collection into memory ([![scrna4/6](https://img.shields.io/badge/scrna4/6-lightgrey)](/scrna4))
5. iterate over the collection to train an ML model ([![scrna5/6](https://img.shields.io/badge/scrna5/6-lightgrey)](/scrna-mappedcollection))
6. concatenate the collection to a single `tiledbsoma` array store ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna-tiledbsoma))

If you're only interested in _using_ a large curated scRNA-seq collection, see the [CELLxGENE guide](inv:docs#cellxgene).

```{toctree}
:maxdepth: 1
:hidden:

scrna2
scrna3
scrna4
scrna-mappedcollection
scrna-tiledbsoma
```

In [None]:
import lamindb as ln
import bionty as bt

ln.track("Nv48yAceNSh8")

## Populate metadata registries based on an artifact

Let us look at the standardized data of [Conde _et al._, Science (2022)](https://doi.org/10.1126/science.abl5197), [available from CELLxGENE](https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3). {func}`~lamindb.core.datasets.anndata_human_immune_cells` loads a subsampled version:

In [None]:
adata = ln.core.datasets.anndata_human_immune_cells()
adata

Before validating & annotating this artifact, we need to define valid features and a schema.

In [None]:
# define valid features
ln.Feature(name="donor", dtype=str).save()
ln.Feature(name="tissue", dtype=bt.Tissue).save()
ln.Feature(name="cell_type", dtype=bt.CellType).save()
ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()

# define anndata schema
obs_schema = ln.Schema(itype=ln.Feature).save()
varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id).save()
schema = ln.Schema(
    name="Flexible AnnData",
    otype="AnnData",
    components={"obs": obs_schema, "var.T": varT_schema},
).save()

Let's curate this artifact:

In [None]:
curator = ln.curators.AnnDataCurator(adata, schema)

In [None]:
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

One cell type isn't validated because it's not part of the `CellType` registry. Let's create it.

In [None]:
bt.CellType(name="animal cell").save()

In [None]:
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

Some Ensembl gene IDs are not validated, likely because they stem from an older version of Ensembl. We create records in the registry through the following convenience method.

In [None]:
curator.slots["var.T"].cat.add_new_from("columns")

Alternatively, we could import genes from an old Ensembl version into the `Gene` registry: {doc}`bio-registries.ipynb#access-any-ensembl-genes`.

When we create a {class}`~lamindb.Artifact` object from an `AnnData`, we automatically curate it with validated features and labels:

In [None]:
artifact = curator.save_artifact(key="datasets/conde22.h5ad")

It is annotated with rich metadata:

In [None]:
artifact.describe()

## Seed a collection

Let's create a first version of a collection that will encompass many `h5ad` files when more data is ingested.

```{note}

To see the result of the incremental growth, take a look at the [CELLxGENE Census guide](inv:docs#cellxgene) for an instance with ~1k h5ads and ~50 million cells.

```

In [None]:
collection = ln.Collection(artifact, key="scrna/collection1").save()

For this version 1 of the collection, collection and artifact match each other. But they're independently tracked and queryable through their registries:

In [None]:
collection.describe()

Access the underlying artifacts like so:

In [None]:
collection.artifacts.df()

See data lineage:

In [None]:
collection.view_lineage()

Finish the run and save the notebook.

In [None]:
ln.finish()