![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/laminlabs/lamin-usecases/main?labpath=lamin-usecases%2Fdocs%2Fscrna.ipynb)

# scRNA-seq

You'll learn how to manage a growing number of scRNA-seq data batches as a single queryable dataset.

Along the way, you'll see how to create reports, leverage data lineage, and query statistics of individual data batches stored as files.

Specifically, you will:

1. read a single `.h5ad` file as an `AnnData` and seed a growing dataset with it (![scrna1/6](https://img.shields.io/badge/scrna1/6-lightgrey), currently page)
2. append a new data batch (a new `.h5ad` file) and create a new version of this dataset ([![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)](/scrna1))
3. query & inspect files by metadata individually ([![scrna3/6](https://img.shields.io/badge/scrna3/6-lightgrey)](/scrna2))
4. load the dataset into memory and save analytical results as plots ([![scrna4/6](https://img.shields.io/badge/scrna4/6-lightgrey)](/scrna3))
5. iterate over the dataset, train a model, store a derived representation ([![scrna5/6](https://img.shields.io/badge/scrna5/6-lightgrey)](/scrna4))
6. discuss converting a number of files to a single TileDB SOMA store of the same data ([![scrna6/6](https://img.shields.io/badge/scrna6/6-lightgrey)](/scrna5))

```{toctree}
:maxdepth: 1
:hidden:

scrna1
scrna2
scrna3
scrna4
scrna5
```

## Setup

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.track()

## Access ![](https://img.shields.io/badge/Access-10b981)

Let us look at the data of [Conde _et al._, Science (2022)](https://doi.org/10.1126/science.abl5197).

These data are available in standardized form from the [CellxGene data portal](https://cellxgene.cziscience.com/).

Here, we'll use it to seed a growing in-house store of scRNA-seq data managed with the corresponding metadata in LaminDB registries.

```{note}

If you're not interested in managing large collections of in-house data and you'd just like to query public data, please take a look at [CellxGene census](docs:cellxgene-census), which exposes all datasets hosted in the data portal as a concatenated TileDB SOMA store.

```

In [None]:
lb.settings.species = "human"

By calling `ln.dev.datasets.anndata_human_immune_cells` below, we download the dataset from the CellxGene portal [here](https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3) and pre-populate some LaminDB registries.

In [None]:
adata = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # this pre-populates registries
)

In [None]:
adata

This `AnnData` is already standardized using the same public ontologies underlying [lnschema-bionty](docs:lnschema-bionty), hence, we expect validation to be simple.

Nonetheless, LaminDB focuses on building clean in-house registries 

```{note}

In the next notebook, we'll look at the more difficult case of a non-standardized dataset that requires curation.

```

### Validate ![](https://img.shields.io/badge/Validate-10b981)

#### Validate genes in `.var`

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

148 gene identifiers can’t be validated (not currently in the `Gene` registry). Let’s inspect them to see what to do:

In [None]:
inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)

Logging says 35 of the non-validated ids can be found in the Bionty reference. Let's register them:

In [None]:
records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)
ln.save(records)

The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)).

We'd still like to register them, but won't dive into the details of converting them from an old Ensembl version to the current one.

In [None]:
validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)
records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]
ln.save(records)

Now all genes pass validation:

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

Our in-house Gene registry provides rich metadata for each gene measured in the `AnnData`:

In [None]:
lb.Gene.filter().df().head(10)

There are about 36k genes in the registry, all for species "human".

In [None]:
lb.Gene.filter().df().shape

#### Validate metadata in `.obs`

In [None]:
adata.obs.columns

In [None]:
ln.Feature.validate(adata.obs.columns)

1 feature is not validated: `"donor"`. Let's register it:

In [None]:
feature = ln.Feature(name="donor", type="category", registries=[ln.ULabel])
ln.save(feature)

```{tip}

You can also use `features = ln.Feature.from_df(df)` to bulk create features with types.
```

All metadata columns are now validated:

In [None]:
ln.Feature.validate(adata.obs.columns)

Next, let's validate the corresponding labels of each feature.

Some of the metadata labels can be typed using dedicated registries like {class}`~docs:lnschema_bionty.CellType`:

In [None]:
validated = lb.CellType.validate(adata.obs.cell_type)

Register non-validated cell types - they can all be loaded from a public ontology through Bionty:

In [None]:
records = lb.CellType.from_values(adata.obs.cell_type[~validated], "name")
ln.save(records)

In [None]:
lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);

Because we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a `Donor` registry, we use the {class}`~lamindb.ULabel` registry to track donor ids:

In [None]:
ln.ULabel.validate(adata.obs.donor);

Donor labels are not validated, so let's register them:

In [None]:
donors = [ln.ULabel(name=name) for name in adata.obs.donor.unique()]
ln.save(donors)

In [None]:
ln.ULabel.validate(adata.obs.donor);

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
modalities = ln.Modality.lookup()
experimental_factors = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
features = ln.Feature.lookup()

#### Register data

When we create a `File` object from an `AnnData`, we’ll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    adata, description="Conde22", field=lb.Gene.ensembl_gene_id, modality=modalities.rna
)

In [None]:
file.save()

The file has the following 2 linked feature sets:

In [None]:
file.features

#### Register metadata links

Let us first link external labels for the entire file:

In [None]:
file.labels.add(species.human, feature=features.species)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)

Next, we parse the columns of `adata.obs` for additional metadata:

In [None]:
file.labels.add(adata.obs.cell_type, feature=features.cell_type)
file.labels.add(adata.obs.assay, feature=features.assay)
file.labels.add(adata.obs.tissue, feature=features.tissue)
file.labels.add(adata.obs.donor, feature=features.donor)

In [None]:
file.features

The file is now queryable by everything we linked:

In [None]:
file.describe()

## Create a dataset from the file

In [None]:
dataset = ln.Dataset(file, name="My versioned scRNA-seq dataset", version="1")

dataset

Let's inspect the features measured in this dataset which were inherited from the file:

In [None]:
dataset.features

This looks all good, hence, let's save it:

In [None]:
dataset.save()

Annotate by linking labels:

In [None]:
dataset.labels.add(experimental_factors.single_cell_rna_sequencing, features.assay)
dataset.labels.add(species.human, features.species)
dataset.labels.add(adata.obs.cell_type, feature=features.cell_type)
dataset.labels.add(adata.obs.assay, feature=features.assay)
dataset.labels.add(adata.obs.tissue, feature=features.tissue)
dataset.labels.add(adata.obs.donor, feature=features.donor)

For this version 1 of the dataset, dataset and file match each other. But they're independently tracked and queryable through their registries.

In [None]:
dataset.describe()

And we can access the file like so:

In [None]:
dataset.file

In [None]:
dataset.view_flow()