[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)

# Validate & register scRNA-seq datasets

scRNA-seq measures gene expression of individual cells. It generates datasets used to define cell states associated with phenotypes.

Their analysis is typically based on data objects like [AnnData](https://anndata.readthedocs.io/en/latest/), [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) & [Seurat objects](https://github.com/satijalab/seurat).

These objects, however, often contain non-validated metadata, making data integration hard.

In this notebook, LaminDB is used to make turn `AnnData` objects into validated & queryable assets.

```{toctree}
:maxdepth: 1
:hidden:

scrna2
```

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

In [None]:
ln.track()

## Human immune cells: Conde22

In [None]:
lb.settings.species = "human"

### Transform ![](https://img.shields.io/badge/Transform-10b981)

(Here we skip typical transformation steps that involve filtering, normalizing, and formatting.)

Let’s look at an scRNA-seq count matrix in form of an AnnData object:

In [None]:
adata = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # this pre-populates registries
)

In [None]:
adata

### Validate ![](https://img.shields.io/badge/Validate-10b981)

#### Validate genes in `.var`

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

148 gene identifiers can’t be validated (not currently in the `Gene` registry). Lt’s inspect them to see what to do:

In [None]:
inspector = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)

Logging says 35 of the non-validated ids can be found in the Bionty reference. Let's register them:

In [None]:
records = lb.Gene.from_values(inspector.non_validated, lb.Gene.ensembl_gene_id)
ln.save(records)

The remaining 113 are legacy IDs, not present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)).

We'd still like to register them:

In [None]:
validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)
records = [lb.Gene(ensembl_gene_id=id) for id in adata.var.index[~validated]]
ln.save(records)

Now all genes pass validation:

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

#### Validate metadata in `.obs`

In [None]:
adata.obs.columns

In [None]:
validated = ln.Feature.validate(adata.obs.columns)

1 feature is not validated: `"donor"`. Let's register it:

In [None]:
feature = ln.Feature.from_df(adata.obs.loc[:, ~validated])[0]
ln.save(feature)

All metadata columns are now validated as feature:

In [None]:
ln.Feature.validate(adata.obs.columns);

Next, let's validate the corresponding labels of each feature.

Some of the metadata labels can be typed using dedicated registries like {class}`~docs:lnschema_bionty.CellType`:

In [None]:
validated = lb.CellType.validate(adata.obs.cell_type)

Register non-validated cell types - they can all be loaded from a public ontology through Bionty:

In [None]:
nonval_cell_type_records = lb.CellType.from_values(
    adata.obs.cell_type[~validated], "name"
)
ln.save(nonval_cell_type_records)

In [None]:
lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);

Because we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a `Donor` registry, we use the {class}`~lamindb.Label` registry to track donor ids:

In [None]:
ln.Label.validate(adata.obs["donor"]);

Donor labels are not validated, so let's register them:

In [None]:
donors = [ln.Label(name=name) for name in adata.obs["donor"].unique()]
ln.save(donors)

In [None]:
ln.Label.validate(adata.obs["donor"]);

#### Validate external metadata

In addition to what’s already in the file, we’d like to link this file to external features including "species" and "assay":

In [None]:
ln.Feature.validate("species")
ln.Feature.validate("assay");

Let's search for the scRNA-seq assay label:

In [None]:
lb.ExperimentalFactor.search("scRNA-seq").head(2)

In [None]:
scrna = lb.ExperimentalFactor.filter(id="068T1Df6").one()

### Register ![](https://img.shields.io/badge/Register-10b981) 

#### Register data

When we create a `File` object from an `AnnData`, we’ll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    adata, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file has the following 2 linked feature sets:

In [None]:
file.features

You can further annotate your feature set with modality:

In [None]:
var_feature_set = file.features.get_feature_set("var")
modalities = ln.Modality.lookup()
var_feature_set.modality = modalities.rna
var_feature_set.save()

#### Link metadata

Let's now link observational metadata by adding labels to corresponding features.

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(adata.obs.assay, field="name")
tissues = lb.Tissue.from_values(adata.obs.tissue, field="name")
donors = ln.Label.from_values(adata.obs["donor"])

file.add_labels(cell_types, "cell_type")
file.add_labels(efs, "assay")
file.add_labels(tissues, "tissue")
file.add_labels(donors, feature="donor")

Note that adding labels to an external feature will create an external feature set.

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")

In [None]:
file.features

The file is now queryable by everything we linked:

In [None]:
file.describe()

## A less well curated dataset

### Transform ![](https://img.shields.io/badge/Transform-10b981)

Let's now consider a dataset with less-well curated features:

In [None]:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols: 

In [None]:
pbcm68k.var.index

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

In [None]:
validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)

In this case, we only want to register data with validated genes:

In [None]:
pbcm68k_validated = pbcm68k[:, validated].copy()

Validate cell types:

In [None]:
# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"])

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    mapper[ct] = record.name
    record.save()
    record.add_synonym(ct)

# standardize cell type names in the dataset
pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)

Now, all cell types are validated:

In [None]:
lb.CellType.validate(pbcm68k_validated.obs["cell_type"]);

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
file = ln.File.from_anndata(
    pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)

In [None]:
file.save()

In [None]:
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modalities.rna
var_feature_set.save()

In [None]:
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file.add_labels(cell_types, "cell_type")

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")

In [None]:
file.features

In [None]:
file.describe()

In [None]:
file.view_lineage()

🎉 Now let's continue with data integration: {doc}`./scrna2`