[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)

# Validate & register scRNA-seq datasets

Single-cell RNA-seq (scRNA-seq) measures gene expression of individual cells and generates datasets that are often used to define cell states that associated with functional phenotypes. Data formats, such as [AnnData](https://anndata.readthedocs.io/en/latest/) and [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) objects help storing metadata and data as an entity. However, non-validated metadata are often stored which made it hard to integrate with other datasets.

In this notebook, we show how Lamin can help with manage scRNA-seq data.

```{toctree}
:maxdepth: 1
:hidden:

scrna2
```

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

In [None]:
ln.track()

## Human immune cells: Conde22

In [None]:
lb.settings.species = "human"

### Transform ![](https://img.shields.io/badge/Transform-10b981)

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

Let’s look at a scRNA-seq count matrix in form of an AnnData object:

In [None]:
adata = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # pre-populate registries to simulate an used instance
)

In [None]:
adata

### Validate ![](https://img.shields.io/badge/Validate-10b981)

#### Validate genes in `.var`

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

We’re seeing that 148 gene identifiers can’t be validated (not currently in the Gene registry). We’d like to validate all features in this dataset, hence, let’s inspect them to see what to do:

In [None]:
inspect_result = lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id)

Inspect logging says 35 of the non-validated ensembl_gene_ids can be found in Bionty reference. Let's register them:

In [None]:
records_bionty = lb.Gene.from_values(
    inspect_result.non_validated, lb.Gene.ensembl_gene_id
)
ln.save(records_bionty)

The rest 113 aren't present in the current Ensembl assembly (e.g. [ENSG00000112096](https://www.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000112096)). 

We'd still like to register them, so let's create Gene records with those ensembl_gene_ids:

In [None]:
validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id, mute=True)
nonval_ensembl_ids = adata.var.index[~validated]
new_records = [
    lb.Gene(ensembl_gene_id=ens_id, species=lb.settings.species)
    for ens_id in nonval_ensembl_ids
]
ln.save(new_records)

Now all genes pass validation:

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

#### Validate metadata in `.obs`

In [None]:
adata.obs.columns

1 feature is not validated: donor

In [None]:
validated = ln.Feature.validate(adata.obs.columns)

Let's register it:

In [None]:
features = ln.Feature.from_df(adata.obs)

In [None]:
ln.save(features)

All metadata columns are now validated as feature:

In [None]:
ln.Feature.validate(adata.obs.columns);

Next, let's validate the corresponding labels of each feature:

Some of the metadata labels can be typed using dedicated registries: (e.g. bionty offers ontology-based registries for biological entities)

In [None]:
validated = lb.CellType.validate(adata.obs.cell_type)

Register non-validated cell types from Bionty:

In [None]:
nonval_cell_type_records = lb.CellType.from_values(
    adata.obs.cell_type[~validated], "name"
)
ln.save(nonval_cell_type_records)

In [None]:
lb.ExperimentalFactor.validate(adata.obs.assay)
lb.Tissue.validate(adata.obs.tissue);

Metadata that can’t be typed with dedicated registries (in this example, we didn't mount a [custom schema](https://lamin.ai/docs/schemas) that contains a Donor registry), we can use the {class}`~lamindb.Label` registry to track donor ids.

In [None]:
ln.Label.validate(adata.obs["donor"]);

Donor labels are not validated, so let's register them:

In [None]:
donors = [ln.Label(name=name) for name in adata.obs["donor"].unique()]
ln.save(donors)

In [None]:
ln.Label.validate(adata.obs["donor"]);

#### Validate external metadata

In addition to what’s already in the file, we’d like to link this file with external features including "species" and "assay":

In [None]:
ln.Feature.validate("species")
ln.Feature.validate("assay");

Validate corresponding labels of these features:

Sometimes we don't remember what the term is called exactly, search can help:

In [None]:
lb.ExperimentalFactor.search("scRNA-seq").head(2)

In [None]:
scrna = lb.ExperimentalFactor.filter(id="068T1Df6").one()

### Register ![](https://img.shields.io/badge/Register-10b981) 

#### Register data

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    adata, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file has the following 2 linked feature sets:

In [None]:
file.features

You can further annotate your feature set with modality:

In [None]:
var_feature_set = file.features.get_feature_set("var")
modalities = ln.Modality.lookup()
var_feature_set.modality = modalities.rna
var_feature_set.save()

#### Link metadata

Let's now link observational metadata by adding labels to corresponding features.

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(adata.obs.assay, field="name")
tissues = lb.Tissue.from_values(adata.obs.tissue, field="name")
donors = ln.Label.from_values(adata.obs["donor"])

file.add_labels(cell_types, "cell_type")
file.add_labels(efs, "assay")
file.add_labels(tissues, "tissue")
file.add_labels(donors, feature="donor")

In [None]:
file.features

Note that adding labels to an external feature will create an external feature set.

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")

The file is now queryable by everything we linked:

In [None]:
file.describe()

## A less well curated dataset

### Transform ![](https://img.shields.io/badge/Transform-10b981)

Let's now consider a dataset with less-well curated features:

In [None]:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols: 

In [None]:
pbcm68k.var.index

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

In [None]:
validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)

In this case, we only want to register data with validated genes:

In [None]:
pbcm68k_validated = pbcm68k[:, validated].copy()

Validate cell types:

In [None]:
# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"])

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    mapper[ct] = record.name
    record.save()
    record.add_synonym(ct)

# standardize cell type names in the dataset
pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)

Now, all cell types are validated:

In [None]:
lb.CellType.validate(pbcm68k_validated.obs["cell_type"]);

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
file = ln.File.from_anndata(
    pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)

In [None]:
file.save()

In [None]:
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modalities.rna
var_feature_set.save()

In [None]:
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file.add_labels(cell_types, "cell_type")

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(scrna, feature="assay")

In [None]:
file.features

In [None]:
file.describe()

In [None]:
file.view_lineage()

🎉 Now let's continue with data integration: {doc}`./scrna2`