[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)

# Validate & register scRNA-seq datasets

This illustrates how to manage scRNA-seq datasets in absence of a custom schema.

```{toctree}
:maxdepth: 1
:hidden:

scrna2
```

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

## Mouse lymph node cells: Detmar22

In [None]:
ln.track()

We're working with mouse data:

In [None]:
lb.settings.species = "mouse"

### Transform ![](https://img.shields.io/badge/Transform-10b981)

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

Let's look at a scRNA-seq count matrix in form of an `AnnData` object:

In [None]:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node(
    populate_registries=True  # pre-populate registries to simulate an used instance
)

In [None]:
adata

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

#### Validate genes in `.var`

In [None]:
validated = lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id)

We're seeing that 43 gene identifiers can't be validated (not currently in the Gene registry). We'd like to validate all features in this dataset, hence, let's inspect them to see what to do:

In [None]:
lb.Gene.inspect(adata.var.index, lb.Gene.ensembl_gene_id);

Inspect logging says 19 of the non-validated ensembl_gene_ids can be found in Bionty reference.

```{note}

ensembl_gene_ids that are present in Bionty public ontology will create ontology-coupled records via `.from_values()`.

In this example:
- 19 records are created from Bionty with additional metadata and source tracking
- while the rest 24 record are created with a single ensembl_gene_id field
```

In [None]:
non_validated = adata.var.index[~validated]
non_validated_records = lb.Gene.from_values(non_validated, lb.Gene.ensembl_gene_id)
ln.save(non_validated_records)

Now all genes pass validation:

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.ensembl_gene_id);

#### Validate metadata in `.obs`

Similarly, for the metadata, we'd like to validate them as features:

In [None]:
ln.Feature.validate(adata.obs.columns);

None of them exist, so let's register them:

In [None]:
features = ln.Feature.from_df(
    adata.obs
)  # Feature.from_df create feature records with type auto-populated

In [None]:
ln.save(features)

Now they are all validated:

In [None]:
ln.Feature.validate(adata.obs.columns);

Some of the metadata labels can be typed using dedicated registries:

In [None]:
lb.ExperimentalFactor.validate(adata.obs["developmental_stage"])
lb.CellType.validate(adata.obs["cell_type"])
lb.Tissue.validate(adata.obs["tissue"]);

Metadata that can't be typed with dedicated registries, we use the {class}`~lamindb.Label` registry:

In [None]:
for col in ["sex", "age", "genotype", "immunophenotype"]:
    ln.Label.validate(adata.obs[col])

#### Validate external metadata

In addition to what's already in the file, we'd like to link this file with external features:

In [None]:
ln.Feature.validate("species")
lb.Species.validate("mouse");

In [None]:
ln.Modality.validate("rna");

Let's try standardizing synonyms that can't be validated:

In [None]:
ln.Feature.validate("assay")
lb.ExperimentalFactor.validate("scRNA-seq");

In [None]:
scrna_name = lb.ExperimentalFactor.standardize("scRNA-seq")
scrna_name

In [None]:
lb.ExperimentalFactor.validate(scrna_name);

### Register ![](https://img.shields.io/badge/Register-10b981) 

#### Register data

When we create a `File` object from an `AnnData`, we'll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    adata, description="Detmar22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file now has two linked feature sets:

In [None]:
file.features

Warning message suggests to assign modality:

In [None]:
var_feature_set = file.features.get_feature_set("var")

In [None]:
var_feature_set

In [None]:
modality = ln.Modality.filter(name="rna").one()
var_feature_set.modality = modality
var_feature_set.save()

In [None]:
file.features

#### Link metadata

Let's add labels to corresponding features:

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(lb.ExperimentalFactor.filter(name=scrna_name).one(), feature="assay")

In [None]:
dev_stages = lb.ExperimentalFactor.from_values(adata.obs["developmental_stage"], "name")
cell_types = lb.CellType.from_values(adata.obs["cell_type"], "name")
tissues = lb.Tissue.from_values(adata.obs["tissue"], "name")

file.add_labels(dev_stages, feature="developmental_stage")
file.add_labels(cell_types, feature="cell_type")
file.add_labels(tissues, feature="tissue")

In [None]:
for col in ["sex", "age", "genotype", "immunophenotype"]:
    labels += ln.Label.from_values(adata.obs[col], field="name")
    file.add_labels(labels, feature=col)

The file is now queryable by everything we linked:

In [None]:
file.describe()

## Human immune cells: Conde22

In [None]:
lb.settings.species = "human"

### Transform ![](https://img.shields.io/badge/Transform-10b981)

In [None]:
conde22 = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # pre-populate registries to simulate an used instance
)

In [None]:
conde22

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

In [None]:
lb.Gene.validate(conde22.var.index, lb.Gene.ensembl_gene_id);

In [None]:
validated = ln.Feature.validate(conde22.obs.columns)

1 feature is not validated: donor

In [None]:
conde22.obs.loc[:, ~validated].head()

Let's register it:

In [None]:
features = ln.Feature.from_df(conde22.obs)

In [None]:
ln.save(features)

All metadata columns are now validated:

In [None]:
ln.Feature.validate(conde22.obs.columns);

In [None]:
lb.CellType.validate(conde22.obs.cell_type)
lb.ExperimentalFactor.validate(conde22.obs.assay)
lb.Tissue.validate(conde22.obs.tissue);

As neither the core schema nor `lnschema_bionty` have a `Donor` table, we're using `Label` to track donor ids.

Donor labels are not validated, so we register them:

In [None]:
ln.Label.validate(conde22.obs["donor"]);

In [None]:
donors = ln.Label.from_values(conde22.obs["donor"])
ln.save(donors)

In [None]:
ln.Label.validate(conde22.obs["donor"]);

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
file = ln.File.from_anndata(
    conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file has the following linked feature sets:

In [None]:
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()

In [None]:
file.features

Let's now link observational metadata.

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(lb.ExperimentalFactor.filter(name=scrna_name).one(), feature="assay")

In [None]:
cell_types = lb.CellType.from_values(conde22.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(conde22.obs.assay, field="name")
tissues = lb.Tissue.from_values(conde22.obs.tissue, field="name")

In [None]:
file.add_labels(cell_types, "cell_type")
file.add_labels(efs, "assay")
file.add_labels(tissues, "tissue")

In [None]:
donors = ln.Label.from_values(conde22.obs["donor"])
file.add_labels(donors, feature="donor")

In [None]:
file.describe()

## A less well curated dataset

### Transform ![](https://img.shields.io/badge/Transform-10b981)

Let's now consider a dataset with less-well curated features:

In [None]:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols: 

In [None]:
pbcm68k.var.index

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

In [None]:
validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)

In this case, we only want to register data with validated genes:

In [None]:
pbcm68k_validated = pbcm68k[:, validated].copy()

Validate cell types:

In [None]:
# don't recurse through ontology hierarchies to speed up CI
# recommend to set to True
lb.settings.auto_save_parents = False

# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"])

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    mapper[ct] = record.name
    record.save()
    record.add_synonym(ct)

pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)

Now, all cell types should be validated:

In [None]:
lb.CellType.validate(pbcm68k_validated.obs["cell_type"]);

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
file = ln.File.from_anndata(
    pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)

In [None]:
file.save()

In [None]:
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file.add_labels(cell_types)

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(lb.ExperimentalFactor.filter(name=scrna_name).one(), feature="assay")

In [None]:
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()

In [None]:
file.features

In [None]:
file.describe()

In [None]:
file.view_lineage()

🎉 Now let's continue with data integration: {doc}`./scrna2`