[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna.ipynb)

# Validate & register scRNA-seq datasets

This illustrates how to manage scRNA-seq datasets in absence of a [custom schema](https://lamin.ai/docs/schemas).

```{toctree}
:maxdepth: 1
:hidden:

scrna2
```

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

## Human immune cells: Conde22

In [None]:
lb.settings.species = "human"

### Transform ![](https://img.shields.io/badge/Transform-10b981)

(Here we skip steps of data transformations, which often includes filtering, normalizing, or formatting data.)

Let’s look at a scRNA-seq count matrix in form of an AnnData object:

In [None]:
conde22 = ln.dev.datasets.anndata_human_immune_cells(
    populate_registries=True  # pre-populate registries to simulate an used instance
)

In [None]:
conde22

### Validate ![](https://img.shields.io/badge/Validate-10b981)

#### Validate genes in `.var`

In [None]:
lb.Gene.validate(conde22.var.index, lb.Gene.ensembl_gene_id);

#### Validate metadata in `.obs`

In [None]:
validated = ln.Feature.validate(conde22.obs.columns)

1 feature is not validated: donor

In [None]:
conde22.obs.loc[:, ~validated].head()

Let's register it:

In [None]:
features = ln.Feature.from_df(conde22.obs)

In [None]:
ln.save(features)

All metadata columns are now validated:

In [None]:
ln.Feature.validate(conde22.obs.columns);

In [None]:
lb.CellType.validate(conde22.obs.cell_type)
lb.ExperimentalFactor.validate(conde22.obs.assay)
lb.Tissue.validate(conde22.obs.tissue);

As neither the core schema nor `lnschema_bionty` have a `Donor` table, we're using `Label` to track donor ids.

Donor labels are not validated, so we register them:

In [None]:
ln.Label.validate(conde22.obs["donor"]);

In [None]:
donors = [ln.Label(name=name) for name in conde22.obs["donor"].unique()]
ln.save(donors)

In [None]:
ln.Label.validate(conde22.obs["donor"]);

#### Validate external metadata

In addition to what’s already in the file, we’d like to link this file with external features:

In [None]:
ln.Feature.validate("species")
lb.Species.validate("human");

In [None]:
modalities = ln.Modality.lookup()
modalities.rna

Let’s try standardizing synonyms that can’t be validated:

In [None]:
ln.Feature.validate("assay")
lb.ExperimentalFactor.validate("scRNA-seq");

In [None]:
scrna_name = lb.ExperimentalFactor.standardize("scRNA-seq")
scrna_name

In [None]:
lb.ExperimentalFactor.validate(scrna_name);

### Register ![](https://img.shields.io/badge/Register-10b981) 

#### Register data

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file has the following 3 linked feature sets:

file.features

You can further annotate your feature set with modality:

In [None]:
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modalities.rna
var_feature_set.save()

#### Link metadata

Let's now link observational metadata by adding labels to corresponding features.

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(lb.ExperimentalFactor.filter(name=scrna_name).one(), feature="assay")

In [None]:
cell_types = lb.CellType.from_values(conde22.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(conde22.obs.assay, field="name")
tissues = lb.Tissue.from_values(conde22.obs.tissue, field="name")

In [None]:
file.add_labels(cell_types, "cell_type")
file.add_labels(efs, "assay")
file.add_labels(tissues, "tissue")

In [None]:
donors = ln.Label.from_values(conde22.obs["donor"])
file.add_labels(donors, feature="donor")

The file is now queryable by everything we linked:

In [None]:
file.describe()

## A less well curated dataset

### Transform ![](https://img.shields.io/badge/Transform-10b981)

Let's now consider a dataset with less-well curated features:

In [None]:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols: 

In [None]:
pbcm68k.var.index

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

In [None]:
validated = lb.Gene.validate(pbcm68k.var.index, lb.Gene.symbol)

In this case, we only want to register data with validated genes:

In [None]:
pbcm68k_validated = pbcm68k[:, validated].copy()

Validate cell types:

In [None]:
# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k_validated.obs["cell_type"])

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
mapper = {}
for ct in pbcm68k_validated.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    mapper[ct] = record.name
    record.save()
    record.add_synonym(ct)

pbcm68k_validated.obs["cell_type"] = pbcm68k_validated.obs["cell_type"].map(mapper)

Now, all cell types are validated:

In [None]:
lb.CellType.validate(pbcm68k_validated.obs["cell_type"]);

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
file = ln.File.from_anndata(
    pbcm68k_validated, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)

In [None]:
file.save()

In [None]:
cell_types = lb.CellType.from_values(pbcm68k_validated.obs["cell_type"], "name")
file.add_labels(cell_types, "cell_type")

In [None]:
file.add_labels(lb.settings.species, feature="species")
file.add_labels(lb.ExperimentalFactor.filter(name=scrna_name).one(), feature="assay")

In [None]:
var_feature_set = file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()

In [None]:
file.features

In [None]:
file.describe()

In [None]:
file.view_lineage()

🎉 Now let's continue with data integration: {doc}`./scrna2`