# Manage scRNA-seq datasets

This illustrates how to manage scRNA-seq datasets in absence of a custom schema.

```{toctree}
:maxdepth: 1
:hidden:

scrna-1
```

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = 3  # show hints

In [None]:
ln.track()

## Preparation: registries

Let's assume that this is not the first time we work with experimental entities, and hence, our registries are already pre-populated:

In [None]:
# assume prepared registries

# strain
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0004472").save()
record = lb.ExperimentalFactor.select(ontology_id="EFO:0004472").one()
record.add_synonym("C57BL/6N")

# developmental stage
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0001272").save()

# tissue
lb.Tissue.from_bionty(ontology_id="UBERON:0001542").save()

# cell types
ln.save(lb.CellType.from_values(["CL:0000115", "CL:0000738"], "ontology_id"))

In [None]:
ln.view(schema="bionty", orms=["CellType", "ExperimentalFactor", "Tissue"])

## Detmar22: Mouse Lymph Node

In [None]:
import lamindb as ln
import lnschema_bionty as lb

We're working with mouse data:

In [None]:
lb.settings.species = "mouse"

Let's look at a scRNA-seq count matrix in form of an `AnnData` object:

In [None]:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()
# The column names are a bit lengthy, let's abbreviate them:
adata.obs.columns = (
    adata.obs.columns.str.replace("Sample Characteristic", "")
    .str.replace("Factor Value ", "Factor Value:", regex=True)
    .str.replace("Factor Value\[", "Factor Value:", regex=True)
    .str.replace(" Ontology Term\[", "ontology_id:", regex=True)
    .str.strip("[]")
)

Let's have a look in the annotations:

In [None]:
adata

In [None]:
adata.obs.columns

When we create a file object from an AnnData, we'll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    adata, description="Detmar22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file now has two linked feature sets:

In [None]:
file.feature_sets.df()

Let's also link observational metadata:

In [None]:
adata.obs.head()

Metadata that have corresponding ORMs:

In [None]:
strains = lb.ExperimentalFactor.from_values(adata.obs["strain"], "name")
dev_stages = lb.ExperimentalFactor.from_values(adata.obs["developmental stage"], "name")
cell_types = lb.CellType.from_values(adata.obs["cell type"], "name")
tissues = lb.Tissue.from_values(adata.obs["organism part"], "name")
file.features.add_labels(strains + dev_stages + tissues + cell_types)

Metadata that doesn't have corresponding ORMs:

In [None]:
labels = ln.Label.from_values(adata.obs["sex"])
labels += ln.Label.from_values(adata.obs["age"])
labels += ln.Label.from_values(adata.obs["genotype"])
labels += ln.Label.from_values(adata.obs["immunophenotype"])
file.features.add_labels(labels)

The file is now queryable by everything we linked:

In [None]:
file.describe()

## Human immune cells: Conde22

In [None]:
lb.settings.species = "human"

In [None]:
conde22 = ln.dev.datasets.anndata_human_immune_cells()

In [None]:
file = ln.File.from_anndata(
    conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)
file.save()

The file has the following linked features:

In [None]:
file.feature_sets.df()

Let's now link observational metadata.

In [None]:
cell_types = lb.CellType.from_values(conde22.obs.cell_type, field="name")
ln.save(cell_types)
file.cell_types.set(cell_types)

In [None]:
efs = lb.ExperimentalFactor.from_values(conde22.obs.assay, field="name")
ln.save(efs)
file.experimental_factors.set(efs)

In [None]:
tissues = lb.Tissue.from_values(conde22.obs.tissue, field="name")
ln.save(tissues)
file.tissues.set(tissues)

As neither the core schema nor `lnschema_bionty` have a `Donor` table, we're using `Label` to track donor ids:

In [None]:
donor = ln.Label(name="donor", description="Parent label for all donor labels")
donor.save()
donors = ln.Label.from_values(conde22.obs["donor_id"])
ln.save(donors)
[d.parents.add(donor) for d in donors]
file.labels.set(donors)
donor.children.df()

In [None]:
file.describe()

## A less well curated dataset

Let's now consider a dataset with less-well curated features:

In [None]:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols: 

In [None]:
pbcm68k.var.index

Because gene symbols don't uniquely characterize an Ensembl ID, we're linking more feature records to this file than columns in the `AnnData`.

```{tip}

Use Ensembl Gene IDs rather than gene Symbols to index genes.

```

In [None]:
file_pbcm68k = ln.File.from_anndata(
    pbcm68k, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)
file_pbcm68k.save()

The cell type names aren't directly mappable to the public source:

In [None]:
lb.CellType.from_values(pbcm68k.obs["cell_type"], "name")

In [None]:
# here we search the cell type names from the public ontology and grab the top match

celltype_bt = lb.CellType.bionty()
ontology_ids = []
for ct in pbcm68k.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    record.save()
    record.add_synonym(ct)

In [None]:
celltypes = lb.CellType.from_values(pbcm68k.obs["cell_type"], "name")
file_pbcm68k.cell_types.set(celltypes)

In [None]:
file_pbcm68k.describe()

🎉 Now let's continue with data integration: {doc}`/biology/scrna-1`