# Manage scRNA-seq datasets

In [None]:
!lamin init --storage ./test-scrna --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = 3  # show hints

In [None]:
# assume prepared registries

# strain
lb.ExperimentalFactor.from_bionty(name="C57BL/6NJ").save()
ef = lb.ExperimentalFactor(
    name="C57BL/6N", ontology_id="EFO:0004472"
)  # obsolete, will be back in Bionty soon and then this is unnecessary
ef.save()
ef.children.add(ef)

# developmental stage
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0001272").save()

# tissue
lb.Tissue.from_bionty(ontology_id="UBERON:0001542").save()

# cell types
lb.CellType.from_bionty(ontology_id="CL:0000115").save()

## Detmar22: mouse

We're working with mouse data:

In [None]:
lb.settings.species = "mouse"

Let's look at a scRNA-seq count matrix in form of an `AnnData` object:

In [None]:
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()

In [None]:
adata

Let's have a look in the annotations:

In [None]:
adata.obs.head()

The column names are a bit lengthy, let's abbreviate them:

In [None]:
adata.obs.columns = adata.obs.columns.str.replace(
    "Sample Characteristic", ""
).str.replace(" Ontology Term", "ontology_id")

When we create a file object from an AnnData, we'll automatically link its feature sets and get information about unmapped categories:

In [None]:
file = ln.File.from_anndata(
    adata, description="Detmar22", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

The file now has two linked feature sets:

In [None]:
file.feature_sets.df()

Let's also capture the observational metadata:

In [None]:
strain = lb.ExperimentalFactor.select(ontology_id="EFO:0004472").one()
file.experimental_factors.add(strain)
dev_stage = lb.ExperimentalFactor.select(ontology_id="EFO:0001272").one()
file.experimental_factors.add(dev_stage)
cell_types = lb.CellType.from_values(["CL:0000115", "CL:0000738"], "ontology_id")
file.cell_types.set(cell_types)

The file is now queryable by everything we linked:

In [None]:
file.describe()

## Add more datasets

Let's consider another dataset with less curated features:

In [None]:
lb.settings.species = "human"

In [None]:
pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols: 

In [None]:
pbcm68k.var.index

Because gene symbols don't uniquely characterize an Ensembl ID, we're linking more feature records to this file than columns in the `AnnData`.

```{tip}

Use Ensembl Gene IDs rather than gene Symbols to index genes.

```

In [None]:
ln.File.from_anndata(
    pbcm68k, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
).save()

In [None]:
conde22 = ln.dev.datasets.anndata_human_immune_cells()

In [None]:
conde22.var.index

In [None]:
ln.File.from_anndata(
    conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
).save()

In [None]:
ln.File.select().df()

In [None]:
ln.FeatureSet.select().df()

In [None]:
!lamin delete test-scrna
!rm -r ./test-scrna