![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna2.ipynb)
[![lamindata](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=ManDYgmftZ8Cz8)

# Standardize and append a batch of data

Here, we'll learn 
- how to standardize a less well curated collection
- how to append it to the growing versioned collection

In [None]:
import lamindb as ln
import bionty as bt

ln.settings.verbosity = "hint"
bt.settings.auto_save_parents = False

In [None]:
ln.transform.stem_uid = "ManDYgmftZ8C"
ln.transform.version = "1"
ln.track()

## Standardize a data shard

Let's now consider a collection with less-well curated features:

In [None]:
adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata

We are still working with human data, and can globally set an organism:

In [None]:
bt.settings.organism = "human"

### Standardize & validate genes ![](https://img.shields.io/badge/Validate-10b981) 

This data shard is indexed by gene symbols which we'll want to map on Ensemble ids:

In [None]:
adata.var.head()

Let's inspect the identifiers:

In [None]:
bt.Gene.inspect(adata.var.index, bt.Gene.symbol)

Let's first standardize the gene symbols from synonyms:

In [None]:
adata.var.index = bt.Gene.standardize(adata.var.index, bt.Gene.symbol)
validated = bt.Gene.validate(adata.var.index, bt.Gene.symbol)

We only want to register data with validated genes:

In [None]:
adata_validated = adata[:, validated].copy()

Now that all symbols are validated, let's convert them to Ensembl ids via {meth}`~docs:lamindb.dev.CanValidate.standardize`. Note that this is an ambiguous mapping and the first match is kept because the `keep` arg of `.standardize()` defaults to `"first"`:

In [None]:
adata_validated.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata_validated.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
adata_validated.var.index.name = "symbol"
adata_validated.var = adata_validated.var.reset_index().set_index("ensembl_gene_id")
adata_validated.var.head()

Here, we'll use `.raw`:

In [None]:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

### Standardize & validate cell types ![](https://img.shields.io/badge/Validate-10b981) 

Inspection shows none of the terms are validated:

In [None]:
inspector = bt.CellType.inspect(adata_validated.obs.cell_type)

Let us search the cell type names from the public ontology, and add the name found in the `AnnData` object as a synonym to the top match found in the public ontology.

In [None]:
bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_public(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)

We can now standardize cell type names using the search-based mapper:

In [None]:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

In [None]:
validated = bt.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)

We don't want to store any of the other metadata columns:

In [None]:
for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
experimental_factors = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()

In [None]:
artifact = ln.Artifact.from_anndata(
    adata_validated,
    description="10x reference adata"
)

As we do not want to manage the remaining unvalidated terms in registries, we can save and annotate the artifact:

In [None]:
artifact.save()
artifact.features.add_from_anndta(field=bt.Gene.ensembl_gene_id)
artifact.labels.add(adata_validated.obs.cell_type, features.cell_type)
artifact.labels.add(organism.human, feature=features.organism)
artifact.labels.add(
    experimental_factors.single_cell_rna_sequencing, feature=features.assay
)
artifact.describe()

In [None]:
artifact.view_lineage()

## Append the shard to the collection

Query the previous collection:

In [None]:
collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new `artifact` and the artifact underlying version 1 of the collection:

In [None]:
collection_v2 = ln.Collection(
    [artifact, collection_v1.artifact],
    is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)

Version 2 of the collection covers significantly more conditions.

In [None]:
collection_v2.describe()

View data lineage:

In [None]:
collection_v2.view_lineage()