![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna2.ipynb)
[![lamindata](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/transform/ManDYgmftZ8Cz8)

# Standardize and append a batch of data

Here, we'll learn 
- how to standardize a less well curated collection
- how to append it to the growing versioned collection

In [None]:
import lamindb as ln
import bionty as bt

ln.settings.verbosity = "hint"
bt.settings.auto_save_parents = False
ln.settings.transform.stem_uid = "ManDYgmftZ8C"
ln.settings.transform.version = "1"
ln.track()

Let's now consider a less-well curated dataset:

In [None]:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata

We are still working with human data, and can globally set an organism:

In [None]:
bt.settings.organism = "human"

In [None]:
annotate = ln.Annotate.from_anndata(adata, var_index=bt.Gene.symbol, categoricals={"cell_type": bt.CellType.name})

## Standardize & validate genes ![](https://img.shields.io/badge/Validate-10b981) 

Let's convert Gene symbols to Ensembl ids via {meth}`~docs:lamindb.core.CanValidate.standardize`. Note that this is a non-unique mapping and the first match is kept because the `keep` parameter in `.standardize()` defaults to `"first"`:

In [None]:
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()

Here, we'll use `.raw`:

In [None]:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

In [None]:
annotate = ln.Annotate.from_anndata(adata_validated, var_index=bt.Gene.ensembl_gene_id, categoricals={"cell_type": bt.CellType.name})

In [None]:
annotate.validate()

## Standardize & validate cell types ![](https://img.shields.io/badge/Validate-10b981) 

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the `AnnData` object as a synonym to the top match found in the public ontology.

In [None]:
bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_public(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)

We can now standardize cell type names using the search-based mapper:

In [None]:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

In [None]:
annotate.validate()

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
artifact = annotate.save_artifact(description="10x reference adata")

In [None]:
artifact.view_lineage()

## Append the shard to the collection

Query the previous collection:

In [None]:
collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new `artifact` and the artifact underlying version 1 of the collection:

In [None]:
collection_v2 = ln.Collection(
    [artifact, collection_v1.artifacts[0]],
    is_new_version_of=collection_v1,
)
collection_v2.save()
collection_v2.labels.add_from(artifact)
collection_v2.labels.add_from(collection_v1)

Version 2 of the collection covers significantly more conditions.

In [None]:
collection_v2.describe()

View data lineage:

In [None]:
collection_v2.view_lineage()