![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna2.ipynb)
[![lamindata](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/transform/ManDYgmftZ8C65cN/D7mKQfDEiaoZl8mhdHYX)

# Standardize and append a dataset

Here, we'll learn 
- how to standardize a less well curated dataset
- how to append it to the growing versioned collection

In [None]:
import lamindb as ln
import bionty as bt

ln.track("ManDYgmftZ8C0003")

Let's now consider a less-well curated dataset:

In [None]:
adata = ln.core.datasets.anndata_pbmc68k_reduced()
# we don't trust the cell type annotation in this dataset
adata.obs.rename(columns={"cell_type": "cell_type_untrusted"}, inplace=True)
adata

Create a curator:

In [None]:
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={"cell_type_untrusted": bt.CellType.name},
    organism="human",
)

## Standardize & validate genes ![](https://img.shields.io/badge/Validate-10b981) 

Let's convert Gene symbols to Ensembl ids via {meth}`~docs:lamindb.core.CanValidate.standardize`. Note that this is a non-unique mapping and the first match is kept because the `keep` parameter in `.standardize()` defaults to `"first"`:

In [None]:
adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
    organism="human",
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()

Here, we'll use `.raw`:

In [None]:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

In [None]:
curate = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type_untrusted": bt.CellType.name},
    organism="human",
)

In [None]:
curate.validate()

In [None]:
curate.add_validated_from_var_index()

## Standardize & validate cell types ![](https://img.shields.io/badge/Validate-10b981) 

None of the cell type names are valid.

We'll now search the public ontology and add the name found in the dataset as a synonym to the top match found in the public ontology.

In [None]:
bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapping = {}
for invalid_name in adata_validated.obs["cell_type_untrusted"].unique():
    ontology_id = bionty.search(invalid_name).iloc[0].ontology_id  # top search hit through iloc[0]
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapping[invalid_name] = record.name  # map the original name to standardized name
    record.save()
    # record.add_synonym(name)  # optionally save the invalid name as synonym so that it becomes searchable
# print the mapping
print(name_mapping)

We can now standardize cell type names using the search-based mapper:

In [None]:
adata_validated.obs["cell_type_untrusted_original"] = adata_validated.obs["cell_type_untrusted"]  # copy the original annotations
adata_validated.obs["cell_type_untrusted"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)

Now, all cell types are validated:

In [None]:
curate.validate()

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
artifact = curate.save_artifact(description="10x reference adata")

In [None]:
artifact.view_lineage()

In [None]:
artifact.describe()

## Re-curate

We review the dataset and find all annotations trustworthy up there being a `'CD38-positive naive B cell'`.

Inspecting the `name_mapping` in detail tells us `'CD8+/CD45RA+ Naive Cytotoxic'` was erroneously mapped on a B cell.

Let us correct this and create a `'cell_type'` feature that we can now trust.

In [None]:
name_mapping['CD38-positive naive B cell'] = 'cytotoxic T cell'

In [None]:
adata_validated.obs["cell_type"] = adata_validated.obs["cell_type_untrusted_original"].map(name_mapping)
adata_validated.obs["cell_type"].unique()

In [None]:
artifact_trusted = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={"cell_type": bt.CellType.name, "cell_type_untrusted": bt.CellType.name},
    organism="human",
).save_artifact(
    description="10x reference adata, trusted cell type annotation",
    revises=artifact,
)

In [None]:
artifact_trusted.describe()

## Append the dataset to the collection

Query the previous collection:

In [None]:
collection_v1 = ln.Collection.get(name="My versioned scRNA-seq collection", is_latest=True)

Create a new version of the collection by sharding it across the new `artifact` and the artifact underlying version 1 of the collection:

In [None]:
collection_v2 = collection_v1.append(artifact_trusted).save()

If you want, you can label the collection's version by setting `.version`.

In [None]:
collection_v2.version = "2"
collection_v2.save()

Version 2 of the collection covers significantly more conditions.

In [None]:
collection_v2.describe()

View data lineage:

In [None]:
collection_v2.view_lineage()