![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna2.ipynb)
[![lamindata](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=ManDYgmftZ8Cz8)

# Standardize and append a batch of data

Here, we'll learn 
- how to standardize a less well curated dataset
- how to append it as a new batch of data to the growing versioned dataset

In [None]:
import lamindb as ln
import lnschema_bionty as lb

ln.track()

## Standardize a data batch

Let's now consider a dataset with less-well curated features:

In [None]:
adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata

We are still working with human data, and can globally instruct `bionty` to assume human:

In [None]:
lb.settings.organism = "human"

### Standardize & validate genes ![](https://img.shields.io/badge/Validate-10b981) 

This data batch is indexed by gene symbols which we'll want to map on Ensemble ids:

In [None]:
adata.var.head()

Let's inspect the identifiers:

In [None]:
lb.Gene.inspect(adata.var.index, lb.Gene.symbol)

Let's first standardize the gene symbols from synonyms:

In [None]:
adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)

We only want to register data with validated genes:

In [None]:
adata_validated = adata[:, validated].copy()

Now that all symbols are validated, let's convert them to Ensembl ids via {meth}`~docs:lamindb.dev.CanValidate.standardize`. Note that this is an ambiguous mapping and the first match is kept because the `keep` arg of `.standardize()` defaults to `"first"`:

In [None]:
adata_validated.var["ensembl_gene_id"] = lb.Gene.standardize(
    adata_validated.var.index,
    field=lb.Gene.symbol,
    return_field=lb.Gene.ensembl_gene_id,
)
adata_validated.var.index.name = "symbol"
adata_validated.var = adata_validated.var.reset_index().set_index("ensembl_gene_id")
adata_validated.var.head()

Here, we'll use `.raw`:

In [None]:
adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index

### Standardize & validate cell types ![](https://img.shields.io/badge/Validate-10b981) 

Inspection shows none of the terms are validated:

In [None]:
inspector = lb.CellType.inspect(adata_validated.obs.cell_type)

Let us search the cell type names from the public ontology, and add the name value found in the `AnnData` object as a synonym to the top match found in the public ontology.

In [None]:
bionty = lb.CellType.bionty()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()  # save the record
    # add the original name as a synonym, so that next time, we can just run .standardize()
    record.add_synonym(name)

We can now standardize cell type names using the search-based mapper:

In [None]:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

In [None]:
validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)

We don't want to store any of the other metadata columns:

In [None]:
for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
experimental_factors = lb.ExperimentalFactor.lookup()
organism = lb.Organism.lookup()
features = ln.Feature.lookup()

In [None]:
file = ln.File.from_anndata(
    adata_validated,
    description="10x reference adata",
    field=lb.Gene.ensembl_gene_id,
)

As we do not want to manage the remaining unvalidated terms in registries, we can save the file.

In [None]:
file.save()

In [None]:
file.labels.add(adata_validated.obs.cell_type, features.cell_type)
file.labels.add(organism.human, feature=features.organism)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)

In [None]:
file.describe()

In [None]:
file.view_flow()

## Append the batch to the dataset

Query the previous dataset:

In [None]:
dataset_v1 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="1").one()

Create a new version of the dataset by sharding it across the new `file` and the file underlying version 1 of the dataset:

In [None]:
dataset_v2 = ln.Dataset(
    [file, dataset_v1.file],
    is_new_version_of=dataset_v1,
)
dataset_v2.save()

# annotate the dataset
dataset_v2.labels.add_from(file)
dataset_v2.labels.add_from(dataset_v1)

Version 2 of the dataset covers significantly more conditions.

In [None]:
dataset_v2.describe()

View the flow:

In [None]:
dataset_v2.view_flow()