![scrna2/6](https://img.shields.io/badge/scrna2/6-lightgrey)
[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/scrna2.ipynb)
[![lamindata](https://img.shields.io/badge/laminlabs/lamindata-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=ManDYgmftZ8Cz8)

# Append a new batch of data

Here, we'll learn 
- how to standardize a less well curated dataset
- how to append it as a new batch of data to the growing versioned dataset

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.track()

## Standardize a less-well curated dataset

### Access ![](https://img.shields.io/badge/Access-10b981)

Let's now consider a dataset with less-well curated features:

In [None]:
adata = ln.dev.datasets.anndata_pbmc68k_reduced()
adata

We see that this dataset is indexed by gene symbols. Because we assume that in-house, we index all datasets by Ensembl IDs, we'll need to re-curate:

In [None]:
adata.var.head()

We are still working with human data, and can globally instruct `bionty` to assume human:

In [None]:
lb.settings.species = "human"

### Validate ![](https://img.shields.io/badge/Validate-10b981) 

#### Curate & validate genes

In [None]:
lb.Gene.validate(adata.var.index, lb.Gene.symbol);

In [None]:
lb.Gene.inspect(adata.var.index, lb.Gene.symbol);

Standardize symbols and register additional symbols from Bionty:

In [None]:
adata.var.index = lb.Gene.standardize(adata.var.index, lb.Gene.symbol)
gene_records = lb.Gene.from_values(adata.var.index, lb.Gene.symbol)
ln.save(gene_records)

We only want to register data with validated genes: data related to other features wouldn't be useful to us, anyway.

Hence, we subset the `AnnData` object to the validated genes:

In [None]:
validated = lb.Gene.validate(adata.var.index, lb.Gene.symbol)
adata_validated = adata[:, validated].copy()

We also subset raw of the anndata object to the validated genes

In [None]:
adata_validated.raw = adata.raw[:, validated].to_adata()

Now, we need to convert gene symbols into ensembl gene ids:

In [None]:
records = lb.Gene.filter(id__in=[record.id for record in gene_records])
mapper = pd.DataFrame(records.values_list("symbol", "ensembl_gene_id")).set_index(0)[1]
adata_validated.var.insert(0, "gene_symbol", adata_validated.var.index)
adata_validated.var.rename(index=mapper, inplace=True)

In [None]:
adata_validated.var.head()

Raw has the same genes, so set them also.

In [None]:
adata_validated.raw.var.index = adata_validated.var.index

#### Curate & validate cell types

Inspection shows none of the terms are validated:

In [None]:
inspector = lb.CellType.inspect(adata_validated.obs.cell_type)

Let us search the cell type names from the public ontology, and add the name value found in the `AnnData` object as a synonym to the top match found in the public ontology.

In [None]:
bionty = lb.CellType.bionty()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    ontology_id = (
        bionty.search(name).iloc[0].ontology_id
    )  # search the public ontology and use the ontology id of the top match
    record = lb.CellType.from_bionty(
        ontology_id=ontology_id
    )  # create a record by loading the top match from bionty
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()  # save the record
    record.add_synonym(
        name
    )  # add the original name as a synonym, so that next time, we can just run .standardize()

We can now standardize cell type names using the search-based mapper:

In [None]:
adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

In [None]:
validated = lb.CellType.validate(adata_validated.obs.cell_type)
assert all(validated)

We don't want to store any of the other metadata columns:

In [None]:
for column in ["n_genes", "percent_mito", "louvain"]:
    adata.obs.drop(column, axis=1)

### Register ![](https://img.shields.io/badge/Register-10b981) 

In [None]:
modalities = ln.Modality.lookup()
experimental_factors = lb.ExperimentalFactor.lookup()
species = lb.Species.lookup()
features = ln.Feature.lookup()

In [None]:
file = ln.File.from_anndata(
    adata_validated,
    description="10x reference adata",
    field=lb.Gene.ensembl_gene_id,
    modality=modalities.rna,
)

As we do not want to manage the remaining unvalidated terms in registries, we can save the file.

In [None]:
file.save()

In [None]:
file.labels.add(adata_validated.obs.cell_type, features.cell_type)
file.labels.add(species.human, feature=features.species)
file.labels.add(experimental_factors.single_cell_rna_sequencing, feature=features.assay)

In [None]:
file.describe()

In [None]:
file.view_flow()

## Append a file to the growing dataset

In [None]:
import lamindb as ln

Query the previous dataset and the file we just created:

In [None]:
dataset_v1 = ln.Dataset.filter(name="My versioned scRNA-seq dataset", version="1").one()

new_file = ln.File.filter().order_by("-created_at").first()

Create a new version of the dataset by sharding it across the new file and the file in the previous version of the dataset:

In [None]:
dataset_v2 = ln.Dataset(
    [new_file, dataset_v1.file],
    is_new_version_of=dataset_v1,
)
dataset_v2.save()

# annotate the dataset
dataset_v2.labels.add_from(new_file)
dataset_v2.labels.add_from(dataset_v1)

Version 2 of the dataset covers significantly more conditions.

In [None]:
dataset_v2.describe()

View the flow:

In [None]:
dataset_v2.view_flow()