# Curate `AnnData` based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) against the [CELLxGENE schema v5.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md).

Load your instance where you want to register the curated AnnData object:

In [None]:
# pip install 'lamindb[bionty,jupyter]' pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.3.2
!lamin init --storage ./test-cellxgene-curate --modules bionty

In [None]:
import lamindb as ln
import bionty as bt

ln.settings.verbosity = "error"

Let's start with an AnnData object that we'd like to inspect and curate.
We are writing it to disk to run [CZI's cellxgene-schema CLI tool](https://github.com/chanzuckerberg/single-cell-curation) which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.

In [None]:
adata = ln.core.datasets.small_dataset3_cellxgene(with_obs_defaults=True)
adata.write_h5ad("small_cxg.h5ad")
adata

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

In [None]:
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad

## Set up instance

CELLxGENE supports several default values such as "normal" for Mondo Disease that we need to save to our instance:

In [None]:
ln.examples.cellxgene.save_cxg_defaults()

## Validate and curate metadata

CELLxGENE requires all `obs` metadata to be stored as ontology IDs in `column_ontology_id` columns.
Therefore, we first translate the `name` based `obs` columns into the required format:

In [None]:
adata.obs

In [None]:
standardization_map = {
    "organism": (bt.Organism, "organism_ontology_term_id"),
    "assay": (bt.ExperimentalFactor, "assay_ontology_term_id"),
    "tissue": (bt.Tissue, "tissue_ontology_term_id"),
    "self_reported_ethnicity": (
        bt.Ethnicity,
        "self_reported_ethnicity_ontology_term_id",
    ),
    "cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}

for col, (bt_class, new_col) in standardization_map.items():
    adata.obs[new_col] = bt_class.standardize(
        adata.obs[col], field="name", return_field="ontology_id"
    )

adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))

In [None]:
schema = ln.examples.cellxgene.get_cxg_schema("5.3.0", field_types="ontology_id")

In [None]:
curator = ln.curators.AnnDataCurator(adata, schema)

In [None]:
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)

In [None]:
adata = adata[
    :, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
    ].copy()
    adata.raw = raw_data

In [None]:
curator = ln.curators.AnnDataCurator(adata, schema)

In [None]:
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)

In [None]:
adata.obs["tissue_ontology_term_id"] = adata.obs["tissue_ontology_term_id"].replace(
    {"lungg": "lung"}
)
adata.obs["tissue_ontology_term_id"] = bt.Tissue.standardize(
    adata.obs.tissue_ontology_term_id, field="name", return_field="ontology_id"
)

In [None]:
curator.validate()

## Save artifact

In [None]:
artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")

In [None]:
artifact.describe()

## Validating using cellxgene-schema

In [None]:
adata.write("small_cxg_curated.h5ad")

In [None]:
%%bash -e
MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad

```{note}

The Curate class is designed to validate all metadata for adherence to ontologies.
It does not reimplement all rules of the cellxgene schema and we therefore recommend running the [cellxgene-schema](https://github.com/chanzuckerberg/single-cell-curation) if full adherence beyond metadata is a necessity.
```