# Curate `AnnData` based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) against the [CELLxGENE schema v5.2.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md).

In [None]:
# pip install 'lamindb[bionty,jupyter]' pronto
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.2.3

!lamin init --storage ./test-cellxgene-curate --modules bionty

In [None]:
import lamindb as ln
import bionty as bt

ln.track()

## The CELLxGENE schema

As a first step, we generate the specific CELLxGENE schema which adds missing sources to the instance:

In [None]:
cxg_schema = ln.examples.cellxgene.get_cxg_schema("5.2.0")

In [None]:
cxg_schema.describe()

The schema has two components:

In [None]:
cxg_schema.slots["var"].describe()

In [None]:
cxg_schema.slots["obs"].describe()

In the following, we will validate a dataset the CELLxGENE schema and curate it.

## Validate and curate metadata

Let's start with an AnnData object that we would like to curate.
We are writing it to disk to run [CZI's cellxgene-schema CLI tool](https://github.com/chanzuckerberg/single-cell-curation) which verifies whether an on-disk h5ad dataset adheres all requirements of CELLxGENE including the CELLxGENE schema.

In [None]:
adata = ln.core.datasets.small_dataset3_cellxgene(with_obs_typo=True)
adata.write_h5ad("small_cxg.h5ad")
adata

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

In [None]:
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg.h5ad

CELLxGENE requires all observations to be annotated.
If information for a specific column like `disease_ontology_term_id` is not available, CELLxGENE requires to fall back to default values like "normal" or "unknown".
Let's save these defaults to the instance using {func}`lamindb.examples.cellxgene.save_cxg_defaults`:

In [None]:
ln.examples.cellxgene.save_cxg_defaults()

Now we can start curating the dataset:

In [None]:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)

The error shows invalid genes are present in the dataset.
Let's remove them from both the `adata` and `adata.raw` objects:

In [None]:
adata = adata[
    :, ~adata.var.index.isin(curator.slots["var"].cat.non_validated["index"])
].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var.index.isin(curator.slots["var"].cat.non_validated["index"])
    ].copy()
    adata.raw = raw_data

As we've subsetted the AnnData object, we have to recreate the `AnnDataCurator` to validate again:

In [None]:
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)

The validation error tells us that we're missing several columns.
The reason is simple:
CELLxGENE requires all `obs` metadata to be stored as ontology IDs in `entity_ontology_term_id` columns.
Therefore, we first translate the `name` based `obs` columns into the required format.

In [None]:
adata.obs

In [None]:
# Add missing assay column
adata.obs["assay_ontology_term_id"] = "EFO:0005684"
# Add `entity_ontology_term_id` columns by translating names to ontology IDs
standardization_map = {
    "organism": (bt.Organism, "organism_ontology_term_id"),
    "self_reported_ethnicity": (
        bt.Ethnicity,
        "self_reported_ethnicity_ontology_term_id",
    ),
    "cell_type": (bt.CellType, "cell_type_ontology_term_id"),
}

for col, (bt_class, new_col) in standardization_map.items():
    adata.obs[new_col] = bt_class.standardize(
        adata.obs[col], field="name", return_field="ontology_id"
    )
# Drop the name columns because CELLxGENE disallows them
adata.obs = adata.obs.drop(columns=list(standardization_map.keys()))

In [None]:
# recreating the object because we dropped `obs` columns
curator = ln.curators.AnnDataCurator(adata, cxg_schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    print(e)

An error is shown for the tissue label “UBERON:0002048XXX” because it contains a few extra `X` - a typo.
Let’s fix it:

In [None]:
adata.obs["tissue_ontology_term_id"] = adata.obs["tissue_ontology_term_id"].replace(
    {"UBERON:0002048XXX": "UBERON:0002048"}
)

Now `validate` should pass.

In [None]:
curator.validate()

## Save artifact

We can now save the curated artifact:

In [None]:
artifact = curator.save_artifact(key="examples/dataset-curated-against-cxg.h5ad")

In [None]:
artifact.describe()

## Validating using cellxgene-schema

To validate the now curated AnnData object using [CZI's cellxgene-schema CLI tool](https://github.com/chanzuckerberg/single-cell-curation), we need to write the AnnData object to disk.

In [None]:
adata.write("small_cxg_curated.h5ad")

In [None]:
# %%bash -e
!MPLBACKEND=agg uvx cellxgene-schema validate small_cxg_curated.h5ad

```{note}

The Curate class is designed to validate all metadata for adherence to ontologies.
It does not reimplement all rules of the cellxgene schema and we therefore recommend running the [cellxgene-schema](https://github.com/chanzuckerberg/single-cell-curation) if full adherence beyond metadata is a necessity.
```