# Curate `AnnData` based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) against the [CELLxGENE schema v5.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md).

Load your instance where you want to register the curated AnnData object:

In [None]:
# pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --modules bionty

In [None]:
import lamindb as ln


def get_semi_curated_dataset():
    adata = ln.core.datasets.anndata_human_immune_cells()
    adata.obs["sex_ontology_term_id"] = "PATO:0000384"
    adata.obs["organism"] = "human"
    adata.obs["sex"] = "unknown"
    # create some typos in the metadata
    adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories({"lung": "lungg"})
    # new donor ids
    adata.obs["donor"] = adata.obs["donor"].astype(str) + "-1"
    # drop animal cell
    adata = adata[adata.obs["cell_type"] != "animal cell", :]
    # remove columns that are reserved in the cellxgene schema
    adata.var.drop(columns=["feature_reference", "feature_biotype"], inplace=True)
    adata.raw.var.drop(
        columns=["feature_name", "feature_reference", "feature_biotype"], inplace=True
    )
    return adata

Let's start with an AnnData object that we'd like to inspect and curate.
We are writing it to disk to run [CZI's cellxgene-schema CLI tool](https://github.com/chanzuckerberg/single-cell-curation) which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.

In [None]:
adata = get_semi_curated_dataset()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

In [None]:
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1

## Validate and curate metadata

We create a `Curate` object that references the `AnnData` object.
During instantiation, any :class:`~lamindb.Feature` records are saved.

In [None]:
curator = ln.curators.CellxGeneAnnDataCatCurator(
    adata, organism="human", schema_version="5.1.0"
)

Let's fix the "donor_id" column name:

In [None]:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)

In [None]:
validated = curator.validate()

For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:

In [None]:
ln.curators.CellxGeneAnnDataCatCurator._get_categoricals_defaults()

```{note}
CELLxGENE requires columns `tissue`, `organism`, and `assay` to have existing values from the ontologies.
Therefore, these columns need to be added and populated manually.
```

In [None]:
curator = ln.curators.CellxGeneAnnDataCatCurator(
    adata,
    defaults=ln.curators.CellxGeneAnnDataCatCurator._get_categoricals_defaults(),
    organism="human",
    schema_version="5.1.0",
)

In [None]:
validated = curator.validate()
validated

## Remove unvalidated values

We remove all unvalidated genes.
These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).

In [None]:
curator.non_validated

In [None]:
# adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
# if adata.raw is not None:
#     raw_data = adata.raw.to_adata()
#     raw_data = raw_data[
#         :, ~raw_data.var_names.isin(curator.non_validated["var_index"])
#     ].copy()
#     adata.raw = raw_data

In [None]:
curator = ln.curators.CellxGeneAnnDataCatCurator(
    adata, organism="human", schema_version="5.1.0"
)

## Register new metadata labels

Following the suggestions above to register genes and labels that aren't present in the current instance:

(Note that our instance is rather empty. Once you filled up the registries, registering new labels won't be frequently needed)

For donors, we register the new labels:

In [None]:
curator.add_new_from("donor_id")

An error is shown for the tissue label "lungg", which is a typo, should be "lung". Let's fix it:

In [None]:
tissues = curator.lookup().tissue
tissues.lung

In [None]:
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
    {"lungg": tissues.lung.name}
)

Let's validate the object again:

In [None]:
validated = curator.validate()
validated

In [None]:
adata.obs.head()

## Save artifact

In [None]:
artifact = curator.save_artifact(
    description=f"dataset curated against cellxgene schema {curator.schema_version}"
)

In [None]:
artifact.describe()

## Return an input h5ad file for cellxgene-schema

In [None]:
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg

In [None]:
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")

In [None]:
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1

```{note}

The Curate class is designed to validate all metadata for adherence to ontologies.
It does not reimplement all rules of the cellxgene schema and we therefore recommend running the [cellxgene-schema](https://github.com/chanzuckerberg/single-cell-curation) if full adherence beyond metadata is a necessity.
```