# Curate `AnnData` based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) against the [CELLxGENE schema v5.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md).

Load your instance where you want to register the curated AnnData object:

In [None]:
# pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --modules bionty

In [1]:
import lamindb as ln
import bionty as bt


def get_semi_curated_dataset():
    adata = ln.core.datasets.anndata_human_immune_cells()
    adata.obs["sex_ontology_term_id"] = "PATO:0000384"
    adata.obs["organism"] = "human"
    adata.obs["sex"] = "unknown"
    # create some typos in the metadata
    adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories({"lung": "lungg"})
    # new donor ids
    adata.obs["donor"] = adata.obs["donor"].astype(str) + "-1"
    # drop animal cell
    adata = adata[adata.obs["cell_type"] != "animal cell", :]
    # remove columns that are reserved in the cellxgene schema
    adata.var.drop(columns=["feature_reference", "feature_biotype"], inplace=True)
    adata.raw.var.drop(
        columns=["feature_name", "feature_reference", "feature_biotype"], inplace=True
    )
    return adata

[92m→[0m connected lamindb: sunnyosun/test-cellxgene-curate


Let's start with an AnnData object that we'd like to inspect and curate.
We are writing it to disk to run [CZI's cellxgene-schema CLI tool](https://github.com/chanzuckerberg/single-cell-curation) which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.

In [2]:
adata = get_semi_curated_dataset()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata

AnnData object with n_obs × n_vars = 1626 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'sex'
    var: 'feature_is_filtered'
    uns: 'default_embedding'
    obsm: 'X_umap'

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

In [None]:
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1

## Validate and curate metadata

We create a `Curate` object that references the `AnnData` object.
During instantiation, any :class:`~lamindb.Feature` records are saved.

In [3]:
curator = ln.curators.CellxGeneAnnDataCatManager(
    adata, organism="human", schema_version="5.1.0"
)

In [4]:
validated = curator.validate()

[91m✗[0m missing required obs columns development_stage, disease, donor_id, self_reported_ethnicity, suspension_type, tissue_type
[94m•[0m consider initializing a Curate object like 'Curate(adata, defaults=cxg.CellxGeneAnnDataCatManager._get_categoricals_defaults())'to automatically add these columns with default values.


Let's fix the "donor_id" column name:

In [7]:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)

For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:

In [8]:
ln.curators.CellxGeneAnnDataCatManager._get_categoricals_defaults()

{'cell_type': 'unknown',
 'development_stage': 'unknown',
 'disease': 'normal',
 'donor_id': 'unknown',
 'self_reported_ethnicity': 'unknown',
 'sex': 'unknown',
 'suspension_type': 'cell',
 'tissue_type': 'tissue'}

```{note}
CELLxGENE requires columns `tissue`, `organism`, and `assay` to have existing values from the ontologies.
Therefore, these columns need to be added and populated manually.
```

In [5]:
curator = ln.curators.CellxGeneAnnDataCatManager(
    adata,
    defaults=ln.curators.CellxGeneAnnDataCatManager._get_categoricals_defaults(),
    organism="human",
    schema_version="5.1.0",
)

[92m→[0m added default value 'unknown' to the adata.obs['development_stage']
[92m→[0m added default value 'normal' to the adata.obs['disease']
[92m→[0m added default value 'unknown' to the adata.obs['donor_id']
[92m→[0m added default value 'unknown' to the adata.obs['self_reported_ethnicity']
[92m→[0m added default value 'cell' to the adata.obs['suspension_type']
[92m→[0m added default value 'tissue' to the adata.obs['tissue_type']


In [6]:
validated = curator.validate()
validated

[94m•[0m mapping "var_index" on [3mGene.ensembl_gene_id[0m
[93m![0m   [1;91m113 terms[0m are not validated: [1;91m'ENSG00000269933', 'ENSG00000261737', 'ENSG00000259834', 'ENSG00000256374', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000272880', 'ENSG00000270188', 'ENSG00000287116', 'ENSG00000237133', 'ENSG00000224739', 'ENSG00000227902', 'ENSG00000239467', 'ENSG00000272551', 'ENSG00000280374', 'ENSG00000236886', 'ENSG00000229352', 'ENSG00000286601', 'ENSG00000227021', ...[0m
    → fix typos, remove non-existent values, or save terms via [1;96m.add_new_from_var_index()[0m
[92m✓[0m "assay" is validated against [3mExperimentalFactor.name[0m
[92m✓[0m "cell_type" is validated against [3mCellType.name[0m
[92m✓[0m "development_stage" is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m "disease" is validated against [3mDisease.name[0m
[92m✓[0m "self_reported_ethnicity" is validated against [3mEthnicity.name[0m
[92m✓[0m "sex_ontolog

False

## Remove unvalidated values

We remove all unvalidated genes.
These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).

In [7]:
curator.non_validated

{'tissue': ['lungg'],
 'var_index': ['ENSG00000269933',
  'ENSG00000261737',
  'ENSG00000259834',
  'ENSG00000256374',
  'ENSG00000263464',
  'ENSG00000203812',
  'ENSG00000272196',
  'ENSG00000272880',
  'ENSG00000270188',
  'ENSG00000287116',
  'ENSG00000237133',
  'ENSG00000224739',
  'ENSG00000227902',
  'ENSG00000239467',
  'ENSG00000272551',
  'ENSG00000280374',
  'ENSG00000236886',
  'ENSG00000229352',
  'ENSG00000286601',
  'ENSG00000227021',
  'ENSG00000259855',
  'ENSG00000273301',
  'ENSG00000271870',
  'ENSG00000237838',
  'ENSG00000286996',
  'ENSG00000269028',
  'ENSG00000286699',
  'ENSG00000273370',
  'ENSG00000261490',
  'ENSG00000272567',
  'ENSG00000270394',
  'ENSG00000272370',
  'ENSG00000272354',
  'ENSG00000251044',
  'ENSG00000272040',
  'ENSG00000182230',
  'ENSG00000204092',
  'ENSG00000261068',
  'ENSG00000236740',
  'ENSG00000236996',
  'ENSG00000232295',
  'ENSG00000271734',
  'ENSG00000236673',
  'ENSG00000227220',
  'ENSG00000236166',
  'ENSG00000112096',

In [45]:
adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
        :, ~raw_data.var_names.isin(curator.non_validated["var_index"])
    ].copy()
    adata.raw = raw_data

In [46]:
curator = ln.curators.CellxGeneAnnDataCatManager(
    adata, organism="human", schema_version="5.1.0"
)

## Register new metadata labels

Following the suggestions above to register genes and labels that aren't present in the current instance:

(Note that our instance is rather empty. Once you filled up the registries, registering new labels won't be frequently needed)

An error is shown for the tissue label "lungg", which is a typo, should be "lung". Let's fix it:

In [47]:
tissues = curator.lookup(public=True).tissue
tissues.lung

Tissue(ontology_id='UBERON:0002048', name='lung', definition='Respiration Organ That Develops As An Outpocketing Of The Esophagus.', synonyms='pulmo', parents=array(['UBERON:0015212', 'UBERON:0005178', 'UBERON:0000171',
       'UBERON:0004119'], dtype=object))

In [48]:
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
    {"lungg": tissues.lung.name}
)

Let's validate the object again:

In [49]:
validated = curator.validate()
validated

[92m✓[0m "var_index" is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m "assay" is validated against [3mExperimentalFactor.name[0m
[92m✓[0m "cell_type" is validated against [3mCellType.name[0m
[92m✓[0m "development_stage" is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m "disease" is validated against [3mDisease.name[0m
[92m✓[0m "self_reported_ethnicity" is validated against [3mEthnicity.name[0m
[92m✓[0m "sex_ontology_term_id" is validated against [3mPhenotype.ontology_id[0m
[92m✓[0m "suspension_type" is validated against [3mULabel.name[0m
[92m✓[0m "tissue" is validated against [3mTissue.name[0m
[92m✓[0m "tissue_type" is validated against [3mULabel.name[0m
[92m✓[0m "organism" is validated against [3mOrganism.name[0m


True

In [50]:
adata.obs.head()

Unnamed: 0,donor_id,tissue,cell_type,assay,sex_ontology_term_id,organism,sex,development_stage,disease,self_reported_ethnicity,suspension_type,tissue_type
CZINY-0109_CTGGTCTAGTCTGTAC,D496-1,blood,classical monocyte,10x 3' v3,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT,621B-1,thoracic lymph node,T follicular helper cell,10x 5' v2,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
Pan_T7935491_CTGGTCTGTACATGTC,A29-1,spleen,memory B cell,10x 5' v1,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
Pan_T7980367_GGGCATCCAGGTGGAT,A36-1,lung,alveolar macrophage,10x 5' v1,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
Pan_T7935494_ATCATGGTCTACCTGC,A29-1,mesenteric lymph node,"naive thymus-derived CD4-positive, alpha-beta ...",10x 5' v1,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue


## Save artifact

In [None]:
artifact = curator.save_artifact(
    key=f"my_datasets/dataset-curated-against-cellxgene-schema-{curator.schema_version}"
)

In [None]:
artifact.describe()

## Return an input h5ad file for cellxgene-schema

In [None]:
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg

In [None]:
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")

In [None]:
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1

```{note}

The Curate class is designed to validate all metadata for adherence to ontologies.
It does not reimplement all rules of the cellxgene schema and we therefore recommend running the [cellxgene-schema](https://github.com/chanzuckerberg/single-cell-curation) if full adherence beyond metadata is a necessity.
```