# Curate `AnnData` based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) against the [CELLxGENE schema v5.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md).

Load your instance where you want to register the curated AnnData object:

In [1]:
# !pip install 'lamindb[bionty,jupyter]' cellxgene-lamin
# cellxgene-schema has pinned dependencies. Therefore we recommend installing it into a separate environment using `uv` or `pipx`
# uv tool install cellxgene-schema==5.1.0
!lamin init --storage ./test-cellxgene-curate --name test-cellxgene-curate --schema bionty

[92m→[0m connected lamindb: zethson/test-cellxgene-curate


In [4]:
import lamindb as ln
import cellxgene_lamin as cxg

[92m→[0m connected lamindb: zethson/test-cellxgene-curate


Let's start with an AnnData object that we'd like to inspect and curate.
We are writing it to disk to run [CZI's cellxgene-schema CLI tool](https://github.com/chanzuckerberg/single-cell-curation) which verifies whether an on-disk h5ad dataset adheres to the cellxgene schema.

In [5]:
adata = cxg.datasets.anndata_human_immune_cells()
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata

AnnData object with n_obs × n_vars = 1626 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay', 'sex_ontology_term_id', 'organism', 'sex'
    var: 'feature_is_filtered'
    uns: 'default_embedding'
    obsm: 'X_umap'

Initially, the cellxgene-schema validator of CZI does not pass and we need to curate the dataset.

In [6]:
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells.h5ad || exit 1

Loading dependencies
Loading validator modules
Starting validation...
Unable to open 'anndata_human_immune_cells.h5ad' with AnnData


## Validate and curate metadata

We create a `Curate` object that references the `AnnData` object.
During instantiation, any :class:`~lamindb.Feature` records are saved.

In [5]:
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")

[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata


Let's fix the "donor_id" column name:

In [6]:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)

In [7]:
validated = curator.validate()

[91m✗[0m missing required obs columns development_stage, disease, self_reported_ethnicity, suspension_type, tissue_type
[94m•[0m consider initializing a Curate object like 'Curate(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS)'to automatically add these columns with default values.


For the missing columns, we can pass default values suggested from CELLxGENE which will automatically add them to the AnnData object:

In [8]:
cxg.CellxGeneFields.OBS_FIELD_DEFAULTS

{'cell_type': 'unknown',
 'development_stage': 'unknown',
 'disease': 'normal',
 'donor_id': 'unknown',
 'self_reported_ethnicity': 'unknown',
 'sex': 'unknown',
 'suspension_type': 'cell',
 'tissue_type': 'tissue'}

```{note}
CELLxGENE requires columns `tissue`, `organism`, and `assay` to have existing values from the ontologies.
Therefore, these columns need to be added and populated manually.
```

In [9]:
curator = cxg.Curator(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS, organism="human", schema_version="5.1.0")

[92m→[0m added default value 'unknown' to the adata.obs['development_stage']


[92m→[0m added default value 'normal' to the adata.obs['disease']
[92m→[0m added default value 'unknown' to the adata.obs['self_reported_ethnicity']
[92m→[0m added default value 'cell' to the adata.obs['suspension_type']
[92m→[0m added default value 'tissue' to the adata.obs['tissue_type']
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourpr

In [10]:
validated = curator.validate()
validated

[92m→[0m validating metadata using registries of instance [3mlaminlabs/cellxgene[0m
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[94m•[0m mapping "var_index" on [3mGene.ensembl_gene_id[0m
[93m![0m   [1;91m113 terms[0m are not validated: [1;91m'ENSG00000269933', 'ENSG00000261737', 'ENSG00000259834', 'ENSG00000256374', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000272880', 'ENSG00000270188', 'ENSG00000287116', 'ENSG00000237133', 'ENSG00000224739', 'ENSG00000227902', 'ENSG00000239467', 'ENSG00000272551', 'ENSG00000280374', 'ENSG00000236886', 'ENSG00000229352', 'ENSG00000286601', 'ENSG00000227021', ...[0m
    → fix typos, remove non-existent values, or save terms via [1;96m.add_new_from_var_index()[0m
[92m✓[0m "assay" is validated against [3mExperimentalFactor.name[0m
[92m✓[0m "cell_type" is validated against [3mCellType.name[0m
[92m✓[0m "development_stage" is 

False

## Remove unvalidated values

We remove all unvalidated genes.
These genes may exist in a different release of ensembl but are not valid for the ensembl version of cellxgene schema 5.0.0 (ensembl release 110).

In [11]:
adata = adata[:, ~adata.var.index.isin(curator.non_validated["var_index"])].copy()
if adata.raw is not None:
    raw_data = adata.raw.to_adata()
    raw_data = raw_data[
         :, ~raw_data.var_names.isin(curator.non_validated["var_index"])
    ].copy()
    adata.raw = raw_data

In [12]:
# We must create the Curate object again to ensure that it references the correct AnnData object
curator = cxg.Curator(adata, organism="human", schema_version="5.1.0")

[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema m

## Register new metadata labels

Following the suggestions above to register genes and labels that aren't present in the current instance:

(Note that our instance is rather empty. Once you filled up the registries, registering new labels won't be frequently needed)

For donors, we register the new labels:

In [13]:
curator.add_new_from("donor_id")

An error is shown for the tissue label "lungg", which is a typo, should be "lung". Let's fix it:

In [14]:
tissues = curator.lookup().tissue
tissues.lung

[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata


Tissue(uid='7Tt4iEKc', name='lung', ontology_id='UBERON:0002048', synonyms='pulmo', description='Respiration Organ That Develops As An Outpocketing Of The Esophagus.', created_by_id=1, source_id=47, created_at=2023-11-28 22:50:53 UTC)

In [15]:
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
    {"lungg": tissues.lung.name}
)

Let's validate the object again:

In [16]:
validated = curator.validate()
validated

[92m→[0m validating metadata using registries of instance [3mlaminlabs/cellxgene[0m
[92m✓[0m "var_index" is validated against [3mGene.ensembl_gene_id[0m
[92m✓[0m "assay" is validated against [3mExperimentalFactor.name[0m
[92m✓[0m "cell_type" is validated against [3mCellType.name[0m
[92m✓[0m "development_stage" is validated against [3mDevelopmentalStage.name[0m
[92m✓[0m "disease" is validated against [3mDisease.name[0m
[92m✓[0m "donor_id" is validated against [3mULabel.name[0m
[92m✓[0m "self_reported_ethnicity" is validated against [3mEthnicity.name[0m
[92m✓[0m "sex_ontology_term_id" is validated against [3mPhenotype.ontology_id[0m
[92m✓[0m "suspension_type" is validated against [3mULabel.name[0m
[92m✓[0m "tissue" is validated against [3mTissue.name[0m
[92m✓[0m "tissue_type" is validated against [3mULabel.name[0m
[92m✓[0m "organism" is validated against [3mOrganism.name[0m


True

In [17]:
adata.obs.head()

Unnamed: 0,donor_id,tissue,cell_type,assay,sex_ontology_term_id,organism,sex,development_stage,disease,self_reported_ethnicity,suspension_type,tissue_type
CZINY-0109_CTGGTCTAGTCTGTAC,D496-1,blood,classical monocyte,10x 3' v3,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT,621B-1,thoracic lymph node,T follicular helper cell,10x 5' v2,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
Pan_T7935491_CTGGTCTGTACATGTC,A29-1,spleen,memory B cell,10x 5' v1,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
Pan_T7980367_GGGCATCCAGGTGGAT,A36-1,lung,alveolar macrophage,10x 5' v1,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue
Pan_T7935494_ATCATGGTCTACCTGC,A29-1,mesenteric lymph node,"naive thymus-derived CD4-positive, alpha-beta ...",10x 5' v1,PATO:0000384,human,unknown,unknown,normal,unknown,cell,tissue


## Save artifact

In [18]:
artifact = curator.save_artifact(description=f"dataset curated against cellxgene schema {curator.schema_version}")

[93m![0m no run & transform got linked, call `ln.track()` & re-run
[92m→[0m returning existing artifact with same hash: Artifact(uid='ZI5uEYLi3fDDDzoM0000', is_latest=True, description='dataset curated against cellxgene schema 5.1.0', suffix='.h5ad', type='dataset', size=54670616, hash='VYhEnkViOhtD-7kN2odUGw', n_observations=1626, _hash_type='sha1-fl', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, created_at=2025-01-10 13:30:15 UTC)
[93m![0m run input wasn't tracked, call `ln.track()` and re-run


In [19]:
artifact.describe()

The below is optional -- it mimics the way cellxgene creates collections of `AnnData` objects to link them to studies.

In [20]:
# register a new collection
title = "Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)"
collection = ln.Collection(
    [artifact],  # registered artifact above, can also pass a list of artifacts
    name=title,  # title of the publication
    description="10.1126/science.abl5197",  # DOI of the publication
    reference="E-MTAB-11536",  # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress",  # source type (e.g. GEO, ArrayExpress, SRA, etc.)
).save()

[93m![0m no run & transform got linked, call `ln.track()` & re-run
[93m![0m returning existing collection with same hash: Collection(uid='KjJVvmu6aOEtxisK0000', is_latest=True, name='Cross-tissue immune cell analysis reveals tissue-specific features in humans (for test demo only)', description='10.1126/science.abl5197', hash='A8JQoolQ3UhcVU1i1JLItw', reference='E-MTAB-11536', reference_type='ArrayExpress', visibility=1, created_by_id=1, created_at=2025-01-10 13:30:18 UTC)
[93m![0m run input wasn't tracked, call `ln.track()` and re-run


## Return an input h5ad file for cellxgene-schema

In [21]:
adata_cxg = curator.to_cellxgene_anndata(is_primary_data=True, title=title)
adata_cxg

[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata
[93m![0m source schema has additional modules: {'ourprojects'}
consider mounting these schema modules to transfer all metadata


AnnData object with n_obs × n_vars = 1626 × 36390
    obs: 'donor_id', 'sex_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'cell_type_ontology_term_id', 'assay_ontology_term_id', 'organism_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data'
    var: 'feature_is_filtered'
    uns: 'default_embedding', 'title', 'cxg_lamin_schema_reference', 'cxg_lamin_schema_version'
    obsm: 'X_umap'

In [22]:
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")

In [23]:
!MPLBACKEND=agg uvx cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad || exit 1

Loading dependencies
Loading validator modules
Starting validation...
Unable to open 'anndata_human_immune_cells_cxg.h5ad' with AnnData


```{note}

The Curate class is designed to validate all metadata for adherence to ontologies.
It does not reimplement all rules of the cellxgene schema and we therefore recommend running the [cellxgene-schema](https://github.com/chanzuckerberg/single-cell-curation) if full adherence beyond metadata is a necessity.
```