# Subclassing Curator

Teams often agree that particular datasets must include a set of predefined columns curated against specific ontologies.
This guide shows how to subclass {class}`~lamindb.Curator` to enforce such rules, like the cellxgene curator ({doc}`docs:cellxgene-curate`) does for specified cellxgene schema versions.

In [None]:
!lamin init --storage ./subclass-curator --name subclass-curator --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

ln.track("2XEr2IA4n1w40000")

## EHRCurator implementation

A clinical team wants to ensure that every electronic health record must have at least the columns 'disease', 'phenotype', 'developmental_stage', and 'age'.
Further, these columns must be mapped against specific versions of ontologies.

We can use all of {class}`~lamindb.core.DataFrameCurator`, {class}`~lamindb.core.AnnDataCurator`, and {class}`~lamindb.core.MuDataCurator` as our base class for our custom Curator.
Since we're dealing with tabular EHR data, we go with {class}`~lamindb.core.DataFrameCurator`.

```{eval-rst}
.. literalinclude:: subclass_curator.py
   :language: python
   :caption: Custom EHR Curator
```

## Curate with EHRCurator

In [None]:
# create example DataFrame that has all mandatory columns but one ('patient_age') is wrongly named
data = {
    'disease': ['Alzheimer disease', 'Diabetes mellitus', 'Breast cancer', 'Hypertension', 'Asthma'],
    'phenotype': ['Cognitive decline', 'Hyperglycemia', 'Tumor growth', 'Increased blood pressure', 'Airway inflammation'],
    'developmental_stage': ['Adult', 'Adult', 'Adult', 'Adult', 'Child'],
    'patient_age': [70, 55, 60, 65, 12],
}
df = pd.DataFrame(data)
df

In [None]:
from subclass_curator import EHRCurator

ehrcurator = EHRCurator(df)

In [None]:
ehrcurator.validate()

In [None]:
# Fix the name of wrongly spelled column
df.columns = df.columns.str.replace("patient_age", "age")

In [None]:
ehrcurator.validate()

In [None]:
# Create lookup objects to curate the values
disease_lo = bt.Disease.public().lookup()
phenotype_lo = bt.Phenotype.public().lookup()
developmental_stage_lo = bt.DevelopmentalStage.public().lookup()

In [None]:
df["disease"] = df["disease"].replace({"Hypertension": disease_lo.hypertensive_disorder.name})
df["phenotype"] = df["phenotype"].replace({"Tumor growth": phenotype_lo.neoplasm.name,
                                           "Airway inflammation": phenotype_lo.bronchitis.name})
df["developmental_stage"] = df["developmental_stage"].replace({"Adult": developmental_stage_lo.adolescent_stage.name,
                                                               "Child": developmental_stage_lo.child_stage.name})

In [None]:
ehrcurator.validate()

In [None]:
ehrcurator.add_validated_from("disease")
ehrcurator.add_validated_from("phenotype")
ehrcurator.add_validated_from("developmental_stage")

In [None]:
ehrcurator.validate()

In [None]:
!rm -rf subclass-curator
!lamin delete --force subclass-curator