# Subclassing Curator

Teams often agree that specific datasets must include a set of predefined columns curated against particular ontologies.
This guide shows how to subclass {class}`~lamindb.Curator` to enforce such rules, like the cellxgene curator ({doc}`docs:cellxgene-curate`) does for specific cellxgene schema versions.

lamindb provides several {class}`~lamindb.Curator` tailored for specific data types such as {class}`~lamindb.core.DataFrameCurator`, {class}`~lamindb.core.AnnDataCurator`, and {class}`~lamindb.core.MuDataCurator`.

In [None]:
!lamin init --storage subclass-curator --name subclass-curator --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

from ehr_curator import EHRCurator

ln.track("2XEr2IA4n1w40000")

## Scenario

A clinical team wants to ensure that every electronic health record must have at least the columns 'disease', 'phenotype', and 'developmental_stage'.
Further, these columns must be mapped against specific versions of ontologies.

In [None]:
# create example DataFrame that has 2 of the 3 mandatory columns and two additional columns
data = {
    'disease': ['Alzheimer disease', 'Diabetes mellitus', 'Breast cancer', 'Hypertension', 'Asthma'],
    'phenotype': ['Cognitive decline', 'Hyperglycemia', 'Tumor growth', 'Increased blood pressure', 'Airway inflammation'],
    'developmental_stage': ['Adult', 'Adult', 'Adult', 'Adult', 'Child'],
    'patient_age': [70, 55, 60, 65, 12],
    'treatment_outcome': ['Improved', 'Stable', 'Improved', 'Worsened', 'Stable']
}
df = pd.DataFrame(data)
df

## Implement EHR Curator

We can use all of {class}`~lamindb.core.DataFrameCurator`, {class}`~lamindb.core.AnnDataCurator`, and {class}`~lamindb.core.MuDataCurator` as our base class for our custom Curator.
Since we're dealing with tabular EHR data, we go with {class}`~lamindb.core.DataFrameCurator`.

```{eval-rst}
.. literalinclude:: ehrcurator.py
   :language: python
   :caption: Custom EHR Curator
```

## Curate with EHRCurator

In [None]:
ehrcurator = EHRCurator(df)

In [None]:
ehrcurator.add_new_from_columns()

In [None]:
ehrcurator.validate()

In [None]:
# Let's fix the name of the column
df = df.rename(columns={"patient_age": "age"})

In [None]:
ehrcurator.validate()

In [None]:
# Fix values
disease_lo = bt.Disease.public().lookup()
phenotype_lo = bt.Phenotype.public().lookup()
developmental_stage_lo = bt.DevelopmentalStage.public().lookup()

In [None]:
df["disease"] = df["disease"].replace({"Hypertension": disease_lo.hypertensive_disorder.name})
df["phenotype"] = df["phenotype"].replace({"Tumor growth": phenotype_lo.neoplasm.name,
                                           "Airway inflammation": phenotype_lo.bronchitis.name})
df["developmental_stage"] = df["developmental_stage"].replace({"Adult": developmental_stage_lo.adolescent_stage.name,
                                                               "Child": developmental_stage_lo.child_stage.name})

In [None]:
ehrcurator = EHRCurator(df)
ehrcurator.validate()

In [None]:
ehrcurator.add_validated_from("disease")
ehrcurator.add_validated_from("phenotype")
ehrcurator.add_validated_from("developmental_stage")

In [None]:
ehrcurator.validate()

In [None]:
!rm -rf custom-curator
!lamin delete --force custom-curator