# Custom Curators

Often teams agree on specific datasets to always require a specific set of columns that all need to be curated against specific ontologies.
Here we show how to define a custom Curator object that defines such a set of rules.
An example is the cellxgene curator ({doc}`docs:cellxgene-curate`) that allows for the curation against specific cellxgene schema versions.

Scenario: A clinical team wants to ensure that every electronic health record must have at least the columns 'disease', 'phenotype', and 'developmental_stage'.
Further, these columns must be mapped against specific versions of ontologies that we'll define throughout this tutorial.

lamindb provides several :class:`lamindb.Curator` tailored for specific data types such as :class:`lamindb.core.DataFrameCurator`, :class:`lamindb.core.AnnDataCurator`, and :class:`lamindb.core.MuDataCurator`.
Here, we assume that the electronic health record data is stored in tabular form in DataFrames.

In [None]:
!lamin init --storage custom-curator --name custom-curator --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

ln.context.uid = "2XEr2IA4n1w40000"
ln.context.track()

## Example data

In [None]:
# create example data that has 2 of the 3 mandatory columns and two additional columns
data = {
    'disease': ['Alzheimer disease', 'Diabetes mellitus', 'Breast cancer', 'Hypertension', 'Asthma'],
    'phenotype': ['Cognitive decline', 'Hyperglycemia', 'Tumor growth', 'Increased blood pressure', 'Airway inflammation'],
    'developmental_stage': ['Adult', 'Adult', 'Adult', 'Adult', 'Child'],
    'patient_age': [70, 55, 60, 65, 12],
    'treatment_outcome': ['Improved', 'Stable', 'Improved', 'Worsened', 'Stable']
}
df = pd.DataFrame(data)

## Implement custom EHR Curator

We can use all of :class:`lamindb.core.DataFrameCurator`, :class:`lamindb.core.AnnDataCurator`, and :class:`lamindb.core.MuDataCurator` as our base class for our custom Curator.
Since we're dealing with tabular EHR data, we go with :class:`lamindb.core.DataFrameCurator`.

In [None]:
from lamindb.core import DataFrameCurator
from lamindb_setup.core.types import UPathStr
from lnschema_core import Record
from lnschema_core.types import FieldAttr

# Curate these columns against the specified fields
DEFAULT_CATEGORICALS = {
    "disease": bt.Disease.name,
    "phenotype": bt.Phenotype.name,
    "developmental_stage": bt.DevelopmentalStage.name,
}

# If columns or values are missing, we substitute with these defaults
DEFAULT_VALUES = {
    "disease": "normal",
    "development_stage": "unknown",
    "phenotype": "unknown"
}

# Curate against these specified sources
FIXED_SOURCES = {
    "disease": bt.Source.filter(entity="bionty.Disease", name="mondo", version="2023-04-04").one(),
    "developmental_stage": bt.Source.filter(entity="bionty.DevelopmentalStage", name="hsapdv", version="2020-03-10").one(),
    "phenotype": bt.Source.filter(entity="bionty.Phenotype", name="hp", version="2023-06-17", organism="human").one()
}

class EHRCurator(DataFrameCurator):
    """Custom curation flow for electronic health record data."""

    def __init__(
        self,
        data: pd.DataFrame | UPathStr,
        categoricals: dict[str, FieldAttr] = DEFAULT_CATEGORICALS,
        *,
        defaults: dict[str, str] = None,
        sources: dict[str, Record] = FIXED_SOURCES,
        organism="human"
    ):  
        self.data = data
        self.organism = organism
        
        # If defaults are provided, we add missing columns with the default value and set all missing values to the default value
        if defaults:
            for col, default in defaults.items():
                if col not in data.columns:
                    data[col] = default
                else:
                    data[col].fillna(default, inplace=True)


        super().__init__(
            df=data,
            categoricals=categoricals,
            sources=sources,
            organism=organism
        )
    
    def validate(self) -> bool:
        """Further custom validation."""
        # --- Custom validation logic goes here --- #        
        return super().validate(organism=self.organism)

## Curate with EHRCurator

In [None]:
ehrcurator = EHRCurator(df)

In [None]:
ehrcurator.add_new_from_columns()

In [None]:
# Catching the exception to ensure that the notebooks runs through without errors
try:
    ehrcurator.validate()
except ValueError as e:
    print(e)

In [None]:
# Let's fix the name of the column
df = df.rename(columns={"patient_age": "age"})

In [None]:
ehrcurator.validate()

In [None]:
# Fix values
disease_lo = bt.Disease.public().lookup()
phenotype_lo = bt.Phenotype.public().lookup()
developmental_stage_lo = bt.DevelopmentalStage.public().lookup()

In [None]:
df["disease"] = df["disease"].replace({"Hypertension": disease_lo.hypertensive_disorder.name})
df["phenotype"] = df["phenotype"].replace({"Tumor growth": phenotype_lo.neoplasm.name,
                                           "Airway inflammation": phenotype_lo.bronchitis.name})
df["developmental_stage"] = df["developmental_stage"].replace({"Adult": developmental_stage_lo.adolescent_stage.name,
                                                               "Child": developmental_stage_lo.child_stage.name})

In [None]:
ehrcurator = EHRCurator(df)
ehrcurator.validate()

In [None]:
ehrcurator.add_validated_from("disease")
ehrcurator.add_validated_from("phenotype")
ehrcurator.add_validated_from("developmental_stage")

In [None]:
ehrcurator.validate()

In [None]:
ln.context.finish()