# Curate dataframes with an EHR schema

In a [previous guide](./curate-df), you defined generic {class}`~lamindb.Schema` for `DataFrame` and other objects.
This guide walks through an exemplary EHR schema.

For a comparable schema related to scRNA-seq data, see the CELLxGENE schema ({doc}`docs:cellxgene-curate`).

In [None]:
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-ehrschema --modules bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

ln.track("2XEr2IA4n1w40000")

We want to ensure that

1. the dataframe has columns `disease`, `phenotype`, `developmental_stage`, and `age`
2. if columns or values are missing, we standardize the dataframe with default values
2. any values that are present map against specific versions of pre-defined ontologies

## Define a schema

Let us first define the ontology versions we want to use.

In [None]:
disease_ontology = bt.Source.get(
    entity="bionty.Disease", name="mondo", version="2023-04-04"
)
developmental_stage_ontology = bt.Source.get(
    entity="bionty.DevelopmentalStage", name="hsapdv", version="2020-03-10"
)
phenotype_ontology = bt.Source.get(
    entity="bionty.Phenotype",
    name="hp",
    version="2023-06-17",
    organism="human",
)

Let us now create a schema by defining the features that it measures. The ontology versions are captured via their `uid`.

In [None]:
schema = ln.Schema(
    name="My EHR schema",
    features=[
        ln.Feature(name="age", dtype=int).save(),
        ln.Feature(
            name="disease",
            dtype=bt.Disease,
            default_value="normal",
            nullable=False,
            cat_filters={"source__uid": disease_ontology.uid},
        ).save(),
        ln.Feature(
            name="developmental_stage",
            dtype=bt.DevelopmentalStage,
            default_value="unknown",
            nullable=False,
            cat_filters={"source__uid": developmental_stage_ontology.uid},
        ).save(),
        ln.Feature(
            name="phenotype",
            dtype=bt.Phenotype,
            default_value="unknown",
            nullable=False,
            cat_filters={"source__uid": phenotype_ontology.uid},
        ).save(),
    ],
).save()
# look at a dataframe of the features that are part of the schema
schema.features.df()

## Curate an example dataset

Create an example `DataFrame` that has all required columns but one is misnamed.

In [None]:
dataset = {
    "disease": pd.Categorical(
        [
            "Alzheimer disease",
            "diabetes mellitus",
            pd.NA,
            "Hypertension",
            "asthma",
        ]
    ),
    "phenotype": pd.Categorical(
        [
            "Mental deterioration",
            "Hyperglycemia",
            "Tumor growth",
            "Increased blood pressure",
            "Airway inflammation",
        ]
    ),
    "developmental_stage": pd.Categorical(
        ["Adult", "Adult", "Adult", "Adult", "Child"]
    ),
    "patient_age": [70, 55, 60, 65, 12],
}
df = pd.DataFrame(dataset)
df

Let's validate it.

In [None]:
curator = ln.curators.DataFrameCurator(df, schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith("column 'age' not in dataframe")
    print(e)

Fix the name of the `patient_age` column to be `age`.

In [None]:
df.columns = df.columns.str.replace("patient_age", "age")
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith("non-nullable series 'disease' contains null values")
    print(e)

Standardize the dataframe so that the missing value gets populated with the default value.

In [None]:
curator.standardize()
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith(
        "2 terms are not validated: 'Tumor growth', 'Airway inflammation'"
    )
    print(e)

Add the 'normal' term to the disease registry.

In [None]:
bt.Disease(name="normal", description="Healthy condition").save()

Curate the remaining mismatches manually.

In [None]:
diseases = bt.Disease.public().lookup()
phenotypes = bt.Phenotype.public().lookup()
developmental_stages = bt.DevelopmentalStage.public().lookup()

df["disease"] = df["disease"].cat.rename_categories(
    {"Hypertension": diseases.hypertensive_disorder.name}
)
df["phenotype"] = df["phenotype"].cat.rename_categories(
    {
        "Tumor growth": phenotypes.neoplasm.name,
        "Airway inflammation": phenotypes.bronchitis.name,
    }
)
df["developmental_stage"] = df["developmental_stage"].cat.rename_categories(
    {
        "Adult": developmental_stages.adolescent_stage.name,
        "Child": developmental_stages.child_stage.name,
    }
)

curator.validate()

In [None]:
!rm -rf test-ehrschema
!lamin delete --force test-ehrschema