# Pydantic & Pandera vs. LaminDB

This doc explains conceptual differences between data validation with `pydantic`, `pandera`, and `lamindb`.

In [None]:
!lamin init --storage test-pydantic-pandera --modules bionty

In [None]:
import pandas as pd
import pydantic
from typing import Literal
import lamindb as ln
import bionty as bt

df = ln.core.datasets.small_dataset1()
df

## A pydantic model

In [None]:
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal["T cell", "B cell"]
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    donor: str | None
    concentration: str
    treatment_time_h: int

    class Config:
        title = "My immuno schema"

## A lamindb schema

In [None]:
ln.ULabel(name="DMSO").save()  # define a DMSO label
ln.ULabel(name="IFNG").save()  # define an IFNG label

# leverage ontologies through types ln.ULabel, bt.CellType, bt.ExperimentalFactor
schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
        ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
        ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
        ln.Feature(name="concentration", dtype=str).save(),
        ln.Feature(name="treatment_time_h", dtype=int).save(),
    ],
).save()

A pandera schema looks essentially the same as a LaminDB schema. LaminDB extends pandera.

## Validate a dataframe with pydantic

In [None]:
class DataFrameValidationError(Exception):
    pass


def validate_dataframe(df: pd.DataFrame, model: type[pydantic.BaseModel]):
    errors = []

    for i, row in enumerate(df.to_dict(orient="records")):
        try:
            model(**row)
        except pydantic.ValidationError as e:
            errors.append(f"row {i} failed validation: {e}")

    if errors:
        error_message = "\n".join(errors)
        raise DataFrameValidationError(
            f"DataFrame validation failed with the following errors:\n{error_message}"
        )

In [None]:
try:
    validate_dataframe(df, ImmunoSchema)
except DataFrameValidationError as e:
    print(e)

To fix this, we need to update the `Literal` and re-run the model definition.

In [None]:
Perturbation = Literal["DMSO", "IFNG"]
CellType = Literal[
    "T cell", "B cell", "CD8-positive, alpha-beta T cell"
]  # <-- This was updated
OntologyID = Literal["EFO:0008913"]


class ImmunoSchema(pydantic.BaseModel):
    perturbation: Perturbation
    cell_type_by_model: CellType
    cell_type_by_expert: CellType
    assay_oid: OntologyID
    donor: str | None
    concentration: str
    treatment_time_h: int

    class Config:
        title = "My immuno schema"

In [None]:
validate_dataframe(df, ImmunoSchema)

## Validate a DataFrame with lamindb

In [None]:
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate()

What was the validation based on? Let's inspect the `CellType` ontology.

In [None]:
bt.CellType.df()

In [None]:
bt.CellType.get(name="CD8-positive, alpha-beta T cell").view_parents()

## Overview of difference in validation properties

Importantly, LaminDB offers not only a `DataFrameCurator`, but also a `AnnDataCurator`, `MuDataCurator`, `SpatialDataCurator`, `TiledbsomaCurator`.

The below overview only concerns validating dataframes.

### Experience of data engineer

property | `pydantic` | `pandera` | `lamindb`
--- | --- | --- | ---
define schema as code | yes, in form of a `pydantic.BaseModel` | yes, in form of a `pandera.DataFrameSchema` | yes, in form of a `lamindb.Schema`
update labels outside of code | not possible because labels are enums/literals | not possible because labels are hard-coded in `Check` | possible by adding new terms to a registry
easily import valid labels from public ontologies | no | no | yes
sync ELN/LIMS systems into label registries | no | no | yes
can re-use fields/columns/features across schemas | no | only in same Python session | yes because persisted in database
can update the schema without fearing that previous datasets are now invalid | no | no | yes because LaminDB allows to query datasets that were validated with a schema version
can use columnar organization of dataframe | no, need to iterate over potentially millions of rows | yes | yes

### Experience of data consumer

property | `pydantic` | `pandera` | `lamindb`
--- | --- | --- | ---
dataset is queryable / findable | no | no | yes, by querying for labels & features
dataset is annotated | no | no | yes
user knows what validation constraints were | no, because might not have access to code and doesn't know which code was run | no (same as pydantic) | yes, `artifact.schema` shows 

## Annotation & queryability

### Data engineer: annotate the dataset

In [None]:
artifact = curator.save_artifact(key="our_datasets/dataset1.parquet")

### Data consumer: see annotations

In [None]:
artifact.describe()

### Data consumer: query the artifact by labels

In [None]:
ln.Artifact.features.filter(perturbation="IFNG").df()

### Data consumer: understand _how_ the artifact was validated

In [None]:
artifact.schema

In [None]:
artifact.schema.features.df()