# Curate datasets

Data curation with LaminDB ensures your datasets are **validated**, **standardized**, and **queryable**. This guide shows you how to transform messy, real-world data into clean, annotated datasets.

Curating a dataset with LaminDB means three things:
- ✅ **Validate** that the dataset matches a desired schema
- 🔧 **Standardize** the dataset (e.g., by fixing typos, mapping synonyms) or update registries if validation fails
- 🏷️ **Annotate** the dataset by linking it against metadata entities so that it becomes queryable

In this guide we'll curate common data structures. Here is a [guide](/faq/curate-any) for the underlying low-level API.

Note: If you know either `pydantic` or `pandera`, here is an [FAQ](/faq/pydantic-pandera) that compares LaminDB with both of these tools.

In [None]:
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty

In [None]:
import lamindb as ln

ln.track("MCeA3reqZG2e")

## Schema design patterns

A {class}`~lamindb.Schema` in LaminDB is a specification that defines the expected structure, data types, and validation rules for a dataset.

Schemas ensure data consistency by defining:
- What features (columns/dimensions) should exist in your data
- What data types those features should have
- What values are valid for categorical features
- Which features are required vs optional

Key components of a schema:
```python
schema = ln.Schema(
    name="experiment_schema",           # Human-readable name
    features=[                          # Required features
        ln.Feature(name="cell_type", dtype=bt.CellType),
        ln.Feature(name="treatment", dtype=str),
    ],
    flexible=True,                      # Allow additional features?
    otype="DataFrame"                   # Object type (DataFrame, AnnData, etc.)
)
```

For Complex Data Structures:
```python
# AnnData with multiple "slots"
adata_schema = ln.Schema(
    otype="AnnData",
    slots={
        "obs": cell_metadata_schema,     # Cell annotations
        "var.T": gene_id_schema          # Gene features  
    }
)
```

Before diving into curation, let's understand the different schema approaches and when to use each one. Think of schemas as rules that define what valid data should look like.

### Flexible schema

Validates against any features in your existing registries.

```{eval-rst}
.. literalinclude:: scripts/define_valid_features.py
   :language: python
```

### Minimal required schema

If we'd like to curate the dataframe with a minimal set of required columns, we can use the following schema.

```{eval-rst}
.. literalinclude:: scripts/define_mini_immuno_schema_flexible.py
   :language: python
```

## DataFrame

### Step 1: Load and examine your data

We'll be working with the mini immuno dataset:

In [None]:
df = ln.core.datasets.mini_immuno.get_dataset1(
    with_cell_type_synonym=True, with_cell_type_typo=True
)
df

### Step 2: Set up your metadata registries

Before creating a schema, ensure your registries have the right features and labels:

```{eval-rst}
.. literalinclude:: scripts/define_mini_immuno_features_labels.py
   :language: python
```

### Step 3: Create your schema

In [None]:
schema = ln.core.datasets.mini_immuno.define_mini_immuno_schema_flexible()
schema.describe()

### Step 4: Initialize Curator and first Validation

If you expect the validation to pass, can directly register an artifact by providing the schema:
```python

artifact = ln.Artifact.from_df(df, key="examples/my_curated_dataset.parquet", schema=schema).save()
```

The {meth}`~lamindb.curators.core.Curator.validate` method validates that your dataset adheres to the criteria defined by the `schema`. It identifies which values are already validated (exist in the registries) and which are potentially problematic (do not yet exist in our registries).

In [None]:
try:
    curator = ln.curators.DataFrameCurator(df, schema)
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

### Step 5: Fix Validation Issues

In [None]:
# check the non-validated terms
curator.cat.non_validated

For `cell_type_by_expert`, we saw 2 terms are not validated.

First, let's standardize synonym "B-cell" as suggested

In [None]:
curator.cat.standardize("cell_type_by_expert")

In [None]:
# now we have only one non-validated cell type left
curator.cat.non_validated

For "CD8-pos alpha-beta T cell", let's understand which cell type in the public ontology might be the actual match.

In [None]:
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup

In [None]:
# here is an example for the "cell_type" column
cell_types = lookup["cell_type_by_expert"]
cell_types.cd8_positive_alpha_beta_t_cell

In [None]:
# fix the cell type name
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
)

For perturbation, we want to add the new values: "DMSO", "IFNG"

In [None]:
# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")

In [None]:
# validate again
curator.validate()

### Step 6: Save Your Curated Dataset

In [None]:
artifact = curator.save_artifact(key="examples/my_curated_dataset.parquet")

In [None]:
artifact.describe()

## AnnData

`AnnData` like all other data structures that follow is a composite structure that stores different arrays in different `slots`.

### Allow a flexible schema

We can also allow a flexible schema for an `AnnData` and only require that it's indexed with Ensembl gene IDs.

```{eval-rst}
.. literalinclude:: scripts/curate_anndata_flexible.py
   :language: python
   :caption: curate_anndata_flexible.py
```

Let's run the script.

In [None]:
!python scripts/curate_anndata_flexible.py

Under-the-hood, this used the following schema:

```{eval-rst}
.. literalinclude:: scripts/define_schema_anndata_ensembl_gene_ids_and_valid_features_in_obs.py
   :language: python
```

This schema tranposes the `var` DataFrame during curation, so that one validates and annotates the `var.T` schema, i.e., `[ENSG00000153563, ENSG00000010610, ENSG00000170458]`.
If one doesn't transpose, one would annotate with the schema of `var`, i.e., `[gene_symbol, gene_type]`.

```{eval-rst}
.. image:: https://lamin-site-assets.s3.amazonaws.com/.lamindb/gLyfToATM7WUzkWW0001.png
    :width: 800px
```

### Fix validation issues

In [None]:
import lamindb as ln

In [None]:
adata = ln.core.datasets.mini_immuno.get_dataset1(
    with_gene_typo=True, with_cell_type_typo=True, otype="AnnData"
)
adata

In [None]:
schema = ln.examples.schemas.anndata_ensembl_gene_ids_and_valid_features_in_obs()
schema.describe()

Check the slots of a schema:

In [None]:
schema.slots

In [None]:
curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

As above, we leverage a lookup object with valid cell types to find the correct name.

In [None]:
valid_cell_types = curator.slots["obs"].cat.lookup()["cell_type_by_expert"]
adata.obs["cell_type_by_expert"] = adata.obs[
    "cell_type_by_expert"
].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": valid_cell_types.cd8_positive_alpha_beta_t_cell.name}
)

The validated `AnnData` can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
adata.obs.columns

In [None]:
curator.slots["var.T"].cat.add_new_from("columns")

In [None]:
curator.validate()

In [None]:
artifact = curator.save_artifact(key="examples/my_curated_anndata.h5ad")

Access the schema for each slot:

In [None]:
artifact.features.slots

The saved artifact has been annotated with validated features and labels:

In [None]:
artifact.describe()

## MuData

```{eval-rst}
.. literalinclude:: scripts/curate_mudata.py
   :language: python
   :caption: curate_mudata.py
```

In [None]:
!python scripts/curate_mudata.py

## SpatialData

```{eval-rst}
.. literalinclude:: scripts/define_schema_spatialdata.py
   :language: python
   :caption: define_schema_spatialdata.py
```

In [None]:
!python scripts/define_schema_spatialdata.py

```{eval-rst}
.. literalinclude:: scripts/curate_spatialdata.py
   :language: python
   :caption: curate_spatialdata.py
```

In [None]:
!python scripts/curate_spatialdata.py

## TiledbsomaExperiment

```{eval-rst}
.. literalinclude:: scripts/curate_soma_experiment.py
   :language: python
   :caption: curate_soma_experiment.py
```

In [None]:
!python scripts/curate_soma_experiment.py

## Other data structures

If you have other data structures, read: {doc}`/faq/curate-any`.

In [None]:
!rm -rf ./test-curate
!rm -rf ./small_dataset.tiledbsoma
!lamin delete --force test-curate