# Curate datasets

Curating a dataset with LaminDB means three things:

1. Validate that the dataset matches a desired schema
2. In case the dataset doesn't validate, standardize it, e.g., by fixing typos or mapping synonyms
3. Annotate the dataset by linking it against metadata entities so that it becomes queryable

In this guide we'll curate common data structures. Here is a [guide](/faq/curate-any) for the underlying low-level API.

In [None]:
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

ln.track("MCeA3reqZG2e")

## DataFrame

### The simple case

In [None]:
df = ln.core.datasets.mini_immuno.get_dataset1()
df

```{eval-rst}
.. literalinclude:: scripts/curate_dataframe_flexible.py
   :language: python
```

In [None]:
!python scripts/curate_dataframe_flexible.py

Under-the-hood, this used the following schema:

```{eval-rst}
.. literalinclude:: scripts/define_valid_features.py
   :language: python
```

Valid features & labels were defined as:

```{eval-rst}
.. literalinclude:: scripts/define_mini_immuno_features_labels.py
   :language: python
```

### The complicated case

In [None]:
df = ln.core.datasets.mini_immuno.get_dataset1(
    with_cell_type_synonym=True, with_cell_type_typo=True
)
df

Define a schema to define the minimal (required) columns we expect in the dataframe.

In [None]:
schema = ln.Schema(
    name="My immuno schema",
    features=[
        ln.Feature.get(name="perturbation"),
        ln.Feature.get(name="cell_type_by_model"),
        ln.Feature.get(name="cell_type_by_expert"),
        ln.Feature.get(name="assay_oid"),
        ln.Feature.get(name="donor"),
        ln.Feature.get(name="concentration"),
        ln.Feature.get(name="treatment_time_h"),
    ],
).save()
schema.features.df()

Create a `Curator` using the dataset & the schema.

In [None]:
curator = ln.curators.DataFrameCurator(df, schema)

The {meth}`~lamindb.curators.Curator.validate` method validates that your dataset adheres to the criteria defined by the `schema`. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

In [None]:
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

In [None]:
# check the non-validated terms
curator.cat.non_validated

For `cell_type`, we saw that "cerebral pyramidal neuron", "astrocytic glia" are not validated.

First, let's standardize synonym "astrocytic glia" as suggested

In [None]:
curator.cat.standardize("cell_type_by_expert")

In [None]:
# now we have only one non-validated cell type left
curator.cat.non_validated

For "CD8-pos alpha-beta T cell", let's understand which cell type in the public ontology might be the actual match.

In [None]:
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup

In [None]:
# here is an example for the "cell_type" column
cell_types = lookup["cell_type_by_expert"]
cell_types.cd8_positive_alpha_beta_t_cell

In [None]:
# fix the cell type name
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
)

For perturbation, we want to add the new values: "DMSO", "IFNG"

In [None]:
# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")

In [None]:
# validate again
curator.validate()

Save a curated artifact.

In [None]:
artifact = curator.save_artifact(key="examples/my_curated_dataset.parquet")

In [None]:
artifact.describe()

## AnnData

### The simple case

```{eval-rst}
.. literalinclude:: scripts/curate_anndata_flexible.py
   :language: python
```

In [None]:
!python scripts/curate_anndata_flexible.py

Under-the-hood, this used the following schema:

```{eval-rst}
.. literalinclude:: scripts/define_schema_anndata_ensembl_gene_ids_and_valid_features_in_obs.py
   :language: python
```

This schema tranposes the `var` DataFrame during curation, so that one validates and annotates the `var.T` schema, i.e., `[ENSG00000153563, ENSG00000010610, ENSG00000170458]`.
If one doesn't transpose, one would annotate with the schema of `var`, i.e., `[gene_symbol, gene_type]`.

```{eval-rst}
.. image:: https://lamin-site-assets.s3.amazonaws.com/.lamindb/gLyfToATM7WUzkWW0001.png
    :width: 800px
```

### The more complex example

In [None]:
import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3],
        "ENSG00000276977": [4, 5, 6],
        "ENSG00000198851": [7, 8, 9],
        "ENSG00000010610": [10, 11, 12],
        "ENSG00000153563": [13, 14, 15],
        "ENSG00corrupted": [16, 17, 18],
    },
    index=df.index,  # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
# define var schema
var_schema = ln.Schema(
    name="my_var_schema",
    itype=bt.Gene.ensembl_gene_id,  # identifier type
    dtype=int,
).save()

# define composite schema
anndata_schema = ln.Schema(
    name="small_dataset1_anndata_schema",
    otype="AnnData",  # object type
    slots={"obs": schema, "var": var_schema},
).save()

Check the slots of a schema:

In [None]:
anndata_schema.slots

In [None]:
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

Subset the `AnnData` to validated genes only:

In [None]:
adata_validated = adata[:, ~adata.var.index.isin(["ENSG00corrupted"])].copy()

Now let's validate the subsetted object:

In [None]:
curator = ln.curators.AnnDataCurator(adata_validated, anndata_schema)
curator.validate()

The validated `AnnData` can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = curator.save_artifact(key="examples/my_curated_anndata.h5ad")

Access the schema for each slot:

In [None]:
artifact.features.slots

The saved artifact has been annotated with validated features and labels:

In [None]:
artifact.describe()

If you need more control, you can access `DataFrameCurator` objects for the `"var"` and `"obs"` slots, respectively.

In [None]:
curator.slots

In [None]:
# revert the previous cell type standardization
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
    {"B cell": "B-cell"}
)
# an AnnData where a cell type matches a synonym
adata_with_synonym = ad.AnnData(X=adata_validated.X, var=adata_validated.var, obs=df)
adata_with_synonym

In [None]:
curator = ln.curators.AnnDataCurator(adata_with_synonym, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

In [None]:
curator.slots["obs"].cat.standardize("cell_type_by_expert")

In [None]:
curator.validate()

## MuData

```{eval-rst}
.. literalinclude:: scripts/curate-mudata.py
   :language: python
```

## SpatialData

```{eval-rst}
.. literalinclude:: scripts/define_schema_spatialdata.py
   :language: python
   :caption: define_schema_spatialdata.py
```

In [None]:
!python scripts/define_schema_spatialdata.py

```{eval-rst}
.. literalinclude:: scripts/curate_spatialdata.py
   :language: python
   :caption: curate_spatialdata.py
```

In [None]:
!python scripts/curate_spatialdata.py

## Other data structures

If you have other data structures, read: {doc}`/faq/curate-any`.

In [None]:
!rm -rf ./test-curate
!lamin delete --force test-curate