# Curate DataFrames and AnnDatas

Curating a dataset with LaminDB means three things:

1. Validate that the dataset matches a desired schema
2. In case the dataset doesn't validate, standardize it, e.g., by fixing typos or mapping synonyms
3. Annotate the dataset by linking it against metadata entities so that it becomes queryable

## Curate a DataFrame

In [None]:
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty

Let's start with a DataFrame that we'd like to validate.

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "perturbation": pd.Categorical(["DMSO", "IFNG", "DMSO"]),
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": pd.Categorical(
            [
                "cerebral pyramidal neuron",
                "astrocytic glia",
                "oligodendrocyte",
            ]
        ),
        "assay_ontology_id": pd.Categorical(
            ["EFO:0008913", "EFO:0008913", "EFO:0008913"]
        ),
        "donor": ["D0001", "D0002", None],
    },
    index=["obs1", "obs2", "obs3"],
)
df

Define a schema to validate this dataset.

In [None]:
schema = ln.Schema(
    name="My example schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="temperature", dtype=float).save(),
        ln.Feature(name="cell_type", dtype=bt.CellType).save(),
        ln.Feature(
            name="assay_ontology_id", dtype=bt.ExperimentalFactor.ontology_id
        ).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()
# display the associated features as a dataframe
schema.features.df()

Create a `Curator` using the dataset & the schema.

In [None]:
curator = ln.curators.DataFrameCurator(df, schema)

The {meth}`~lamindb.curators.Curator.validate` method validates that your dataset adheres to the criteria defined by the `schema`. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

In [None]:
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

In [None]:
# check the non-validated terms
curator.cat.non_validated

For `cell_type`, we saw that "cerebral pyramidal neuron", "astrocytic glia" are not validated.

First, let's standardize synonym "astrocytic glia" as suggested

In [None]:
curator.cat.standardize("cell_type")

In [None]:
# now we have only one non-validated cell type left
curator.cat.non_validated

For "cerebral pyramidal neuron", let's understand which cell type in the public ontology might be the actual match.

In [None]:
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup

In [None]:
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# fix the cell type
df.cell_type = df.cell_type.cat.rename_categories(
    {"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name}
)

For perturbation, we want to add the new values: "DMSO", "IFNG"

In [None]:
# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")

In [None]:
# validate again
curator.validate()

Save a curated artifact.

In [None]:
artifact = curator.save_artifact(key="my_datasets/my_curated_dataset.parquet")

In [None]:
artifact.describe()

## Curate an AnnData

Here we additionally specify which `var_index` to validate against.

In [None]:
import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3],
        "ENSG00000276977": [4, 5, 6],
        "ENSG00000198851": [7, 8, 9],
        "ENSG00000010610": [10, 11, 12],
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18],
    },
    index=df.index,  # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
# define var schema
var_schema = ln.Schema(
    name="my_var_schema",
    itype=bt.Gene.ensembl_gene_id,  # identifier type
    dtype=int,
).save()

# define composite schema
anndata_schema = ln.Schema(
    name="small_dataset1_anndata_schema",
    otype="AnnData",  # object type
    components={"obs": schema, "var": var_schema},
).save()

In [None]:
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

Subset the `AnnData` to validated genes only:

In [None]:
adata_validated = adata[:, ~adata.var.index.isin(["ENSGcorrupted"])].copy()

Now let's validate the subsetted object:

In [None]:
curator = ln.curators.AnnDataCurator(adata_validated, anndata_schema)
curator.validate()

The validated `AnnData` can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = curator.save_artifact(key="my_datasets/my_curated_anndata.h5ad")

The saved artifact has been annotated with validated features and labels:

In [None]:
artifact.describe()

## Standardize an AnnData

If you need more control, you can access the underlying `"var"` and `"obs"` `DataFrameCurator` objects directly.

In [None]:
curator.slots["var"]
curator.slots["obs"]

In [None]:
# revert the previous cell type standardization
df["cell_type"] = df["cell_type"].cat.rename_categories(
    {"astrocyte": "astrocytic glia"}
)
# an AnnData where a cell type matches a synonym
adata_with_synonym = ad.AnnData(X=adata_validated.X, var=adata_validated.var, obs=df)
adata_with_synonym

In [None]:
curator = ln.curators.AnnDataCurator(adata_with_synonym, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)

In [None]:
curator.slots["obs"].cat.standardize("cell_type")

In [None]:
curator.validate()

## Summary

We've walked through the process of validating, standardizing, and annotating datasets going through these key steps:

1. Defining validation criteria
2. Validating data against existing registries
3. Adding new validated entries to registries
4. Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren't DataFrame-like or AnnData-like, read: {doc}`curate-any`.

In [None]:
!rm -rf ./test-curate
!lamin delete --force test-curate