# Curate DataFrame and AnnData objects

When we talk about "curating datasets", we typically mean three distinct actions:

1. Validate: ensure a dataset meets predefined validation criteria
2. Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers
3. Annotate: link a dataset against metadata records

## Key Concepts

- **Registries** store valid metadata records. For instance, if the string `"Experiment 1"` was registered as the `name` of a `ULabel` record, it's going to pass validation against `ULabel.name`.

- **Artifacts**: These are the data objects that you manage with LaminDB. Artifacts can be validated & curated with metadata records from registries.

In [None]:
#! pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

## Validate and standardize metadata from a DataFrame

Let's start with a DataFrame that we'd like to validate:

In [None]:
df = pd.DataFrame({
    "temperature": [37.2, 36.3, 38.2],
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

First, let's define the validation criteria:

In [None]:
# define validation criteria for categorical variables
# in the dictionary, each key is a column name of the dataframe, and each value is a registry field onto which values are mapped
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# create an Curate object to guide validation and annotation
# this object will use our DataFrame and the defined categorical criteria
curate = ln.Curate.from_df(df, categoricals=categoricals)

The `validate()` method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are new or potentially problematic.

In [None]:
curate.validate()

## Register new metadata values

If you see any "non-validated" values, you'll need to decide whether to add them to your registries or correct them in your data.

Because our current database instance is empty, here, we'll add values to the registries defined in the validation criteria.

In [None]:
# this adds assays that were validated (via a public ontology)
curate.add_validated_from("assay_ontology_id")

In [None]:
# this adds cell types that were _not_ validated
curate.add_new_from("cell_type")

In [None]:
# use a lookup object to get the correct spelling of categories from a public reference
lookup = curate.lookup("public")
lookup

In [None]:
cell_types = lookup[df.cell_type.name]
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# curate the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
# register validated cell types
curate.add_validated_from(df.cell_type.name)

In [None]:
# register non-validated donors
curate.add_new_from(df.donor.name)

In [None]:
# validate again
validated = curate.validate()
validated

## Validate an AnnData object

Here we addtionally specify which `var_fields` to validate against.

In [None]:
df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
curate = ln.Curate.from_anndata(
    adata, 
    var_index=bt.Gene.symbol,
    categoricals=categoricals, 
    organism="human",
)

In [None]:
curate.validate()

In [None]:
curate.add_validated_from("all")

In [None]:
curate.validate()

## Save a curated artifact

The validated object can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = curate.save_artifact(description="test AnnData")

Validated features and labels are linked to the artifact:

In [None]:
artifact.describe()

We've walked through the process of validating, standardizing, and annotating datasets going through these key steps:

1. Defining validation criteria
2. Validating data against existing registries
3. Adding new validated entries to registries
4. Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren't DataFrame-like or AnnData-like, read: {doc}`curate-any`.