# Validate, standardize & annotate

This guide focuses on three crucial aspects of data management:

1. Validation: Ensuring your data meets predefined criteria
2. Standardization: Conforming data to consistent formats and terminologies
3. Annotation: Enriching data with metadata for improved organization and analysis

## Key Concepts

- **Registries**: In LaminDB, registries are collections of validated metadata. They serve as the "source of truth" for your data annotations. For instance, if "Experiment 1" has been registered as the `name` of a `ULabel` record, it is a validated value for field `ULabel.name`.

- **Artifacts**: These are the data objects that you manage with LaminDB. Artifacts can be annotated with validated metadata from registries.

- **Annotation**: The process of attaching metadata to your data objects, enhancing their context and searchability.

In this guide, we'll walk you through the following flow for `DataFrame` and `AnnData`: 

```{toctree}
:maxdepth: 1
:hidden:

annotate-flexible
```

Install the `lamindb` Python package:
```shell
pip install 'lamindb[bionty]'
```

In [None]:
!lamin init --storage ./test-annotate --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

## Validate and standardize metadata from a DataFrame

Let's start with a DataFrame that we'd like to validate:

In [None]:
df = pd.DataFrame({
    "temperature": [37.2, 36.3, 38.2],
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

First, let's define the validation criteria:

In [None]:
# define validation criteria for categorical variables
# each key is a column name, and each value is the registry field to validate against
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# create an Annotate object to guide validation and annotation
# this object will use our DataFrame and the defined categorical criteria
annotate = ln.Annotate.from_df(df, categoricals=categoricals)

The `validate()` method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are new or potentially problematic.

In [None]:
annotate.validate()

## Curate and register new metadata labels

If you see any "non-validated" entries, you'll need to decide whether to add them to your registries or correct them in your data.

Our current database instance is empty. Once you populated its registries, saving new labels will only rarely be needed. You'll mostly use your lamindb instance to validate any incoming new data and annotate it.

In [None]:
# this adds assays that were validated via the public ontology
annotate.add_validated_from("assay_ontology_id")

In [None]:
# this adds cell types that were validated via the public ontology
annotate.add_validated_from("cell_type")

In [None]:
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup("public")
lookup

In [None]:
cell_types = lookup[df.cell_type.name]
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# curate the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
# register validated cell types
annotate.add_validated_from(df.cell_type.name)

In [None]:
# register non-validated donors
annotate.add_new_from(df.donor.name)

In [None]:
# validate again
validated = annotate.validate()
validated

## Validate an AnnData object

Here we addtionally specify which `var_fields` to validate against.

In [None]:
df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
annotate = ln.Annotate.from_anndata(
    adata, 
    var_index=bt.Gene.symbol,
    categoricals=categoricals, 
    organism="human",
)

In [None]:
annotate.validate()

In [None]:
annotate.add_validated_from("all")

In [None]:
annotate.validate()

## Save an annotated artifact

The validated object can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = annotate.save_artifact(description="test AnnData")

Validated features and labels are linked to the artifact:

In [None]:
artifact.describe()

We've walked through the process of validating, standardizing, and annotating data using LaminDB. Key steps include:

1. Defining validation criteria
2. Validating data against existing registries
3. Adding new validated entries to registries
4. Annotating data objects (Artifacts) with validated metadata

By following these steps, you can ensure your data is clean, standardized, and well-annotated, setting a strong foundation for further analysis and collaboration.

If you have datasets with other formats, please check out [Validate, standardize & annotate data of flexible formats](./annotate-flexible).