# Validate, standardize & annotate

We'll walk you through the following flow: 
1. define validation criteria
2. validate & standardize metadata
3. save validated & annotated artifacts

:::{dropdown} How do we validate metadata?

Registries in your database define the "truth" for metadata.

For instance, if "Experiment 1" has been registered as the `name` of a `ULabel` record, it is a validated value for field `ULabel.name`.

:::

```{toctree}
:maxdepth: 1
:hidden:

can-validate
annotate-for-developers
```

Install the `lamindb` Python package:
```shell
pip install 'lamindb[bionty]'
```

In [None]:
!lamin init --storage ./test-annotate --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

## Validate and standardize metadata from a DataFrame

Let's start with a DataFrame that we'd like to validate:

In [None]:
df = pd.DataFrame({
    "temperature": [37.2, 36.3, 38.2],
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

In [None]:
# define validation criteria for the categoricals
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}
# create an object to guide validation and annotation
annotate = ln.Annotate.from_df(df, categoricals=categoricals)
# validate
annotate.validate()

## Register new metadata labels

Our current database instance is empty. Once you populated its registries, saving new labels will only rarely be needed. You'll mostly use your lamindb instance to validate any incoming new data and annotate it.

In [None]:
# this adds assays that were validated via the public ontology
annotate.add_validated_from("assay_ontology_id")

In [None]:
# this adds cell types that were validated via the public ontology
annotate.add_validated_from("cell_type")

In [None]:
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup("public")
lookup

In [None]:
cell_types = lookup[df.cell_type.name]

In [None]:
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# fix the typo
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

annotate.add_validated_from(df.cell_type.name)

In [None]:
# register non-validated terms
annotate.add_new_from(df.donor.name)

In [None]:
# validate again
validated = annotate.validate()
validated

## Validate an AnnData object

Here we specify which `var_fields` and `obs_fields` to validate against.

In [None]:
df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
annotate = ln.Annotate.from_anndata(
    adata, 
    var_index=bt.Gene.symbol,
    categoricals=categoricals, 
    organism="human",
)

In [None]:
annotate.validate()

In [None]:
annotate.add_validated_from("all")

In [None]:
annotate.validate()

## Save an annotated artifact

The validated object can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = annotate.save_artifact(description="test AnnData")

Validated features and labels are linked to the artifact:

In [None]:
artifact.describe()

## Annotate artifacts in flexible formats

In [None]:
# Let's try to annotate an image file
data_path = ln.core.datasets.file_jpg_paradisi05()
data_path

In [None]:
annotate = ln.Annotate()

In [None]:
features = ["experiment", "project", "tissue"]
annotate.save_features(features, field=ln.Feature.name, slot="external")

In [None]:
annotate.save_features(features, field=ln.Feature.name, slot="external", validated_only=False)

In [None]:
annotate.save_labels(["Experiment 001"], field=ln.ULabel.name, feature="experiment", validated_only=False)
annotate.save_labels(["Project 001"], field=ln.ULabel.name, feature="project", validated_only=False)
annotate.save_labels(["UBERON:0000948"], field=bt.Tissue.ontology_id, feature="tissue")

In [None]:
artifact = annotate.save_artifact(data_path, description="a metadata-annotated image")
artifact.describe()