# Validate, standardize & annotate

We'll walk you through the following flow: 
1. define validation criteria
2. validate & standardize metadata
3. save validated & annotated artifacts

:::{dropdown} How do we validate metadata?

Registries in your database define the "truth" for metadata.

For instance, if "Experiment 1" has been registered as the `name` of a `ULabel` record, it is a validated value for field `ULabel.name`.

:::

```{toctree}
:maxdepth: 1
:hidden:

can-validate
annotate-for-developers
```

In [None]:
!lamin init --storage ./test-annotate --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

ln.settings.verbosity = "hint"

Let's start with a DataFrame that we'd like to validate:

In [None]:
df = pd.DataFrame({
    "temperature": [37.2, 36.3, 38.2],
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

## Validate and standardize metadata

In [None]:
# define validation criteria for the categoricals
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}
# create an object to guide validation and annotation
annotate = ln.Annotate.from_df(df, categoricals=categoricals)
# validate
validated = annotate.validate()
validated

## Validate using registries in another instance

Sometimes you want to validate against other existing registries, for instance [cellxgene](https://lamin.ai/laminlabs/cellxgene).

This allows us to directly transfer values that are currently missing in our registries from the [cellxgene instance](https://lamin.ai/laminlabs/cellxgene).

In [None]:
annotate = ln.Annotate.from_df(
    df, 
    categoricals=categoricals,
    using="laminlabs/cellxgene",  # pass the instance slug
)
annotate.validate()

## Register new metadata labels

Our current database instance is empty. Once you populated its registries, saving new labels will only rarely be needed. You'll mostly use your lamindb instance to validate any incoming new data and annotate it.

In [None]:
annotate.add_validated_from(df.cell_type.name)

In [None]:
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()
lookup

In [None]:
cell_types = lookup[df.cell_type.name]

In [None]:
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# fix the typo
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

annotate.add_validated_from(df.cell_type.name)

In [None]:
# register non-validated terms
annotate.add_new_from(df.donor.name)

In [None]:
# validate again
validated = annotate.validate()
validated

## Validate an AnnData object

Here we specify which `var_fields` and `obs_fields` to validate against.

In [None]:
df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
annotate = ln.Annotate.from_anndata(
    adata, 
    var_index=bt.Gene.symbol,
    categoricals=categoricals, 
    organism="human",
)

In [None]:
annotate.validate()

In [None]:
annotate.add_validated_from("all")

In [None]:
annotate.validate()

## Save an artifact

The validated object can be subsequently saved as an {class}`~lamindb.Artifact`:

In [None]:
artifact = annotate.save_artifact(description="test AnnData")

In [None]:
artifact.describe()

## Save a collection

Register a new collection for the registered artifact:

In [None]:
# register a new collection
collection = annotate.save_collection(
    artifact,  # registered artifact above, can also pass a list of artifacts
    name="Experiment X in brain",  # title of the publication
    description="10.1126/science.xxxxx",  # DOI of the publication
    reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress" # source type (e.g. GEO, ArrayExpress, SRA, etc.)
)

In [None]:
collection.artifacts.df()