# Validate, standardize & annotate data of flexible formats

Our [previous guide](./annotate) explained how to validate, standardize & annotate `DataFrame` and `AnnData`. In this guide, we'll dive into how to use the basic APIs that let you work with any format of data.

LaminDB makes it easy to validate categorical variables based on registries ({class}`~lamindb.core.CanValidate`).

:::{dropdown} How do I validate based on a public ontology?

{class}`~lamindb.core.CanValidate` methods validate against the registries in your LaminDB instance.
In {doc}`./bio-registries`, you'll see how to extend standard validation to validation against _public references_ using a `ReferenceTable` ontology object: `public = Registry.public()`.
By default, {meth}`~lamindb.core.Registry.from_values` considers a match in a public reference a validated value for any {mod}`bionty` entity.

:::

:::{dropdown} What to do for non-validated values?

Be aware when you are working with a _freshly initialized instance_: nothing is validated as no records have yet been registered.
Run `inspect` to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.

:::

## Setup

Install the `lamindb` Python package:
```shell
pip install 'lamindb[bionty]'
```

In [None]:
!lamin init --storage ./test-annotate-flexible --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

ln.settings.verbosity = "info"

Pre-populate registries:

In [None]:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.Artifact.from_df(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
ln.save(bt.Disease.from_values(["MONDO:0004975", 'MONDO:0004980'], field=bt.Disease.ontology_id))

In [None]:
import zarr
import numpy as np

z = zarr.create((10,), dtype=[('value', 'f8'), ("gene", "U15"), ('disease', 'U16')], store='data.zarr')
z["gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703", "ENSG00000157764", "ENSG00000171862", "ENSG00000091831", "ENSG00000141736", "ENSG00000133056", "ENSG00000146648", "ENSG00000118523"]
z["disease"] = np.random.choice(['MONDO:0004975', 'MONDO:0004980'], 10)

## Define validation criteria

Any entity that doesn't have its dedicated registry ("is not typed") can be validated & registered using {class}`~lamindb.ULabel`:

In [None]:
criteria = {
    "disease": bt.Disease.ontology_id,
    "project": ln.ULabel.name,
    "gene": bt.Gene.ensembl_gene_id,
}

## Validate and standardize metadata

{meth}`~lamindb.core.CanValidate.validate` validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.

In [None]:
bt.Disease.validate(z["disease"], field=bt.Disease.ontology_id)

When validation fails, you can call {meth}`~lamindb.core.CanValidate.inspect` to figure out what to do.

{meth}`~lamindb.core.CanValidate.inspect` applies the same definition of validation as {meth}`~lamindb.core.CanValidate.validate`, but returns a rich return value {class}`~lamindb.core.InspectResult`. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use {meth}`~lamindb.core.CanValidate.standardize` to standardize synonyms.

In [None]:
bt.Disease.inspect(z["disease"], field=bt.Disease.ontology_id);

Following the suggestions to register new labels:

Bulk creating records using {meth}`~lamindb.core.Registry.from_values` only returns validated records:

Note: Terms validated with public reference are also created with `.from_values`, see {doc}`/bio-registries` for details.

In [None]:
diseases = bt.Disease.from_values(z["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)

Repeat the process for another labels:

In [None]:
projects = ln.ULabel.from_values(
    ["Project A", "Project B"], 
    field=ln.ULabel.name, 
    create=True, # create non-existing labels
)
ln.save(projects)

In [None]:
genes = bt.Gene.from_values(z["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)

## Annotate and register dataset with validated metadata

Register the dataset as an artifact:

In [None]:
artifact = ln.Artifact("data.zarr", description="a zarr object").save()

Link the artifact to validated labels through features:

In [None]:
try:
    artifact.features.add_values({"project": projects, "disease": diseases})
except Exception as e:
    print(e)

Follow the suggestion to register new features:

In [None]:
ln.Feature(name='project', dtype='cat[ULabel]').save()
ln.Feature(name='disease', dtype='cat[bionty.Disease]').save();

In [None]:
artifact.features.add_values({"project": projects[0], "disease": diseases[0]})

In [None]:
artifact.describe()

Since genes are the measurements, we register them as features:

In [None]:
feature_set = ln.FeatureSet(genes)
feature_set.save()

In [None]:
artifact.features.add_feature_set(feature_set, slot="gene")

In [None]:
artifact.describe()

In [None]:
# clean up test instance
!lamin delete --force test-validate
!rm -r test-validate