# Curate datasets of any format

Our [previous guide](./curate) explained how to validate, standardize & annotate `DataFrame` and `AnnData`. In this guide, we'll walk through the basic API that lets you work with any format of data.

:::{dropdown} How do I validate based on a public ontology?

LaminDB makes it easy to validate categorical variables based on registries that inherit from {class}`~lamindb.core.CanCurate`.

{class}`~lamindb.core.CanCurate` methods validate against the registries in your LaminDB instance.
In {doc}`./bio-registries`, you'll see how to extend standard validation to validation against _public references_ using a `ReferenceTable` ontology object: `public = Record.public()`.
By default, {meth}`~lamindb.core.CanCurate.from_values` considers a match in a public reference a validated value for any {mod}`bionty` entity.

:::

In [None]:
# !pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import zarr
import numpy as np

data = zarr.create((10,), dtype=[('value', 'f8'), ("gene", "U15"), ('disease', 'U16')], store='data.zarr')
data["gene"] = ["ENSG00000139618", "ENSG00000141510", "ENSG00000133703", "ENSG00000157764", "ENSG00000171862", "ENSG00000091831", "ENSG00000141736", "ENSG00000133056", "ENSG00000146648", "ENSG00000118523"]
data["disease"] = np.random.choice(['MONDO:0004975', 'MONDO:0004980'], 10)

## Define validation criteria

Entities that don't have a dedicated registry ("are not typed") can be validated & registered using {class}`~lamindb.ULabel`:

In [None]:
criteria = {
    "disease": bt.Disease.ontology_id,
    "project": ln.ULabel.name,
    "gene": bt.Gene.ensembl_gene_id,
}

## Validate and standardize metadata

{meth}`~lamindb.core.CanCurate.validate` validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.

In [None]:
bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)

When validation fails, you can call {meth}`~lamindb.core.CanCurate.inspect` to figure out what to do.

{meth}`~lamindb.core.CanCurate.inspect` applies the same definition of validation as {meth}`~lamindb.core.CanCurate.validate`, but returns a rich return value {class}`~lamindb.core.InspectResult`. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use {meth}`~lamindb.core.CanCurate.standardize` to standardize synonyms.

In [None]:
bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id);

Following the suggestions to register new labels:

Bulk creating records using {meth}`~lamindb.core.CanCurate.from_values` only returns validated records:

Note: Terms validated with public reference are also created with `.from_values`, see {doc}`/bio-registries` for details.

In [None]:
diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)

Repeat the process for more labels:

In [None]:
projects = ln.ULabel.from_values(
    ["Project A", "Project B"], 
    field=ln.ULabel.name, 
    create=True, # create non-existing labels rather than attempting to load them from the database
)
ln.save(projects)

In [None]:
genes = bt.Gene.from_values(data["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)

## Annotate and save dataset with validated metadata

Register the dataset as an artifact:

In [None]:
artifact = ln.Artifact("data.zarr", description="a zarr object").save()

Link the artifact to validated labels. You could directly do this, e.g., via `artifact.ulabels.add(projects)` or `artifact.diseases.add(diseases)`.

However, often, you want to track the features that measured labels. Hence, let's try to associate our labels with features:

In [None]:
from lamindb.core.exceptions import ValidationError

try:
    artifact.features.add_values({"project": projects, "disease": diseases})
except ValidationError as e:
    print(e)

This errored because we hadn't yet registered features. After copy and paste from the error message, things work out:

In [None]:
ln.Feature(name='project', dtype='cat[ULabel]').save()
ln.Feature(name='disease', dtype='cat[bionty.Disease]').save()
artifact.features.add_values({"project": projects, "disease": diseases})
artifact.features

Since genes are the measurements, we register them as features:

In [None]:
feature_set = ln.FeatureSet(genes)
feature_set.save()
artifact.features.add_feature_set(feature_set, slot="genes")
artifact.describe()

In [None]:
# clean up test instance
!lamin delete --force test-curate-any
!rm -r data.zarr