# Validate & standardize for developers

LaminDB makes it easy to validate categorical variables based on registries ({class}`~lamindb.core.CanValidate`).

:::{dropdown} How do I validate based on a public ontology?

{class}`~lamindb.core.CanValidate` methods validate against the registries in your LaminDB instance.
In {doc}`./bio-registries`, you'll see how to extend standard validation to validation against _public references_ using a `ReferenceTable` ontology object: `public = Registry.public()`.
By default, {meth}`~lamindb.core.Registry.from_values` considers a match in a public reference a validated value for any {mod}`bionty` entity.

:::

:::{dropdown} What to do for non-validated values?

Be aware when you are working with a _freshly initialized instance_: nothing is validated as no records have yet been registered.
Run `inspect` to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.

:::

## Setup

Install the `lamindb` Python package:
```shell
pip install 'lamindb[bionty]'
```

In [None]:
!lamin init --storage ./test-validate --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd

In [None]:
ln.settings.verbosity = "info"

Pre-populate registries:

In [None]:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.Artifact.from_df(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
bt.Disease.from_public(ontology_id="MONDO:0004975").save()

## Standard validation

### Name duplication

Creating a record with the same name field automatically returns the existing record:

In [None]:
ln.ULabel(name="Project A")

Bulk creating records using {meth}`~lamindb.core.Registry.from_values` only returns validated records:

Note: Terms validated with public reference are also created with `.from_values`, see {doc}`/bio-registries` for details.

In [None]:
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.ULabel.from_values(projects)

(Versioned records also account for `version` in addition to `name`. Also see: [idempotency](faq/idempotency).)

### Data duplication

Creating an artifact or collection with the same content automatically returns the existing record:

In [None]:
ln.Artifact.from_df(df, description="same data")

### Schema-based validation

Type checks, constraint checks, and [Django validators](https://docs.djangoproject.com/en/4.2/ref/validators/) can be configured in the [schema](https://lamin.ai/docs/schemas).

## Registry-based validation

{meth}`~lamindb.core.CanValidate.validate` validates passed values against reference values in a registry.
It returns a boolean vector indicating whether a value has an exact match in the reference values.

### Using dedicated registries

For instance, {mod}`bionty` types basic biological entities: every entity has its own registry, a Python class.
By default, the first string field is used for validation. For {class}`~bionty.Disease`, it's `name`:

In [None]:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = bt.Disease.validate(diseases)
validated

Validate against a non-default field:

In [None]:
bt.Disease.validate(
    ["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], bt.Disease.ontology_id
)

### Using the `ULabel` registry

Any entity that doesn't have its dedicated registry ("is not typed") can be validated & registered using {class}`~lamindb.ULabel`:

In [None]:
ln.ULabel.validate(["Project A", "Project B", "Project C"])

## Inspect & standardize

When validation fails, you can call {meth}`~lamindb.core.CanValidate.inspect` to figure out what to do.

{meth}`~lamindb.core.CanValidate.inspect` applies the same definition of validation as {meth}`~lamindb.core.CanValidate.validate`, but returns a rich return value {class}`~lamindb.core.InspectResult`. Most importantly, it logs recommended curation steps that would render the data validated.

In [None]:
result = bt.Disease.inspect(diseases)

In this case, it suggests to call {meth}`~lamindb.core.CanValidate.standardize` to standardize synonyms:

In [None]:
bt.Disease.standardize(result.non_validated)

For more, see {doc}`./bio-registries`.

## Extend registries

Sometimes, we simply want to register new records to extend the content of registries:

In [None]:
result = ln.ULabel.inspect(projects)

In [None]:
new_labels = [ln.ULabel(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels

## Validate features

When calling `File.from_...` and `Collection.from_...`, features are automatically validated.
Validated features are grouped in "feature sets" indexed by "slots".
For a basic example, see {doc}`/tutorial2`.

For an overview of data formats used to model different data types, see {doc}`docs:by-datatype`.

## Bulk validation

In [None]:
# clean up test instance
!lamin delete --force test-validate
!rm -r test-validate