# Validate data

To make data more re-usable, LaminDB validates categorical variables when using registries ({class}`~lamindb.dev.CanValidate`) in addition to standard data validation. The process of data validation also includes curating non-validated data & amending registries.

:::{dropdown} What does "validated" mean?

**Validated value** refer to _registered records_ of your working LaminDB instance. They are _categorical_ values exist in a specified _field_ (default to "name") of a _registry_.

For instance, if "Experiment 1" has been registered as the "name" of a `ULabel` record, it is a validated value of the `ULabel.name`.

{class}`~lamindb.dev.CanValidate` methods {meth}`~lamindb.dev.CanValidate.validate`, {meth}`~lamindb.dev.CanValidate.inspect`, {meth}`~lamindb.dev.CanValidate.standardize`, {meth}`~lamindb.dev.Registry.from_values` primarily take 2 arguments: "values" and "field". The argument "values" takes an iterable of input categorical values, and the argument "field" takes a typed field of a registry.

:::

:::{dropdown} What does "Bionty-validated" mean?

{class}`~lamindb.dev.CanValidate` methods validate against the _in-house reference_, aka records in your instance.

For {doc}`./bio-registries`, you can extend to validate against _public references_ using the {meth}`~lnschema_bionty.dev.BioRegistry.bionty` object: `bionty_object = Registry.bionty()`

Note that {meth}`~lamindb.dev.Registry.from_values` is aware of Bionty.

:::

:::{dropdown} What to do for non-validated values?

Be aware if you are working with a _newly initialized instance_, nothing is validated as no records have been registered previously.

Run `inspect` to get instructions of how to register non-validated values. You may need to standardize your values, fix typos or simply register them.

:::

## Setup

In [None]:
!lamin init --storage ./test-validate --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

In [None]:
ln.settings.verbosity = "info"

Pre-populate registries:

In [None]:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.File(df, description="test data").save()
ln.ULabel(name="Project A").save()
ln.ULabel(name="Project B").save()
lb.Disease.from_bionty(ontology_id="MONDO:0004975").save()

## Standard validation

### Name duplication

Creating a record with the same name field automatically returns the existing record:

In [None]:
ln.ULabel(name="Project A")

Bulk creating records using {meth}`~lamindb.dev.Registry.from_values` only returns validated records:

Note: Bionty-validated terms are also created with `.from_values`, see {doc}`/bio-registries` for details.

In [None]:
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.ULabel.from_values(projects)

(Versioned records also account for `version` in addition to `name`. Also see: [idempotency](faq/idempotency).)

### Data duplication

Creating a file or dataset with the same content automatically returns the existing record:

In [None]:
ln.File(df, description="same data")

### Schema-based validation

Type checks, constraint checks, and [Django validators](https://docs.djangoproject.com/en/4.2/ref/validators/) can be configured in the [schema](https://lamin.ai/docs/schemas).

## Registry-based validation

{meth}`~lamindb.dev.CanValidate.validate` validates passed values against reference values in a registry.

It returns a boolean vector indicating whether a value has an exact match in the reference values.

### Using dedicated registries

For instance, {mod}`lnschema_bionty` types basic biological entities: every entity has its own registry, a Python class.

By default, the first string field is used for validation. For {class}`~lnschema_bionty.Disease`, it's `name`:

In [None]:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = lb.Disease.validate(diseases)
validated

Validate against a non-default field:

In [None]:
lb.Disease.validate(
    ["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], lb.Disease.ontology_id
)

### Using the `ULabel` registry

Any entity that doesn't have its dedicated registry ("is not typed") can be validated & registered using {class}`~lamindb.ULabel`:

In [None]:
ln.ULabel.validate(["Project A", "Project B", "Project C"])

## Inspect & standardize

When validation fails, you can call {meth}`~lamindb.dev.CanValidate.inspect` to figure out what to do.

{meth}`~lamindb.dev.CanValidate.inspect` applies the same definition of validation as {meth}`~lamindb.dev.CanValidate.validate`, but returns a rich return value {class}`~lamindb.dev.InspectResult`. Most importantly, it logs recommended curation steps that would render the data validated.

In [None]:
result = lb.Disease.inspect(diseases)

In this case, it suggests to call {meth}`~lamindb.dev.CanValidate.standardize` to standardize synonyms:

In [None]:
lb.Disease.standardize(result.non_validated)

For more, see {doc}`./bio-registries`.

## Extend registries

Sometimes, we simply want to register new records to extend the content of registries:

In [None]:
result = ln.ULabel.inspect(projects)

In [None]:
new_labels = [ln.ULabel(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels

## Validate features

When calling `File.from_...` and `Dataset.from_...`, features are automatically validated.

Validated features are grouped in "feature sets" indexed by "slots".

For a basic example, see {doc}`/tutorial2`.

For an overview of data formats used to model different data types, see {doc}`docs:by-datatype`.

In [None]:
!lamin delete --force test-validate
!rm -r test-validate