# Validate data

To make data more re-usable, LaminDB validates categorical variables when using registries ({class}`~lamindb.dev.CanValidate`) in addition to standard data validation.

The typical data management process has 3 steps:
1. ![](https://img.shields.io/badge/Transform-10b981) Transform data, including formatting, normalizing, subsetting, concatenating, modeling, etc.
2. ![](https://img.shields.io/badge/Validate-10b981) Validate data, including curating non-validated data & amending registries.
3. ![](https://img.shields.io/badge/Register-10b981) Save data using the `File` or `Dataset` registry, link to validated & registered metadata.

## Setup

In [None]:
!lamin init --storage ./test-validate --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

Pre-populate registries:

In [None]:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.File(df, description="test data").save()
ln.Label(name="Project A").save()
ln.Label(name="Project B").save()
lb.Disease.from_bionty(ontology_id="MONDO:0004975").save()

## Standard validation

### Name duplication

Creating a record with the same name field automatically returns the existing record:

In [None]:
ln.Label(name="Project A")

Bulk creating records using {meth}`~lamindb.dev.Registry.from_values` only returns validated (existing) records:

In [None]:
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.Label.from_values(projects)

(Versioned records also account for `version` in addition to `name`. Also see: [idempotency](faq/idempotency).)

### Data duplication

Creating a file or dataset with the same content automatically returns the existing record:

In [None]:
ln.File(df, description="same data")

### Schema-based validation

Type checks, constraint checks, and [Django validators](https://docs.djangoproject.com/en/4.2/ref/validators/) can be configured in the [schema](https://lamin.ai/docs/schemas).

## Registry-based validation

{meth}`~lamindb.dev.CanValidate.validate` validates passed values against reference values in a registry.

It returns a boolean vector indicating whether a value has an exact match in the reference values.

### Using dedicated registries

For instance, {mod}`lnschema_bionty` types basic biological entities: every entity has its own registry, a Python class.

By default, the first string field is used for validation. For {class}`~lnschema_bionty.Disease`, it's `name`:

In [None]:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = lb.Disease.validate(diseases)
validated

Validate against a non-default field:

In [None]:
lb.Disease.validate(
    ["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], lb.Disease.ontology_id
)

### Using the `Label` registry

Any entity that doesn't have its dedicated registry ("is not typed") can be validated & registered using {class}`~lamindb.Label`:

In [None]:
ln.Label.validate(["Project A", "Project B", "Project C"])

## Inspect & standardize

When validation fails, you can call {meth}`~lamindb.dev.CanValidate.inspect` to figure out what to do.

{meth}`~lamindb.dev.CanValidate.inspect` applies the same definition of validation as {meth}`~lamindb.dev.CanValidate.validate`, but returns a rich return value {class}`~lamindb.dev.InspectResult`. Most importantly, it logs recommended curation steps that would render the data validated.

In [None]:
result = lb.Disease.inspect(diseases)

In this case, it suggests to call {meth}`~lamindb.dev.CanValidate.standardize` to standardize synonyms:

In [None]:
diseases_standardized = lb.Disease.standardize(result.non_validated)
diseases_standardized

For more, see {doc}`./bio-registries`.

## Extend registries

Sometimes, we want to extend the content of registries with new records:

In [None]:
result = ln.Label.inspect(projects)

In [None]:
new_labels = [ln.Label(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels

## Validate features

When calling `File.from_...` and `Dataset.from_...`, features are automatically validated.

Validated features are grouped in "feature sets" indexed by "slots".

For a basic example, see {doc}`/tutorial1`.

For an overview of data formats used to model different data types, see {doc}`docs:by-datatype`.

In [None]:
!lamin delete --force test-validate
!rm -r test-validate