# Validate data

To make data more re-usable, LaminDB validates categorical variables when using registries ({class}`~lamindb.dev.CanValidate`).

The typical data management process has 3 steps:
1. ![](https://img.shields.io/badge/Transform-10b981) Transform data, including formatting, normalizing, subsetting, concatenating, modeling, etc.
2. ![](https://img.shields.io/badge/Validate-10b981) Validate data, including curating non-validated data & amending registries.
3. ![](https://img.shields.io/badge/Register-10b981) Save data using the `File` or `Dataset` registry, link to validated & registered metadata.

## Setup

In [None]:
!lamin init --storage ./test-validate --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

Pre-populate registries:

In [None]:
df = pd.DataFrame({"A": 1, "B": 2}, index=["i1"])
ln.File(df, description="test data").save()
ln.Feature(name="project", type="category").save()
ln.Feature(name="experiment", type="category").save()
ln.Label(name="Project A").save()
ln.Label(name="Project B").save()
ln.Label(name="Experiment A001").save()
ln.Label(name="Experiment A002").save()
lb.Disease.from_bionty(ontology_id="MONDO:0004975").save()

## Built-in validation

### Name duplication

Creating a record with the same name field automatically returns the existing record:

In [None]:
ln.Label(name="Project A")

Bulk creating records using {meth}`~lamindb.dev.Registry.from_values` only returns validated (existing) records:

In [None]:
projects = ["Project A", "Project B", "Project D", "Project E"]
ln.Label.from_values(projects)

(Versioned records also account for `version` in addition to `name`. Also see: [idempotency](faq/idempotency).)

### Data duplication

Creating a file or dataset with the same content automatically returns the existing record:

In [None]:
ln.File(df, description="same data")

### Database validations

Types, constraints and [Django validators](https://docs.djangoproject.com/en/4.2/ref/validators/) can be configured in the [schema](https://lamin.ai/docs/schemas)

## Validate against a categorical field

{meth}`~lamindb.dev.CanValidate.validate` _strictly_ (has to be exact matches) validates terms/categories against a field. It returns a boolean vector for each term.

### With typed registries

Typed registries correspond to tables in the schema with their own entities. For instance, [bionty schema](https://lamin.ai/docs/lnschema-bionty) offers registries for basic biological entities

By default, the first string field is used for validation. For {class}`~lnschema_bionty.Disease`, it's the "name".

In [None]:
diseases = ["Alzheimer disease", "Alzheimer's disease", "AD"]
validated = lb.Disease.validate(diseases)
validated

Validate against a non-default field.

In [None]:
lb.Disease.validate(
    ["MONDO:0004975", "MONDO:0004976", "MONDO:0004977"], lb.Disease.ontology_id
)

### Without typed registries

Any non-typed entities can be registered using {class}`~lamindb.Label`.

In [None]:
ln.Label.validate(["Project A", "Project B", "Project C"])

## Inspect and standardize non-validated terms

When validation fails, you can call {meth}`~lamindb.dev.CanValidate.inspect` figure out what to do.

- Same as `.validate`, it _strictly_ validates terms/categories against a field.
- However, it returns a rich {class}`~lamindb.dev.InspectResult`.
- Most importantly, it logs what users can do in order to pass validation.

Also see {doc}`./bio-registries` for more examples.

In [None]:
result = lb.Disease.inspect(diseases)

In this case, it suggests to call {meth}`~lamindb.dev.CanValidate.standardize` to standardize synonyms:

In [None]:
diseases_standardized = lb.Disease.standardize(result.non_validated)
diseases_standardized

## Register non-validated terms

In [None]:
result = ln.Label.inspect(projects)

In [None]:
new_labels = [ln.Label(name=name) for name in result.non_validated]
ln.save(new_labels)
new_labels

## Validate features upon registering `File`/`Dataset`

When calling `File.from_...` and `Dataset.from_...`, features are validated according to the data formats. Validated features create feature sets depending on slots.

Check out [use cases](https://lamin.ai/docs/by-datatype) of your favorite data type.

In [None]:
!lamin delete --force test-validate
!rm -r test-validate