# Annotate data

This guide shows how to annotate data, including defining clear validation criteria, validating & curating metadata, registering annotated artifacts, within a few minutes.

By the end, you'll have validated data objects empowered by LaminDB registries.

:::{dropdown} What does "validating a categorical variable based on registries" mean?

The records in your LaminDB instance define the validated reference values for any entity managed in your schema.

Validated categorical values are stored in a field of a registry; a column of the registry table.

The default field to label an entity record is the `name` field.

For instance, if "Experiment 1" has been registered as the `name` of a `ULabel` record, it is a validated value for field `ULabel.name`.

{class}`~lamindb.core.CanValidate` methods {meth}`~lamindb.core.CanValidate.validate`, {meth}`~lamindb.core.CanValidate.inspect`, {meth}`~lamindb.core.CanValidate.standardize`, {meth}`~lamindb.core.Registry.from_values` take 2 important parameters: `values` and `field`. The parameter `values` takes an iterable of input categorical values, and the parameter `field` takes a typed field of a registry.

:::

```{toctree}
:maxdepth: 1
:hidden:

can-validate
annotate-for-developers
```

## Set up

In [None]:
!lamin init --storage ./test-annotate --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad

ln.settings.verbosity = "hint"

## A DataFrame with labels

Let's start with a DataFrame object that we'd like to validate and curate:

In [None]:
df = pd.DataFrame({
    "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
    "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
    "donor": ["D0001", "D0002", "DOOO3"],
})
df

## Validate and curate metadata

Define validation criteria for the columns:

In [None]:
fields = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

Validate the Pandas DataFrame:

In [None]:
annotate = ln.Annotate.from_df(df, fields=fields)

In [None]:
validated = annotate.validate()

In [None]:
validated

## Validate using registries in another instance

Sometimes you want to validate against existing registries others might have created.

Here we use the [cellxgene instance](https://lamin.ai/laminlabs/cellxgene) registries to curate against. You will notice more terms are validated than above.

This allows us to register values that are currently missing in our instance from the [cellxgene instance](https://lamin.ai/laminlabs/cellxgene) directly.
By having our own registry but also validating against the [cellxgene instance](https://lamin.ai/laminlabs/cellxgene), we enable the addition of new registry values while keeping the [cellxgene instance](https://lamin.ai/laminlabs/cellxgene) focused on the [cellxgene schema](https://github.com/chanzuckerberg/single-cell-curation/tree/main/schema).

In [None]:
annotate = ln.Annotate.from_df(
    df, 
    fields=fields, 
    using="laminlabs/cellxgene",  # pass the instance slug
    )
annotate.validate()

## Register new metadata labels

Following the suggestions above to register labels that aren't present in the current instance:

(Note that our current instance is empty. Once you filled up the registries, registering new labels won't be frequently needed)

In [None]:
annotate.update_registry("cell_type")

Fix typo and register again:

In [None]:
# use a lookup object to get the correct spelling of categories from public reference
# pass "public" to use the public reference
lookup = annotate.lookup()

In [None]:
lookup

In [None]:
cell_types = lookup["cell_type"]

In [None]:
cell_types.cerebral_cortex_pyramidal_neuron

In [None]:
# fix the typo
df["cell_type"] = df["cell_type"].replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})

annotate.update_registry("cell_type")

In [None]:
annotate.update_registry("donor")

To register non-validated terms, pass `validated_only=False`:

In [None]:
annotate.update_registry("donor", validated_only=False)

Let's validate it again:

In [None]:
validated = annotate.validate()

In [None]:
validated

## Validate an AnnData object

We offer an AnnData specific annotate that is aware of the variables in addition to the observations DataFrame.

Here we specify which `var_fields` and `obs_fields` to validate against.

In [None]:
df.index = ["obs1", "obs2", "obs3"]

X = pd.DataFrame({"TCF7": [1, 2, 3], "PDCD1": [4, 5, 6], "CD3E": [7, 8, 9], "CD4": [10, 11, 12], "CD8A": [13, 14, 15]}, index=["obs1", "obs2", "obs3"])

adata = ad.AnnData(X=X, obs=df)
adata

In [None]:
annotate = ln.Annotate.from_anndata(
    adata, 
    obs_fields=fields, 
    var_field=bt.Gene.symbol, # specify the field for the var
    organism="human",
    )

In [None]:
annotate.validate()

In [None]:
annotate.update_registry("all")

In [None]:
annotate.validate()

## Register file

The validated object can be subsequently registered as an {class}`~lamindb.Artifact` in your LaminDB instance:

In [None]:
ln.transform.stem_uid = "WOK3vP0bNGLx"
ln.transform.version = "0"
ln.track()

In [None]:
artifact = annotate.register_artifact(description="test AnnData")

View the registered artifact with metadata:

In [None]:
artifact.describe()

## Register collection

Register a new collection for the registered artifact:

In [None]:
# register a new collection
collection = annotate.register_collection(
    artifact,  # registered artifact above, can also pass a list of artifacts
    name="Experiment X in brain",  # title of the publication
    description="10.1126/science.xxxxx",  # DOI of the publication
    reference="E-MTAB-xxxxx", # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress") # source type (e.g. GEO, ArrayExpress, SRA, etc.)

In [None]:
collection.artifact

In [None]:
artifact.collection