# Annotate data

While data is the primary information or raw facts that are collected and stored, metadata is the supporting information that provides context and meaning to that data.

LaminDB let's you annotate data with metadata in two ways: features and labels. (Also see [tutorial](/tutorial2))

This guide extends [Quickstart](/introduction) to explain the details of annotating data.

## Setup

Let us create an instance that has {mod}`lnschema_bionty` mounted:

In [None]:
!lamin init --storage ./test-annotate --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import anndata as ad

In [None]:
lb.settings.organism = "human"  # globally set organism
lb.settings.auto_save_parents = False  # ignores ontological hierarchy
ln.settings.verbosity = "info"

## Register a dataset

Let's use the same example data as in the [Quickstart](/introduction):

In [None]:
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7]},
    index=["sample1", "sample2", "sample3"],
)

In addition to the data, we also have two types of metadata as follows:

In [None]:
# observational metadata (1:1 correspondence with samples)
obs_meta = pd.DataFrame(
    {
        "cell_type": ["T cell", "T cell", "Monocyte"],
        "tissue": ["capillary blood", "arterial blood", "capillary blood"],
    },
    index=["sample1", "sample2", "sample3"],
)

# external metadata (describes the entire dataset)
external_meta = {
    "organism": "human",
    "assay": "scRNA-seq",
    "experiment": "EXP0001",
    "project": "PRJ0001",
}

To store both data and observational metadata, we use an [`AnnData` object](https://anndata.readthedocs.io/):

In [None]:
# note that we didn't add external metadata to adata.uns, because we will use LaminDB to store it
adata = ad.AnnData(df, obs=obs_meta)
adata

Now let's register the AnnData object without annotating with any metadata:

In [None]:
ln.track()

dataset = ln.Dataset(adata, name="my RNA-seq")
dataset.save()

We don't see any metadata in the registered dataset yet:

In [None]:
dataset.describe()

## Define features and labels

Features and labels are records from their respective registries.

You can define them schema-less using {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` registries, or schema-full using dedicated registries.

### Define data features

Data features refer to individual measurable properties or characteristics of a phenomenon being observed. In data analysis and machine learning, features are the input variables used to predict or classify an outcome.

Data features are often numeric, but can also be categorical. For example, in the case of gene expression data, the features are the expression levels of individual genes. They are often stored as columns in a data table (adata.var_names for `AnnData` objects).

Here we define them using the {class}`~lnschema_bionty.Gene` registry:

In [None]:
data_features = lb.Gene.from_values(adata.var_names)
ln.save(data_features)
data_features

### Define metadata features

Metadata features refer to descriptive or contextual information about the data. They don't directly describe the content of the data but rather its characteristics.

In this example, the metadata features are "cell_type", "tissue" that describe observations (stored in `adata.obs.columns`) and "organism", "assay", "experiment" that describe the entire dataset.

Here we define them using the {class}`~lamindb.Feature` registry:

In [None]:
# obs metadata features
obs_meta_features = ln.Feature.from_df(adata.obs)
ln.save(obs_meta_features)
obs_meta_features

In [None]:
# external metadata features
external_meta_features = [
    ln.Feature(name=name, type="category") for name in external_meta.keys()
]
ln.save(external_meta_features)
external_meta_features

### Define metadata labels

Metadata labels are the categorical values of metadata features. They are more specific than features and are often used in classification.

In this example, the metadata labels of feature "cell_type" are "T cell" and "Monocyte"; the metadata labels of feature "tissue" are "capillary blood", "arterial blood"; the metadata labels of feature "organism" is "human"; and so on.

Let's define them with their respective registries:

In [None]:
cell_types = lb.CellType.from_values(adata.obs["cell_type"])
ln.save(cell_types)
cell_types

In [None]:
tissues = lb.Tissue.from_values(adata.obs["tissue"])
ln.save(tissues)
tissues

In [None]:
organism = lb.Organism.from_bionty(name=external_meta["organism"])
organism.save()
organism

In [None]:
assay = lb.ExperimentalFactor.from_bionty(name=external_meta["assay"])
assay.save()
assay

In [None]:
experiment = ln.ULabel(name=external_meta["experiment"], description="An experiment")
experiment.save()
experiment

In [None]:
project = ln.ULabel(name=external_meta["project"], description="A project")
project.save()
project

## Annotate with features

Non-external features are annotated when registering datasets using `.from_df` or `.from_anndata` methods:

(See the below "Annotate with labels stratified by metadata features" session for adding external features.)

In [None]:
dataset = ln.Dataset.from_anndata(
    adata,
    name="my RNA-seq",
    field=lb.Gene.symbol,  # the registry field to use for the data features
)
dataset.save()

This dataset is now annotated with features:

In [None]:
dataset.describe()

You see two types of features are annotated and organized as featuresets by slot:
- "var": data features
- "obs": observational metadata features

In [None]:
dataset.features

Use slots to retrieve corresponding annotated features:

In [None]:
dataset.features["var"].df()

In [None]:
dataset.features["obs"].df()

## Annotate with labels

If you simply want to tag a dataset with some descriptive labels, you can pass them to `.labels.add`. For example, let's add the experiment label "EXP0001" and project label "PRJ0001" to the dataset:

In [None]:
dataset.labels.add(experiment)
dataset.labels.add(project)

Now you see the dataset is annotated with 'EXP0001', 'PRJ0001' labels:

In [None]:
dataset.describe()

To view all annotated labels:

In [None]:
dataset.labels

Since we didn't specify which features the labels belongs to, they are accessible only through the default accessor ".ulabels" for {class}`~lamindb.ULabel` Registry.

You may already notice that it could be difficult to interpret labels without features if they belong to the same registry.

In [None]:
dataset.ulabels.df()

## Annotate with labels stratified by metadata features

For labels associated with metadata features, you can pass "feature" to `.labels.add` to stratified them by feature. (Another way to stratify labels is through ontological hierarchy, which is covered in the [Quickstart](/introduction))

Let's add the experiment label "EXP0001" and project label "PRJ0001" to the dataset again, this time specifying their features:

In [None]:
# an auto-complete object of registered features
features = ln.Feature.lookup()

dataset.labels.add(experiment, feature=features.experiment)
dataset.labels.add(project, feature=features.project)

You now see a 3rd featureset is added to the dataset at slot "external", and the labels are stratified by two features:

In [None]:
dataset.describe()

With feature-stratified labels, you can retrieve labels by feature:

In [None]:
dataset.labels.get(features.experiment).df()

Note that adding feature-stratified labels will also allow you to retrieve labels with the default accessor of respective registries:

In [None]:
dataset.labels.add(assay, feature=features.assay)

In [None]:
# access labels directly via default accessor "experimental_factors"
dataset.experimental_factors.df()

In [None]:
# access labels via feature
dataset.labels.get(features.assay).df()

Let's finish the rest annotation of labels:

In [None]:
# labels of obs metadata features
dataset.labels.add(cell_types, feature=features.cell_type)
dataset.labels.add(tissues, feature=features.tissue)

# labels of external metadata features
dataset.labels.add(organism, feature=features.organism)

Now you've annotated your dataset with all features and labels:

In [None]:
dataset.describe()

In [None]:
# clean up test instance
!lamin delete --force test-registries
!rm -r test-registries