![](https://img.shields.io/badge/tutorial2/2-lightgrey)
[![](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial2.ipynb)
[![](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/transform/dMtrt8YMSdl6z8)

# Tutorial: Features & labels

In {doc}`/tutorial`, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

1. Findability: Which collections measured expression of cell marker `CD14`? Which characterized cell line `K562`? Which collections have a test & train split? Etc.
2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

:::{hint}

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you're just looking to readily validate and annotate a dataset with features and labels, see this guide: {doc}`annotate`.

:::

In [None]:
import lamindb as ln
import pandas as pd
import pytest

ln.settings.verbosity = "hint"

## TLDR

### Annotate by labels

In [None]:
# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()

### Annotate by features

Features are buckets for labels, numbers and other data types.

Often, data that you want to ingest comes with metadata.

Here, three metadata features `species`, `scientist`, `instrument` were collected.

In [None]:
df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
df.head()

There are only a few values for features `species`, `scientist` & `instrument`, and we'd like to label the artifact with these values:

In [None]:
df.nunique()

Let's annotate the artifact with features & values and also add in a `temperature` measurement that Barbara & Edgar had forgotten to add to their csv:

In [None]:
with pytest.raises(ln.core.exceptions.ValidationError) as error:
    artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
print(error.exconly())

As we saw, nothing was validated and hence, we got an error that tells us to register features & labels:

In [None]:
ln.Feature(name='species', dtype='cat[ULabel]').save()
ln.Feature(name='scientist', dtype='cat[ULabel]').save()
ln.Feature(name='instrument', dtype='cat[ULabel]').save()
ln.Feature(name='study', dtype='cat[ULabel]').save()
ln.Feature(name='temperature', dtype='float').save()
species = ln.ULabel.from_values(df['species'].unique(), create=True)
ln.save(species)
authors = ln.ULabel.from_values(df['scientist'].unique(), create=True)
ln.save(authors)
instruments = ln.ULabel.from_values(df['instrument'].unique(), create=True)
ln.save(instruments)

Now everything works:

In [None]:
artifact.features.add_values({"species": df.species.unique(), "scientist": df.scientist.unique(), "instrument": df.instrument.unique(), "temperature": 27.6, "study": "Study 0: initial plant gathering"})
artifact.describe()

Because we also re-annotated with the study label `Study 0: initial plant gathering'`, we see that it appears under the `study` feature.

### Retrieve features

In [None]:
artifact.features.get_values()

### Query by features

In [None]:
artifact = ln.Artifact.features.filter(temperature=27.6).one()
artifact

## Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. `"species"`) and labels represent measured values (e.g. `"iris setosa"`, `"iris versicolor"`, `"iris virginica"`).

In statistics, you'd say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

### Register labels

We study 3 species of the Iris plant: `setosa`, `versicolor` & `virginica`. Let's create 3 labels with {class}`~lamindb.ULabel`.

{class}`~lamindb.ULabel` enables you to manage an in-house ontology to manage all kinds of generic labels.

:::{dropdown} What are alternatives to ULabel?

In a complex project, you'll likely want dedicated typed registries for selected label types, e.g., {class}`~bionty.Gene`, {class}`~bionty.Tissue`, etc. See: {doc}`/bio-registries`.

{class}`~lamindb.ULabel`, however, will get you quite far and scale to ~1M labels.

:::

Anticipating that we'll have many different labels when working with more data, we'd like to express that all 3 labels are species labels:

In [None]:
is_species = ln.ULabel(name="is_species").save()
is_species.children.set(species)
is_species.view_parents(with_children=True)

### Query artifacts by labels

Using the new annotations, you can now query image artifacts by species & study labels:

In [None]:
ln.ULabel.df()

In [None]:
ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.study_0_initial_plant_gathering).one()

## Run an ML model

Let's now run a mock ML model that transforms the images into 4 high-level features.

In [None]:
def run_ml_model() -> pd.DataFrame:
    image_file_dir = artifact.cache()
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data

transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
run = ln.track(transform=transform)
df = run_ml_model()

The output is a dataframe:

In [None]:
df.head()

And this is the pipeline that produced the dataframe:

In [None]:
run

In [None]:
run.transform.view_parents()

### Register the output data

Let's first register the features of the transformed data:

In [None]:
new_features = ln.Feature.from_df(df)
ln.save(new_features)

:::{dropdown} How to track units of features?

Use the `unit` field of {class}`~lamindb.Feature`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()
```

:::

We can now validate & register the dataframe in one line:

In [None]:
artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()

There is one categorical feature, let's add the species labels:

In [None]:
features = ln.Feature.lookup()

In [None]:
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.species)

In [None]:
species_labels

Let's now add study labels:

In [None]:
artifact.labels.add(ulabels.study_0_initial_plant_gathering, feature=features.study)

This is the context for our artifact:

In [None]:
artifact.describe()
artifact.view_lineage()

See the database content:

In [None]:
ln.view(registries=["Feature", "ULabel"])

This is it! 😅

If you're interested, please check out guides & use cases or make an issue on GitHub to [discuss](https://github.com/laminlabs/lamindb/issues/new).

## Appendix

### Manage metadata

#### Avoid duplicates

Let's create a label `"project1"`:

In [None]:
ln.ULabel(name="project1").save()

We already created a `project1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.ULabel(name="project1")
label.save()

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell "project 1" with a white space:

In [None]:
ln.ULabel(name="project 1")

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via {attr}`~lamindb.core.Settings.upon_create_search_names`.

#### Update & delete records

In [None]:
label = ln.ULabel.filter(name="project1").first()
label

In [None]:
label.name = "project1a"
label.save()
label

In [None]:
label.delete()

### Manage storage

#### Change default storage

The default storage location is:

In [None]:
ln.settings.storage

You can change it by setting `ln.settings.storage = "s3://my-bucket"`.

#### See all storage locations

In [None]:
ln.Storage.df()

In [None]:
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial