![](https://img.shields.io/badge/tutorial2/2-lightgrey)
[![](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial2.ipynb)
[![](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/transform/dMtrt8YMSdl6z8)

# Tutorial: Features & labels

In {doc}`/tutorial`, we learned about provenance and data access. Here, we walk through validating & annotating datasets with features & labels to improve:

1. Findability: Which collections measured expression of cell marker `CD14`? Which characterized cell line `K562`? Which collections have a test & train split? Etc.
2. Usability: Are there typos in feature names? Are there typos in sampled labels? Are types and units of features consistent? Etc.

:::{hint}

This is a low-level tutorial aimed at a basic understanding of registering features and labels for annotation & validation.

If you're just looking to readily validate and annotate a dataset with features and labels, see this guide: {doc}`annotate`.

:::

In [None]:
import lamindb as ln
import pandas as pd

ln.settings.verbosity = "hint"

## Re-cap

Let's briefly re-cap what we learned in {doc}`/introduction`. We started with simple labeling:

In [None]:
# create a label
study0 = ln.ULabel(name="Study 0: initial plant gathering", description="My initial study").save()
# query an artifact from the previous tutorial
artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images").one()
# label the artifact
artifact.labels.add(study0)
# look at artifact metadata
artifact.describe()

In general, it's good practice to associate labels with features so that we can later feed them into learning algorithms with a defined dimension:

In [None]:
feature = ln.Feature(name="study_name", dtype="cat").save()
artifact.labels.add(study0, feature)
artifact.describe()

## Register metadata

Features and labels are the primary ways of registering domain-knowledge related metadata in LaminDB.

Features represent measurement dimensions (e.g. `"species"`) and labels represent measured values (e.g. `"iris setosa"`, `"iris versicolor"`, `"iris virginica"`).

In statistics, you'd say a feature is a categorical or numerical variable while a label is a simple category. Categorical variables draw their values from a set of categories.

### Register labels

We study 3 species of the Iris plant: `setosa`, `versicolor` & `virginica`. Let's create 3 labels with {class}`~lamindb.ULabel`.

In [None]:
speciess = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(speciess)

{class}`~lamindb.ULabel` enables you to manage an in-house ontology to manage all kinds of generic labels.

:::{dropdown} What are alternatives to ULabel?

In a complex project, you'll likely want dedicated typed registries for selected label types, e.g., {class}`~bionty.Gene`, {class}`~bionty.Tissue`, etc. See: {doc}`/bio-registries`.

{class}`~lamindb.ULabel`, however, will get you quite far and scale to ~1M labels.

:::

Anticipating that we'll have many different labels when working with more data, we'd like to express that all 3 labels are species labels:

In [None]:
is_species = ln.ULabel(name="is_species").save()
is_species.children.set(speciess)
is_species.view_parents(with_children=True)

In [None]:
studies = [ln.ULabel(name=name) for name in ["study0", "study1", "study2"]]
ln.save(studies)
is_study = ln.ULabel(name="is_study").save()
is_study.children.set(studies)
is_study.view_parents(with_children=True)

### Register features

For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.

When we integrate datasets, feature names will label columns that store data.

Let's create and save two {class}`~lamindb.Feature` records to identify measurements of the iris species label and the study:

In [None]:
ln.Feature(name="iris_species_name", dtype="cat").save()

# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()

## Validate & link labels

We already looked at the metadata for `study0`, before: 

In [None]:
meta_artifact = ln.Artifact.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_artifact.load(index_col=0)  # load a dataframe
meta.head()

### Validate metadata

Depending on the data generation process, such metadata might or might not match the labels we defined in our registries.

Let's validate the labels by mapping the values stored in the artifact on the {class}`~lamindb.ULabel` registry:

In [None]:
ln.ULabel.validate(meta["1"], field="name")

Everything passed and no fixes are needed!

If validation doesn't pass, {meth}`~lamindb.core.CanValidate.standardize` and {meth}`~lamindb.core.CanValidate.inspect` will help standardize data.

### Label artifacts

You can label an artifact by calling `artifact.labels.add()` and pass a single or multiple labels, and optionally, the corresponding feature.

Let's do this based on the labels in `meta.csv`:

In [None]:
ln.Artifact.df()

In [None]:
study_artifacts = ln.Artifact.filter(key__startswith="iris_studies/", suffix="").all()
study_labels = ln.ULabel.filter(name="is_study").one().children.all()
for artifact, study in zip(study_artifacts, study_labels):
    artifact.labels.add(study, feature=features.study_name)
    df = pd.read_csv(artifact.path / "meta.csv", index_col=0)
    species_labels = ln.ULabel.from_values(df["1"].unique())
    artifact.labels.add(species_labels, feature=features.iris_species_name)

### Query artifacts by labels

Using the new annotations, you can now query image artifacts by species & study labels:

In [None]:
ulabels = ln.ULabel.lookup()
artifact = ln.Artifact.filter(ulabels=ulabels.study0).first()

We also see them when calling {meth}`~lamindb.core.Data.describe`:

In [None]:
artifact.describe()

### Label collections

Labeling collections works in the same way as labeling artifacts:

In [None]:
collection = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
collection.labels.add(ulabels.study0, feature=features.study_name)
all_species_labels = ln.ULabel.filter(parents__name="is_species").all()
collection.labels.add(all_species_labels, feature=features.iris_species_name)

In [None]:
collection.describe()

## Run an ML model

Let's now run a mock ML model that transforms the images into 4 high-level features.

In [None]:
def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform=transform)
    input_data = ln.Collection.filter(name__startswith="Iris collection", version="1").one()
    input_paths = [
        path.download_to(path.name) for path in input_data.artifacts[0].path.glob("*")
    ]
    # apply ML model
    output_data = ln.core.datasets.df_iris_in_meter_study1()
    return output_data


df = run_ml_model()

The output is a dataframe:

In [None]:
df.head()

And this is the pipeline that produced the dataframe:

In [None]:
ln.core.run_context.transform.view_parents()

### Register the output data

Let's first register the features of the transformed data:

In [None]:
new_features = ln.Feature.from_df(df)
ln.save(new_features)

:::{dropdown} How to track units of features?

Use the `unit` field of {class}`~lamindb.Feature`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()
```

:::

We can now validate & register the dataframe in one line:

In [None]:
artifact = ln.Artifact.from_df(
    df,
    description="Iris study 1 - after measuring sepal & petal metrics",
)
artifact.save()

In [None]:
artifact.features.add_feature_set(ln.FeatureSet(new_features), slot="columns")

### Feature sets

Get an overview of linked features:

In [None]:
artifact.features

You'll see that they're always grouped in sets that correspond to records of {class}`~lamindb.FeatureSet`.

:::{dropdown} Why does LaminDB model feature sets, not just features?

1. Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you'll only need to store 1M instead of 1M x 20k = 20B links.
2. Interpretation: Model protein panels, gene panels, etc.
3. Data integration: Feature sets provide the currency that determines whether two collections can be easily concatenated.

These reasons do not hold for label sets. Hence, LaminDB does not model label sets.

:::

A `slot` provides a string key to access feature sets. It's typically the accessor within the registered data object, here `pd.DataFrame.columns`.

Let's use it to access all linked features:

In [None]:
artifact.features["columns"].df()

There is one categorical feature, let's add the species labels:

In [None]:
species_labels = ln.ULabel.filter(parents__name="is_species").all()
artifact.labels.add(species_labels, feature=features.iris_species_name)

Let's now add study labels:

In [None]:
artifact.labels.add(ulabels.study0, feature=features.study_name)

In addition to the `columns` feature set, we now have an `external` feature set:

In [None]:
artifact.features

This is the context for our artifact:

In [None]:
artifact.describe()
artifact.view_lineage()

See the database content:

In [None]:
ln.view(registries=["Feature", "FeatureSet", "ULabel"])

## Manage follow-up data

Assume that a couple of weeks later, we receive a new dataset in a follow-up study 2.

Let's track a new analysis:

In [None]:
ln.settings.transform.stem_uid = "dMtrt8YMSdl6"
ln.settings.transform.version = "1"
ln.track()

### Register a joint collection

Assume we already ran all preprocessing including the ML model.

We get a DataFrame and store it as an artifact:

In [None]:
df = ln.core.datasets.df_iris_in_meter_study2()
ln.Artifact.from_df(df, description="Iris study 2 - transformed").save()

Let's load it:

In [None]:
artifact2 = ln.Artifact.filter(description="Iris study 2 - transformed").one()

We can now store the joint collection:

In [None]:
collection = ln.Collection(
    [artifact, artifact2], name="Iris flower study 1 & 2 - transformed"
)
collection.save()

### Auto-concatenate datasets

Because both datasets measured the same validated feature set, we can auto-concatenate the collection:

In [None]:
collection.load().tail()

We can also access & query the underlying two artifact objects:

In [None]:
collection.artifacts.df()

Or look at their data lineage:

In [None]:
collection.view_lineage()

Or look at the database:

In [None]:
ln.view()

This is it! 😅

If you're interested, please check out guides & use cases or make an issue on GitHub to [discuss](https://github.com/laminlabs/lamindb/issues/new).

## Appendix

### Manage metadata

#### Avoid duplicates

Let's create a label `"project1"`:

In [None]:
ln.ULabel(name="project1").save()

We already created a `project1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.ULabel(name="project1")
label.save()

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell "project 1" with a white space:

In [None]:
ln.ULabel(name="project 1")

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via {attr}`~lamindb.core.Settings.upon_create_search_names`.

#### Update & delete records

In [None]:
label = ln.ULabel.filter(name="project1").first()
label

In [None]:
label.name = "project1a"
label.save()
label

In [None]:
label.delete()

### Manage storage

#### Change default storage

The default storage location is:

In [None]:
ln.settings.storage

You can change it by setting `ln.settings.storage = "s3://my-bucket"`.

#### See all storage locations

In [None]:
ln.Storage.df()

In [None]:
# clean up what we wrote in this notebook
!rm -r lamin-tutorial
!lamin delete --force lamin-tutorial