![](https://img.shields.io/badge/tutorial2/2-lightgrey)
[![](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial1.ipynb)
[![](https://img.shields.io/badge/laminlabs/lamindata-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?id=dMtrt8YMSdl6z8)

# Tutorial: Features & labels

In the previous tutorial ({doc}`/tutorial`), we learned about about how to leverage basic metadata for files & datasets to access data (query, search, stage & load).

Here, we walk through annotating & validating data with features & labels to improve:

1. Finding data: Which datasets measured expression of cell marker `CD14`? Which characterized cell line `K562`? Which datasets have a test & train split? Etc.
2. Using data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

:::{dropdown} What was LaminDB's most basic inspiration?

The pydata family of objects is at the heart of most data science, ML & comp bio workflows: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, ...

And still, we couldn’t find a tool to link these objects to context so that they could be analyzed in context!

Context relevant for analyses includes anything that's needed to interpret & model data.

So, `lamindb.File` and `lamindb.Dataset` track:

- data sources, data transformations, models, users & pipelines that performed transformations (provenance)
- any entity of the domain in which data is generated and modeled (features & labels)

:::

In [None]:
import lamindb as ln
import pandas as pd

In [None]:
ln.settings.verbosity = "hint"

## Register metadata

### Register labels

We study 3 organism of the Iris plant: `setosa`, `versicolor` & `virginica`.

Let's populate the universal (untyped) label registry ({class}`~lamindb.ULabel`) for them:

In [None]:
labels = [ln.ULabel(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(labels)

labels

Anticipating that we'll have many different labels when working with more data, we'd like to express that all 3 labels are organism labels:

In [None]:
parent = ln.ULabel(name="is_organism")
parent.save()

for label in labels:
    label.parents.add(parent)

parent.view_parents(with_children=True)

{class}`~lamindb.ULabel` enables you to manage an in-house ontology to manage all kinds of _untyped_ labels.

If you'd like to leverage pre-built _typed_ ontologies for basic biological entities in the same way, see: {doc}`/bio-registries`.

In addition to organism, we'd like to track the studies that produced the data:

In [None]:
ln.ULabel(name="study0").save()

:::{dropdown} Why label a data batch by study?

We can then

1. query all files link to this experiment
2. model it as a confounder when we'll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix

:::

### Register features

For every set of studied labels (measured values), we typically also want an identifier for the corresponding measurement dimension: the feature.

When we integrate data batches, feature names will label columns that store data.

Let's create and save two {class}`~lamindb.Feature` records to identify measurements of the iris organism label and the study:

In [None]:
ln.Feature(name="iris_organism_name", type="category").save()
ln.Feature(name="study_name", type="category").save()
# create a lookup object so that we can access features with auto-complete
features = ln.Feature.lookup()

## Validate & link labels

We already looked at the metadata for `study0`, before: 

In [None]:
meta_file = ln.File.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_file.load(index_col=0)  # load a dataframe

meta.head()

### Validate metadata

Depending on the data generation process, such metadata might or might not match the labels we defined in our registries.

Let's validate the labels by mapping the values stored in the file on the :class:`~lamindb.ULabel` registry:

In [None]:
ln.ULabel.validate(meta["1"], field="name")

Everything passed and no fixes are needed!

If validation doesn't pass, {meth}`~lamindb.dev.CanValidate.standardize` and {meth}`~lamindb.dev.CanValidate.inspect` will help curate data.

### Label files

Labeling a set of files is useful if we want to make the set queryable among a large number of files.

You can label a file by calling `file.labels.add()` and pass a single or multiple label records.

Let's do this based on the labels in `meta.csv`:

In [None]:
image_files = ln.File.filter(
    key__startswith="iris_studies/study0_raw_images", suffix=".jpg"
)

study_label = ln.ULabel.filter(name="study0").one()
for file in image_files:
    file.labels.add(study_label, feature=features.study_name)
    # get organism name from metadata file
    organism_name = meta.loc[file.path.name == meta["0"], "1"].values[0]
    organism_label = ln.ULabel.filter(name=organism_name).one()
    file.labels.add(organism_label, feature=features.iris_organism_name)

### Query files by labels

Using the new annotations, you can now query image files by organism & study labels:

In [None]:
labels = ln.ULabel.lookup()
file = ln.File.filter(ulabels__in=[labels.versicolor, labels.study0]).first()

We also see them when calling {meth}`~lamindb.dev.Data.describe`:

In [None]:
file.describe()

### Label datasets

Labeling datasets works in the same way as labeling files:

In [None]:
# query the dataset
dataset = ln.Dataset.filter(name="Iris study 1").one()

# add study label
dataset.labels.add(study_label, feature=features.study_name)

# get all organism labels
all_organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
dataset.labels.add(all_organism_labels, feature=features.iris_organism_name)

Check that the dataset is labeled:

In [None]:
dataset.describe()

## Run an ML model

Let's now run a ML model that transforms the images into 4 high-level features.

In [None]:
def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform)
    input_dataset = ln.Dataset.filter(name="Iris study 1").one()
    input_paths = [file.stage() for file in input_dataset.files.all()]
    # transform the data...
    output_dataset = ln.dev.datasets.df_iris_in_meter_study1()
    return output_dataset


df = run_ml_model()

The output is a dataframe:

In [None]:
df.head()

And this is the ML pipeline that produced the dataframe:

In [None]:
ln.run_context.transform.view_parents()

### Register the output data

Let's first register the features of the transformed data:

In [None]:
new_features = ln.Feature.from_df(df)
ln.save(new_features)

:::{dropdown} How to track units of features?

Use the `unit` field of {class}`~lamindb.Feature`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "number":
        feature.unit = "m"  # SI unit for meters
        feature.save()
```

:::

We can now validate & register the dataframe in one line by creating a {class}`~lamindb.Dataset` record:

In [None]:
dataset = ln.Dataset.from_df(
    df,
    name="Iris study 1 - transformed",
    description="Iris dataset after measuring sepal & petal metrics",
)

dataset.save()

### Feature sets

Get an overview of linked features:

In [None]:
dataset.features

You'll see that they're always grouped in sets that correspond to records of {class}`~lamindb.FeatureSet`.

:::{dropdown} Why does LaminDB model feature sets, not just features?

1. Performance: Imagine you measure the same panel of 20k transcripts in 1M samples. By modeling the panel as a feature set, you'll only need to store 1M instead of 1M x 20k = 20B links.
2. Interpretation: Model protein panels, gene panels, etc.
3. Data integration: Feature sets provide the currency that determines whether two datasets can be easily concatenated.

These reasons do not hold for label sets. Hence, LaminDB does not model label sets.

:::

A `slot` provides a string key to access feature sets. It's typically the accessor within the registered data object, here `pd.DataFrame.columns`.

Let's use it to access all linked features:

In [None]:
dataset.features["columns"].df()

There is one categorical feature, let's add the organism labels:

In [None]:
organism_labels = ln.ULabel.filter(parents__name="is_organism").all()
dataset.labels.add(organism_labels, feature=features.iris_organism_name)

Let's now add study labels:

In [None]:
dataset.labels.add(study_label, feature=features.study_name)

In addition to the `columns` feature set, we now have an `external` feature set:

In [None]:
dataset.features

This is the context for our file:

In [None]:
dataset.describe()

In [None]:
dataset.file.view_flow()

See the database content:

In [None]:
ln.view(registries=["Feature", "FeatureSet", "ULabel"])

## Manage follow-up data

Assume that a couple of weeks later, we receive a new batch of data in a follow-up study 2.

Let's track a new analysis:

In [None]:
ln.track()

### Register a joint dataset

Assume we already ran all preprocessing including the ML model.

We get a DataFrame and store it as a file:

In [None]:
df = ln.dev.datasets.df_iris_in_meter_study2()
ln.File.from_df(df, description="Iris study 2 - transformed").save()

Let's load both data batches as files:

In [None]:
dataset1 = ln.Dataset.filter(name="Iris study 1 - transformed").one()

file1 = dataset1.file
file2 = ln.File.filter(description="Iris study 2 - transformed").one()

We can now store the joint dataset:

In [None]:
dataset = ln.Dataset([file1, file2], name="Iris flower study 1 & 2 - transformed")

dataset.save()

### Auto-concatenate data batches

Because both data batches measured the same validated feature set, we can auto-concatenate the sharded dataset.

This means, we can load it as if it was stored in a single file:

In [None]:
dataset.load().tail()

We can also access & query the underlying two file objects:

In [None]:
dataset.files.list()

Or look at their data flow:

In [None]:
dataset.view_flow()

Or look at the database:

In [None]:
ln.view()

This is it! 😅

If you're interested, please check out guides & use cases or make an issue on GitHub to [discuss](https://github.com/laminlabs/lamindb/issues/new).

## Appendix

### Manage metadata

#### Hierarchical ontologies

Say, we want to express that `study0` belongs to project 1 and is a study, we can use `.parents`:

In [None]:
project1 = ln.ULabel(name="project1")
project1.save()
is_study = ln.ULabel(name="is_study")
is_study.save()
study_label.parents.set([project1, is_study])
study_label.view_parents()

For more info, see {meth}`~lamindb.dev.HasParents.view_parents`.

#### Avoid duplicates

We already created a `project1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.ULabel(name="project1")

label.save()

Instead of creating a new record, LaminDB loads and returns the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplicates.

Say, we spell "project 1" with a white space:

In [None]:
ln.ULabel(name="project 1")

To avoid inserting duplicates when creating new records, a search compares whether a similar record already exists.

You can switch it off for performance gains via {attr}`~lamindb.dev.Settings.upon_create_search_names`.

#### Update & delete records

In [None]:
label = ln.ULabel.filter(name="project1").first()

label

In [None]:
label.name = "project1a"

label.save()

label

In [None]:
label.delete()

### Manage storage

#### Change default storage

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"`.

#### See all storage locations

In [None]:
ln.Storage.filter().df()

### Set verbosity

To reduce the number of logging messages, set {attr}`~lamindb.dev.Settings.verbosity`:

In [None]:
ln.settings.verbosity = 3  # only show info, no hints

In [None]:
# clean up what we wrote in this notebook
!lamin delete --force lamin-tutorial
!rm -r lamin-tutorial