[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial1.ipynb)

# Manage features & labels

Why care about features & labels?

1. Finding data: Which datasets measured expression of cell marker `CD14`? Which characterized cell line `K562`? Which datasets have a test & train split? Etc.
2. Validating data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

:::{dropdown} A perspective on contextualizing data objects

We love the pydata family of data objects: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, ...

But we couldn’t find an object for linking data objects to context!

So, we made `lamindb.File` and `lamindb.Dataset` to model how data objects relate to their context:

- other data objects, data transformations, models, users & pipelines that performed transformations (provenance)
- any entity of the domain in which data is generated and modeled (biology)

:::

```{note}

This notebook uses the instance created in part 1 of the tutorial: {doc}`/tutorial`.

```

In [None]:
import lamindb as ln

ln.track()

## Populate the label registry

We're studying 3 species of the Iris plant: `setosa`, `versicolor` & `virginica`.

Let's populate the {class}`~lamindb.Label` registry for them:

In [None]:
labels = [ln.Label(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(labels)

labels

Anticipating that we'll have many different types of labels when working with more data, we'd like to express that all 3 labels are species labels:

In [None]:
parent = ln.Label(name="is_species")
parent.save()

for label in labels:
    label.parents.add(parent)

parent.view_parents(with_children=True)

{class}`~lamindb.Label` enables you to manage an in-house ontology to manage all kinds of labels.

If you'd like to leverage pre-built ontologies for basic biological entities in the same way, see: {doc}`/bio-registries`.

In addition to species, we'd like to track the studies that produced the data:

In [None]:
ln.Label(name="study0").save()

## Validate labels

We already looked at the metadata for `study0`, before: 

In [None]:
meta_file = ln.File.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_file.load(index_col=0)  # load a dataframe

meta.head()

Validate the content of that file by mapping it on the :class:`~lamindb.Label` registry:

In [None]:
ln.Label.validate(meta["1"])

Everything passed and no fixes are needed!

## Populate the feature registry

For every set of studied labels (measured values), we typically also want a feature (a measurement dimension aka "column name").

Let's populate it:

In [None]:
ln.Feature(name="iris_species_name", type="category").save()
ln.Feature(name="study_name", type="category").save()

## Link labels to files

Labeling a set of files is useful if we want to make it queryable among a large number of files.

In [None]:
image_files = ln.File.filter(
    key__startswith="iris_studies/study0_raw_images", suffix=".jpg"
)

You can label a file by calling `file.add_labels()` and pass a single or multiple label records.

Let's do this based on the labels in `meta.csv`:

In [None]:
study_label = ln.Label.filter(name="study0").one()
for file in image_files:
    species_name = meta.loc[file.path.name == meta["0"], "1"].values[0]
    species_label = ln.Label.filter(name=species_name).one()
    file.add_labels(species_label, feature="iris_species_name")
    file.add_labels(study_label, feature="study_name")

## Query labeled files

Using the new annotations, you can now query image files by species & study labels:

In [None]:
labels = ln.Label.lookup()
ln.File.filter(labels__in=[labels.versicolor, labels.study0]).df().head()

## Describe files

In [None]:
file.describe()

## Label datasets

Labeling datasets works in the same way as labeling files.

In [None]:
dataset = ln.Dataset.filter(name="Iris study 1").one()

In [None]:
dataset.add_labels(study_label, feature="study_name")

In [None]:
dataset.describe()

## Dataframes

Consider a batch of the Iris flower dataset (a `DataFrame`):

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch1()

df.head()

## Validate & link features

Let's use {meth}`~lamindb.File.from_df` to track this DataFrame along with its columns as features:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

Features couldn't be validated and are ignored because this is an empty LaminDB instance without a single registered feature.

But, all features here are meaningful and well-curated, so, let's create records for them:

In [None]:
features = ln.Feature.from_df(df)

features

As soon as we save them, they'll serve as the reference for validating data batches that we'd like to validate.

In [None]:
ln.save(features)

:::{dropdown} How to track units of features?

It's easy using {class}`~lamindb.Feature.unit`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "float":
        feature.unit = "m"  # SI unit for meters
        feature.save()
```

:::

If we create the `File` now, we'll see that features are validated based on the registry content:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

Let's register the file along with its linked features.

In [None]:
file.save()

Get an overview of linked feature sets:

In [None]:
file.features

A `slot` provides a string key to access feature sets. It's typically the accessor of feature identifiers in the data object we're validating & registering (here, a `DataFrame`).

Let's use it to access all linked features:

In [None]:
file.features["columns"].df()

## Validate & link labels

The Iris dataset comes with labels within the data object.

In [None]:
species_labels = ln.Label.from_values(df["iris_species_name"])

species_labels

Let's save them to the {class}`~lamindb.Label` registry so that they get validated going forward:

In [None]:
ln.save(species_labels)

And annotate the file with the labels for feature `iris_species_name`:

In [None]:
file.add_labels(species_labels)

Now we can get linked labels from a feature:

In [None]:
file.get_labels("iris_species_name").df()

We can now query & search the file by whether `"setosa"` is linked to it:

In [None]:
ln.File.filter(labels__name="setosa").df()

In addition to features present _within_ a data object like a `DataFrame`, a file can be labeled with external metadata.

Let's label this file with `"experiment_1"`:

In [None]:
experiment1 = ln.Label(name="experiment_1")
experiment1.save()
experiment1

:::{dropdown} Why labeling a data batch by experiment?

We can then

1. query all files link to this experiment
2. model it as a confounder when we'll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix

:::

Let's also register a feature that holds experiment labels in concatenated datasets:

In [None]:
ln.Feature(name="experiment", type="category").save()

In [None]:
file.add_labels(experiment1, feature="experiment")

We now have the original feature set and the external feature set:

In [None]:
file.features

This is the context for our file:

In [None]:
file.describe()

See the database content:

In [None]:
ln.view(registries=["Feature", "FeatureSet", "Label", "Modality"])

## Manage datasets

In simple cases we just saw, we can use files to store datasets.

In more complex cases, we'd like to store collections of files and data in mutable storage backends (zarr, TileDB, DuckDB, etc.) or in SQL tables in BigQuery, Snowflake, or Postgres.

Hence, we need a second central class for data storage: {class}`~lamindb.Dataset`.

Let's say we have a second batch of the Iris dataset:

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch2()
ln.File.from_df(df, description="Iris flower dataset batch 2").save()

And load both files storing separate batches:

In [None]:
file1 = ln.File.filter(description="Iris flower dataset batch 1").one()
file2 = ln.File.filter(description="Iris flower dataset batch 2").one()

We can now create a sharded dataset from these two batches:

In [None]:
dataset = ln.Dataset([file1, file2], name="The combined Iris dataset")

In [None]:
dataset.save()

You can load the sharded dataset as if it was one dataset:

In [None]:
dataset.load().tail()

Access the underlying two file objects:

In [None]:
dataset.files.list()

In [None]:
dataset.files.list()[0].view_lineage()

Or see the registries:

In [None]:
ln.view(registries=["Dataset", "File"])

For a more interesting data lineage graph, let's pretend we're now running a pipeline:

In [None]:
pipeline = ln.Transform(name="Iris Postprocessor", version="0.7.2")
ln.track(pipeline)  # create & track a pipeline
input_files = ln.File.filter(transform__name__contains="files & datasets").all()
[file.stage() for file in input_files]  # let's load the input files

## Manage metadata

To end this guide through basic file & metadata tracking, let's see how to update registry records.

### Hierarchical ontologies

Say, we want to express that `experiment_1` belongs to project 1, we can use `.parents`

In [None]:
project1 = ln.Label(name="project_1")
project1.save()
experiment1.parents.add(project1)
experiment1.view_parents()

For more info, see {meth}`~lamindb.dev.ParentsAware.view_parents`.

### Validate records upon creation

We already created a `project_1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.Label(name="project_1")

label.save()

Instead of creating a new record, LaminDB will load and return the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplictes.

Say, we spell "project_1" without an underscore:

In [None]:
ln.Label(name="project 1")

You see that for every record creation, a search compares whether a similar already exists!
    
This is to avoid inserting duplicated records.

You can switch it off (for performance gains) via `ln.settings.upon_create_search_names = False`.

### Update records

In [None]:
label = ln.Label.filter(name="project_1").first()

In [None]:
label

In [None]:
label.name = "project_1a"

In [None]:
label.save()

In [None]:
label

### Delete records

Delete records like so:

In [None]:
label.delete()

## Other topics

### Change default storage

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"` and see all storage locations via:

In [None]:
ln.Storage.filter().df()

### Set verbosity

In [None]:
ln.settings.verbosity = 3  # only show info, no hints

In [None]:
# clean up what we wrote in this notebook
!lamin delete --force mydata
!rm -r mydata