[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial1.ipynb)

# Manage features & labels

Why care about features & labels?

1. Finding data: Which datasets measured expression of cell marker `CD14`? Which characterized cell line `K562`? Which datasets have a test & train split? Etc.
2. Validating data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

:::{dropdown} A perspective on contextualizing data objects

We love the pydata family of data objects: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, ...

But we couldn’t find an object for linking data objects to context!

So, we made `lamindb.File` and `lamindb.Dataset` to model how data objects relate to their context:

- other data objects, data transformations, models, users & pipelines that performed transformations (provenance)
- any entity of the domain in which data is generated and modeled (biology)

:::

```{note}

This notebook uses the instance created in part 1 of the tutorial: {doc}`/tutorial`.

```

In [None]:
import lamindb as ln
import pandas as pd

## Register metadata

### Register labels

We're studying 3 species of the Iris plant: `setosa`, `versicolor` & `virginica`.

Let's populate the {class}`~lamindb.Label` registry for them:

In [None]:
labels = [ln.Label(name=name) for name in ["setosa", "versicolor", "virginica"]]
ln.save(labels)

labels

Anticipating that we'll have many different types of labels when working with more data, we'd like to express that all 3 labels are species labels:

In [None]:
parent = ln.Label(name="is_species")
parent.save()

for label in labels:
    label.parents.add(parent)

parent.view_parents(with_children=True)

{class}`~lamindb.Label` enables you to manage an in-house ontology to manage all kinds of labels.

If you'd like to leverage pre-built ontologies for basic biological entities in the same way, see: {doc}`/bio-registries`.

In addition to species, we'd like to track the studies that produced the data:

In [None]:
ln.Label(name="study0").save()

:::{dropdown} Why labeling a data batch by study?

We can then

1. query all files link to this experiment
2. model it as a confounder when we'll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix

:::

### Register features

For every set of studied labels (measured values), we typically also want a feature (a measurement dimension aka "column name").

Let's populate it:

In [None]:
ln.Feature(name="iris_species_name", type="category").save()
ln.Feature(name="study_name", type="category").save()

## Validate & link labels

We already looked at the metadata for `study0`, before: 

In [None]:
meta_file = ln.File.filter(key="iris_studies/study0_raw_images/meta.csv").one()
meta = meta_file.load(index_col=0)  # load a dataframe

meta.head()

### Validate metadata

Depending on the data generation process, such metadata might or might not match the labels we defined in our registries.

Let's validate the labels by mapping the values stored in the file on the :class:`~lamindb.Label` registry:

In [None]:
ln.Label.validate(meta["1"], field="name")

Everything passed and no fixes are needed!

If validation doesn't pass, {meth}`~lamindb.dev.CanValidate.standardize` and {meth}`~lamindb.dev.CanValidate.inspect` will help curate data.

### Label files

Labeling a set of files is useful if we want to make the set queryable among a large number of files.

You can label a file by calling `file.add_labels()` and pass a single or multiple label records.

Let's do this based on the labels in `meta.csv`:

In [None]:
image_files = ln.File.filter(
    key__startswith="iris_studies/study0_raw_images", suffix=".jpg"
)

study_label = ln.Label.filter(name="study0").one()
for file in image_files:
    file.add_labels(study_label, feature="study_name")
    # get species name from metadata file
    species_name = meta.loc[file.path.name == meta["0"], "1"].values[0]
    species_label = ln.Label.filter(name=species_name).one()
    file.add_labels(species_label, feature="iris_species_name")

### Query files by labels

Using the new annotations, you can now query image files by species & study labels:

In [None]:
labels = ln.Label.lookup()
file = ln.File.filter(labels__in=[labels.versicolor, labels.study0]).first()

We also see them when calling {meth}`~lamindb.dev.Data.describe`:

In [None]:
file.describe()

## Label datasets

Labeling datasets works in the same way as labeling files:

In [None]:
# query the dataset
dataset = ln.Dataset.filter(name="Iris study 1").one()

# add study label
dataset.add_labels(study_label, feature="study_name")

# get all species labels
all_species_labels = ln.Label.filter(parents__name="is_species").all()
dataset.add_labels(all_species_labels, feature="iris_species_name")

Check that the dataset is labeled:

In [None]:
dataset.describe()

## Transform images into a DataFrame

Let's now run a ML model that transforms the images into 4 high-level features.

In [None]:
def run_ml_model() -> pd.DataFrame:
    transform = ln.Transform(name="Petal & sepal regressor", type="pipeline")
    ln.track(transform)
    input_dataset = ln.Dataset.filter(name="Iris study 1").one()
    input_paths = [file.stage() for file in input_dataset.files.all()]
    # transform the data...
    output_dataset = ln.dev.datasets.df_iris_in_meter_study1()
    return output_dataset


df = run_ml_model()

The output is a dataframe that looks as a follows:

In [None]:
df.head()

## Register the output data

Let's first register the new features of the transformed data:

In [None]:
features = ln.Feature.from_df(df)
ln.save(features)

:::{dropdown} How to track units of features?

It's easy using {class}`~lamindb.Feature.unit`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "float":
        feature.unit = "m"  # SI unit for meters
        feature.save()
```

:::

We can now validate & register the dataframe in one line:

In [None]:
dataset = ln.Dataset.from_df(
    df,
    name="Iris study 1 - transformed",
    description="Iris dataset after measuring sepal & petal metrics",
)

In [None]:
dataset.save()

Get an overview of linked feature sets:

In [None]:
dataset.features

A `slot` provides a string key to access feature sets. It's typically the accessor of feature identifiers in the data object we're validating & registering (here, a `DataFrame`).

Let's use it to access all linked features:

In [None]:
dataset.features["columns"].df()

The Iris dataset comes with labels within the data object.

In [None]:
species_labels = ln.Label.filter(parents__name="is_species").all()

species_labels

In [None]:
dataset.add_labels(species_labels, feature="iris_species_name")

In [None]:
dataset.add_labels(study_label, feature="study_name")

We now have the original feature set and the external feature set:

In [None]:
dataset.features

This is the context for our file:

In [None]:
dataset.describe()

In [None]:
dataset.file.view_lineage()

See the database content:

In [None]:
ln.view(registries=["Feature", "FeatureSet", "Label"])

## Manage data of study2

In [None]:
ln.track()

In [None]:
df = ln.dev.datasets.df_iris_in_meter_study2()
ln.File.from_df(df, description="Iris flower dataset study 2").save()

And load both files storing separate batches:

In [None]:
file1 = dataset.file
file2 = ln.File.filter(description="Iris flower dataset study 2").one()

We can now create a sharded dataset from these two batches:

In [None]:
dataset = ln.Dataset([file1, file2], name="Iris datasets study 1 & 2")

In [None]:
dataset.save()

You can load the sharded dataset as if it was one dataset:

In [None]:
dataset.load().tail()

Access the underlying two file objects:

In [None]:
dataset.files.list()

In [None]:
dataset.files.list()[0].view_lineage()

For a more interesting data lineage graph, let's pretend we're now running a pipeline:

In [None]:
pipeline = ln.Transform(name="Iris Postprocessor", version="0.7.2")
ln.track(pipeline)  # create & track a pipeline
input_files = ln.File.filter(transform__name__contains="files & datasets").all()
[file.stage() for file in input_files]  # let's load the input files

## Manage metadata

To end this guide through basic file & metadata tracking, let's see how to update registry records.

### Hierarchical ontologies

Say, we want to express that `study0` belongs to project 1, we can use `.parents`:

In [None]:
project1 = ln.Label(name="project1")
project1.save()
study_label.parents.add(project1)
study_label.view_parents()

For more info, see {meth}`~lamindb.dev.HasParents.view_parents`.

### Validate records upon creation

We already created a `project_1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.Label(name="project1")

label.save()

Instead of creating a new record, LaminDB will load and return the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplictes.

Say, we spell "project_1" without an underscore:

In [None]:
ln.Label(name="project 1")

You see that for every record creation, a search compares whether a similar already exists!
    
This is to avoid inserting duplicated records.

You can switch it off (for performance gains) via `ln.settings.upon_create_search_names = False`.

### Update records

In [None]:
label = ln.Label.filter(name="project1").first()

In [None]:
label

In [None]:
label.name = "project1a"

In [None]:
label.save()

In [None]:
label

### Delete records

Delete records like so:

In [None]:
label.delete()

## Other topics

### Change default storage

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"` and see all storage locations via:

In [None]:
ln.Storage.filter().df()

### Set verbosity

In [None]:
ln.settings.verbosity = 4  # only show info, no hints

In [None]:
# clean up what we wrote in this notebook
!lamin delete --force mydata
!rm -r mydata