# What does the key parameter do under the hood?

LaminDB is designed around associating biological metadata to files and datasets.
This enables querying for them in storage by metadata and removes the requirement for semantic file and dataset names.

Here, we will discuss trade-offs for using the `key` parameter, which allows for semantic keys, in various scenarios.

## Setup

We're simulating a file system with several nested folders and files.
Such structures are resembled in, for example, the {doc}`docs:rxrx` guide.

In [None]:
import random
import string
from pathlib import Path


def create_complex_biological_hierarchy(root_folder):
    root_path = Path(root_folder)

    if root_path.exists():
        print("Folder structure already exists. Skipping...")
    else:
        root_path.mkdir()

        raw_folder = root_path / "raw"
        preprocessed_folder = root_path / "preprocessed"
        raw_folder.mkdir()
        preprocessed_folder.mkdir()

        for i in range(1, 5):
            file_name = f"raw_data_{i}.txt"
            with (raw_folder / file_name).open("w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)

        for i in range(1, 3):
            dataset_folder = raw_folder / f"Dataset_{i}"
            dataset_folder.mkdir()

            for j in range(1, 5):
                file_name = f"raw_data_{j}.txt"
                with (dataset_folder / file_name).open("w") as f:
                    random_text = "".join(
                        random.choice(string.ascii_letters) for _ in range(10)
                    )
                    f.write(random_text)

        for i in range(1, 5):
            file_name = f"result_{i}.txt"
            with (preprocessed_folder / file_name).open("w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)


root_folder = "complex_biological_project"
create_complex_biological_hierarchy(root_folder)

In [None]:
!lamin init --storage ./key-eval

In [None]:
import lamindb as ln


ln.settings.verbosity = "hint"

In [None]:
ln.UPath.view_tree("complex_biological_project")

In [None]:
ln.track()

## Storing files using `Storage`, `File`, and `Dataset`

Lamin has three storage classes that manage different types of in-memory and on-disk objects:

1. {class}`~lamindb.Storage`: Manages the default storage root that can be either local or in the cloud. For more details we refer to {doc}`docs:faq/storage`.
2. {class}`~lamindb.File`: Manages data batches with an optional `key` that acts as a relative path within the current default storage root (see {class}`~lamindb.Storage`). An example is a single h5 file.
3. {class}`~lamindb.Dataset`: Manages a collection of data batches with an optional `key` that acts as a relative path within the current default storage root (see {class}`~lamindb.Storage`). An example is a collection of h5 files.

For more details we refer to {doc}`docs:tutorial`.

The current storage root is:

In [None]:
ln.settings.storage

By default, Lamin uses virtual `keys` that are only reflected in the database but not in storage.
It is possible to turn this behavior off by setting `ln.settings.file_use_virtual_keys = False`.
Generally, we discourage disabling this setting manually. For more details we refer to {doc}`docs:faq/storage`.

In [None]:
ln.settings.file_use_virtual_keys

We will now create `File` objects with and without semantic keys using `key` and also save them as `Datasets`.

In [None]:
file_no_key_1 = ln.File("complex_biological_project/raw/raw_data_1.txt")
file_no_key_2 = ln.File("complex_biological_project/raw/raw_data_2.txt")

The logging suggests that the files will be saved to our current default storage with auto generated storage keys.

In [None]:
file_no_key_1.save()
file_no_key_2.save()

In [None]:
file_key_3 = ln.File(
    "complex_biological_project/raw/raw_data_3.txt", key="raw/raw_data_3.txt"
)
file_key_4 = ln.File(
    "complex_biological_project/raw/raw_data_4.txt", key="raw/raw_data_4.txt"
)
file_key_3.save()
file_key_4.save()

`Files` with keys are not stored in different locations because of the usage of `virtual keys`.
However, they are still semantically queryable by `key`.

In [None]:
ln.File.filter(key__contains="raw").df().head()

`Dataset` does not have a `key` parameter because it does not store any additional data in `Storage`.
In contrast, it has a `name` parameter that serves as a semantic identifier of the dataset.

In [None]:
ds_1 = ln.Dataset(data=[file_no_key_1, file_no_key_2], name="no key collection")
ds_2 = ln.Dataset(data=[file_key_3, file_key_4], name="sample collection")
ds_1

## Advantages and disadvantages of semantic keys

Semantic keys have several advantages and disadvantages that we will discuss and demonstrate in the remaining notebook:

### Advantages:

- Simple: It can be easier to refer to specific datasets in conversations
- Familiarity: Most people are familiar with the concept of semantic names

### Disadvantages

- Length: Semantic names can be long with limited aesthetic appeal
- Inconsistency: Lack of naming conventions can lead to confusion
- Limited metadata: Semantic keys can contain some, but usually not all metadata
- Inefficiency: Writing lengthy semantic names is a repetitive process and can be time-consuming
- Ambiguity: Overly descriptive file names may introduce ambiguity and redundancy
- Clashes: Several people may attempt to use the same semantic key. They are not unique

## Renaming files

Renaming `Files` that have associated keys can be done on several levels.

### In storage

A file can be locally moved or renamed:

In [None]:
file_key_3.path

In [None]:
loaded_file = file_key_3.load()

In [None]:
!mkdir complex_biological_project/moved_files
!mv complex_biological_project/raw/raw_data_3.txt complex_biological_project/moved_files

In [None]:
file_key_3.path

After moving the file locally, the storage location (the path) has not changed and the file can still be loaded.

In [None]:
file_3 = file_key_3.load()

The same applies to the `key` which has not changed.

In [None]:
file_key_3.key

### By key

Besides moving the file in storage, the `key` can also be renamed.

In [None]:
file_key_4.key

In [None]:
file_key_4.key = "bad_samples/sample_data_4.txt"
file_key_4.key

Due to the usage of virtual `keys`, modifying the key does not change the storage location and the file stays accessible.

In [None]:
file_key_4.path

In [None]:
file_4 = file_key_4.load()

### Modifying the `path` attribute

However, modifying the `path` directly is not allowed:

In [None]:
try:
    file_key_4.path = f"{ln.settings.storage}/here_now/sample_data_4.txt"
except AttributeError as e:
    print(e)

## Clashing semantic keys

Semantic keys should not clash. Let's attempt to use the same semantic key twice

In [None]:
print(file_key_3.key)
print(file_key_4.key)

In [None]:
file_key_4.key = "raw/raw_data_3.txt"

In [None]:
print(file_key_3.key)
print(file_key_4.key)

When filtering for this semantic key it is now unclear to which file we were referring to:

In [None]:
ln.File.filter(key__icontains="sample_data_3").df()

When querying by `key` LaminDB cannot resolve which file we actually wanted.
In fact, we only get a single hit which does not paint a complete picture.

In [None]:
print(file_key_3.uid)
print(file_key_4.uid)

Both files still exist though with unique `uids` that can be used to get access to them.
Most importantly though, saving these files to the database will result in an `IntegrityError` to prevent this issue.

In [None]:
try:
    file_key_3.save()
    file_key_4.save()
except Exception as e:
    print(
        "It is not possible to save files to the same key. This results in an Integrity"
        " Error!"
    )

We refer to {doc}`docs:faq/idempotency` for more detailed explanations of behavior when attempting to save files multiple times.

## Hierarchies 

Another common use-case of `keys` are file hierarchies.
It can be useful to resemble the file structure in "complex_biological_project" from above also in LaminDB to allow for queries for files that were stored in specific folders.
Common examples of this are folders specifying different processing stages such as `raw`, `preprocessed`, or `annotated`.

Note that this use-case may also be overlapping with `Dataset` which also allows for grouping `Files`.
However, `Dataset` cannot model hierarchical groupings.

### Key

In [None]:
import os

for root, _, files in os.walk("complex_biological_project/raw"):
    for filename in files:
        file_path = os.path.join(root, filename)
        key_path = file_path.removeprefix("complex_biological_project")
        ln_file = ln.File(file_path, key=key_path)
        ln_file.save()

In [None]:
ln.File.filter(key__startswith="raw").df()

### Dataset

Alternatively, it would have been possible to create a `Dataset` with a corresponding name:

In [None]:
all_data_paths = []
for root, _, files in os.walk("complex_biological_project/raw"):
    for filename in files:
        file_path = os.path.join(root, filename)
        all_data_paths.append(file_path)

all_data_files = []
for path in all_data_paths:
    all_data_files.append(ln.File(path))

data_ds = ln.Dataset(all_data_files, name="data")
data_ds.save()

In [None]:
ln.Dataset.filter(name__icontains="data").df()

This approach will likely lead to clashes. Alternatively, `Ulabels` can be added to `Files` to resemble hierarchies.

### Ulabels

In [None]:
for root, _, files in os.walk("complex_biological_project/raw"):
    for filename in files:
        file_path = os.path.join(root, filename)
        key_path = file_path.removeprefix("complex_biological_project")
        ln_file = ln.File(file_path, key=key_path)
        ln_file.save()

        data_label = ln.ULabel(name="data")
        data_label.save()
        ln_file.ulabels.add(data_label)

In [None]:
labels = ln.ULabel.lookup()

In [None]:
ln.File.filter(ulabels__in=[labels.data]).df()

However, `Ulabels` are too versatile for such an approach and clashes are also to be expected here.

### Metadata

Due to the chance of clashes for the aforementioned approaches being rather high, we generally recommend not to store hierarchical data with solely semantic keys.
Biological metadata makes `Files` and `Datasets` unambiguous and easily queryable.


## Legacy data and multiple storage roots

### Distributed Datasets

LaminDB can ingest legacy data that already had a structure in their storage.
In such cases, it disables `file_use_virtual_keys` and the files are ingested with their actual storage location.
It might be therefore be possible that `Files` stored in different storage roots may be associated with a single `Dataset`.
To simulate this, we are disabling `file_use_virtual_keys` and ingest files stored in a different path (the "legacy data").

In [None]:
ln.settings.file_use_virtual_keys = False

In [None]:
for root, _, files in os.walk("complex_biological_project/preprocessed"):
    for filename in files:
        file_path = os.path.join(root, filename)
        key_path = file_path.removeprefix("complex_biological_project")

        print(file_path)
        print()

        ln_file = ln.File(file_path, key=f"./{key_path}")
        ln_file.save()

In [None]:
ln.File.filter().df()

In [None]:
file_from_raw = ln.File.filter(key__icontains="Dataset_2/raw_data_1").first()
file_from_preprocessed = ln.File.filter(key__icontains="preprocessed/result_1").first()

print(file_from_raw.path)
print(file_from_preprocessed.path)

Let's create our `Dataset`:

In [None]:
ds = ln.Dataset(
    [file_from_raw, file_from_preprocessed], name="raw_and_processed_dataset_2"
)
ds.save()

In [None]:
ds.files.df()

### Modeling folders

In [None]:
ln.settings.file_use_virtual_keys = True

In [None]:
dir_path = ln.dev.datasets.dir_scrnaseq_cellranger("sample_001")
ln.UPath.view_tree(dir_path)

There are two ways to create `File` objects from folders: {func}`~lamindb.File.from_dir` and {class}`~lamindb.Dataset`.

In [None]:
cellranger_raw_file = ln.File.from_dir("sample_001/raw_feature_bc_matrix/")

In [None]:
for file in cellranger_raw_file:
    file.save()

In [None]:
cellranger_raw_ds = ln.Dataset(
    "sample_001/raw_feature_bc_matrix/", name="cellranger raw"
)

In [None]:
cellranger_raw_ds.save()

In [None]:
ln.File.filter(key__icontains="raw_feature_bc_matrix").df()

In [None]:
ln.File.filter(key__icontains="raw_feature_bc_matrix/matrix.mtx.gz").one().path

In [None]:
input_paths = [
    file.stage() for file in ln.Dataset.filter(name="cellranger raw").one().files.all()
]
# We expect the input_paths to be empty
input_paths

While `File.from_dir` creates explicit `File` objects with the default constructor, the `Dataset` constructor only returns a `Dataset` without any `File` records.
The latter behavior is particularly useful when only a reference to a dataset is necessary and not to particular files.
This saves a lot of transactions for particularly large datasets with a lot of files.

### Messing with the storage root

In [None]:
ln.settings.storage.root

In [None]:
# NEED TO MOVE TO POSTGRES OR SOMETHING
ln.settings.storage = "/filtered"

In [None]:
ln.settings.storage.view_tree()

## Discussion points

### Changing the storage root

Question: Are things going to work well when setting the current storage to `ln.settings.storage = "s3://theislab/raw"`?
Basically if people are using a semantic key, can we change the storage root that uses the key prefix to store the data there?

Answer: I currently cannot change the storage root of the S3 based storage because I'm running a sqlite instance. I'd have to use LaminData or so.
Generally this should work. However, we discourage people from messing with the storage location anyway and only to trust the `virtual keys`.
There should not be a use-case for this with a single exception: People uploaded legacy data to Lamin, and they're reusing the storage for a different application where they also need to preserve the structure in the future. In such cases, `file_use_virtual_keys` should still be switched off even though we considered not exposing it anymore.

### The rxrx1 use-case

Question: Does the rxrx1 use-case work well with our current `key` design?

Answer: The rxrx1 use-case currently has an immutable parquet file with metadata associated with the `Dataset`.
Since we did not register the `File` objects themselves, we have to query for the files and their paths through the parquet file.
This requires the paths to be stable remote URLs that are kept sync with the metadata parquet `File`.
https://github.com/laminlabs/rxrx-lamin/blob/main/docs/notebooks/02-rxrx1.ipynb describes the curation for the images.
Here, a `Dataset` is created without any `File` objects.
The `Dataset` does not have `key`, solely `File` objects, which are not created here.
Any hierarchy that is resembled in the `paths` and `path` (why 2?) columns of the metadata DataFrame is not reflected in `Storage` (via `key`) due to the lack of `Files`.

### On the entire doc pretty much only using virtual keys

Question: Another overarching remark concerns that the entire doc does in fact only use virtual keys, and no real storage keys.

Answer: Yup, which is the default and we discussed removing the option to changing that.
Only legacy data may be using real storage keys.