# What does the key parameter do under the hood?

LaminDB is designed around associating biological metadata to files and datasets.
This enables querying for them in storage by metadata and removes the requirement for semantic file and dataset names.

Here, we will discuss trade-offs for using the `key` parameter, which allows for semantic keys, in various scenarios.

## Setup

We're simulating a file system with several nested folders and files.
Such structures are resembled in, for example, the {doc}`docs:rxrx` guide.

In [None]:
# This code will eventually be moved to a small script function
import os
import random
import string


def create_complex_biological_hierarchy(root_folder):
    if not os.path.exists(root_folder):
        os.mkdir(root_folder)

    data_folder = os.path.join(root_folder, "data")
    analysis_folder = os.path.join(root_folder, "analysis")
    os.mkdir(data_folder)
    os.mkdir(analysis_folder)

    for i in range(1, 5):
        file_name = f"sample_data_{i}.txt"
        with open(os.path.join(data_folder, file_name), "w") as f:
            random_text = "".join(
                random.choice(string.ascii_letters) for _ in range(10)
            )
            f.write(random_text)

    for i in range(1, 3):
        dataset_folder = os.path.join(data_folder, f"Dataset_{i}")
        os.mkdir(dataset_folder)

        for j in range(1, 5):
            file_name = f"sample_data_{j}.txt"
            with open(os.path.join(dataset_folder, file_name), "w") as f:
                random_text = "".join(
                    random.choice(string.ascii_letters) for _ in range(10)
                )
                f.write(random_text)

    nested_analysis_folder = os.path.join(analysis_folder, "nested_analysis")
    os.mkdir(nested_analysis_folder)

    for i in range(1, 5):
        file_name = f"result_{i}.txt"
        with open(os.path.join(nested_analysis_folder, file_name), "w") as f:
            random_text = "".join(
                random.choice(string.ascii_letters) for _ in range(10)
            )
            f.write(random_text)


root_folder = "complex_biological_project"
create_complex_biological_hierarchy(root_folder)

In [None]:
!tree complex_biological_project

In [None]:
!lamin init --storage ./test-key

In [None]:
import lamindb as ln

ln.settings.verbosity = "hint"

In [None]:
ln.track()

## Storing files using `Storage`, `File`, and `Dataset`

Lamin has three storage classes that manage different types of in-memory and on-disk objects:

1. {class}`~lamindb.Storage`: Manages the default storage root that can be either local or in the cloud. For more details we refer to {doc}`docs:faq/storage`
2. {class}`~lamindb.File`: Manages data batches with an optional `key` that acts as a relative path within the current default storage root (see {class}`~lamindb.Storage`). An example is a single h5 file.
3. {class}`~lamindb.Dataset`: Manages a collection of data batches with an optional `key` that acts as a relative path within the current default storage root (see {class}`~lamindb.Storage`). An example is a collection of h5 files.

For more details we refer to {doc}`docs:tutorial`.

The current storage root is:

In [None]:
ln.settings.storage

We will now create `File` objects with and without semantic keys using `key` and also save them as `Datasets`.

In [None]:
file_no_key_1 = ln.File("complex_biological_project/data/sample_data_1.txt")
file_no_key_2 = ln.File("complex_biological_project/data/sample_data_2.txt")

The logging suggests that the files will be saved to our current default storage with auto generated storage keys.

In [None]:
file_no_key_1.save()
file_no_key_2.save()

In [None]:
file_key_3 = ln.File(
    "complex_biological_project/data/sample_data_3.txt", key="samples/sample_data_3.txt"
)
file_key_4 = ln.File(
    "complex_biological_project/data/sample_data_4.txt", key="samples/sample_data_4.txt"
)
file_key_3.save()
file_key_4.save()

As can be seen, the file `Files` with keys are stored in different locations (as specified by `key`) than their keyless counter parts.
However, this also enables semantically filtering for those files:

In [None]:
ln.File.filter(key__contains="samples").df().head()

`Dataset` does not have a `key` parameter because it does not store any additional data in `Storage`.
In contrast, it has a `name` parameter that serves as a semantic identifier of the dataset.

In [None]:
ds_1 = ln.Dataset(data=[file_no_key_1, file_no_key_2], name="no key collection")
ds_2 = ln.Dataset(data=[file_key_3, file_key_4], name="sample collection")
ds_1

## Advantages and disadvantages of semantic keys

Semantic keys have several advantages and disadvantages that we will discuss and demonstrate in the remaining notebook:

### Advantages:

- Simple: It can be easier to refer to specific datasets in conversations
- Familiarity: Most people are familiar with the concept of semantic names

### Disadvantages

- Length: Semantic names can be long with limited aesthetic appeal
- Inconsistency: Lack of naming conventions can lead to confusion
- Limited metadata: Semantic keys can contain some, but usually not all metadata
- Inefficiency: Writing lengthy semantic names is a repetitive process and can be time-consuming
- Ambiguity: Overly descriptive file names may introduce ambiguity and redundancy
- Clashes: Several people may attempt to use the same semantic key. They are not unique

## Semantic key ambiguity

The current implementation of search and filter are based on fuzzy matching.
Fuzzy matching with semantic keys can fail if semantic keys are long which can lead to less matched characters and therefore ratios.

![title](fuzzy_matching_fail.png)

The files that we were actually looking for are not the top, but the bottom hits.

## Renaming files

Renaming `Files` that have associated keys can be done on several levels.

### In storage

A file can be locally moved or renamed:

In [None]:
file_key_3.path

In [None]:
!mkdir complex_biological_project/moved_files
!mv complex_biological_project/data/sample_data_3.txt complex_biological_project/moved_files

In [None]:
file_key_3.path

After moving the file locally, the storage location (the path) has not updated in the database.

In [None]:
file_key_3.key

The same applies to the `key` which has not been updated.
If initially the storage locations were supposed to be kept in sync with any semantic `key`, moving files in storage can violate this assumption.
This also applies to changing the default storage location.

### By key

Besides moving the file in storage, the `key` can also be renamed.

In [None]:
file_key_4.key

In [None]:
file_key_4.key = "bad_samples/sample_data_4.txt"
file_key_4.key

This does however have an effect on the storage path of the `File`:

In [None]:
file_key_4.path

### Modifying the `path` attribute

However, modifying the `path` directly is not allowed:

In [None]:
try:
    file_key_4.path = f"{ln.settings.storage}/here_now/sample_data_4.txt"
except AttributeError as e:
    print(e)

## Clashing semantic keys

Semantic keys should not clash. Let's attempt to use the same semantic key twice

In [None]:
print(file_key_3.key)
print(file_key_4.key)

In [None]:
file_key_4.key = "samples/sample_data_3.txt"

In [None]:
print(file_key_3.key)
print(file_key_4.key)

When filtering for this semantic key it is now unclear to which file we were referring to:

In [None]:
ln.File.filter(key__icontains="sample_data_3").df()

When querying by `key` LaminDB cannot resolve which file we actually wanted.
In fact, we only get a single hit which does not paint a complete picture.

In [None]:
print(file_key_3.uid)
print(file_key_4.uid)

Both files still exist though with unique `uids` that can be used to get access to them.

We refer to {doc}`docs:faq/idempotency` for more detailed explanations of behavior when attempting to save files multiple times.

## Hierarchies 

Another common use-case of `keys` are file hierarchies.
It can be useful to resemble the file structure in "complex_biological_project" from above also in LaminDB to allow for queries for specific subsets.
Note that this use-case may also be overlapping with `Dataset` which also allows for grouping `Files` (but is usually used in a different context).

### Key

In [None]:
for root, _, files in os.walk("complex_biological_project/data"):
    for filename in files:
        file_path = os.path.join(root, filename)
        key_path = file_path.removeprefix("complex_biological_project")
        ln_file = ln.File(file_path, key=key_path)
        ln_file.save()

In [None]:
ln.File.filter(key__startswith="data").df()

### Dataset

Alternatively, it would have been possible to create a `Dataset` with a corresponding name:

In [None]:
all_data_paths = []
for root, _, files in os.walk("complex_biological_project/data"):
    for filename in files:
        file_path = os.path.join(root, filename)
        all_data_paths.append(file_path)

all_data_files = []
for path in all_data_paths:
    all_data_files.append(ln.File(path))

data_ds = ln.Dataset(all_data_files, name="data")
data_ds.save()

In [None]:
ln.Dataset.filter(name__icontains="data").df()

This approach will likely lead to clashes. Alternatively, `Ulabels` can be added to `Files` to resemble hierarchies.

### Ulabels

In [None]:
for root, _, files in os.walk("complex_biological_project/data"):
    for filename in files:
        file_path = os.path.join(root, filename)
        key_path = file_path.removeprefix("complex_biological_project")
        ln_file = ln.File(file_path, key=key_path)
        ln_file.save()

        data_label = ln.ULabel(name="data")
        data_label.save()
        ln_file.ulabels.add(data_label)

In [None]:
labels = ln.ULabel.lookup()

In [None]:
ln.File.filter(ulabels__in=[labels.data]).df()

However, `Ulabels` are too versatile for such an approach and clashes are also to be expected here.

### Metadata

Due to the chance of clashes for the aforementioned approaches being rather high, we generally recommend not to store hierarchical data with solely semantic keys.
Biological metadata makes `Files` and `Datasets` unambiguous and easily queryable.
