# Manage files & datasets

Here, you'll learn LaminDB's basic data management workflow.

Starting with a [lake](https://en.wikipedia.org/wiki/Data_lake) of files, you'll arrive at a [warehouse](https://en.wikipedia.org/wiki/Data_warehouse) of analysis & ML-ready datasets (a [feature store](https://en.wikipedia.org/wiki/Feature_engineering#Feature_stores)).

While this tutorial is all about basic metadata, you'll later see that `lamindb` gives you a framework for linking complex metadata related to biology and any custom schema.

```{warning}

This is still work-in-progress.

```

```{tip}

This tutorial is a [Jupyter notebook](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial1.ipynb).

```

## Set up an instance

[Installation and sign-up](./guide.md#setup) take no time: Run `pip install lamindb` and `lamin signup <email>` on the command line.

Using the CLI, let's create a LaminDB instance with a directory `./mydata` for storing files and a SQLite database for managing metadata:


In [None]:
!lamin init --storage ./mydata  # or "s3://my-bucket" or "gs://my-bucket"

(Think of initializing a LaminDB instance as analogous to initializing a git repository.)

We're now ready to import `lamindb`:

In [None]:
import lamindb as ln

## Track a data source

Knowing where a batch of data comes from helps finding & understanding it.

We call the code that generated it a _transform_. The code can be a data pipeline, a notebook or an app/instrument upload.

With {class}`~lamindb.Transform`, LaminDB maintains a registry of transforms and makes it easy to link data against them.

Here, we're running a Jupyter notebook. Let's track it:

In [None]:
ln.track()

By calling {func}`~lamindb.track`, the notebook is automatically linked as the source of all data that's about to be saved.

:::{dropdown} What happened under the hood?

Logging informed us about

1. the package versions that the notebook imports
2. the automatic detection of notebook metadata (title, filename, version, timestamp, creator) and creation of a {class}`~lamindb.Transform` object
3. the automatic creation of a {class}`~lamindb.Run` object (timestamp, transform, creator)

:::

:::{dropdown} How do I track a versioned pipeline?

If you'd like to track one of your versioned pipelines as a data source:

```python
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

:::{dropdown} Why do we care about notebooks?

Most people advocate for "not using notebooks in production" or similar. And we agree! Anything that can be a pipeline, should be a pipeline.

But we also think that a lot of the downstream insight & value generated from biological data is driven by computational biologists interacting with it.

And we think this is very much akin to the prose-heavy design of biological experiments documented in an ELN.

A notebook that's run a single time on specific data batches is not a pipeline, it's a _document_ that produced an insight or some other form of data representation.

Unfortunately, most mistakes happen when using notebooks. `ln.track()` tries to help with avoiding some.

An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).

:::

## Manage files

We'll need some dummy data:

In [None]:
# put a file "mini.csv" into default storage
filepath = ln.dev.datasets.file_mini_csv(in_storage_root=True)
# put a directory "sample_001" into default storage
ln.dev.datasets.dir_scrnaseq_cellranger("sample_001", ln.settings.storage)

### Register a file

Here, we have an existing file in our default storage location: `./mydata/mini.csv`

Let's create a {class}`~lamindb.File` object from the path:

In [None]:
file = ln.File("./mydata/mini.csv")  # or "s3://my-bucket/my-folder/my-file.csv"

:::{dropdown} What is a File object in LaminDB?

It's an object to manage the file metadata, enable search & queries of the file, and different ways of accessing the file.

Basic metadata is:

- `id`: a universal ID (also serves as a primary key in the underlying SQL table of the instance)
- `key`: an optional storage key, i.e., the relative path of the file in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or network location)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this file already stored?)
- `hash_type`: the type of the hash (usually, an MD5 or SHA1 checksum)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related metadata is:

- `created_by`: the {class}`~lamindb.User` who created the file
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `path()`: the path (cloud or local)
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

By saving a file object, metadata & data are saved to database & storage in a single [ACID](/faq/acid) transaction:

In [None]:
file.save()  # as the file is already in the desired storage location, only metadata is written

Because we called `ln.track()`, we know where the file came from. It has linked {class}`~lamindb.Transform` and {class}`~lamindb.Run` objects:

In [None]:
file.transform

In [None]:
file.run

### Add a new file

Here's a local file that's not yet in a registered storage location:

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve()

filepath

Because it's not found in a storage location, we're now getting a hint that tells us it will will be copied into default storage upon `.save()`.

This behavior is useful when you're working with local caches and want to upload "final" data to the cloud:

In [None]:
file = ln.File(filepath, description="paradisi05 laminopathic nuclei image")

# Optionally, you may specify the target path for storing the file by passing the `key` argument
# this will store the file as `./mydata/images/paradisi05_laminopathic_nuclei.jpg`
# file = ln.File(filepath, key="images/paradisi05_laminopathic_nuclei.jpg")

In [None]:
file.save()

Looking into the default storage, we see:

In [None]:
ln.File.tree()

Looking into the database, you'll see:

In [None]:
ln.view()

### Access a file

{meth}`~lamindb.File.path` will give you the filepath:

In [None]:
file.path()

If the file is in the cloud, you typically stage a cached file ({meth}`~lamindb.File.stage`) or stream its data ({meth}`~lamindb.File.backed`).

### Search or query the file

You can search the file based on the fields in the `File` registry:

In [None]:
ln.File.search("paradisi")

Alternatively, you can query the file by any metadata combination: 

In [None]:
users = ln.User.lookup()  # auto-complete users
transform = ln.Transform.filter(
    name__contains="files & datasets"
).one()  # query name field of Transform registry, expect *exactly* one result

ln.File.filter(
    suffix=".jpg",
    created_by=users.testuser1,
    transform=transform,
).df()

You can also chain `.filter()` and `.search()` statements, e.g. `ln.File.filter(suffix=".jpg").search("my image")`.

An empty filter gives you the entire registry content:

In [None]:
ln.File.filter().df()

## Manage directories

Use {meth}`~lamindb.File.from_dir` to create files from a directory:

In [None]:
files = ln.File.from_dir("./mydata/sample_001/")

Let's save them:

In [None]:
ln.save(files)

View as a tree:

In [None]:
ln.File.tree("./mydata/sample_001")

Or as a query:

In [None]:
ln.File.filter(key__startswith="sample_001/").df().head()

```{note}

LaminDB treats directories similar to AWS S3, as a prefix in the storage `key`, queryable with `key__startswith`.

```

## Manage features & labels

:::{dropdown} Why do we care about managing features & labels?

1. Finding data: Which datasets measured expression of cell marker CD14? Which datasets have a test & train split? Which characterized cell line K562? Etc.
2. Validating data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

:::

:::{dropdown} A perspective on contextualizing data objects

We have come to love the pydata family of data objects: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, ...

But we couldn’t find an object for linking data objects to context!

So, we made `lamindb.File` and `lamindb.Dataset` to model how data objects relate to their context.

Context can be other data objects, data transformations, ML models, users & pipelines that performed transformations (all aspects of data lineage).

Context can also be any entity of the domain in which data is generated and modeled.

:::

Consider a batch of the Iris flower dataset in the form of a DataFrame:

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch1()

df.head()

### Validate & link features

Let's use {meth}`~lamindb.File.from_df` to track this DataFrame along with its columns as features:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

Because this is an empty LaminDB instance without a single registered feature, we are informed that features couldn't be validated and are ignored.

But, all features here are meaningful and well-curated, so, let's create records for them:

In [None]:
features = ln.Feature.from_df(df)

features

As soon as we save them, they'll serve as the reference for validating data batches that we'd like to validate.

In [None]:
ln.save(features)

:::{dropdown} How to track units of features?

It's easy using {class}`~lamindb.Feature.unit`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "float":
        feature.unit = "m"
        feature.save()
```

:::

If we now create a `File` object again, we'll see that features are validated based on the registry content:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

Let's register the file along with its linked features.

In [None]:
file.save()

We can get an overview of all linked feature sets by `slot`:

In [None]:
file.features

A `slot` provides a string key to access feature sets. It's typically the accessor of feature identifiers in the data object we're validating & registering (here, a `DataFrame`).

Let's use it to access all linked features:

In [None]:
file.features["columns"].df()

### Validate & link labels

The Iris dataset comes with labels within the data object.

In [None]:
species_labels = ln.Label.from_values(df["iris_species_name"])

ln.save(species_labels)

species_labels

Let's also annotate the file with the labels for `iris_species_name` that are sampled in it:

In [None]:
file.add_labels(species_labels)

This enables to query & search the file by whether "setosa" was sampled in it:

In [None]:
ln.File.filter(labels__name="setosa").df()

Or for a given feature of a file, which labels are present:

In [None]:
file.get_labels("iris_species_name").df()

In addition to features present in columns, a file can be labeled with additional metadata:

Let's say this file belongs to `"experiment_1"` and we'd like to track this information for two reasons: 

1. later we'd like to query all files link to this experiment
2. we consider it a potential confounder when we'll analyze similar data from a follow-up experiment

In [None]:
experiment1 = ln.Label(name="experiment_1")

experiment1.save()

experiment1

In [None]:
ln.Feature(name="experiment", type="category").save()

In [None]:
file.add_labels(experiment1, feature="experiment")

You notice a new feature set is created for slot "ext" (external):

In [None]:
file.features

In [None]:
file.describe()

See the database content:

In [None]:
ln.view(registries=["Feature", "FeatureSet", "Label"])

## Manage datasets

In simple cases as we just saw, we can use files to store datasets.

In more complex cases, however, we'd like store collections of images, collections of data objects, or SQL tables in BigQuery, Snowflake, or Postgres.

Hence, we need a second central class for data storage: {class}`~lamindb.Dataset`.

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch2()
ln.File.from_df(df, description="Iris flower dataset batch 2").save()

In [None]:
file1 = ln.File.filter(description="Iris flower dataset batch 1").one()
file2 = ln.File.filter(description="Iris flower dataset batch 2").one()

In [None]:
dataset = ln.Dataset.from_files(name="The combined Iris dataset", files=[file1, file2])

In [None]:
dataset.save()

:::{dropdown} What is a Dataset?

Basic dataset metadata is:

id: a universal ID that also serves as a primary key in the SQL table
name: a name
hash: an MD5 hash useful to check for integrity and collisions
file: a link to a single file, if the dataset consists in a single file
files: a link to several files, if the dataset consists in several files (is "sharded")
created_at: time of creation
updated_at: time of last update
created_by: the {class}~lamindb.User who created the file
Managing the underlying data:

load(): load the file to memory for formats like .parquet, .zarr, and .h5ad
backed(): the path (cloud or local)
For a full reference, see {class}~lamindb.Dataset.

:::

You can load the sharded dataset as if it was one dataset:

In [None]:
dataset.load()

Access the underlying two file objects:

In [None]:
dataset.files.list()

Or see the registries:

In [None]:
ln.view(registries=["Dataset", "File"])

## Manage metadata

To end this guide through basic file & metadata tracking, let's see how to update records storing metadata for any entity.

### Validate records upon creation

We already created a `project_1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.Label(name="project_1")

label.save()

Instead of creating a new record, LaminDB will load and return the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplictes.

Say, we spell "project_1" without an underscore:

In [None]:
ln.Label(name="project 1")

You see that for every record creation, a search compares whether a similar already exists!
    
This is to avoid inserting duplicated records.

You can switch it off (for performance gains) via `ln.settings.upon_create_search_names = False`.

### Update records

In [None]:
label = ln.Label.filter(name="project_1").first()

In [None]:
label

In [None]:
label.name = "project_1a"

In [None]:
label.save()

### Delete records

Delete records like so:

In [None]:
label.delete()

## Default storage

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"` and see all storage locations via:

In [None]:
ln.Storage.filter().df()

In [None]:
# clean up what we wrote in this notebook
!lamin delete mydata
!rm -r mydata
!rm paradisi05_laminopathic_nuclei.jpg