# Manage files & datasets

Here, you'll learn LaminDB's basic data management workflow.

Starting with a [lake](https://en.wikipedia.org/wiki/Data_lake) of files, you'll arrive at a [warehouse](https://en.wikipedia.org/wiki/Data_warehouse) of analysis & ML-ready datasets (a [feature store](https://en.wikipedia.org/wiki/Feature_engineering#Feature_stores)).

While this tutorial is all about basic metadata, you'll later see that `lamindb` gives you a framework for linking complex metadata related to biology and any custom schema.

```{tip}

This tutorial is a [Jupyter notebook](https://github.com/laminlabs/lamindb/blob/main/docs/guide/tutorial1.ipynb).

```

## Set up an instance

[Installation and sign-up](./index.md#setup) take no time: Run `pip install lamindb` and `lamin signup <email>` on the command line.

Using the CLI, let's create a LaminDB instance with a directory `./mydata` for storing files and a SQLite database for managing metadata:


In [None]:
!lamin init --storage ./mydata  # or "s3://my-bucket" or "gs://my-bucket"

(Think of initializing a LaminDB instance as analogous to initializing a git repository.)

We're now ready to import `lamindb`:

In [None]:
import lamindb as ln

ln.settings.verbosity = 3  # show hints

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"`.

And you can see all storage locations by querying {class}`~lamindb.Storage`:

In [None]:
ln.Storage.select().df()  # more on select statements later!

## Track a data source

Knowing where a batch of data comes from helps finding & understanding it.

We call the code that generated it a _transform_. The code can be a data pipeline, a notebook or an app/instrument upload.

With {class}`~lamindb.Transform`, LaminDB maintains a registry of transforms and makes it easy to link data against them.

Here, we're running a Jupyter notebook. Let's track it:

In [None]:
ln.track()

By calling {func}`~lamindb.track`, the notebook is automatically linked as the source of all data that's about to be saved.

:::{dropdown} What happened under the hood?

Logging informed us about

1. the package versions that the notebook imports
2. the automatic detection of notebook metadata (title, filename, version, timestamp, creator) and creation of a {class}`~lamindb.Transform` object
3. the automatic creation of a {class}`~lamindb.Run` object (timestamp, transform, creator)

:::

:::{dropdown} How do I track a versioned pipeline?

If you'd like to track one of your versioned pipelines as a data source:

```python
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

:::{dropdown} Why do we care about notebooks?

Most people advocate for "not using notebooks in production" or similar. And we agree! Anything that can be a pipeline, should be a pipeline.

But we also think that a lot of the downstream insight & value generated from biological data is driven by computational biologists interacting with it.

And we think this is very much akin to the prose-heavy design of biological experiments documented in an ELN.

A notebook that's run a single time on specific data batches is not a pipeline, it's a _document_ that produced an insight or some other form of data representation.

Unfortunately, most mistakes happen when using notebooks. `ln.track()` tries to help with avoiding some.

An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).

:::

## Manage files

### Track an existing file

Here, we have an existing file in our storage location: `./mydata/mini.csv`

In [None]:
# put a file "mini.csv" into our default storage
filepath = ln.dev.datasets.file_mini_csv()
filepath.rename(ln.setup.settings.storage.root / filepath.name)

Let's create a {class}`~lamindb.File` object from the path:

In [None]:
file = ln.File("./mydata/mini.csv")  # or "s3://my-bucket/my-folder/my-file.csv"

:::{dropdown} What is a File object in LaminDB?

It's an object to manage the file metadata, enable search & queries of the file, and different ways of accessing the file.

Basic metadata is:

- `id`: a universal ID (also serves as a primary key in the underlying SQL table of the instance)
- `key`: an optional storage key, i.e., the relative path of the file in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or network location)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this file already stored?)
- `hash_type`: the type of the hash (usually, an MD5 or SHA1 checksum)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related metadata is:

- `created_by`: the {class}`~lamindb.User` who created the file
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `path()`: the path (cloud or local)
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

By saving a file object, metadata & data are saved to database & storage in a single [ACID](/faq/acid) transaction:

In [None]:
file.save()  # as the file is already in the desired storage location, only metadata is written

Because we called `ln.track()`, we know where the file came from. It has linked {class}`~lamindb.Transform` and {class}`~lamindb.Run` objects:

In [None]:
file.transform

In [None]:
file.run

### Add a new file

Here's a local file that's not yet in LaminDB storage:

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve()

filepath

The way you indicate the target path for storing the file is by passing the `key` argument:

In [None]:
file = ln.File(filepath, key="images/paradisi05_laminopathic_nuclei.jpg")

In [None]:
file.save()

Looking into our default storage, we see:

In [None]:
ln.File.tree()  # this also shows the LaminDB-managed SQLite database `mydata.lndb`

Looking into the database, you'll see:

In [None]:
ln.view()

### Access a file

{meth}`~lamindb.File.path` will give you a filepath within the storage location.

In [None]:
file.path()

If the file is in the cloud, you typically stage a cached file ({meth}`~lamindb.File.stage`) or stream its data ({meth}`~lamindb.File.backed`).

### Search or query the file

You can search the file by its metadata:

In [None]:
ln.File.search("paradisi")

Alternatively, you can query the file by any metadata combination: 

In [None]:
ln.File.select(key="images/paradisi05_laminopathic_nuclei.jpg").df()

In [None]:
users = ln.User.lookup("handle")
ln.File.select(created_by=users.testuser1).df()

In [None]:
transform = ln.Transform.select(id="NJvdsWWbJlZSz8").one()
transform

In [None]:
ln.File.select(transform=transform).df()

## Manage  in-memory data objects

A `File` object can also be created from an in-memory data object like a `DataFrame`, e.g.,

```
file = ln.File(df, description="Data batch X")  # serialize df to storage, default is `.parquet`
file = ln.File.from_df(df, description="Data batch X")  # additionally track columns as features
```

Consider a batch of the Iris flower dataset:

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch1()

df.head()

Let's use {meth}`~lamindb.File.from_df` to track this DataFrame along with its columns as features:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

In [None]:
file.save()

### Query by features

Because the file is linked to the features it measured, we can query or search for them:

In [None]:
ln.Feature.search("iris_species")

In [None]:
iris_species_name = ln.Feature.select(name="iris_species_name").one()
feature_set = ln.FeatureSet.select(features=iris_species_name).one()
ln.File.select(feature_sets=feature_set).df()

### Annotate & query by labels

Let's register the species labels and annotate the file:

In [None]:
iris_species = ln.Label.from_values(
    df["iris_species_name"]
)  # create records for the sampled labels within the dataframe column
ln.save(iris_species)  # save labels to database
file.labels.set(iris_species)  # annotate the file

This enables to query & search the file by whether "setosa" was sampled in it:

In [None]:
setosa = ln.Label.select(name="setosa").one()
ln.File.select(labels=setosa).df()

Or, for a any given file, to see which labels were sampled:

In [None]:
file.labels.df()

Using the `ref_id`, `ref_orm`, and `ref_schema` fields, we'll be able to derive labels from other registries!

## Understand data objects in context 

We have come to love the pydata family of data objects: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, and others.

But we couldn’t find an object for linking data objects to context! 😠

So, we made `lamindb.File` and `lamindb.Dataset` to model how data objects relate to their context.

Context can be other data objects, data transformations, ML models, users & pipelines that performed transformations (all aspects of data lineage).

Context can also be any entity of the domain in which data is generated and modeled.

We focused on linking `File` and `Dataset` to data lineage & biological concepts. You'll learn about them further down the guide.

## Manage features

As we just saw, by using {meth}`~lamindb.File.from_df`, `lamindb` automatically linked features and warned us about the creation of new {class}`~lamindb.Feature` and {class}`~lamindb.Label` records.

We see the result in the database overview:

In [None]:
ln.view(orms=["Label", "Feature", "FeatureSet"])

:::{dropdown} Why do we care about managing features?

1. Finding a dataset: Which datasets measured expression of cell marker CD14? Which datasets have an out-of-domain split?
2. Validating integrity: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent?

:::

We see an evident place where we might want to further populate `Feature` records: the unit of measure for float-typed features.

In [None]:
features_in_meters = ln.Feature.select(type="float").all()

In [None]:
for feature in features_in_meters:
    feature.unit = "m"
    feature.save()

In [None]:
ln.Feature.select().df()

## Manage datasets

In simple cases as we just saw, we can use files to store datasets.

In more complex cases, however, we'd like store collections of images, collections of data objects, or SQL tables in BigQuery, Snowflake, or Postgres.

Hence, we need a second central class for data storage: {class}`~lamindb.Dataset`.

We'll start with the simplest case: A dataset that's stored in a single file.

### A single DataFrame

Let us look at a second batch of the iris dataset:

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch2()

df.head()

In [None]:
dataset = ln.Dataset(df, name="Iris flower dataset batch 2")

In [None]:
dataset

In [None]:
dataset.save()

Get the dataframe back:

In [None]:
df = dataset.load()

df.head()

The `Dataset` object has a 1:1 correspondence to an underlying file object, accessible via:

In [None]:
dataset.file

So, you can stage the underlying parquet file:

In [None]:
dataset.file.stage()

The data got added with a storage key based on the `id`, because here, we didn't pass the `key` argument.

In [None]:
ln.File.tree()

In the database, we're now seeing the following:

In [None]:
ln.view()

:::{dropdown} Dataset overview

Basic dataset metadata is:

- `id`: a universal ID that also serves as a primary key in the SQL table
- `name`: a name
- `hash`: an MD5 hash useful to check for integrity and collisions
- `file`: a link to a single file, if the dataset consists in a single file
- `files`: a link to several files, if the dataset consists in several files (is "sharded")
- `created_at`: time of creation
- `updated_at`: time of last update
- `created_by`: the {class}`~lamindb.User` who created the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `backed()`: the path (cloud or local)

For a full reference, see {class}`~lamindb.Dataset`.

:::

### Multiple DataFrames - sharded datasets

Often, we measure data in batches and want to store these batches separately.

Let us look at how to construct a `Dataset` from two (or more) files corresponding to these batches (or "shards").

In [None]:
file1 = ln.File.select(description="Iris flower dataset batch 1").one()
file2 = dataset.file

In [None]:
dataset = ln.Dataset.from_files(name="The combined Iris dataset", files=[file1, file2])

In [None]:
dataset.save()

You can load the sharded dataset as if it was one dataset:

In [None]:
dataset.load()

In storage, you see:

In [None]:
ln.File.tree()

In the database, you see:

In [None]:
ln.view()

## Manage directories

In [None]:
# generate some files in default storage
ln.dev.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)

We can pass an existing directory to {meth}`~lamindb.File.from_dir`:

In [None]:
files = ln.File.from_dir("./mydata/sample_001/")

In [None]:
print(files[:2])

In [None]:
ln.save(files)

View the files as a tree:

In [None]:
ln.File.tree()  # to subset, call ln.File.tree("sample_001")

Under-the-hood, the following records got written:

In [None]:
ln.File.select(key__startswith="sample_001/").df().head()

Query a specific file by passing the full key to `ln.select`:

In [None]:
ln.File.select(key="sample_001/metrics_summary.csv").df()

You see that LaminDB treats directories similar to S3, as a plain prefix in the storage `key`.

If you want to flexibly group files, consider label ({class}`~lamindb.Label`).

## Label files

Say, we want to tag the files related to `sample_0001` independent of where they are in storage.

Let's create and save a tag:

In [None]:
label = ln.Label(name="Sample 0001")
label.save()

Let's now label each file in `files` with this tag and save the update:

In [None]:
for file in files:
    file.labels.add(label)
ln.save(files)

We can now query by this tag (and arbitrarily more):

In [None]:
ln.File.select(labels=label).df().head()

We can also group tags in a hierarchy using the same principles with which we can manage arbitrary ontologies.

Let's create a super class for the sample tag and label our tag with it.

In [None]:
sample_label = ln.Label(name="Sample")
sample_label.save()
label.parents.add(sample_label)

We can now see the hierarchical structure of tags and easily query for all files that have _any_ sample tag:

In [None]:
label.view_parents()

In [None]:
ln.File.select(labels__parents=sample_label).df().head()

## Create, update & delete validated metadata

To end this guide through basic file & metadata tracking, let's see how to update records storing metadata for any entity.

### Create & save records

A single record:

In [None]:
label = ln.Label(name="Project A")

In [None]:
label.save()

Multiple records:

In [None]:
labels = [ln.Label(name=name) for name in ["Project B", "Project C", "Project D"]]

You see that for every record creation, a search compares whether a similar already exists!
    
This is to avoid inserting duplicated records.

You can switch it off (for performance gains) via `ln.settings.upon_create_search_names = False`.

In [None]:
ln.save(labels)

Similarly, if you try to create the same record again, it will load instead of re-creating it:

In [None]:
ln.Label(name="Project A")

You'll learn about more advanced data validation in further guides.

### Update records

In [None]:
label = ln.Label.select(name="Project A").first()

In [None]:
label

In [None]:
label.name = "Project a"

In [None]:
label.save()

### Delete records

In [None]:
label = ln.Label.select(name="Project B").first()

In [None]:
label.delete()

In [None]:
# clean up what we wrote in this notebook
!lamin delete mydata
!rm -r mydata
!rm paradisi05_laminopathic_nuclei.jpg