# Files & datasets, pipelines & analyses

In this first tutorial you'll learn LaminDB's basic data management workflow.

You'll start with a [data lake](https://en.wikipedia.org/wiki/Data_lake) of files and arrive at a [warehouse](https://en.wikipedia.org/wiki/Data_warehouse) of analysis & ML-ready datasets (a [feature store](https://en.wikipedia.org/wiki/Feature_engineering#Feature_stores)).

All accessible through one API & App!

```{tip}

You can run this tutorial as a [Jupyter notebook](https://github.com/laminlabs/lamindb/blob/main/docs/guide/01-tutorial1.ipynb).

```

## Setup a LaminDB instance

[Installing LaminDB and signing up](./index#setup) takes 2 min. ‚úÖ

Using the CLI, let's create a LaminDB instance with a directory `./mydata` for storing files and a SQLite database for managing metadata:


In [None]:
!lamin init --storage ./mydata  # or "s3://my-bucket" or "gs://my-bucket"

(Think of initializing a LaminDB instance as analogous to initializing a git repository.)

We're now ready to import `lamindb`:

In [None]:
import lamindb as ln

ln.settings.verbosity = 3  # show hints

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"`.

And you can see all storage locations by querying {class}`~lamindb.Storage`:

In [None]:
ln.Storage.select().df()  # SQLite only supports a single location, Postgres supports an arbitrary number

## Overview

In this tutorial, we'll walk through managing:

1. _files_ (as in a data lake)
2. _datasets_ in form of collections of files, _data objects_ (`DataFrame`, `AnnData`) or SQL tables (as in a feature store or data warehouse)
3. _metadata_ in form of _transforms_ (notebooks & pipelines), _runs_, _features_, _tags_, _projects_ & _users_

In later material, you'll see that `lamindb` gives you a full framework for linking metadata related to [data lineage](./data-lineage), [biology](./registries) and any [custom schema](https://github.com/laminlabs/lnschema-lamin1).

## Track a data source

Knowing where a piece of data comes from greatly helps with finding & understanding it. üëç

The data source is the code that generates the data, which we model as a "transform", a {class}`~lamindb.Transform` object.

A transform can be a data pipeline, a notebook or an app or instrument upload.

With {class}`~lamindb.Transform`, LaminDB maintains a registry of your transforms and makes it easy to link data against them.

Here, we're running a Jupyter notebook. Let's track it:

In [None]:
ln.track()

By calling {func}`~lamindb.track`, the notebook will **automatically** be linked as the source of all data that's about to be saved.

What happens under the hood? Logging informed us about

1. the package versions that the notebook imports
2. the automatic detection of notebook metadata (id, title, filename, version, timestamp, creator) and creation of a {class}`~lamindb.Transform` object with id `NJvdsWWbJlZSry`
3. the automatic creation of a {class}`~lamindb.Run` object (id, timestamp, transform, creator)

:::{note}

If you'd like to track one of your versioned pipelines as a data source:

```{python}
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

## Manage files

### Track an existing file

In [None]:
# put a file "mini.csv" into our default storage
filepath = ln.dev.datasets.file_mini_csv()
filepath.rename(ln.setup.settings.storage.root / filepath.name)

We have an existing file in our storage location: `./mydata/mini.csv`

Create a {class}`~lamindb.File` object from the path:

In [None]:
file = ln.File("./mydata/mini.csv")

:::{dropdown} File overview

Basic file metadata is:

- `id`: a universal ID that also serves as a primary key in the SQL table
- `key`: the storage key, i.e., the relative path of the file in the storage location
- `storage`: the storage location (the root, say, an S3 bucket)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: an MD5 checksum useful to check for integrity and collisions (is this file already stored?)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related metadata is:

- `created_by`: the {class}`~lamindb.User` who created the file
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `path()`: the path (cloud or local)
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

By saving a file object, metadata & data are saved to database & storage in a single [ACID](/faq/acid) transaction:

In [None]:
file.save()  # as the file is already in the desired storage location, only metadata is written

The file now has linked transform and run objects:

In [None]:
file.transform

In [None]:
file.run

### Add a new file

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve()

Here's a local file that's not yet in LaminDB storage:

In [None]:
filepath

The way you indicate the target path for storing the file is by passing the `key` argument:

In [None]:
file = ln.File(filepath, key="images/paradisi05_laminopathic_nuclei.jpg")

In [None]:
file.save()

Looking into our default storage, we see:

In [None]:
ln.File.tree()  # this also shows the SQLite database `mydata.lndb` holding metadata

You'll see your files also in the SQL database together with entries for storage and users (and later down this guide, many other entities):

In [None]:
ln.view()

### Access a file

{meth}`~lamindb.File.stage` will give you a filepath to a local file, also for a cloud-based file (it will cache a cloud object):

In [None]:
file.stage()

If we want the full `path` within the storage location (say, in an S3 bucket), we use {meth}`~lamindb.File.path`.

### Query or search a file

You can query the file by its metadata. The simplest way is by `key`:

In [None]:
file = ln.File.select(key="images/paradisi05_laminopathic_nuclei.jpg").one()

file

You can search the file by its metadata:

In [None]:
ln.File.search("paradisi")

### In-memory data objects

A `File` object can also be created from an in-memory data object like a `DataFrame` or an `AnnData`.

For this, you'd call one of:

- `file = ln.File(df, name="My dataset X")`
- `file = ln.File(df, key="my_folder/my_file.parquet")`
- `file = ln.File.from_df(df, name="My dataset X")  # will track column names as features`
- `file = ln.File.from_df(df, key="my_folder/my_file.parquet")  # will track column names as features`


Under-the-hood, the object will be serialized into a configurable storage format (e.g. `DataFrame` ‚Üí `.parquet`, `AnnData` ‚Üí `.h5ad`/`.zrad`, ...).

However, while this is the way to go for "auxialiary datasets", if your dataset is relevant for a later analysis, you'd rather create a `Dataset` instead of a `File` in such cases.

## Managing datasets

### A single DataFrame

Let us look at the simplest case in which a dataset corresponds to a single `DataFrame`, which we'll store as a `File` object (a `.parquet` file in storage).

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch1()

In [None]:
df.head()

In [None]:
dataset = ln.Dataset(df, name="Iris flower dataset batch1")

In [None]:
dataset

In [None]:
dataset.save()

Get the dataframe back:

In [None]:
dataset.load().head()

The `Dataset` object has a 1:1 correspondence to an underlying file object, accessible via:

In [None]:
dataset.file

So, you can stage the underlying parquet file:

In [None]:
dataset.file.stage()

The data got added with a storage key based on the `id`, because here, we didn't pass the `key` argument.

In [None]:
ln.File.tree()

In the database, we're now seeing the following:

In [None]:
ln.view()

:::{dropdown} Dataset overview

Basic dataset metadata is:

- `id`: a universal ID that also serves as a primary key in the SQL table
- `name`: a name
- `hash`: an MD5 hash useful to check for integrity and collisions
- `file`: a link to a single file, if the dataset consists in a single file
- `files`: a link to several files, if the dataset consists in several files (is "sharded")
- `created_at`: time of creation
- `updated_at`: time of last update
- `created_by`: the {class}`~lamindb.User` who created the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `backed()`: the path (cloud or local)

For a full reference, see {class}`~lamindb.Dataset`.

:::

### Multiple DataFrames

Often, we measure data in batches and want to store these batches separately.

Let us look at how to construct a `Dataset` from two (or more) files corresponding to these batches (or "shards").

In [None]:
file1 = dataset.file
file2 = ln.File.from_df(
    ln.dev.datasets.df_iris_in_meter_batch2(), description="Iris batch 2"
)
file2.save()  # we have to save a file before using it to compose a dataset

In [None]:
dataset = ln.Dataset.from_files(name="The combined Iris dataset", files=[file1, file2])

In [None]:
dataset.save()

You can load the sharded dataset as if it was one dataset:

In [None]:
dataset.load()

In storage, you see:

In [None]:
ln.File.tree()

In the database, you see:

In [None]:
ln.view()

## Understand data objects in context 

We have come to love the pydata family of data objects: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, and others.

But we couldn‚Äôt find an object for linking data objects to context! üò†

So, we made `lamindb.File` and `lamindb.Dataset` to model how data objects relate to their context.

Context can be other data objects, data transformations, ML models, users & pipelines that performed transformations (all aspects of data lineage).

Context can also be any entity of the domain in which data is generated and modeled.

We focused on linking `File` and `Dataset` to data lineage & biological concepts. You'll learn about them further down the guide.

## Manage directories

In [None]:
# generate some files in default storage
ln.dev.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)

We can pass an existing directory to {meth}`~lamindb.File.from_dir`:

In [None]:
files = ln.File.from_dir("./mydata/sample_001/")

In [None]:
print(files[:2])

In [None]:
ln.save(files)

View the files as a tree:

In [None]:
ln.File.tree()  # to subset, call ln.File.tree("sample_001")

Under-the-hood, the following records got written:

In [None]:
ln.File.select(key__startswith="sample_001/").df().head()

Query a specific file by passing the full key to `ln.select`:

In [None]:
ln.File.select(key="sample_001/metrics_summary.csv").df()

You see that LaminDB treats directories similar to S3, as a plain prefix in the storage `key`.

If you want to flexibly group files, consider tags ({class}`~lamindb.Tag`).

## Tag files

Say, we want to tag the files related to `sample_0001` independent of where they are in storage.

Let's create and save a tag:

In [None]:
tag = ln.Tag(name="Sample 0001")
tag.save()

Let's now label each file in `files` with this tag and save the update:

In [None]:
for file in files:
    file.tags.add(tag)
ln.save(files)

We can now query by this tag (and arbitrarily more):

In [None]:
ln.File.select(tags=tag).df()

## Create, update & delete validated metadata

To end this guide through basic file & metadata tracking, let's see how to update records storing metadata for any entity.

### Create & save records

A single record:

In [None]:
project = ln.Project(name="Project A")

In [None]:
project.save()

Multiple records:

In [None]:
projects = [ln.Project(name=name) for name in ["Project B", "Project C", "Project D"]]

You see that for every record creation, a search compares whether a similar already exists!
    
This is to avoid inserting duplicated records.

You can switch it off (for performance gains) via `ln.settings.upon_create_search_names = False`.

In [None]:
ln.save(projects)

Similarly, if you try to create the same record again, it will load instead of re-creating it:

In [None]:
ln.Project(name="Project A")

You'll learn about more advanced data validation in further guides.

### Update records

In [None]:
project = ln.Project.select(name="Project A").first()

In [None]:
project

In [None]:
project.name = "Project 1"

In [None]:
project.save()

### Delete records

In [None]:
project = ln.Project.select(name="Project B").first()

In [None]:
project.delete()

In [None]:
# clean up what we wrote in this notebook
!lamin delete mydata
!rm -r mydata
!rm paradisi05_laminopathic_nuclei.jpg