![](https://img.shields.io/badge/tutorial1/2-lightgrey)
[![](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial.ipynb)
[![](https://img.shields.io/badge/laminlabs/lamindata-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?id=NJvdsWWbJlZSz8)

# Tutorial: Files & datasets

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

Biology is measured in samples that generate data batches and you'll almost always start out with files.

LaminDB assists while you transform files into more useful representations: validated, queryable datasets or analytical insights.

The best way to build a map of the API is to embed into an iterative data warehousing or learning process (graphic).

The tutorial has two parts, each is a Jupyter notebook:

1. {doc}`/tutorial` - register & access
2. {doc}`/tutorial2` - validate & annotate



## Setup

1. Install the `lamindb` Python package:
    ```shell
    pip install 'lamindb[jupyter,aws]'
    ```
2. [Sign up](https://lamin.ai/signup) for a free account (see more [info](https://lamin.ai/docs/setup)) and copy the API key.
3. Log in on the command line:
    ```shell
    lamin login <email> --key <API-key>
    ```

You can now init a LaminDB instance with a directory `./lamin-tutorial` for storing data:

In [None]:
!lamin init --storage ./lamin-tutorial  # or "s3://my-bucket" or "gs://my-bucket"

:::{dropdown} What else can I configure during setup?

1. Instead of the default SQLite database, use PostgreSQL:
    ```shell
    --db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
    ```
2. Instead of a default instance name derived from storage, provide a custom name:
    ```shell
    --name myinstance
    ``````
3. Beyond the core schema, use bionty and other schemas:
    ```shell
    --schema bionty,custom1,template1
    ```

For more, see {doc}`/setup`.

:::

## Track a data source

In [None]:
import lamindb as ln

If new to LaminDB, set {attr}`~lamindb.dev.Settings.verbosity` to hint level:

In [None]:
ln.settings.verbosity = "hint"

The code that generates a batch of data is a transform ({class}`~lamindb.Transform`). It could be a pipeline, a notebook or an app upload.

Let's track the notebook that's being run:

In [None]:
ln.track()

By calling {func}`~lamindb.track`, the notebook is automatically linked as the source of all data that's about to be saved!

:::{dropdown} What happened under the hood?

1. Imported package versions of current notebook were detected
2. Notebook metadata was detected and stored in a {class}`~lamindb.Transform` record
3. Run metadata was detected and stored in a {class}`~lamindb.Run` record

The {class}`~lamindb.Transform` class registers data transformations: a notebook, a pipeline or a UI operation.

The {class}`~lamindb.Run` class registers executions of transforms. Several runs can be linked to the same transform if executed with different context (time, user, input data, etc.).

:::

:::{dropdown} How do I track a pipeline instead of a notebook?

```python
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

:::{dropdown} Why should I care about tracking notebooks?

If you can, avoid interactive notebooks: Anything that can be a deterministic pipeline, should be a pipeline.

Just: much insight generated from biological data is driven by computational biologists _interacting_ with it.

A notebook that's run a single time on specific data is not a pipeline: it's a (versioned) **document** that produced insight or some other form of data representation (with parallels to an ELN in the wetlab).

Because humans are in the loop, most mistakes happen when using notebooks: {func}`~lamindb.track` helps avoiding some.

(An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).)

:::

## Manage files

We'll work with a toy dataset of image files and transform it into higher-level features for downstream analysis.

(For other data types: see {doc}`docs:by-datatype`.)

Consider 3 directories storing images & metadata of Iris flowers, generated in 3 subsequent studies:

In [None]:
ln.File.view_tree("s3://lamindb-dev-datasets/iris_studies")

Our goal is to turn these files into a validated & queryable dataset that can be used alongside many other datasets.

### Register a file

LaminDB uses the {class}`~lamindb.File` class to model files with their metadata and access. It's a registry that manages search, queries, validation & access of files through metadata.

Let's create a {class}`~lamindb.File` record from one of the files:

In [None]:
file = ln.File("s3://lamindb-dev-datasets/iris_studies/study0_raw_images/meta.csv")

file

:::{dropdown} Which fields are populated when creating a File record?

Basic fields:

- `id`: a universal ID (serves as a primary key in the underlying SQL table of the instance)
- `key`: an optional storage key, i.e., the relative path of the file in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or network location)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this file already stored?)
- `hash_type`: the type of the hash (usually, an MD5 or SHA1 checksum)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related fields:

- `created_by`: the {class}`~lamindb.User` who created the file
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the file

For a full reference, see {class}`~lamindb.File`.

:::

Upon `.save()`, file metadata is written to the database:

In [None]:
file.save()

:::{dropdown} What happens during save?

In the database: A file record is inserted into the `File` registry. If the file record exists already, it's updated.

In storage:
- If the default storage is in the cloud, `.save()` triggers an upload for a local file.
- If the file is already in a registered storage location, only the metadata of the record is saved to the `File` registry.

:::

The `meta.csv` file is now registered in the database:

In [None]:
ln.File.filter().df()

### View data flow

Because we called {func}`~lamindb.track`, we know that the file was saved in the current notebook ({meth}`~lamindb.dev.Data.view_flow`):

In [None]:
file.view_flow()

We can also directly access its linked {class}`~lamindb.Transform` & {class}`~lamindb.Run` records:

In [None]:
file.transform

In [None]:
file.run

(For a comprehensive example with data flow through app uploads, pipelines & notebooks of multiple data types, see {doc}`docs:project-flow`.)

### Access a file

{attr}`~lamindb.File.path` gives you the filepath:

In [None]:
file.path

To download the file to a local cache, call {meth}`~lamindb.File.stage`:

In [None]:
file.stage()

To load a file into memory with a default loader, call {meth}`~lamindb.File.load`: 

In [None]:
df = file.load(index_col=0)  # calls `pd.read_csv` and passes `index_col=0` to it

df.head()

If the file is large, you'll likely want to query it via {meth}`~lamindb.File.backed`. For more on this, see: {doc}`data`.

:::{dropdown} How do I update a file?

If you'd like to replace the underlying stored object, use {meth}`~lamindb.File.replace`.

If you'd like to update metadata:
```
file.description = "My new description"
file.save()  # save the change to the database
``` 

:::


### Register directories

With {meth}`~lamindb.File.from_dir` we now register the entire directory of the first study:

In [None]:
files = ln.File.from_dir("s3://lamindb-dev-datasets/iris_studies/study0_raw_images")

(We see that we already registered one of the files. Instead of creating a new file record, the existing one is returned: see [idempotency](/faq/idempotency)).

Let's only register the first 5 records to keep things simple:

In [None]:
files_subset = files[:5]
ln.save(files_subset)

### Query & search files

You can search files directly based on the {class}`~lamindb.File` registry:

In [None]:
ln.File.search("meta").head()

You can also query & search the file by any metadata combination.

For instance, look up a user with auto-complete from the {class}`~lamindb.User` registry:

In [None]:
users = ln.User.lookup()
users.testuser1

Filter the {class}`~lamindb.Transform` registry for a name:

In [None]:
transform = ln.Transform.filter(
    name__icontains="files & datasets"
).one()  # get exactly one result
transform

:::{dropdown} What does a double underscore mean?

For any field, the double underscore defines a comparator, e.g.,

* `name__icontains="Martha"`: `name` contains `"Martha"` when ignoring case
* `name__startswith="Martha"`: `name` starts with `"Martha`
* `name__in=["Martha", "John"]`: `name` is `"John"` or `"Martha"`

For more info, see: {doc}`meta`.

:::

Use these results to filter the {class}`~lamindb.File` registry:

In [None]:
ln.File.filter(
    created_by=users.testuser1,
    transform=transform,
    suffix=".jpg",
).df().head()

You can also query for directories using `key__startswith` (LaminDB treats directories like AWS S3, as the prefix of the storage `key`): 

In [None]:
ln.File.filter(key__startswith="iris_studies/study0_raw_images/").df().head()

```{note}

You can look up, filter & search any registry ({class}`~lamindb.dev.Registry`).

You can chain {meth}`~lamindb.dev.Registry.filter` statements and {meth}`~lamindb.dev.QuerySet.search`: `ln.File.filter(suffix=".jpg").search("my image")`

An empty filter returns the entire registry: `ln.File.filter()`
```

For more info, see: {doc}`meta`.

## Describe files

Get an overview of what happened:

In [None]:
file.describe()

In [None]:
file.view_flow()

## Version files

If you'd like to version a file or transform, either provide the `version` parameter when creating it or create new versions through `is_new_version_of`.

For instance:
```
new_file = ln.File(data, is_new_version_of=old_file)
```

Are there remaining questions about storing files? If so, see: {doc}`docs:faq/storage`.

## Create a dataset

The 50 image files together with their metadata annotations present a dataset. Let's track it as such:

In [None]:
dataset = ln.Dataset(
    files_subset, name="Iris study 1", description="50 image files and metadata"
)

In [None]:
dataset.save()

Most functionality that you just learned about files - e.g., queries & provenance - also applies to {class}`~lamindb.Dataset`.

The important difference is that a `Dataset` does not have a `key` field: it's an abstraction over storing data in one or several files or other storage backends.

We'll learn more about dataasets in the next part of the tutorial.

## View changes

With {func}`~lamindb.view`, you can see the latest changes to the database:

In [None]:
ln.view()  # link tables in the database are not shown

## Save notebook

When you've completed the work on the notebook, you can save an execution report and notebook source code in your storage location like so:

lamin save <notebook_path>  # e.g., tutorial.ipynb

This will enable you to query the report and source code via `transform.latest_report` and `transform.source_file` and see it in the hub, e.g., [here](https://lamin.ai/laminlabs/lamindata/record/core/Transform?id=NJvdsWWbJlZSz8).

## Read on

Now, you already know about 6 out of 10 LaminDB core classes! The two most central are:

- {class}`~lamindb.File`: data batches
- {class}`~lamindb.Dataset`: collections of data batches

And the four registries related to provenance:

- {class}`~lamindb.Transform`: transforms of files & datasets
- {class}`~lamindb.Run`: runs of transforms
- {class}`~lamindb.User`: users
- {class}`~lamindb.Storage`: storage locations like S3/GCP buckets or local directories

If you want to validate data, label files & datasets and manage features, read on: {doc}`/tutorial2`.