# Manage files & datasets

Turn messy files into validated, queryable datasets.

```{note}

- This tutorial as [Jupyter notebook](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial1.ipynb).
- This tutorial manages metadata schema-less, but `lamindb` gives you a framework for managing complex typed metadata schema-full.

```

## Set up an instance

[Installation and sign-up](./guide.md#setup) take no time: Run `pip install lamindb` and `lamin signup <email>` on the command line.

Using the CLI, create a LaminDB instance with a directory `./mydata` for storing data and a SQLite database for managing it:


In [None]:
!lamin init --storage ./mydata  # or "s3://my-bucket" or "gs://my-bucket"

:::{dropdown} What else can I configure during setup?

1. Instead of SQLite, use PostgreSQL: `--db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>`
2. Beyond the core schema, use bionty and other schemas: `--schema bionty,custom1,template1`
3. Instead of an instance name that's derived from default storage, provide a custom name: `--name myinstance`

For more, see {doc}`./guide/setup`.

:::

## Track a data source

In [None]:
import lamindb as ln

Knowing where a batch of data comes from helps finding & understanding it.

Let's call the code that generated it a transform. This code can be a data pipeline, a notebook or an app upload.

With {class}`~lamindb.Transform`, LaminDB maintains a registry of transforms and makes it easy to link data against them.

Here, we're running a Jupyter notebook. Let's track it:

In [None]:
ln.track()

By calling {func}`~lamindb.track`, the notebook is automatically linked as the source of all data that's about to be saved.

:::{dropdown} What happened under the hood?

1. Imported package versions were detected
2. Notebook metadata was detected and stored in a {class}`~lamindb.Transform` record (title, filename, version, timestamp, creator)
3. A run {class}`~lamindb.Run` record was created (timestamp, transform, creator)

:::

:::{dropdown} How do I track a pipeline instead of a notebook?

```python
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

:::{dropdown} Why do we care about tracking notebooks?

Most people advocate for "not using notebooks in production" or similar. And we agree! Anything that can be a pipeline, should be a pipeline.

But we also think that a lot of the downstream insight & value generated from biological data is driven by computational biologists interacting with it.

And we think this is very much akin to the prose-heavy design of biological experiments documented in an ELN.

A notebook that's run a single time on specific data batches is not a pipeline, it's a _document_ that produced an insight or some other form of data representation.

Unfortunately, most mistakes happen when using notebooks. `ln.track()` tries to help with avoiding some.

An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).

:::

## Manage files

We'll need some dummy data:

In [None]:
# a file "./mydata/mini.csv" in default storage
ln.dev.datasets.file_mini_csv(in_storage_root=True)
# a directory "./mydata/sample_001" in default storage
ln.dev.datasets.dir_scrnaseq_cellranger("sample_001", ln.settings.storage)
# a file "paradisi05_laminopathic_nuclei.jpg" in the working directory
ln.dev.datasets.file_jpg_paradisi05().resolve()

### Register a file

There already is a file in default storage: `./mydata/mini.csv`

Let's create a {class}`~lamindb.File` object:

In [None]:
file = ln.File("./mydata/mini.csv")  # or "s3://my-bucket/my-folder/my-file.csv"

:::{dropdown} What is the File class?

It manages file metadata to enable search & queries of the file, validation of file content, and different ways of access.

Basic fields are:

- `id`: a universal ID (also serves as a primary key in the underlying SQL table of the instance)
- `key`: an optional storage key, i.e., the relative path of the file in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or network location)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this file already stored?)
- `hash_type`: the type of the hash (usually, an MD5 or SHA1 checksum)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related fields are:

- `created_by`: the {class}`~lamindb.User` who created the file
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the file

Access the path through a property:

- `path`: the file path

Access the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `backed()`: stream/query the file from the cloud
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

By saving the file, metadata & data are saved to database & storage:

In [None]:
file.save()  # as the file is already in a registered storage location, only metadata is written

Because we called {func}`~lamindb.track`, we know where the file came from ({class}`~lamindb.Transform` & {class}`~lamindb.Run`):

In [None]:
file.transform

In [None]:
file.run

### Add a new file

Here's a local file that's not yet in a registered storage location: `./paradisi05_laminopathic_nuclei.jpg`

Upon `.save()` it will be copied (or uploaded) to default storage (here, `./mydata`):

In [None]:
file = ln.File(
    "paradisi05_laminopathic_nuclei.jpg",
    description="paradisi05 laminopathic nuclei image",
)  # or with key="images/paradisi05_laminopathic_nuclei.jpg"

In [None]:
file.save()

In storage are two files. One with a human-readable key (`'mini.csv'`) and one that's auto-keyed by LaminDB with its file id:

In [None]:
ln.File.tree()

The database stores context for both files:

In [None]:
ln.view()  # link tables in the database are not shown

### Access a file

{attr}`~lamindb.File.path` gives you a filepath:

In [None]:
file.path  # or "s3://my-bucket/my-folder/my-file.jpg"

If the file is in the cloud, you typically stage a cached file ({meth}`~lamindb.File.stage`) or stream its data ({meth}`~lamindb.File.backed`).

### Search or query the file

You can search the file based on the fields in the {class}`~lamindb.File` registry:

In [None]:
ln.File.search("paradisi")

Alternatively, query the file by any metadata combination: 

In [None]:
users = ln.User.lookup()  # look up users with auto-complete
transform = ln.Transform.filter(
    name__contains="files & datasets"
).one()  # query transforms, expect *exactly* one result

ln.File.filter(
    suffix=".jpg",
    created_by=users.testuser1,
    transform=transform,
).df()

You can also chain `.filter()` and `.search()` statements, e.g. `ln.File.filter(suffix=".jpg").search("my image")`.

An empty filter returns the entire registry:

In [None]:
ln.File.filter().df()

For more info, see: {doc}`guide/select`.

## Data lineage

In [None]:
file.view_lineage()

For a comprehensive example with handovers from app uploads, pipelines & notebooks of multiple data types, see {doc}`guide/data-lineage`.

## Manage features & labels

:::{dropdown} Why care about features & labels?

1. Finding data: Which datasets measured expression of cell marker CD14? Which characterized cell line K562? Which datasets have a test & train split? Etc.
2. Validating data: Are there typos in feature names? Are there typos in sampled labels? Are units of features consistent? Etc.

:::

:::{dropdown} A perspective on contextualizing data objects

We love the pydata family of data objects: `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, ...

But we couldn’t find an object for linking data objects to context!

So, we made `lamindb.File` and `lamindb.Dataset` to model how data objects relate to their context:

- other data objects, data transformations, models, users & pipelines that performed transformations (provenance)
- any entity of the domain in which data is generated and modeled (biology)

:::

Consider a batch of the Iris flower dataset (a `DataFrame`):

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch1()

df.head()

### Validate & link features

Let's use {meth}`~lamindb.File.from_df` to track this DataFrame along with its columns as features:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

Features couldn't be validated and are ignored because this is an empty LaminDB instance without a single registered feature.

But, all features here are meaningful and well-curated, so, let's create records for them:

In [None]:
features = ln.Feature.from_df(df)

features

As soon as we save them, they'll serve as the reference for validating data batches that we'd like to validate.

In [None]:
ln.save(features)

:::{dropdown} How to track units of features?

It's easy using {class}`~lamindb.Feature.unit`. In the above example, you'd do:

```python
for feature in features:
    if feature.type == "float":
        feature.unit = "m"  # SI unit for meters
        feature.save()
```

:::

If we create the `File` now, we'll see that features are validated based on the registry content:

In [None]:
file = ln.File.from_df(df, description="Iris flower dataset batch 1")

Let's register the file along with its linked features.

In [None]:
file.save()

Get an overview of linked feature sets:

In [None]:
file.features

A `slot` provides a string key to access feature sets. It's typically the accessor of feature identifiers in the data object we're validating & registering (here, a `DataFrame`).

Let's use it to access all linked features:

In [None]:
file.features["columns"].df()

### Validate & link labels

The Iris dataset comes with labels within the data object.

In [None]:
species_labels = ln.Label.from_values(df["iris_species_name"])

species_labels

Let's save them to the {class}`~lamindb.Label` registry so that they get validated going forward:

In [None]:
ln.save(species_labels)

And annotate the file with the labels for feature `iris_species_name`:

In [None]:
file.add_labels(species_labels)

Now we can get linked labels from a feature:

In [None]:
file.get_labels("iris_species_name").df()

We can now query & search the file by whether `"setosa"` is linked to it:

In [None]:
ln.File.filter(labels__name="setosa").df()

In addition to features present _within_ a data object like a `DataFrame`, a file can be labeled with external metadata.

Let's label this file with `"experiment_1"`!

In [None]:
experiment1 = ln.Label(name="experiment_1")
experiment1.save()
experiment1

:::{dropdown} Why labeling a data batch by experiment?

We can then

1. query all files link to this experiment
2. model it as a confounder when we'll analyze similar data from a follow-up experiment, and concatenate data using the label as a feature in a data matrix

:::

Let's also register a feature that holds experiment labels in concatenated datasets:

In [None]:
ln.Feature(name="experiment", type="category").save()

In [None]:
file.add_labels(experiment1, feature="experiment")

We now have the original feature set and the external dataset:

In [None]:
file.features

This is the context for our file:

In [None]:
file.describe()

See the database content:

In [None]:
ln.view(registries=["Feature", "FeatureSet", "Label", "Modality"])

## Manage datasets

In simple cases as we just saw, we can use files to store datasets.

In more complex cases, we'd like store collections of files, data in mutable storage backends (zarr, TileDB, DuckDB, etc.) or in SQL tables in BigQuery, Snowflake, or Postgres.

Hence, we need a second central class for data storage: {class}`~lamindb.Dataset`.

Let's say we have a second batch of the Iris dataset:

In [None]:
df = ln.dev.datasets.df_iris_in_meter_batch2()
ln.File.from_df(df, description="Iris flower dataset batch 2").save()

And load both files storing these batches:

In [None]:
file1 = ln.File.filter(description="Iris flower dataset batch 1").one()
file2 = ln.File.filter(description="Iris flower dataset batch 2").one()

We can now create a sharded dataset from these two batches:

In [None]:
dataset = ln.Dataset.from_files(name="The combined Iris dataset", files=[file1, file2])

In [None]:
dataset.save()

You can load the sharded dataset as if it was one dataset:

In [None]:
dataset.load().tail()

Access the underlying two file objects:

In [None]:
dataset.files.list()

In [None]:
dataset.files.list()[0].view_lineage()

Or see the registries:

In [None]:
ln.view(registries=["Dataset", "File"])

For a more interesting data lineage graph, let's pretend we're now running a pipeline:

In [None]:
pipeline = ln.Transform(name="Iris Postprocessor", version="0.7.2")
ln.track(pipeline)  # create & track a pipeline
input_files = ln.File.filter(transform__name__contains="files & datasets").all()
[file.stage() for file in input_files]  # let's load the input files

## Manage directories

Use {meth}`~lamindb.File.from_dir` to create files from a directory:

In [None]:
files = ln.File.from_dir("./mydata/sample_001/")

Let's save them:

In [None]:
ln.save(files)

View as a tree:

In [None]:
ln.File.tree("./mydata/sample_001")

Or as a query:

In [None]:
ln.File.filter(key__startswith="sample_001/").df().head()

```{note}

LaminDB treats directories similar to AWS S3, as a prefix in the storage `key`, queryable with `key__startswith`.

```

Here's a summary for all files ingested from this notebook (interactive exploration will be possible in the app):

In [None]:
files[0].view_lineage()

## Manage metadata

To end this guide through basic file & metadata tracking, let's see how to update registry records.

### Metadata ontologies

Say, we want to express that `experiment_1` belongs to project 1. This is how we can do it:

In [None]:
project1 = ln.Label(name="project_1")
project1.save()
experiment1.parents.add(project1)
experiment1.view_parents()

For more info, see {meth}`~lamindb.dev.Registry.view_parents`.

### Validate records upon creation

We already created a `project_1` label before, let's see what happens if we try to create it again:

In [None]:
label = ln.Label(name="project_1")

label.save()

Instead of creating a new record, LaminDB will load and return the existing record from the database.

If there is no exact match, LaminDB will warn you upon creating a record about potential duplictes.

Say, we spell "project_1" without an underscore:

In [None]:
ln.Label(name="project 1")

You see that for every record creation, a search compares whether a similar already exists!
    
This is to avoid inserting duplicated records.

You can switch it off (for performance gains) via `ln.settings.upon_create_search_names = False`.

### Update records

In [None]:
label = ln.Label.filter(name="project_1").first()

In [None]:
label

In [None]:
label.name = "project_1a"

In [None]:
label.save()

In [None]:
label

### Delete records

Delete records like so:

In [None]:
label.delete()

## Default storage

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"` and see all storage locations via:

In [None]:
ln.Storage.filter().df()

## Verbosity

In [None]:
ln.settings.verbosity = 3  # only show info, no hints

In [None]:
# clean up what we wrote in this notebook
!lamin delete mydata
!rm -r mydata
!rm paradisi05_laminopathic_nuclei.jpg