# Track files & records

This first basic guide walks through tracking files with basic metadata that you'll find in many tools.

But unlike other tools, `lamindb` gives you a SQL-based framework for linking complex metadata related to [data lineage](./data-lineage), [biology](./registries) or any [custom schema](https://github.com/laminlabs/lnschema-lamin1).

When you first import `lamindb`, it'll show you a warning:

In [None]:
import lamindb as ln

## Setup

Let's create a LaminDB instance with a directory `./mydata` for storing files and a SQLite database for managing metadata:


In [None]:
ln.setup.init(storage="./mydata")  # or "s3://my-bucket"

(You can think about initializing an instance as initializing a git repository.)

The default storage location is:

In [None]:
ln.settings.storage  # your "working data directory"

You can change it by setting `ln.settings.storage = "s3://my-bucket"`.

And you can see all storage locations by querying {class}`~lamindb.Storage`:

In [None]:
ln.Storage.select().df()  # more on such calls later!

## Files

In [None]:
ln.settings.verbosity = 3  # show hints

### Track an existing file

In [None]:
# put a file "mini.csv" into our default storage
filepath = ln.dev.datasets.file_mini_csv()
filepath.rename(ln.setup.settings.storage.root / filepath.name)

We have an existing file in our storage location: `./mydata/mini.csv`

Create a {class}`~lamindb.File` object from the path:

In [None]:
file = ln.File("./mydata/mini.csv")

:::{dropdown} Quick overview

Basic file metadata is:

- `id`: a universally unique persistent ID that also serves as a primary key in the SQL table
- `name`: a name (e.g., the original file name)
- `key`: the storage key, i.e., the relative path of the file in the storage location
- `storage`: the storage location (the root, say, an S3 bucket)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: an MD5 checksum useful to check for integrity and collisions (is this file already stored?)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related metadata is:

- `created_by`: the {class}`~lamindb.User` who created the file
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `path()`: the path (cloud or local)
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

By saving a file object, metadata & data are saved to database & storage in a single [ACID](/faq/acid) transaction:

In [None]:
file.save()  # as the file is already in the desired storage location, only metadata is written

### Add a new file

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve()

Here's a local file that's not yet in LaminDB storage:

In [None]:
filepath

The way you indicate the target path for storing the file is by passing the `key` argument:

In [None]:
file = ln.File(filepath, key="images/paradisi05_laminopathic_nuclei.jpg")

In [None]:
file.save()

Looking into our default storage, we see:

In [None]:
ln.File.tree()

You'll see your files also in the SQL database together with entries for storage and users (and later down this guide, many other entities):

In [None]:
ln.view()

### Access a file

There are several ways of accessing a file.

For instance, `.stage()` returns a local filepath (it will cache a cloud object):

In [None]:
file.stage()

If we want the full `path` within the storage location (say, in an S3 bucket), we'll call `.path()`.

### Query a file

You can query the file by its metadata. The simplest way is by key:

In [None]:
file = ln.File.select(key="images/paradisi05_laminopathic_nuclei.jpg").one()

file

## In-memory objects

A `File` object can also be created from an in-memory object.

Under-the-hood, it is serialized into a configurable storage format (e.g. `DataFrame` → `.parquet`, `AnnData` → `.h5ad`/`.zrad`, ...).

In [None]:
df = ln.dev.datasets.df_iris()

In [None]:
df.head()

In [None]:
file = ln.File(df, name="Anderson's Iris flower dataset")

In [None]:
file.save()

The data got added with a storage key based on the `id`, because here, we didn't pass the `key` argument.

In [None]:
ln.File.tree()

Get the dataframe back:

In [None]:
file.load().head()

Or stage the underlying parquet file:

In [None]:
file.stage()

## Data objects in context 

We have come to love the pydata family of data objects (`DataFrame`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, `AnnData`, and others).

But we couldn’t find an object for linking data objects to context and storing it at arbitrary scale.

So, we made `lamindb.File` to help with modeling data objects in relation to their context.

Context can be other data objects, data transformations, ML models, users & pipelines that performed transformations (all aspects of data lineage).

Context can also be any entity of the domain in which data is generated and modeled.

We focused on linking `File` to data lineage & biological concepts. You'll learn about them further down the guide.

## Directories

In [None]:
# generate some files in default storage
ln.dev.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)

We can pass an existing directory to {meth}`~lamindb.File.from_dir`:

In [None]:
files = ln.File.from_dir("./mydata/sample_001/")

In [None]:
print(files[:2])

In [None]:
ln.save(files)

View the files as a tree:

In [None]:
ln.File.tree()  # to subset, call ln.File.tree("sample_001")

Under-the-hood, the following records got written:

In [None]:
ln.File.select(key__startswith="sample_001/").df().head()

Query a specific file by passing the full key to `ln.select`:

In [None]:
ln.File.select(key="sample_001/metrics_summary.csv").df()

You see that LaminDB treats directories similar to S3, as a plain prefix in the storage `key`.

If you want to flexibly group files, consider tags ({class}`~lamindb.Tag`).

## Tag files

Say, we want to tag the files related to `sample_0001` independent of where they are in storage.

Let's create and save a tag:

In [None]:
tag = ln.Tag(name="Sample 0001")
tag.save()

Let's now label each file in `files` with this tag and save the update:

In [None]:
for file in files:
    file.tags.add(tag)
ln.save(files)

We can now query by this tag (and arbitrarily more):

In [None]:
ln.File.select(tags=tag).df()

## Save, update & delete metadata

To end this guide through basic file & metadata tracking, let's see how to update records storing metadata for any entity.

### Save records

A single record:

In [None]:
project = ln.Project(name="Project A")

In [None]:
project.save()

Multiple records:

In [None]:
projects = [ln.Project(name=name) for name in ["Project B", "Project C", "Project D"]]

In [None]:
ln.save(projects)

### Update records

In [None]:
project = ln.Project.select(name="Project A").first()

In [None]:
project

In [None]:
project.name = "Project 1"

In [None]:
project.save()

### Delete records

In [None]:
project = ln.Project.select(name="Project B").first()

In [None]:
project.delete()

In [None]:
# clean up what we wrote in this notebook
!lamin delete mydata
!rm -r mydata
!rm paradisi05_laminopathic_nuclei.jpg