# Track files

In [None]:
import lamindb as ln

## Usage

Let us first track the data source. Here, it's a Jupyter notebook, so we can run:

In [None]:
ln.track()

:::{dropdown} Track a pipeline instead of a notebook

If this is run in a pipeline, we need to pass a {class}`~lamindb.Transform` object of `type` "pipeline":

```
transform = ln.Transform("My script", type="pipeline")
ln.track(transform=transform)
```

This readily creates a {class}`~lamindb.Run` for the pipeline.

A pipeline can be any Python script or workflow tool you may use.

:::

Here's a file on local storage:

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve()

In [None]:
filepath

In LaminDB, you track files in two steps.

First, create a {class}`~lamindb.File` object. Here, we pass an optional storage key:

In [None]:
file = ln.File(filepath, key="images/paradisi05_laminopathic_nuclei.jpg")

In [None]:
file

:::{dropdown} Quick overview

A {class}`~lamindb.File` object manages any serialized data object.

Basic file metadata is:

- `id`: a universally unique persistent ID that also serves as a primary key in the SQL table
- `name`: a name (e.g., the original file name)
- `key`: the storage key, i.e., the relative path of the file in the storage location
- `storage`: the storage location (the root, say, an S3 bucket)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: an MD5 checksum useful to check for integrity and collisions (is this file already stored?)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related metadata is:

- `created_by`: the :class:`~lamindb.User` who created the file
- `transform`: the general :class:`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the specific :class:`~lamindb.Run` of the transform that generated the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `path()`: the path (cloud or local)
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

The `file` object also links to the current notebook run:

In [None]:
file.run

In [None]:
# a few checks
assert file.hash == "r4tnqmKI_SjrkdLzpuWp4g"
assert file.run == ln.context.run

Second, add the `file` object to the LaminDB instance: metadata & data are added to database & storage in a single ACID transaction:

In [None]:
file = ln.add(file)

If you don't want to move the file, you can also track it in existing storage locations: {doc}`/guide/existing`.

## What happens under the hood?

### In the SQL database

Creation of 
1. a `File` record
2. a `Transform` record
3. a `Run` record

All three records are linked so that you can find the file using any of the metadata fields.

In [None]:
ln.select(ln.File).df()

In [None]:
ln.select(ln.Transform).df()

In [None]:
ln.select(ln.Run).df()

### In storage

In [None]:
!ls ./mydata

`./mydata` is your "working storage location", akin to a "working directory". It can be a cloud storage location (S3 or GCP bucket).

In [None]:
ln.setup.settings.storage.root

## Retrieve a file

Getting the data back works through `.stage()` - here, we get back a local filepath:

In [None]:
file.stage()

If we want the full `path` within the storage location, we'll call `.path()`:

In [None]:
file.path()

## Query a file

You can also query the file-associated File record by its metadata. One of the simplest ways is by name:

In [None]:
file = ln.select(ln.File, name="paradisi05_laminopathic_nuclei.jpg").one()

file

Learn more: {doc}`/guide/select`.