# Track files, in-memory objects & folders

Let us create a test instance with a local folder and local SQLite as storage & SQL backends, respectively.



In [None]:
!lamin init --storage ./myobjects

# Files

In [None]:
import lamindb as ln

### Track an existing file

`./mydata` is your "working storage location", akin to a "working directory". It can be a cloud storage location (S3 or GCP bucket) and you can switch between different tracked locations via {func}`lamindb.setup.set.storage`.

In [None]:
ln.setup.settings.storage.root

In [None]:
filepath = ln.dev.datasets.file_mini_csv()
filepath = filepath.rename(ln.setup.settings.storage.root / filepath.name)

Assume we have an existing file in our storage location:

In [None]:
!ls ./myobjects/mini.csv

In LaminDB, you track files in two steps.

First, create a {class}`~lamindb.File` object. Here, we pass an optional storage key:

In [None]:
file = ln.File("./myobjects/mini.csv")

:::{dropdown} Quick overview

A {class}`~lamindb.File` object manages any serialized data object.

Basic file metadata is:

- `id`: a universally unique persistent ID that also serves as a primary key in the SQL table
- `name`: a name (e.g., the original file name)
- `key`: the storage key, i.e., the relative path of the file in the storage location
- `storage`: the storage location (the root, say, an S3 bucket)
- `suffix`: the file suffix
- `size`: the file size in bytes
- `hash`: an MD5 checksum useful to check for integrity and collisions (is this file already stored?)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related metadata is:

- `created_by`: the :class:`~lamindb.User` who created the file
- `transform`: the general :class:`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the specific :class:`~lamindb.Run` of the transform that generated the file

Managing the underlying data:

- `load()`: load the file to memory for formats like `.parquet`, `.zarr`, and `.h5ad`
- `path()`: the path (cloud or local)
- `stage()`: a local path to a cached object
- `replace()`: replace the content of the file

For a full reference, see {class}`~lamindb.File`.

:::

Second, add the `file` object to the LaminDB instance: metadata & data are added to database & storage in a single ACID transaction:

In [None]:
ln.add(file)

In [None]:
assert str(filepath.resolve()) == str(file.path())

### Add a new file

In [None]:
filepath = ln.dev.datasets.file_jpg_paradisi05().resolve()

Here's a local file that's not yet in LaminDB storage:

In [None]:
filepath

In [None]:
file = ln.File(filepath, key="images/paradisi05_laminopathic_nuclei.jpg")

In [None]:
file

In [None]:
# a few checks
assert file.hash == "r4tnqmKI_SjrkdLzpuWp4g"
assert file.run == ln.context.run

In [None]:
file = ln.add(file)

#### What happens under the hood?

In storage:

In [None]:
!ls -R ./myobjects

In the SQL database, for each object in storage, there is a row in the File table:

In [None]:
ln.view()

You will see how to link provenance, biology and arbitrary metadata to files later in this guide!

## Retrieve a file

There are several ways of accessing a file.

For instance, `.stage()` returns a local filepath (it will cache a cloud object):

In [None]:
file.stage()

If we want the full `path` within the storage location, we'll call `.path()`. Because the file is in local storage, `.stage()` and `.path()` return the same result.

## Query a file

You can query the file by its metadata. Two of the simplest ways are by name or key:

In [None]:
file = ln.select(ln.File, name="mini.csv").one()

file

In [None]:
file = ln.select(ln.File, key="images/paradisi05_laminopathic_nuclei.jpg").one()

file

Learn more: {doc}`/guide/select`.

## Replace a file

In [None]:

# a dummy change to the file
!cp index.md paradisi05_laminopathic_nuclei.jpg

Say we made a change to the content of a file (e.g., edited the image `paradisi05_laminopathic_nuclei.jpg`).

This is how we replace the old file in storage with the new file:

In [None]:
file.replace("paradisi05_laminopathic_nuclei.jpg")

In [None]:
ln.add(file)

The file record now has an entry for field `updated_at`.

In [None]:
assert (
    ln.select(ln.File, name="paradisi05_laminopathic_nuclei.jpg").one().updated_at
    is not None
)

## Delete a file

In [None]:
file = ln.select(ln.File, name="paradisi05_laminopathic_nuclei.jpg").one()

In [None]:
ln.delete(file, delete_data_from_storage=True)

```{important}

By default only the record is deleted, and you will be asked to confirm deleting data from storage.

You may set `delete_data_from_storage=True` to auto confirm the data deletion from storage.
```

In [None]:
from pathlib import Path

assert ln.select(ln.File, name="paradisi05_laminopathic_nuclei.jpg").first() is None
assert not Path("./myobjects/paradisi05_laminopathic_nuclei.jpg").exists()

## In-memory objects

A `File` object can also be created from an in-memory object serializing it into a configurable storage format (e.g. `DataFrame` → `.parquet`, `AnnData` → `.h5ad`/`.zarr`, ...).

In [None]:
import sklearn.datasets

df = sklearn.datasets.load_iris(as_frame=True).frame

In [None]:
df.head()

In [None]:
file = ln.File(df, name="iris")

In [None]:
ln.add(file)

The data got added with storage key based on the `id`, because here, we didn't pass the `key` argument.

In [None]:
!ls ./myobjects

Get the dataframe back:

In [None]:
file.load()

Stage the underlying parquet file:

In [None]:
file.stage()

## Data objects in context 

We have come to love the pydata family of `DataFrame`, `AnnData`, `pytorch.DataLoader`, `zarr.Array`, `pyarrow.Table`, `xarray.Dataset`, and others for accessing lower-level data objects.

But we couldn’t find an object for accessing how data objects are linked to context.
So, we made `lamindb.File` to help with modeling and understanding data objects in relation to their context.

Context can be other data objects, data transformations, ML models, users & pipelines who performed transformations, and all aspects of data lineage.
Context can also be hypotheses and any entity of the domain in which data is generated and modeled.

Depending on how files are linked to context, they give rise to features of data lakes, warehouses and knowledge graphs.

We focused on linking `File` to biological concepts: entities, their types, records, transformations, and relations.
You'll learn about them further down the guide.

## Track folders

You can track folders in the configured storage location.

### Real folders

In [None]:
ln.dev.datasets.generate_cell_ranger_files(
    "sample_001", ln.setup.settings.instance.storage.root
)

In [None]:
!ls -l './myobjects/sample_001/'

Let's pass this directory to `ln.Folder`:

In [None]:
folder = ln.Folder("./myobjects/sample_001/")

In [None]:
folder.files[:2]

In [None]:
ln.add(folder)

View the files as a tree:

In [None]:
folder.tree()

Under-the-hood, the following records got written:

In [None]:
ln.select(ln.Folder, name=folder.name).one()

In [None]:
ln.select(ln.File).join(ln.File.folders).where(
    ln.Folder.name == "sample_001"
).df().head()

A `Folder` can be subset to files via their relative path in the directory:

In [None]:
folder.subset(prefix="raw_feature_bc_matrix/")

In [None]:
folder.subset(prefix="raw_feature_bc_matrix", suffix=".mtx.gz")

Query a specific file from a folder using `ln.select`:

In [None]:
ln.select(ln.File, key="sample_001/metrics_summary.csv").df()

### Virtual folders

You can also create virtual folders that group files together, independent of their storage key.

In that case, you'll need to write richer queries to retrieve files, e.g.,

In [None]:
ln.select(ln.File, name="metrics_summary.csv").join(ln.File.folders).where(
    ln.Folder.name == "sample_001"
).df()

## Add, update & delete metadata

### Add records

In [None]:
project = ln.Project(name="B1")

In [None]:
ln.add(project)

A list of records:

In [None]:
projects = [ln.Project(name=name) for name in ["B2", "B3", "B4"]]

In [None]:
ln.add(projects)

### Update records

In [None]:
file = ln.select(ln.File, name="iris").first()
file

Update the name "iris" to "iris_new":

In [None]:
file.name = "iris_new"

Add the updated record to the database:

In [None]:
ln.add(file)

### Delete records

In [None]:
project = ln.select(ln.Project, name="B2").first()

In [None]:
ln.delete(project)

In [None]:
!lamin delete myobjects
!rm -r myobjects
!rm paradisi05_laminopathic_nuclei.jpg