![](https://img.shields.io/badge/tutorial1/2-lightgrey)
[![](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial.ipynb)
[![](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=NJvdsWWbJlZSz8)

# Tutorial: Artifacts

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

Biology is measured in samples that generate batches of data.

LaminDB provides a framework to transform these batches into more useful representations: validated, queryable datasets, machine learning models, and analytical insights (graphic).

The tutorial has two parts, each is a Jupyter notebook:

1. {doc}`/tutorial` - register & access
2. {doc}`/tutorial2` - validate & annotate



## Setup

1. Install the `lamindb` Python package:
    ```shell
    pip install 'lamindb[jupyter,aws]'
    ```
2. [Sign up](https://lamin.ai/signup) for a free account (see more [info](https://lamin.ai/docs/setup)) and copy the API key.
3. Log in on the command line:
    ```shell
    lamin login <email> --key <API-key>
    ```

You can now init a LaminDB instance with a directory `./lamin-tutorial` for storing data:

In [3]:
!lamin init --storage ./lamin-tutorial  # or "s3://my-bucket" or "gs://my-bucket"

💡 found cached instance metadata: /Users/falexwolf/.lamin/instance--testuser1--lamin-tutorial.env
💡 loaded instance: testuser1/lamin-tutorial
[0m

:::{dropdown} What else can I configure during setup?

1. Instead of the default SQLite database, use PostgreSQL:
    ```shell
    --db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
    ```
2. Instead of a default instance name derived from storage, provide a custom name:
    ```shell
    --name myinstance
    ``````
3. Beyond the core schema, use bionty and other schemas:
    ```shell
    --schema bionty,custom1,template1
    ```

For more, see {doc}`/setup`.

:::

## Track a data source

In [4]:
import lamindb as ln

💡 lamindb instance: testuser1/lamin-tutorial


If new to LaminDB, set {attr}`~lamindb.dev.Settings.verbosity` to hint level:

In [5]:
ln.settings.verbosity = "hint"

The code that generates a batch of data is a transform ({class}`~lamindb.Transform`). It could be a pipeline, a notebook or an app upload.

Let's track the notebook that's being run:

In [6]:
ln.track()

💡 notebook imports: lamindb==0.63.3
💡 loaded: Transform(uid='NJvdsWWbJlZSz8', name='Tutorial: Artifacts', short_name='tutorial', version='0', type='notebook', updated_at=2023-12-11 21:13:26 UTC, created_by_id=1)
💡 loaded: Run(uid='O2mhZmQZrxbnyLReqR9h', run_at=2023-12-12 20:45:39 UTC, transform_id=1, created_by_id=1)


By calling {func}`~lamindb.track`, the notebook is automatically linked as the source of all data that's about to be saved!

:::{dropdown} What happened under the hood?

1. Imported package versions of current notebook were detected
2. Notebook metadata was detected and stored in a {class}`~lamindb.Transform` record
3. Run metadata was detected and stored in a {class}`~lamindb.Run` record

The {class}`~lamindb.Transform` class registers data transformations: a notebook, a pipeline or a UI operation.

The {class}`~lamindb.Run` class registers executions of transforms. Several runs can be linked to the same transform if executed with different context (time, user, input data, etc.).

:::

:::{dropdown} How do I track a pipeline instead of a notebook?

```python
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

:::{dropdown} Why should I care about tracking notebooks?

If you can, avoid interactive notebooks: Anything that can be a deterministic pipeline, should be a pipeline.

Just: much insight generated from biological data is driven by computational biologists _interacting_ with it.

A notebook that's run a single time on specific data is not a pipeline: it's a (versioned) **document** that produced insight or some other form of data representation (with parallels to an ELN in the wetlab).

Because humans are in the loop, most mistakes happen when using notebooks: {func}`~lamindb.track` helps avoiding some.

(An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).)

:::

## Manage artifacts

We'll work with a toy dataset of image files and transform it into higher-level features for downstream analysis.

(For other data types: see {doc}`docs:by-datatype`.)

Consider 3 directories storing images & metadata of Iris flowers, generated in 3 subsequent studies:

In [7]:
ln.UPath("s3://lamindb-dev-datasets/iris_studies").view_tree()

iris_studies (3 sub-directories & 151 files with suffixes '.jpg', '.csv'): 
├── study0_raw_images
│   ├── iris-0337d20a3b7273aa0ddaa7d6afb57a37a759b060e4401871db3cefaa6adc068d.jpg
│   ├── iris-0797945218a97d6e5251b4758a2ba1b418cbd52ce4ef46a3239e4b939bd9807b.jpg
│   ├── iris-0f133861ea3fe1b68f9f1b59ebd9116ff963ee7104a0c4200218a33903f82444.jpg
│   ├── iris-0fec175448a23db03c1987527f7e9bb74c18cffa76ef003f962c62603b1cbb87.jpg
│   ├── iris-125b6645e086cd60131764a6bed12650e0f7f2091c8bbb72555c103196c01881.jpg
│   ├── iris-13dfaff08727abea3da8cfd8d097fe1404e76417fefe27ff71900a89954e145a.jpg
│   ...
│   └── meta.csv
├── study1_raw_images
│   ├── iris-0879d3f5b337fe512da1c7bf1d2bfd7616d744d3eef7fa532455a879d5cc4ba0.jpg
│   ├── iris-0b486eebacd93e114a6ec24264e035684cebe7d2074eb71eb1a71dd70bf61e8f.jpg
│   ├── iris-0ff5ba898a0ec179a25ca217af45374fdd06d606bb85fc29294291facad1776a.jpg
│   ├── iris-1175239c07a943d89a6335fb4b99a9fb5aabb2137c4d96102f10b25260ae523f.jpg
│   ├── iris-1289c57b571e8e98e4feb3

Our goal is to turn these directories into a validated & queryable dataset that can be used alongside many other datasets.

### Register an artifact

LaminDB uses the {class}`~lamindb.Artifact` class to model files, folders & arrays in storage with their metadata. It's a registry to manage search, queries, validation & access of storage locations.

Let's create a {class}`~lamindb.Artifact` record from one of the files:

In [None]:
artifact = ln.Artifact(
    "s3://lamindb-dev-datasets/iris_studies/study0_raw_images/meta.csv"
)
artifact

:::{dropdown} Which fields are populated when creating a artifact record?

Basic fields:

- `uid`: universal ID
- `key`: storage key, a relative path of the artifact in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or a local directory)
- `suffix`: an optional file/path suffix
- `size`: the artifact size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this artifact already stored?)
- `hash_type`: the type of the hash (usually, an MD5 or SHA1 checksum)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related fields:

- `created_by`: the {class}`~lamindb.User` who created the artifact
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the artifact

For a full reference, see {class}`~lamindb.Artifact`.

:::

Upon `.save()`, artifact metadata is written to the database:

In [None]:
artifact.save()

:::{dropdown} What happens during save?

In the database: A artifact record is inserted into the `artifact` registry. If the artifact record exists already, it's updated.

In storage:
- If the default storage is in the cloud, `.save()` triggers an upload for a local artifact.
- If the artifact is already in a registered storage location, only the metadata of the record is saved to the `artifact` registry.

:::

We can get an overview of all artifacts in the database by calling {meth}`~lamindb.dev.Registry.df`:

In [None]:
ln.Artifact.df()

### View data flow

Because we called {func}`~lamindb.track`, we know that the artifact was saved in the current notebook ({meth}`~lamindb.dev.Data.view_flow`):

In [None]:
artifact.view_flow()

We can also directly access its linked {class}`~lamindb.Transform` & {class}`~lamindb.Run` records:

In [None]:
artifact.transform

In [None]:
artifact.run

(For a comprehensive example with data flow through app uploads, pipelines & notebooks of multiple data types, see {doc}`docs:project-flow`.)

### Access an artifact

{attr}`~lamindb.Artifact.path` gives you the file path (:class:`~lamindb.UPath`):

In [None]:
artifact.path

To download the artifact to a local cache, call {meth}`~lamindb.Artifact.stage`:

In [None]:
artifact.stage()

To load data into memory with a default loader, call {meth}`~lamindb.Artifact.load`: 

In [None]:
df = artifact.load(index_col=0)
df.head()

If the data is large, you'll likely want to query it via {meth}`~lamindb.Artifact.backed`. For more on this, see: {doc}`data`.

:::{dropdown} How do I update a artifact?

If you'd like to replace the underlying stored object, use {meth}`~lamindb.Artifact.replace`.

If you'd like to update metadata:
```
artifact.description = "My new description"
artifact.save()  # save the change to the database
```

:::


### Register directories as artifacts

We now register the entire directory for study 0 as an artifact:

In [7]:
study0_data = ln.Artifact(f"s3://lamindb-dev-datasets/iris_studies/study0_raw_images")
study0_data.save()
ln.Artifact.df()  # see the registry content

❗ no run & transform get linked, consider passing a `run` or calling ln.track()


Unnamed: 0_level_0,uid,storage_id,key,suffix,accessor,description,version,size,hash,hash_type,n_objects,n_observations,transform_id,run_id,initial_version_id,visibility,key_is_virtual,updated_at,created_by_id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,5vQ7PAw21PywI3KPtLCV,2,iris_studies/study0_raw_images/meta.csv,.csv,,,,4355,ZpAEpN0iFYH6vjZNigic7g,md5,,,1.0,1.0,,1,False,2023-12-11 21:13:37.166476+00:00,1
3,hDG8dS8TdmT5QyvulQ4H,2,iris_studies/study0_raw_images,,,,,656692,d8_SjrP3V5tGetN8LQZC7w,md5-d,51.0,,,,,1,False,2023-12-12 21:04:31.590187+00:00,1


### Filter & search artifacts

You can search artifacts directly based on the {class}`~lamindb.Artifact` registry:

In [None]:
ln.Artifact.search("meta").head()

You can also query & search the artifact by any metadata combination.

For instance, look up a user with auto-complete from the {class}`~lamindb.User` registry:

In [None]:
users = ln.User.lookup()
users.testuser1

Filter the {class}`~lamindb.Transform` registry for a name:

In [None]:
transform = ln.Transform.filter(
    name__icontains="Artifacts"
).one()  # get exactly one result
transform

:::{dropdown} What does a double underscore mean?

For any field, the double underscore defines a comparator, e.g.,

* `name__icontains="Martha"`: `name` contains `"Martha"` when ignoring case
* `name__startswith="Martha"`: `name` starts with `"Martha`
* `name__in=["Martha", "John"]`: `name` is `"John"` or `"Martha"`

For more info, see: {doc}`meta`.

:::

Use these results to filter the {class}`~lamindb.Artifact` registry:

In [None]:
ln.Artifact.filter(
    created_by=users.testuser1,
    transform=transform,
    suffix=".jpg",
).df().head()

You can also query for directories using `key__startswith` (LaminDB treats directories like AWS S3, as the prefix of the storage `key`): 

In [None]:
ln.Artifact.filter(key__startswith="iris_studies/study0_raw_images/").df().head()

```{note}

You can look up, filter & search any registry ({class}`~lamindb.dev.Registry`).

You can chain {meth}`~lamindb.dev.Registry.filter` statements and {meth}`~lamindb.dev.QuerySet.search`: `ln.Artifact.filter(suffix=".jpg").search("my image")`

An empty filter returns the entire registry: `ln.Artifact.filter()`
```

For more info, see: {doc}`meta`.

:::{dropdown} Filter & search on LaminHub

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/L188T2JjzZHWHfv2S0ib.png" width="700px">

:::

## Describe artifacts

Get an overview of what happened:

In [None]:
artifact.describe()

In [None]:
artifact.view_flow()

## Version artifacts

If you'd like to version a artifact or transform, either provide the `version` parameter when creating it or create new versions through `is_new_version_of`.

For instance:
```
new_artifact = ln.Artifact(data, is_new_version_of=old_artifact)
```

Are there remaining questions about storing artifacts? If so, see: {doc}`docs:faq/storage`.

## Datasets

An artifact can model anything that's in storage: a file, a dataset, an array, a machine learning model.

Often times, several artifacts together represent a dataset.

Let's store the artifact for `study0_data` as a {class}`~lamindb.Dataset`:

In [None]:
dataset = ln.Dataset(
    study0_data,
    name="Iris dataset",
    version="1",
    description="50 image files and metadata",
)
dataset

And save it:

In [None]:
dataset.save()

Now, we perform subsequent studies by collecting more data.

We'd like to keep track of their data as part of a growing versioned dataset:

In [None]:
artifacts = [study0_data]
for folder_name in ["study1_raw_images", "study2_raw_images"]:
    # create an artifact for the folder
    artifact = ln.Artifact(f"s3://lamindb-dev-datasets/iris_studies/{folder_name}")
    artifact.save()
    artifacts.append(artifact)
    # create a new version of the dataset
    dataset = ln.Dataset(
        artifacts, is_new_version_of=dataset, description="Another 50 images"
    )
    dataset.description = "Another 50 images"
    dataset.save()

See all artifacts:

In [None]:
ln.Artifact.df()

See all datasets:

In [None]:
ln.Dataset.df()

Most functionality that you just learned about artifacts - e.g., queries & provenance - also applies to {class}`~lamindb.Dataset`.

But `Dataset` is an abstraction over storing data in one or several artifacts and does not have a `key` field.

We'll learn more about datasets in the next part of the tutorial.

## View changes

With {func}`~lamindb.view`, you can see the latest changes to the database:

In [None]:
ln.view()  # link tables in the database are not shown

## Save notebook & scripts

When you've completed the work on a notebook  or script, you can save the source code and, for notebooks, an execution report to your storage location like so:

```
lamin save <file_path>  # e.g., my_script.py, my_notebook.ipynb
```

This enables you to query source code and report via `transform.source_code` and `transform.latest_report` and see it in the hub, e.g., [here](https://lamin.ai/laminlabs/lamindata/record/core/Transform?uid=NJvdsWWbJlZSz8).

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8aBoM.png" width="700px">

## Read on

Now, you already know about 6 out of 9 LaminDB core classes! The two most central are:

- {class}`~lamindb.Artifact`
- {class}`~lamindb.Dataset`

And the four registries related to provenance:

- {class}`~lamindb.Transform`: transforms of artifacts
- {class}`~lamindb.Run`: runs of transforms
- {class}`~lamindb.User`: users
- {class}`~lamindb.Storage`: storage locations like S3/GCP buckets or local directories

If you want to validate data, label artifacts, and manage features, read on: {doc}`/tutorial2`.