![](https://img.shields.io/badge/tutorial1/2-lightgrey)
[![](https://img.shields.io/badge/Source%20on%20GitHub-orange)](https://github.com/laminlabs/lamindb/blob/main/docs/tutorial.ipynb)
[![](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/lamindata/transform/NJvdsWWbJlZSz8)

# Tutorial: Artifacts

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

Biology is measured in samples that generate batches of data.

LaminDB provides a framework to transform these batches into more useful representations: validated, queryable datasets, machine learning models, and analytical insights.

The tutorial has two parts, each is a Jupyter notebook:

1. {doc}`/tutorial` - register & access
2. {doc}`/tutorial2` - validate & annotate

## Setup

Install the `lamindb` Python package:
```shell
pip install 'lamindb[jupyter,aws]'
```

Init a LaminDB instance with a directory `./lamin-tutorial` for storing data:

In [None]:
!lamin init --storage ./lamin-tutorial  # or "s3://my-bucket" or "gs://my-bucket"

In [None]:
import lamindb as ln

ln.settings.verbosity = "hint"

:::{dropdown} What else can I configure during setup?

1. Instead of the default SQLite database, use PostgreSQL:
    ```python
    db=postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
    ```
2. Instead of a default instance name derived from storage, provide a custom name:
    ```python
    name=myinstance
    ``````
3. Beyond the core schema, use bionty and other schemas:
    ```python
    schema=bionty,custom1,template1
    ```

For more, see {doc}`/setup`.

:::

## Track a data source

The code that generates a dataset is a transform ({class}`~lamindb.Transform`). It could be a script, a notebook, a pipeline or a UI interaction like an upload.

Let's track the notebook that's being run:

In [None]:
ln.settings.transform.stem_uid = "NJvdsWWbJlZS"
ln.settings.transform.version = "0"
ln.track()

By calling {func}`~lamindb.track`, the notebook is automatically linked as the source of all data that's about to be saved!

:::{dropdown} What happened under the hood?

1. Imported package versions of current notebook were detected
2. Notebook metadata was detected and stored in a {class}`~lamindb.Transform` record
3. Run metadata was detected and stored in a {class}`~lamindb.Run` record

The {class}`~lamindb.Transform` class registers data transformations: a notebook, a pipeline or a UI operation.

The {class}`~lamindb.Run` class registers executions of transforms. Several runs can be linked to the same transform if executed with different context (time, user, input data, etc.).

:::

:::{dropdown} How do I track a pipeline instead of a notebook?

```python
transform = ln.Transform(name="My pipeline", version="1.2.0")
ln.track(transform)
```

:::

:::{dropdown} Why should I care about tracking notebooks?

If you can, avoid interactive notebooks: Anything that can be a deterministic pipeline, should be a pipeline.

Just: much insight generated from biological data is driven by computational biologists _interacting_ with it.

A notebook that's run a single time on specific data is not a pipeline: it's a (versioned) **document** that produced insight or some other form of data representation (with parallels to an ELN in the wetlab).

Because humans are in the loop, most mistakes happen when using notebooks: {func}`~lamindb.track` helps avoiding some.

(An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).)

:::

## Manage artifacts

We'll work with a toy collection of image files and transform them into higher-level features for downstream analysis.

(For other data types: see {doc}`docs:by-datatype`.)

Consider 3 directories storing images & metadata of Iris flowers, generated in 3 subsequent studies:

In [None]:
ln.UPath("s3://lamindb-dev-datasets/iris_studies").view_tree()

Our goal is to turn these directories into a validated & queryable dataset that can be used alongside many other datasets.

### Register an artifact

LaminDB uses the {class}`~lamindb.Artifact` class to manage datasets & models that are stored as files, folders, or arrays. {class}`~lamindb.Artifact` is a registry to manage search, queries, validation & storage access.

Let's create a {class}`~lamindb.Artifact` object from one of the files:

In [None]:
artifact = ln.Artifact(
    "s3://lamindb-dev-datasets/iris_studies/study0_raw_images/meta.csv"
)
artifact

:::{dropdown} Which fields are populated when creating an artifact record?

Basic fields:

- `uid`: universal ID
- `key`: storage key, a relative path of the artifact in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or a local directory)
- `suffix`: an optional file/path suffix
- `size`: the artifact size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this artifact already stored?)
- `hash_type`: the type of the hash (usually, an MD5 or SHA1 checksum)
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related fields:

- `created_by`: the {class}`~lamindb.User` who created the artifact
- `transform`: the {class}`~lamindb.Transform` (pipeline, notebook, instrument, app) that was run
- `run`: the {class}`~lamindb.Run` of the transform that created the artifact

For a full reference, see {class}`~lamindb.Artifact`.

:::

Upon `.save()`, artifact metadata is written to the database:

In [None]:
artifact.save()

:::{dropdown} What happens during save?

In the database: A artifact record is inserted into the `artifact` registry. If the artifact record exists already, it's updated.

In storage:
- If the default storage is in the cloud, `.save()` triggers an upload for a local artifact.
- If the artifact is already in a registered storage location, only the metadata of the record is saved to the `artifact` registry.

:::

We can get an overview of all artifacts in the database by calling {meth}`~lamindb.core.Registry.df`:

In [None]:
ln.Artifact.df()

### View data lineage

Visualize data lineage with {meth}`~lamindb.core.Data.view_lineage`:

In [None]:
artifact.view_lineage()

Or directly access its linked {class}`~lamindb.Transform` & {class}`~lamindb.Run` records:

In [None]:
artifact.transform

In [None]:
artifact.run

(For a comprehensive example with data lineage through UI uploads, pipelines & notebooks of multiple data types, see {doc}`docs:project-flow`.)

### Access an artifact

{attr}`~lamindb.Artifact.path` gives you the file path, a {class}`~lamindb.UPath` object:

In [None]:
artifact.path

Typically, your artifact is in cloud storage - to cache it locally, call {meth}`~lamindb.Artifact.cache`:

In [None]:
artifact.cache()

Many artifacts have default in-memory representations. To load artifacts into memory with a default loader, call {meth}`~lamindb.Artifact.load`: 

In [None]:
df = artifact.load(index_col=0)
df.head()

If the data is large, you'll likely want to query it via {meth}`~lamindb.Artifact.backed` or shard the adata across many array-like artifacts. For more on this, see: {doc}`data`.

:::{dropdown} How do I update an artifact?

If you'd like to update metadata:
```
artifact.description = "My new description"
artifact.save()  # save the change to the database
```

If you'd like to replace the underlying stored object, use {meth}`~lamindb.Artifact.replace`.

:::


### Register directories as artifacts

Register an entire folder as an artifact:

In [None]:
study0_data = ln.Artifact(f"s3://lamindb-dev-datasets/iris_studies/study0_raw_images")
study0_data.save()
ln.Artifact.df()

### Filter & search artifacts

You can search artifacts directly based on the {class}`~lamindb.Artifact` registry:

In [None]:
ln.Artifact.search("meta").df().head()

You can also query & search the artifact by any metadata combination.

For instance, look up a user with auto-complete from the {class}`~lamindb.User` registry:

In [None]:
users = ln.User.lookup()
users.anonymous

:::{dropdown} How do I act non-anonymously?

1. [Sign up](https://lamin.ai/signup) for a free account (see more [info](https://lamin.ai/docs/setup)) and copy the API key.
2. Log in on the command line:
   ```shell
   lamin login <email> --key <API-key>
   ```
:::

Filter the {class}`~lamindb.Transform` registry for a name:

In [None]:
transform = ln.Transform.filter(
    name__icontains="Artifacts"
).one()  # get exactly one result
transform

:::{dropdown} What does a double underscore mean?

For any field, the double underscore defines a comparator, e.g.,

* `name__icontains="Martha"`: `name` contains `"Martha"` when ignoring case
* `name__startswith="Martha"`: `name` starts with `"Martha`
* `name__in=["Martha", "John"]`: `name` is `"John"` or `"Martha"`

For more info, see: {doc}`meta`.

:::

Use these results to filter the {class}`~lamindb.Artifact` registry:

In [None]:
ln.Artifact.filter(
    created_by=users.anonymous,
    transform=transform,
    suffix=".csv",
).df().head()

You can also query for directories using `key__startswith` (LaminDB treats directories like AWS S3, as the prefix of the storage `key`): 

In [None]:
ln.Artifact.filter(key__startswith="iris_studies/study0_raw_images/").df().head()

```{note}

You can look up, filter & search any registry ({class}`~lamindb.core.Registry`).

You can chain {meth}`~lamindb.core.Registry.filter` statements and {meth}`~lamindb.core.QuerySet.search`: `ln.Artifact.filter(suffix=".jpg").search("my image")`

An empty filter returns the entire registry: `ln.Artifact.filter()`
```

For more info, see: {doc}`meta`.

:::{dropdown} Filter & search on LaminHub

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/L188T2JjzZHWHfv2S0ib.png" width="700px">

:::

## Describe artifacts

Get an overview of what happened:

In [None]:
artifact.describe()

In [None]:
artifact.view_lineage()

## Version artifacts

If you'd like to version an artifact or transform, either provide the `version` parameter when creating it or create new versions through `is_new_version_of`.

For instance:
```
new_artifact = ln.Artifact(data, is_new_version_of=old_artifact)
```

If you'd like to add a registered artifact to a version family, use `add_to_version_family`.

For instance:
```
new_artifact.add_to_version_family(old_artifact)
```

Are there remaining questions about storing artifacts? If so, see: {doc}`docs:faq/storage`.

## Collections

Often times, several artifacts together represent a collection.

Let's seed a growing {class}`~lamindb.Collection` of artifacts:

In [None]:
collection = ln.Collection(
    study0_data,
    name="Iris collection",
    version="1",
    description="Iris study 0",
)
collection.save()

Now, we collect more data in subsequent studies.

We want to keep track of their data as part of a growing versioned collection:

In [None]:
artifacts = [study0_data]
for folder_name in ["study1_raw_images", "study2_raw_images"]:
    # create an artifact for the folder
    artifact = ln.Artifact(f"s3://lamindb-dev-datasets/iris_studies/{folder_name}").save()
    artifacts.append(artifact)
    # create a new version of the collection
    collection = ln.Collection(
        artifacts, is_new_version_of=collection, description=f"Now includes {folder_name}"
    )
    collection.save()

See all artifacts:

In [None]:
ln.Artifact.df()

See all collections:

In [None]:
ln.Collection.df()

Most functionality that you just learned about artifacts - e.g., queries & provenance - also applies to {class}`~lamindb.Collection`.

Collections become powerful if you directly leverage them for training models: {doc}`docs:scrna5`.

## View changes

With {func}`~lamindb.view`, you can see the latest changes to the database:

In [None]:
ln.view()  # link tables in the database are not shown

## Save notebook & scripts

When you've completed the work on a notebook  or script, you can save the source code and, for notebooks, an execution report to your storage location like so:

```
ln.finish()
```

This enables you to query execution report & source code via `transform.latest_report` and `transform.source_code`.

If you registered the instance on LaminHub, you can share it like [here](https://lamin.ai/laminlabs/lamindata/transform/NJvdsWWbJlZSz8).

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8aBoM.png" width="700px">


## Get notebooks & scripts

If you want to cache a notebook or script, call:

```
lamin get https://lamin.ai/laminlabs/lamindata/transform/NJvdsWWbJlZSz8
```


## Read on

Now, you already know about 6 out of 9 LaminDB core classes! The two most central are:

- {class}`~lamindb.Artifact`
- {class}`~lamindb.Collection`

And the four registries related to provenance:

- {class}`~lamindb.Transform`: transforms of artifacts
- {class}`~lamindb.Run`: runs of transforms
- {class}`~lamindb.User`: users
- {class}`~lamindb.Storage`: storage locations like S3/GCP buckets or local directories

If you want to validate data, label artifacts, and manage features, read on: {doc}`/tutorial2`.