[![stars](https://img.shields.io/github/stars/laminlabs/lamindb?logo=GitHub&color=yellow)](https://github.com/laminlabs/lamindb)
[![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=pypi%20package)](https://pypi.org/project/lamindb)
[![cran](https://www.r-pkg.org/badges/version/laminr?color=green)](https://cran.r-project.org/package=laminr)

# Introduction

```{include} includes/preface.md

```

## Track notebooks & scripts

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQck.svg" width="330px" style="background: transparent; margin-top: 1em;" align="right">

LaminDB provides a framework to transform datasets into more useful representations: validated & queryable datasets, machine learning models, and analytical insights. The transformations can be notebooks, scripts, pipelines, or functions.

The metadata involved in this process are stored in a _LaminDB instance_, a database that manages datasets in storage. For the following walk through LaminDB's core features, we'll be working with a local instance.

::::{tab-set}
:::{tab-item} Py
:sync: python
```bash
lamin init --storage ./lamin-intro --modules bionty
```
:::
:::{tab-item} R
:sync: r
```R
library(laminr)
lamin_init(storage = "./laminr-intro", modules = c("bionty"))
```
:::
::::

In [None]:
!lamin init --storage ./lamin-intro --modules bionty

:::{dropdown} What else can I configure during setup?

1. You can pass a cloud storage location to `--storage` (S3, GCP, R2, HF, etc.)
    ```python
    --storage s3://my-bucket
    ```
2. Instead of the default SQLite database pass a Postgres connection string to `--db`:
    ```python
    --db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
    ```
3. Instead of a default instance name derived from the storage location, provide a custom name:
    ```python
    --name my-name
    ``````
4. Mount additional schema modules:
    ```python
    --modules bionty,wetlab,custom1
    ```

For more info, see {doc}`/setup`.

:::

:::{dropdown} If you decide to connect your instance to the hub, you will see data & metadata in a UI.

<a href="https://lamin.ai/laminlabs/lamindata">
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/YuefPQlAfeHcQvtq0000.png" width="700px">
</a>

:::

Let's now track the notebook that's being run.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
import lamindb as ln

ln.track()  # track the current notebook or script
```
:::
:::{tab-item} R
:sync: r
```R
library(laminr)
ln <- import_module("lamindb")  # instantiate the central `ln` object of the API

ln$track()  # track a run of your notebook or script
```
:::
::::

In [None]:
import lamindb as ln

ln.track()  # track the current notebook or script

By calling {meth}`~lamindb.track`, the notebook gets automatically linked as the source of all data that's about to be saved! You can see all your transforms and their runs in the {class}`~lamindb.Transform` and {class}`~lamindb.Run` registries.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.Transform.df()
```
:::
:::{tab-item} R
:sync: r
```R
ln$Transform$df()
```
:::
::::

In [None]:
ln.Transform.df()

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.Run.df()
```
:::
:::{tab-item} R
:sync: r
```R
ln$Run$df()
```
:::
::::

In [None]:
ln.Run.df()

:::{dropdown} What happened under the hood?

1. The full run environment and imported package versions of current notebook were detected
2. Notebook metadata was detected and stored in a {class}`~lamindb.Transform` record with a unique id
3. Run metadata was detected and stored in a {class}`~lamindb.Run` record with a unique id

The {class}`~lamindb.Transform` registry stores data transformations: scripts, notebooks, pipelines, functions.

The {class}`~lamindb.Run` registry stores executions of transforms. Many runs can be linked to the same transform if executed with different context (time, user, input data, etc.).

:::

:::{dropdown} How do I track a pipeline instead of a notebook?

Leverage a pipeline integration, see: {doc}`/pipelines`. Or manually add code as seen below.

```python
transform = ln.Transform(name="My pipeline")
transform.version = "1.2.0"  # tag the version
ln.track(transform)
```

:::

:::{dropdown} Why should I care about tracking notebooks?

Because of interactivity & humans are in the loop, most mistakes happen when using notebooks.

{func}`~lamindb.track` makes notebooks & derived results reproducible & auditable, enabling to learn from mistakes.

This is important as much insight generated from biological data is driven by computational biologists _interacting_ with it. An early blog post on this is [here](https://blog.lamin.ai/nbproject).

:::

:::{dropdown} Is this compliant with OpenLineage?

Yes. What OpenLineage calls a "job", LaminDB calls a "transform". What OpenLineage calls a "run", LaminDB calls a "run".

:::

## Manage artifacts

The {class}`~lamindb.Artifact` class manages datasets & models that are stored as files, folders, or arrays. {class}`~lamindb.Artifact` is a registry to manage search, queries, validation & storage access. 

You can register data objects (`DataFrame`, `AnnData`, ...) and files or folders in local storage, AWS S3 (`s3://`), Google Cloud (`gs://`), Hugging Face (`hf://`), or any other file system supported by `fsspec`.

### Manage Dataframes

Let's first look at an exemplary dataframe.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
df = ln.core.datasets.small_dataset1(with_typo=True)
df
```
:::
:::{tab-item} R
:sync: r
```R
df <- ln$core$datasets$small_dataset1(otype = "DataFrame", with_typo = TRUE)
df
```
:::
::::

In [None]:
df = ln.core.datasets.small_dataset1(with_typo=True)
df

This is how you create an artifact from a dataframe.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()
artifact.describe()
```
:::
:::{tab-item} R
:sync: r
```R
artifact <- ln$Artifact$from_df(df, key = "my_datasets/rnaseq1.parquet")$save()
artifact$describe()
```
:::
::::

In [None]:
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()
artifact.describe()

And this is how you load it back into memory.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact.load()
```
:::
:::{tab-item} R
:sync: r
```R
artifact$load()
```
:::
::::

In [None]:
artifact.load()

### Trace data lineage

You can understand where an artifact comes from by looking at its {class}`~lamindb.Transform` & {class}`~lamindb.Run` records:

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact.transform
```
:::
:::{tab-item} R
:sync: r
```R
artifact$transform
```
:::
::::

In [None]:
artifact.transform

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact.run
```
:::
:::{tab-item} R
:sync: r
```R
artifact$run
```
:::
::::

In [None]:
artifact.run

Or visualize deeper data lineage with the `view_lineage()` method. Here we're only one step deep.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact.view_lineage()
```
:::
:::{tab-item} R
:sync: r
```R
artifact$view_lineage()
```
:::
::::

In [None]:
artifact.view_lineage()

:::::{dropdown} Show me a more interesting example, please!

::::{tab-set}
:::{tab-item} Py

Explore and load the notebook from [here](https://lamin.ai/laminlabs/lamindata/transform/F4L3oC6QsZvQ0002).

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Yk0003.png" width="800">
:::
:::{tab-item} Hub

Explore data lineage interactively [here](https://lamin.ai/laminlabs/lamindata/artifact/W1AiST5wLrbNEyVq0000).

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/0bXenaC9F24iP3Iy0000.png" width="800">
:::
::::

:::::

:::{dropdown} I just want to see the transforms.

```python
artifact.transform.view_lineage()  # Python only
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMO0001.png" width="400">

:::

Data lineage also helps to understand what a dataset is being used for. Many datasets are being used over and over for different purposes.

Once you're done, at the end of your notebook or script, call {meth}`~lamindb.finish`. Here, we're not yet done so we're commenting it out.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# ln.finish()  # mark run as finished, save execution report, source code & environment
```
:::
:::{tab-item} R
:sync: r
```R
# ln$finish()  # mark run as finished, save execution report & source code
```

If you did _not_ use RStudio's notebook mode, you have to render an HTML externally.

1. Render the notebook to HTML via one of:

    - In RStudio, click the "Knit" button
    - From the command line, run

        ```bash
        Rscript -e 'rmarkdown::render("introduction.Rmd")'
        ```

    - Use the `rmarkdown` package in R

        ```r
        rmarkdown::render("introduction.Rmd")
        ```

2. Save it to your LaminDB instance via one of:

    - Using the `lamin_save()` function in R

        ```r
        lamin_save("introduction.Rmd")
        ```

    - Using the `lamin` CLI

        ```bash
        lamin save introduction.Rmd
        ```

:::
::::

:::{dropdown} Here is how a notebook looks on the hub.

[Explore](https://lamin.ai/laminlabs/lamindata/transform/PtTXoc0RbOIq65cN).

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J80003.png" width="900px">

To create a new version of a notebook or script, run `lamin load` on the terminal, e.g.,

```bash
$ lamin load https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0004
→ notebook is here: mcfarland_2020_preparation.ipynb
```

:::

### Manage versioning

Just like transforms, artifacts are versioned. Let's create a new version by revising the dataset.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# keep the dataframe with a typo around - we'll need it later
df_typo = df.copy()

# fix the "IFNJ" typo
df["perturbation"] = df["perturbation"].cat.rename_categories({"IFNJ": "IFNG"})

# create a new version
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()

# see all versions of an artifact
artifact.versions.df()
```
:::
:::{tab-item} R
:sync: r
```R
# keep the dataframe with a typo around - we'll need it later
df_typo <- df

# fix the "IFNJ" typo
levels(df$perturbation) <- c("DMSO", "IFNG")
df["sample2", "perturbation"] <- "IFNG"

# create a new version
artifact <- ln$Artifact$from_df(df, key = "my_datasets/rnaseq1.parquet")$save()

# see all versions of an artifact
artifact$versions$df()
```
:::
::::

In [None]:
# keep the dataframe with a typo around - we'll need it later
df_typo = df.copy()

# fix the "IFNJ" typo
df["perturbation"] = df["perturbation"].cat.rename_categories({"IFNJ": "IFNG"})

# create a new version
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()

# see all versions of an artifact
artifact.versions.df()

:::{dropdown} Can I also create new versions without passing `key`?

That works, too, you can use `revises`:

```python
artifact_v1 = ln.Artifact.from_df(df, description="Just a description").save()
# below revises artifact_v1
artifact_v2 = ln.Artifact.from_df(df_updated, revises=artifact_v1).save()
```

<br>

The good thing about passing `revises: Artifact` is that you don't need to worry about coming up with naming conventions for paths.

The good thing about versioning based on `key` is that it's how all data versioning tools are doing it.

:::

### Manage files & folders

Let's look at a folder in the cloud that contains 3 sub-folders storing images & metadata of Iris flowers, generated in 3 subsequent studies.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# we use anon=True here in case no aws credentials are configured
ln.UPath("s3://lamindata/iris_studies", anon=True).view_tree()
```
:::
:::{tab-item} R
:sync: r
```R
# we use anon=True here in case no aws credentials are configured
ln$UPath("s3://lamindata/iris_studies", anon = True).view_tree()
```
:::
::::

In [None]:
# we use anon=True here in case no aws credentials are configured
ln.UPath("s3://lamindata/iris_studies", anon=True).view_tree()

Let's create an artifact for the first sub-folder.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact = ln.Artifact("s3://lamindata/iris_studies/study0_raw_images").save()
artifact
```
:::
:::{tab-item} R
:sync: r
```R
artifact = ln$Artifact("s3://lamindata/iris_studies/study0_raw_images")$save()
artifact
```
:::
::::

In [None]:
artifact = ln.Artifact("s3://lamindata/iris_studies/study0_raw_images").save()
artifact

As you see from {attr}`~lamindb.Artifact.path`, the folder was merely registered in its present storage location without copying it.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact.path
```
:::
:::{tab-item} R
:sync: r
```R
artifact$path
```
:::
::::

In [None]:
artifact.path

LaminDB keeps track of all your storage locations.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.Storage.df()
```
:::
:::{tab-item} R
:sync: r
```R
ln$Storage$df()
```
:::
::::

In [None]:
ln.Storage.df()

To cache the cloud folder locally, call {meth}`~lamindb.Artifact.cache`.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
artifact.cache()
```
:::
:::{tab-item} R
:sync: r
```R
artifact$cache()
```
:::
::::

In [None]:
artifact.cache()

If the data is large, you might not want to download but stream it via {meth}`~lamindb.Artifact.open`. For more on this, see: {doc}`arrays`.

:::{dropdown} How do I update or delete an artifact?

```
artifact.description = "My new description"  # change description
artifact.save()  # save the change to the database
artifact.delete()  # move to trash
artifact.delete(permanent=True)  # permanently delete
```

:::

:::{dropdown} How do I create an artifact for a local file or folder?

Source path is local:

```python
ln.Artifact("./my_data.fcs", key="my_data.fcs")
ln.Artifact("./my_images/", key="my_images")
```
<br>

Upon `artifact.save()`, the source path will be copied or uploaded into your instance's current storage, visible & changeable via `ln.settings.storage`.

If the source path is remote _or_ already in a registered storage location (one that's registered in `ln.Storage`), `artifact.save()` will _not_ trigger a copy or upload but register the existing path.

```python
ln.Artifact("s3://my-bucket/my_data.fcs")  # key is auto-populated from S3, you can optionally pass a description
ln.Artifact("s3://my-bucket/my_images/")  # key is auto-populated from S3, you can optionally pass a description
```
<br>
You can use any storage location supported by `fsspec`.

:::

:::{dropdown} Which fields are populated when creating an artifact record?

Basic fields:

- `uid`: universal ID
- `key`: a (virtual) relative path of the artifact in `storage`
- `description`: an optional string description
- `storage`: the storage location (the root, say, an S3 bucket or a local directory)
- `suffix`: an optional file/path suffix
- `size`: the artifact size in bytes
- `hash`: a hash useful to check for integrity and collisions (is this artifact already stored?)
- `hash_type`: the type of the hash
- `created_at`: time of creation
- `updated_at`: time of last update

Provenance-related fields:

- `created_by`: the {class}`~lamindb.User` who created the artifact
- `run`: the {class}`~lamindb.Run` of the {class}`~lamindb.Transform` that created the artifact

For a full reference, see {class}`~lamindb.Artifact`.

:::

:::{dropdown} What exactly happens during save?

In the database: An artifact record is inserted into the `Artifact` registry. If the artifact record exists already, it's returned.

In storage:
- If the default storage is in the cloud, `.save()` triggers an upload for a local artifact.
- If the artifact is already in a registered storage location, only the metadata of the record is saved to the `artifact` registry.

:::

:::{dropdown} How does LaminDB compare to a AWS S3?

LaminDB provides a database on top of AWS S3 (or GCP storage, file systems, etc.).

Similar to organizing files with paths, you can organize artifacts using the `key` parameter of {class}`~lamindb.Artifact`.

However, you'll see that you can more conveniently query data by entities you care about: people, code, experiments, genes, proteins, cell types, etc.

:::

:::{dropdown} Are artifacts aware of array-like data?

Yes.

You can make artifacts from paths referencing array-like objects:

```python
ln.Artifact("./my_anndata.h5ad", key="my_anndata.h5ad")
ln.Artifact("./my_zarr_array/", key="my_zarr_array")
```

Or from in-memory objects:

```python
ln.Artifact.from_df(df, key="my_dataframe.parquet")
ln.Artifact.from_anndata(adata, key="my_anndata.h5ad")
```

You can open large artifacts for slicing from the cloud or load small artifacts directly into memory via:

```python
artifact.open()
```

:::

## Query & search registries

To get an overview over all artifacts in your instance, call {class}`~lamindb.models.Record.df`.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.Artifact.df()
```
:::
:::{tab-item} R
:sync: r
```R
ln$Artifact$df()
```
:::
::::

In [None]:
ln.Artifact.df()

LaminDB's central classes are registries that store records ({class}`~lamindb.models.Record` objects). If you want to see the fields of a registry, look at the class or auto-complete.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.Artifact
```
:::
:::{tab-item} R
:sync: r
```R
ln$Artifact
```
:::
::::

In [None]:
ln.Artifact

Each registry is a table in the relational schema of the underlying database. With {func}`~lamindb.view`, you can see the latest changes to the database.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.view()
```
:::
:::{tab-item} R
:sync: r
```R
ln$view()
```
:::
::::

In [None]:
ln.view()

:::{dropdown} Which registries have I already learned about? 🤔

- {class}`~lamindb.Artifact`: datasets & models stored as files, folders, or arrays
- {class}`~lamindb.Transform`: transforms of artifacts
- {class}`~lamindb.Run`: runs of transforms
- {class}`~lamindb.User`: users
- {class}`~lamindb.Storage`: local or cloud storage locations

:::

Every registry supports arbitrary relational queries using the class methods {class}`~lamindb.models.Record.get` and {class}`~lamindb.models.Record.filter`.
The syntax for it is Django's query syntax.

Here are some simple query examples.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# get a single record (here the current notebook)
transform = ln.Transform.get(key="introduction.ipynb")

# get a set of records by filtering for a directory (LaminDB treats directories like AWS S3, as the prefix of the storage key)
ln.Artifact.filter(key__startswith="my_datasets/").df()

# query all artifacts ingested from a transform
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the title
artifacts = ln.Artifact.filter(
    transform__description__icontains="intro",
).all()
```
:::
:::{tab-item} R
:sync: r
```R
# get a single record (here the current notebook)
transform <- ln$Transform$get(key = "introduction.Rmd")

# get a set of records by filtering for a directory (LaminDB treats directories like AWS S3, as the prefix of the storage key)
ln$Artifact$filter(key__startswith = "my_datasets/")$df()

# query all artifacts ingested from a transform
artifacts <- ln$Artifact$filter(transform = transform)$all()

# query all artifacts ingested from a notebook with "intro" in the title
artifacts <- ln$Artifact$filter(
  transform__description__icontains = "intro",
)$all()
```
:::
::::

In [None]:
# get a single record (here the current notebook)
transform = ln.Transform.get(key="introduction.ipynb")

# get a set of records by filtering for a directory (LaminDB treats directories like AWS S3, as the prefix of the storage key)
ln.Artifact.filter(key__startswith="my_datasets/").df()

# query all artifacts ingested from a transform
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the title
artifacts = ln.Artifact.filter(
    transform__description__icontains="intro",
).all()

:::{dropdown} What does a double underscore mean?

For any field, the double underscore defines a comparator, e.g.,

* `name__icontains="Martha"`: `name` contains `"Martha"` when ignoring case
* `name__startswith="Martha"`: `name` starts with `"Martha`
* `name__in=["Martha", "John"]`: `name` is `"John"` or `"Martha"`

For more info, see: {doc}`registries`.

:::

:::{dropdown} Can I chain filters and searches?

Yes: `ln.Artifact.filter(suffix=".jpg").search("my image")`

:::

The class methods {class}`~lamindb.models.Record.search` and {class}`~lamindb.models.Record.lookup` help with approximate matches.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# search artifacts
ln.Artifact.search("iris").df().head()

# search transforms
ln.Transform.search("intro").df()

# look up records with auto-complete
ulabels = ln.ULabel.lookup()
```
:::
:::{tab-item} R
:sync: r
```R
# search artifacts
ln$Artifact$search("iris")$df()

# search transforms
ln$Transform$search("intro")$df()

# look up records with auto-complete
ulabels = ln$ULabel$lookup()
```
:::
::::

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

For more info, see: {doc}`registries`.

## Features & labels

Features & labels make it easier to find datasets and help standardizing them so that they're re-usable by analysts and machine learning models alike. Features are measurement dimensions (e.g. `"species"`, `"temperature"`) and labels are measured values (e.g. `"human"`, `"mouse"`). In stats, a feature is a variable while a label is a category.

:::{dropdown} Can you give me examples for what findability and usability means?

1. Findability: Which datasets measured expression of cell marker `CD14`? Which characterized cell line `K562`? Which have a test & train split? Etc.
2. Usability: Are there typos in feature names? Are there typos in labels? Are types and units of features consistent? Etc.

:::

Let's annotate an artifact with a {class}`~lamindb.ULabel`, a built-in universal label ontology.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# create & save a typed label
experiment_type = ln.ULabel(name="InVitroStudy", is_type=True).save()
my_experiment = ln.ULabel(name="My experiment", type=experiment_type).save()

# annotate the artifact with a label
artifact.ulabels.add(my_experiment)

# describe the artifact
artifact.describe()
```
:::
:::{tab-item} R
:sync: r
```R
# create & save a typed label
experiment_type = ln$ULabel(name="InVitroStudy", is_type=True)$save()
my_experiment = ln$ULabel(name="My experiment", type=experiment_type)$save()

# annotate the artifact with a label
artifact$ulabels$add(my_experiment)

# describe the artifact
artifact$describe()
```
:::
::::

In [None]:
# create & save a typed ulabel
experiment_type = ln.ULabel(name="InVitroStudy", is_type=True).save()
my_experiment = ln.ULabel(name="My experiment", type=experiment_type).save()

# annotate the artifact with a ulabel
artifact.ulabels.add(my_experiment)

# describe the artifact
artifact.describe()

This is how you query artifacts by ulabels.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
ln.Artifact.filter(ulabels=my_experiment).df()
```
:::
:::{tab-item} R
:sync: r
```R
ln$Artifact$filter(ulabels=my_experiment)$df()
```
:::
::::

In [None]:
ln.Artifact.filter(ulabels=my_experiment).df()

You can also label based on another module, like the biological ontologies in the bionty module.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
import bionty as bt

# create a cell type label from the source ontology
cell_type = bt.CellType.from_source(name="effector T cell").save()

# annotate the artifact with a cell type
artifact.cell_types.add(cell_type)

# describe the artifact
artifact.describe()
```
:::
:::{tab-item} R
:sync: r
```R
bt <- import_module("bionty")

# create a cell type label from the source ontology
cell_type <- bt$CellType$from_source(name = "effector T cell")$save()

# annotate the artifact with a cell type
artifact$cell_types$add(cell_type)

# describe the artifact
artifact$describe()
```
:::
::::

In [None]:
import bionty as bt

# create a cell type label from the source ontology
cell_type = bt.CellType.from_source(name="effector T cell").save()

# annotate the artifact with a cell type
artifact.cell_types.add(cell_type)

# describe the artifact
artifact.describe()

If you want to annotate by non-categorical metadata or indicate the feature for a label, annotate via features.

::::{tab-set}
:::{tab-item} Py
:sync: python
```python
# define the "temperature" & "experiment" features
ln.Feature(name="temperature", dtype=float).save()
ln.Feature(name="experiment", dtype=ln.ULabel).save()

# annotate the artifact
artifact.features.add_values({"temperature": 21.6, "experiment": "My experiment"})

# describe the artifact
artifact.describe()
```
:::
:::{tab-item} R
:sync: r
```R
# define the "temperature" & "experiment" features
ln$Feature(name = "temperature", dtype = "float")$save()
ln$Feature(name = "experiment", dtype = ln$ULabel)$save()

# annotate the artifact
artifact$features$add_values(
  list("temperature" = 21.6, "experiment" = "My experiment")
)

# describe the artifact
artifact$describe()
```
:::
::::

In [None]:
# define the "temperature" & "experiment" features
ln.Feature(name="temperature", dtype=float).save()
ln.Feature(name="experiment", dtype=ln.ULabel).save()

# annotate the artifact
artifact.features.add_values({"temperature": 21.6, "experiment": "My experiment"})

# describe the artifact
artifact.describe()

## Curate datasets

You already saw how to ingest datasets without validation.
This is often enough if you're prototyping or working with one-off studies.
But if you want to create a big body of standardized data, you have to invest the time to curate your datasets.

Let's define a {class}`~lamindb.Schema` to curate a `DataFrame`.

In [None]:
# define valid labels
perturbation_type = ln.ULabel(name="Perturbation", is_type=True).save()
ln.ULabel(name="DMSO", type=perturbation_type).save()
ln.ULabel(name="IFNG", type=perturbation_type).save()

# define the schema
schema = ln.Schema(
    name="My DataFrame schema",
    features=[
        ln.Feature(name="ENSG00000153563", dtype=int).save(),
        ln.Feature(name="ENSG00000010610", dtype=int).save(),
        ln.Feature(name="ENSG00000170458", dtype=int).save(),
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
    ],
).save()

With a `Curator`, we can save an _annotated_ & _validated_ artifact with a single line of code.

In [None]:
curator = ln.curators.DataFrameCurator(df, schema)

# save curated artifact
artifact = curator.save_artifact(key="my_curated_dataset.parquet")  # calls .validate()

# see the parsed annotations
artifact.describe()

# query for a ulabel that was parsed from the dataset
ln.Artifact.get(ulabels__name="IFNG")

If we feed a dataset with an invalid dtype or typo, we'll get a `ValidationError`.

In [None]:
curator = ln.curators.DataFrameCurator(df_typo, schema)

# validate the dataset
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(str(error))

## Manage biological registries

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` registries will get you pretty far.

But let's now look at what you do can with a dedicated biological registry like {class}`~bionty.Gene`.

Every {py:mod}`bionty` registry is based on configurable public ontologies (>20 of them).

In [None]:
import bionty as bt

cell_types = bt.CellType.public()
cell_types

In [None]:
cell_types.search("gamma-delta T cell").head(2)

Define an `AnnData` schema.

In [None]:
# define var schema
var_schema = ln.Schema(
    name="my_var_schema",
    itype=bt.Gene.ensembl_gene_id,
    dtype=int,
).save()

obs_schema = ln.Schema(
    name="my_obs_schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
    ],
).save()

# define composite schema
anndata_schema = ln.Schema(
    name="my_anndata_schema",
    otype="AnnData",
    components={"obs": obs_schema, "var": var_schema},
).save()

Validate & annotate an `AnnData`.

In [None]:
import anndata as ad
import bionty as bt

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(
    df[["ENSG00000153563", "ENSG00000010610", "ENSG00000170458"]],
    obs=df[["perturbation"]],
)

# save curated artifact
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
artifact = curator.save_artifact(description="my RNA-seq")
artifact.describe()

Query for typed features.

In [None]:
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()

Update ontologies, e.g., create a cell type record and add a new cell state.

In [None]:
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_source(name="neuron").save()

# create a record to track a new cell state
new_cell_state = bt.CellType(
    name="my neuron cell state", description="explains X"
).save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

## Scale learning

How do you integrate new datasets with your existing datasets? Leverage {class}`~lamindb.Collection`.

In [None]:
# a new dataset
df2 = ln.core.datasets.small_dataset2(otype="DataFrame")
adata = ad.AnnData(
    df2[["ENSG00000153563", "ENSG00000010610", "ENSG00000004468"]],
    obs=df2[["perturbation"]],
)
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
artifact2 = curator.save_artifact(key="my_datasets/my_rnaseq2.h5ad")

Create a collection using {class}`~lamindb.Collection`.

In [None]:
collection = ln.Collection([artifact, artifact2], key="my-RNA-seq-collection").save()
collection.describe()
collection.view_lineage()

In [None]:
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, open it for streaming (if the backend allows it)
# collection.open()

# or iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()

Directly train models on collections of `AnnData`.

```
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["cell_medium"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_medium"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass
```

Read this [blog post](https://lamin.ai/blog/arrayloader-benchmarks) for more on training models on sharded datasets.