```{include} includes/preface.md

```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQck.svg" width="300px" style="background: transparent" align="right">

## Walkthrough

Features of biological systems are measured in samples that generate batched datasets.

LaminDB provides a framework to transform batched datasets into more useful representations: validated, queryable datasets, machine learning models, and analytical insights.

All data involved in this process are stored in a _LaminDB instance_, a database that manages datasets in different storage locations through their metadata. Let's create one.

In [None]:
!lamin init --storage ./lamin-intro --modules bionty

:::{dropdown} What else can I configure during setup?

1. You can pass a cloud storage location to `--storage` (S3, GCP, R2, HF, etc.)
    ```python
    --storage s3://my-bucket
    ```
2. Instead of the default SQLite database pass a Postgres database connection string to `--db`:
    ```python
    --db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>
    ```
3. Instead of a default instance name derived from the storage location, provide a custom name:
    ```python
    name=myinstance
    ``````
4. Mount additional schema modules:
    ```python
    modules=bionty,wetlab,custom1
    ```

For more, see {doc}`/setup`.

:::

```{dropdown} If you decide to connect your instance to the hub, you will see data & metadata in a UI.

<a href="https://lamin.ai/laminlabs/lamindata">
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/YuefPQlAfeHcQvtq0000.png" width="700px">
</a>

```

### Track data transformations

The code that generates a dataset is a transform ({class}`~lamindb.Transform`). It could be a script, a notebook, a pipeline, or a function. Let's track the notebook that's being run:

In [None]:
import lamindb as ln
import pandas as pd

ln.track()  # track the current notebook or script

By calling {meth}`~lamindb.track`, the notebook gets automatically linked as the source of all data that's about to be saved!

:::{dropdown} What happened under the hood?

1. The full run environment and imported package versions of current notebook were detected
2. Notebook metadata was detected and stored in a {class}`~lamindb.Transform` record
3. Run metadata was detected and stored in a {class}`~lamindb.Run` record

The {class}`~lamindb.Transform` registry stores data transformations: scripts, notebooks, pipelines, functions.

The {class}`~lamindb.Run` registry stores executions of transforms. Many runs can be linked to the same transform if executed with different context (time, user, input data, etc.).

:::

:::{dropdown} How do I track a pipeline instead of a notebook?

You need to integrate calls along the lines below into your pipeline or leverage a pipeline integration, see: {doc}`/pipelines`.

```python
transform = ln.Transform(name="My pipeline")
transform.version = "1.2.0"  # tag the version
ln.track(transform)
```

:::

:::{dropdown} Why should I care about tracking notebooks?

If you can, avoid interactive notebooks: Anything that can be a deterministic pipeline, should be a pipeline.

Just: much insight generated from biological data is driven by computational biologists _interacting_ with it.

A notebook that's run a single time on specific data is not a pipeline: it's a (versioned) document that produced insight or some other form of data representation (with parallels to an ELN in the wetlab).

Because humans are in the loop, most mistakes happen when using notebooks: {func}`~lamindb.track` helps avoiding some.

(An early blog post on this is [here](https://lamin.ai/blog/2022/nbproject).)

:::

:::{dropdown} Is this compliant with OpenLineage?

Yes. What OpenLineage calls a "job", LaminDB calls a "transform". What OpenLineage calls a "run", LaminDB calls a "run".

:::

You can see all your transforms and their runs in the {class}`~lamindb.Transform` and {class}`~lamindb.Run` registries.

In [None]:
ln.Transform.df()

In [None]:
ln.Run.df()

### Artifacts

An {class}`~lamindb.Artifact` stores a dataset or model as a file or folder.

In [None]:
# an example dataset
df = ln.core.datasets.small_dataset1(otype="DataFrame", with_typo=True)
df

In [None]:
# create & save an artifact from a DataFrame
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()

# describe the artifact
artifact.describe()

Copy or download the artifact into a local cache.

In [None]:
artifact.cache()

Open the artifact for streaming.

In [None]:
dataset = artifact.open()  # returns pyarrow.Dataset
dataset.head(2).to_pandas()

Cache & load the artifact into memory.

In [None]:
artifact.load()

View data lineage.

In [None]:
artifact.view_lineage()

:::{dropdown} How do I create an artifact for a file or folder?

Source path is local:

```python
ln.Artifact("./my_data.fcs", key="my_data.fcs")
ln.Artifact("./my_images/", key="my_images")
```
<br>

Upon `artifact.save()`, the source path will be copied or uploaded into your instance's current storage, visible & changeable via `ln.settings.storage`.

If the source path is remote or already in a registered storage location, `artifact.save()` won't trigger a copy or upload but register the existing path.

```python
ln.Artifact("s3://my-bucket/my_data.fcs")  # key is auto-populated from S3, you can optionally pass a description
ln.Artifact("s3://my-bucket/my_images/")  # key is auto-populated from S3, you can optionally pass a description
```
<br>
You can also use other remote file systems supported by `fsspec`.

:::

```{dropdown} How does LaminDB compare to a AWS S3?

LaminDB provides a database on top of AWS S3 (or GCP storage, file systems, etc.).

Similar to organizing files with paths, you can organize artifacts using the `key` parameter of {class}`~lamindb.Artifact`.

However, you'll see that you can more conveniently query data by entities you care about: people, code, experiments, genes, proteins, cell types, etc.

```

:::{dropdown} Are artifacts aware of array-like data?

Yes.

You can make artifacts from paths referencing array-like objects:

```python
ln.Artifact("./my_anndata.h5ad", key="my_anndata.h5ad")
ln.Artifact("./my_zarr_array/", key="my_zarr_array")
```

Or from in-memory objects:

```python
ln.Artifact.from_df(df, key="my_dataframe.parquet")
ln.Artifact.from_anndata(adata, key="my_anndata.h5ad")
```

You can open large artifacts for slicing from the cloud or load small artifacts directly into memory.

:::

Just like transforms, artifacts are versioned. Let's create a new version by revising the dataset.

In [None]:
# keep the dataframe with a typo around - we'll need it later
df_typo = df.copy()

# fix the "IFNJ" typo
df["perturbation"] = df["perturbation"].cat.rename_categories({"IFNJ": "IFNG"})

# create a new version
artifact = ln.Artifact.from_df(df, key="my_datasets/rnaseq1.parquet").save()

# see all versions of an artifact
artifact.versions.df()

:::{dropdown} Can I also create new versions independent of `key`?

That works, too, you can use `revises`:

```python
artifact_v1 = ln.Artifact.from_df(df, description="Just a description").save()
# below revises artifact_v1
artifact_v2 = ln.Artifact.from_df(df_updated, revises=artifact_v1).save()
```

<br>

The good thing about passing `revises: Artifact` is that you don't need to worry about coming up with naming conventions for paths.

The good thing about versioning based on `key` is that it's how all data versioning tools are doing it.

:::

### Labels

Annotate an artifact with a {class}`~lamindb.ULabel` and a {class}`bionty.CellType`. The same works for any entity in any custom schema module.

In [None]:
import bionty as bt

# create & save a typed label
experiment_type = ln.ULabel(name="Experiment", is_type=True).save()
candidate_marker_experiment = ln.ULabel(
    name="Candidate marker experiment", type=experiment_type
).save()

# label the artifact
artifact.ulabels.add(candidate_marker_experiment)

# repeat for a bionty entity
cell_type = bt.CellType.from_source(name="effector T cell").save()
artifact.cell_types.add(cell_type)

# describe the artifact
artifact.describe()

For annotating datasets with parsed labels like the cell_mediums `DMSO` & `IFNG`, jump to "Curate datasets".

### Registries

LaminDB's central classes are registries that store records ({class}`~lamindb.models.Record` objects).

The easiest way to see the latest records for a registry is to call the _class method_ {class}`~lamindb.models.Record.df`.

In [None]:
ln.ULabel.df()

A record and its registry share the same fields, which define the metadata you can query for. If you want to see them, look at the class or auto-complete.

In [None]:
ln.Artifact

### Query & search

You can write arbitrary relational queries using the class methods {class}`~lamindb.models.Record.get` and {class}`~lamindb.models.Record.filter`.
The syntax for it is Django's query syntax.

In [None]:
# get a single record (here the current notebook)
transform = ln.Transform.get(key="introduction.ipynb")

# get a set of records by filtering on description
ln.Artifact.filter(key__startswith="my_datasets/").df()

# query all artifacts ingested from a transform
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the title and labeled "Candidate marker experiment"
artifacts = ln.Artifact.filter(
    transform__description__icontains="intro", ulabels=candidate_marker_experiment
).all()

The class methods {class}`~lamindb.models.Record.search` and {class}`~lamindb.models.Record.lookup` help with approximate matches.

In [None]:
# search in a registry
ln.Transform.search("intro").df()

# look up records with auto-complete
ulabels = ln.ULabel.lookup()
cell_types = bt.CellType.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

### Features

You can annotate datasets by associated features.

In [None]:
# define the "temperature" & "experiment" features
ln.Feature(name="temperature", dtype=float).save()
ln.Feature(
    name="experiment", dtype=ln.ULabel
).save()  # categorical values are validated against the ULabel registry

# annotate
artifact.features.add_values(
    {"temperature": 21.6, "experiment": "Candidate marker experiment"}
)

# describe the artifact
artifact.describe()

Query artifacts by features.

In [None]:
ln.Artifact.features.filter(experiment__contains="marker experiment").df()

The easiest way to validate & annotate a dataset by the features they measure is via a `Curator`: jump to "Curate datasets".

## Key use cases

### Understand data lineage

Understand where a dataset comes from and what it's used for ([background](inv:docs#project-flow)).

```python
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">

:::{dropdown} I just want to see the transforms.

```python
transform.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

:::

You don't need a workflow manager to track data lineage (if you want to use one, see {doc}`docs:pipelines`). All you need is:

```python
import lamindb as ln

ln.track()  # track your run, start tracking inputs & outputs

# your code

ln.finish()  # mark run as finished, save execution report, source code & environment
```

```{dropdown} On the hub.

Below is how a single transform ([a notebook](https://lamin.ai/laminlabs/lamindata/transform/PtTXoc0RbOIq65cN)) with its run report looks on the hub.

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8dJfH.png" width="900px">

```

To create a new version of a notebook or script, run `lamin load` on the terminal, e.g.,

```bash
$ lamin load https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0004
→ notebook is here: mcfarland_2020_preparation.ipynb
```

### Curate datasets

You already saw how to ingest datasets without validation.
This is often enough if you're prototyping or working with one-off studies.
But if you want to create a big body of standardized data, you have to invest the time to curate your datasets.

Let's define a {class}`~lamindb.Schema` to curate a `DataFrame`.

In [None]:
# define valid labels
perturbation_type = ln.ULabel(name="Perturbation", is_type=True).save()
ln.ULabel(name="DMSO", type=perturbation_type).save()
ln.ULabel(name="IFNG", type=perturbation_type).save()

# define the schema
schema = ln.Schema(
    name="My DataFrame schema",
    features=[
        ln.Feature(name="ENSG00000153563", dtype=int).save(),
        ln.Feature(name="ENSG00000010610", dtype=int).save(),
        ln.Feature(name="ENSG00000170458", dtype=int).save(),
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
    ],
).save()

With a `Curator`, we can save an _annotated_ & _validated_ artifact with a single line of code.

In [None]:
curator = ln.curators.DataFrameCurator(df, schema)

# save curated artifact
artifact = curator.save_artifact(key="my_curated_dataset.parquet")  # calls .validate()

# see the parsed annotations
artifact.describe()

# query for a ulabel that was parsed from the dataset
ln.Artifact.get(ulabels__name="IFNG")

If we feed a dataset with an invalid dtype or typo, we'll get a `ValidationError`.

In [None]:
curator = ln.curators.DataFrameCurator(df_typo, schema)

# validate the dataset
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(str(error))

### Manage biological registries

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` registries will get you pretty far.

But let's now look at what you do can with a dedicated biological registry like {class}`~bionty.Gene`.

Every {py:mod}`bionty` registry is based on configurable public ontologies (>20 of them).

In [None]:
cell_types = bt.CellType.public()
cell_types

In [None]:
cell_types.search("gamma-delta T cell").head(2)

Define an `AnnData` schema.

In [None]:
# define var schema
var_schema = ln.Schema(
    name="my_var_schema",
    itype=bt.Gene.ensembl_gene_id,
    dtype=int,
).save()

obs_schema = ln.Schema(
    name="my_obs_schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
    ],
).save()

# define composite schema
anndata_schema = ln.Schema(
    name="my_anndata_schema",
    otype="AnnData",
    components={"obs": obs_schema, "var": var_schema},
).save()

Validate & annotate an `AnnData`.

In [None]:
import anndata as ad
import bionty as bt

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(
    df[["ENSG00000153563", "ENSG00000010610", "ENSG00000170458"]],
    obs=df[["perturbation"]],
)

# save curated artifact
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
artifact = curator.save_artifact(description="my RNA-seq")
artifact.describe()

Query for typed features.

In [None]:
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()

Update ontologies, e.g., create a cell type record and add a new cell state.

In [None]:
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_source(name="neuron").save()

# create a record to track a new cell state
new_cell_state = bt.CellType(
    name="my neuron cell state", description="explains X"
).save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

### Scale learning

How do you integrate new datasets with your existing datasets? Leverage {class}`~lamindb.Collection`.

In [None]:
# a new dataset
df2 = ln.core.datasets.small_dataset2(otype="DataFrame")
adata = ad.AnnData(
    df2[["ENSG00000153563", "ENSG00000010610", "ENSG00000004468"]],
    obs=df2[["perturbation"]],
)
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
artifact2 = curator.save_artifact(key="my_datasets/my_rnaseq2.h5ad")

Create a collection using {class}`~lamindb.Collection`.

In [None]:
collection = ln.Collection([artifact, artifact2], key="my-RNA-seq-collection").save()
collection.describe()
collection.view_lineage()

In [None]:
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, open it for streaming (if the backend allows it)
# collection.open()

# or iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()

Directly train models on collections of `AnnData`.

```
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["cell_medium"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("cell_medium"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass
```

Read this [blog post](https://lamin.ai/blog/arrayloader-benchmarks) for more on training models on sharded datasets.

```{include} includes/epilogue.md

```