[![stars](https://img.shields.io/github/stars/laminlabs/lamindb?logo=GitHub&color=yellow)](https://github.com/laminlabs/lamindb)
[![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=pypi%20package)](https://pypi.org/project/lamindb)
[![cran](https://www.r-pkg.org/badges/version/laminr?color=green)](https://cran.r-project.org/package=laminr)

# Introduction

Biological data are often poorly organized. 
It's often difficult to reproduce analytical results or understand how a dataset was processed. 
And it's typically hard to apply models to historical data, orthogonal assays, or datasets generated by other teams.

LaminDB is an open-source framework that makes working with biological datasets more robust, scalable, and understandable.
Instead of managing datasets with nested file systems, a LaminDB instance provides a database with metadata structures to organize files, folders, and arrays across any number of storage locations.

:::{dropdown} LaminDB specs

```{include} includes/features-lamindb.md
```
:::

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

:::{dropdown} LaminHub overview

```{include} includes/features-laminhub.md
```
:::

## Quickstart

Install the `lamindb` Python package.

```shell
# install with support for notebooks, biological entities & AWS
pip install 'lamindb[jupyter,bionty,aws]'
```

Connect to a LaminDB instance.

```shell
lamin connect account/instance  # <-- replace with your instance
```

Access an input dataset and save an output dataset.

::::{tab-set}
:::{tab-item} Python
```python
import lamindb as ln
	
ln.track()  # track a run for a notebook or script 
artifact = ln.Artifact.get("3TNCsZZcnIBv2WGb0001")  # get an artifact record
artifact.describe()   # show metadata of artifact
df = artifact.load()  # load artifact into memory, e.g., a DataFrame

# work with the dataset

ln.Artifact("./my_result_folder", description="My result").save()  # save a folder
ln.finish()  # mark the run as finished
```
:::
:::{tab-item} R

```R
install.packages("laminr", dependencies = TRUE)  # install the R package
library(laminr)

db <- connect()  # create an instance object
db$track(path = "./my-analysis.Rmd")  # track your .Rmd, .R or .qmd code
artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu")
adata <- artifact$load() # load the dataset into memory

# work with the dataset

db$Artifact("./my_result_folder", description="My result").save()
db$finish()  # soon
```

Save an html export for `.qmd` or `.Rmd` file as a report.

```shell
lamin save my-analysis.Rmd
```

For more, see the  [R docs](https://laminr.lamin.ai/).

:::
::::

## Concepts

### LaminDB instance

A LaminDB instance is a single relational database that manages metadata for datasets across any number of storage locations, conforming to LaminDB's schema management. You can readily create a local instance to manage data in a local folder.
Here, you create one that mounts schema module {py:mod}`bionty`.

In [None]:
# manage artifacts in local directory `./lamin-intro`
!lamin init --storage ./lamin-intro --schema bionty

You can also connect your cloud storage locations (S3, GCP, R2, HuggingFace, etc.) and databases (Postgres & SQLite). See {doc}`setup`.
If you decide to connect your LaminDB instance to LaminHub, you will see data & metadata in a GUI.

<a href="https://lamin.ai/laminlabs/lamindata">
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/YuefPQlAfeHcQvtq0000.png" width="700px">
</a>

### Data transformation

A data transformation (a "transform") is any piece of code (script, notebook, pipeline, function) that can be applied to input data to produce output data.
When you call {meth}`~lamindb.track`, you register a transform in the {class}`~lamindb.Transform` registry, starting to auto-track inputs and outputs, with each run stored in {class}`~lamindb.Run`.

In [None]:
import lamindb as ln

# --> `ln.track()` generates a uid for your code
# --> `ln.track(uid)` initiates a tracked run
ln.track("FPnfDtJz8qbE0000")  

:::{dropdown} Is this compliant with OpenLineage?

Yes. What OpenLineage calls a "job", LaminDB calls a "transform". What OpenLineage calls a "run", LaminDB calls a "run".

:::

:::{dropdown} What is the `uid`?

To tie a piece of code to a record in a database in a way that survives name and content changes, you need to attach it to an immutable identifier, e.g., LaminDB's `uid`.

git, by comparison, identifies code by its content hash & file name. If you rename a notebook or script file and change the content, you lose the identity of the file. 

To version transforms, LaminDB generates `uid = f"{stem_uid}{version_suffix}"` so that different versions of a transform are grouped by a "stem uid" while the last four `uid` characters encoding its version. All versioned entities in LaminDB are versioned in this way, including artifacts and collections.

:::

### Artifact

An {class}`~lamindb.Artifact` stores a dataset or model as a file, folder or array.

In [None]:
import pandas as pd

# a sample dataset
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNJ", "DMSO"],},
    index=["sample1", "sample2", "sample3"],
)

# create & save an artifact from a DataFrame -- delete via artifact.delete(permanent=True)
artifact = ln.Artifact.from_df(df, description="my RNA-seq").save()

# describe the artifact
artifact.describe()

Load the artifact into memory.

In [None]:
artifact.load()

View data lineage.

In [None]:
artifact.view_lineage()

:::{dropdown} How do I create an artifact for a file or folder?

Source path is local:

```python
ln.Artifact("./my_data.fcs", description="my flow cytometry file")
ln.Artifact("./my_images/", description="my folder of images")
```
<br>

Upon `artifact.save()`, the source path will be copied or uploaded into your instance's current default storage.

If the source path is remote or already in a registered storage location, `artifact.save()` won't trigger data duplication but register the existing path.

```python
ln.Artifact("s3://my-bucket/my_data.fcs", description="my flow cytometry file")
ln.Artifact("s3://my-bucket/my_images/", description="my folder of images")
```
<br>
You can also use other remote file systems supported by `fsspec`.

:::

```{dropdown} How does LaminDB compare to a AWS S3?

LaminDB provides a relational metadata layer on top of AWS S3 (or GCP storage, file system, etc.).

Similar to organizing files in file systems & object stores with paths, you can organize artifacts using the `key` parameter of {class}`~lamindb.Artifact`.
However, LaminDB encourages you to **not** rely on semantic keys but instead organize your data based on metadata.

Rather than memorizing names of folders and files, you find data via the entities you care about: people, code, experiments, genes, proteins, cell types, etc.

LaminDB indexes artifacts in storage via the `uid`.
This scales much better than semantic keys, which lead to deep hierarchical information structures that can become hard to navigate.

Because metadata is typed and relational, you can work with more structure, more integrity, and richer queries compared to leveraging S3's JSON-like metadata.
You'll learn more about this below.

```

:::{dropdown} Are artifacts aware of array-like data?

Yes.

You can make artifacts from paths referencing array-like objects:

```python
ln.Artifact("./my_anndata.h5ad", description="curated array")
ln.Artifact("./my_zarr_array/", description="my zarr array store")
```

Or from in-memory objects:

```python
ln.Artifact.from_df(df, description="my dataframe")
ln.Artifact.from_anndata(adata, description="annotated array")
```

You can open large artifacts for slicing from the cloud or load small artifacts directly into memory.

:::

Just like transforms, artifacts are versioned. Let's create a new version by revising the dataset.

In [None]:
# keep the dataframe with a typo around - we'll need it later
df_typo = df.copy()

# fix the "IFNJ" typo
df.loc["sample2", "perturbation"] = "IFNG"

# create a new version by revising the artifact
artifact = ln.Artifact.from_df(df, revises=artifact).save()

# see all versions of an artifact
artifact.versions.df()

:::{dropdown} I'd rather control versioning through a key or file path like on S3.

That works, too, and you won't need to pass an old version via `revises`:

```python
artifact_v1 = ln.Artifact.from_df(df, key="my_datasets/my_study1.parquet").save()
# below automatically creates a new version of artifact_v1 because the `key` matches
artifact_v2 = ln.Artifact.from_df(df_updated, key="my_datasets/my_study1.parquet").save()
```

<br>

The good thing about passing `revises: Artifact` is that it works for entities that don't come with a file path and you don't need to worry about coming up with naming conventions for paths.
You'll see that LaminDB makes it easy to organize data by entities, rather than file paths.

:::

### Label

Label an artifact with a {class}`~lamindb.ULabel` and a {class}`bionty.CellType`. The same works for any entity in any custom schema module.

In [None]:
import bionty as bt

# create & save a ulabel record
candidate_marker_study = ln.ULabel(name="Candidate marker study").save()

# label the artifact
artifact.ulabels.add(candidate_marker_study)

# repeat for a bionty entity
cell_type = bt.CellType.from_source(name="effector T cell").save()
artifact.cell_types.add(cell_type)

# describe the artifact
artifact.describe()

### Registry

LaminDB's central classes are registries that store records ({class}`~lamindb.core.Record` objects).
We've already seen how to create new `artifact`, `transform` and `ulabel` records.

The easiest way to see the latest records of a given type is to call the _class method_ {class}`~lamindb.core.Record.df`.

In [None]:
ln.ULabel.df()

Existing records are stored in the record's registry (metaclass {class}`~lamindb.core.Registry`), which maps 1:1 to a SQL table.

A record and its registry share the same fields, which define the metadata you can query for. If you want to see them, look at the class or auto-complete.

In [None]:
ln.Artifact

### Query & search

You can write arbitrary relational queries using the class methods {class}`~lamindb.core.Record.get` and {class}`~lamindb.core.Record.filter`.
The syntax for it is Django's query syntax, one of the two most popular ORMs in Python (the other is SQLAlchemy).

In [None]:
# get a single record by uid (here, the latest version of the current notebook)
transform = ln.Transform.get("FPnfDtJz8qbE")

# get a single record by matching a field
transform = ln.Transform.get(name="Introduction")

# get a set of records by filtering on description
ln.Artifact.filter(description="my RNA-seq").df()

# query all artifacts ingested from the current notebook
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the name and labeled "Candidate marker study"
artifacts = ln.Artifact.filter(
    transform__name__icontains="intro", ulabels=candidate_marker_study
).all()

The class methods {class}`~lamindb.core.Record.search` and {class}`~lamindb.core.Record.lookup` help finding sets of approximately matching records.

In [None]:
# search in a registry
ln.Transform.search("intro").df()

# look up records with auto-complete
ulabels = ln.ULabel.lookup()
cell_types = bt.CellType.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

### Feature

What fields are to metadata records, features are to datasets.
You can annotate datasets by the features they measure.

But because LaminDB validates all user input against its registries, annotating with a `"temperature"` feature doesn't work right away.

In [None]:
import pytest

with pytest.raises(ln.core.exceptions.ValidationError) as e:
    artifact.features.add_values({"temperature": 21.6})

print(e.exconly())

Following the hint in the error message, create & save a {class}`~lamindb.Feature`.

In [None]:
# create & save the "temperature" feature (only required once)
ln.Feature(name="temperature", dtype="float").save()

# now we can annotate with the feature & the value
artifact.features.add_values({"temperature": 21.6})

# describe the artifact
artifact.describe()

We can also annotate with categorical features:

In [None]:
# register a categorical feature
ln.Feature(name="study", dtype="cat").save()

# add a categorical value
artifact.features.add_values({"study": "Candidate marker study"})

# describe the artifact with type information
artifact.describe(print_types=True)

This is how you query artifacts by features.

In [None]:
ln.Artifact.features.filter(study__contains="marker study").df()

Features organize labels by how they're measured in datasets, independently of how labels are stored in metadata registries.

## Key use cases

### Understand data lineage

Understand where a dataset comes from and what it's used for ([background](inv:docs#project-flow)).

```python
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">

:::{dropdown} I just want to see the transformations.

```python
transform.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

:::

You don't need a workflow manager to track data lineage (if you want to use one, see {doc}`docs:pipelines`). All you need is:

```python
import lamindb as ln

ln.track()  # track your run

# your code

ln.finish()  # mark run as finished, save execution report, source code & environment
```

Below is how a single transform ([a notebook](https://lamin.ai/laminlabs/lamindata/transform/PtTXoc0RbOIq65cN)) with its run report looks on the hub.

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8dJfH.png" width="900px">

To create a new version of a notebook or script, run `lamin load` on the terminal, e.g.,

```bash
$ lamin load https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0004
→ connected lamindb: laminlabs/lamindata
→ updated uid: 13VINnFk89PE0004 → 13VINnFk89PE0005
→ notebook is here: mcfarland_2020_preparation.ipynb
```

### Curate datasets

In the quickstart, you just saw how to ingest & annotate datasets without validation.
This is often enough if you're prototyping or working with one-off studies.
But if you want to create a big body of standardized data, you have to invest the time to curate your datasets.

Let's use a {class}`~lamindb.Curator` object to curate a `DataFrame`.

In [None]:
# construct a Curator object to validate & annotate a DataFrame
curator = ln.Curator.from_df(
    df,
    # define validation criteria as mappings
    columns=ln.Feature.name,  # map column names
    categoricals={"perturbation": ln.ULabel.name},  # map categories
)

# validate the dataset
curator.validate()

The validation did not pass because LaminDB's registries don't yet know about the features `"CD8A", "CD4", "CD14", "perturbation"` and labels `"DMSO", "IFNG", "DMSO"` in this dataset.
Hence, we need to initially populate them.

In [None]:
# add non-validated features based on the DataFrame columns
curator.add_new_from_columns()

# add non-validated labels based on the perturbation column of the dataframe
curator.add_new_from("perturbation")

# see the updated content of the ULabel registry
ln.ULabel.df()

With the {class}`~lamindb.ULabel` and {class}`~lamindb.Feature` registries now containing meaningful reference values, validation passes & and we can automatically parse features & labels to save an _annotated_ & _curated_ artifact.

In [None]:
# given the updated registries, the validation passes
curator.validate()

# save curated artifact
artifact = curator.save_artifact(description="my RNA-seq")

# see the parsed annotations
artifact.describe()

# query for a ulabel that was parsed from the dataset
ln.Artifact.get(ulabels__name="IFNG")

Had we used `ln.Cuartor` from the beginning, we would have caught the typo.

In [None]:
# construct a Curator object to validate & annotate a DataFrame
curator = ln.Curator.from_df(
    df_typo,
    columns=ln.Feature.name,
    categoricals={"perturbation": ln.ULabel.name},
)

# validate the dataset
curator.validate()

### Manage biological registries

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` registries will get you pretty far.

But let's now look at what you do can with a dedicated biological registry like {class}`~bionty.Gene`.

Every {py:mod}`bionty` registry is based on configurable public ontologies (>20 of them).

In [None]:
cell_types = bt.CellType.public()
cell_types

In [None]:
cell_types.search("gamma delta T cell").head(2)

Validate & annotate with typed features.

In [None]:
import anndata as ad

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(
    df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation"]]
)

# create an annotation flow for an AnnData object
curate = ln.Curator.from_anndata(
    adata,
    # define validation criteria
    var_index=bt.Gene.symbol,  # map .var.index onto Gene registry
    categoricals={adata.obs.perturbation.name: ln.ULabel.name},
    organism="human",  # specify the organism for the Gene registry
)
curate.validate()

# save curated artifact
artifact = curate.save_artifact(description="my RNA-seq")
artifact.describe()

Query for typed features.

In [None]:
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()

Update ontologies, e.g., create a cell type record and add a new cell state.

In [None]:
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_source(name="neuron").save()

# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X").save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

### Scale learning

How do you integrate new datasets with your existing datasets? Leverage {class}`~lamindb.Collection`.

In [None]:
# a new dataset
df = pd.DataFrame(
    {"CD8A": [2, 3, 3], "CD4": [3, 4, 5], "CD38": [4, 2, 3], "perturbation": ["DMSO", "IFNG", "IFNG"],},
    index=["sample4", "sample5", "sample6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

# validate, curate and save a new artifact
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.perturbation.name: ln.ULabel.name},
    organism="human",
)
curate.validate()
artifact2 = curate.save_artifact(description="my RNA-seq dataset 2")

Create a collection using {class}`~lamindb.Collection`.

In [None]:
collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection").save()
collection.describe()
collection.view_lineage()

In [None]:
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()

Directly train models on collections of `AnnData`.

```
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass
```

Read this [blog post](https://lamin.ai/blog/arrayloader-benchmarks) for more on training models on sharded datasets.

## Design

### Why?

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQck.svg" width="350px" style="background: transparent" align="right">

Objects like `pd.DataFrame` are at the heart of many data science workflows but there hasn't been a tool to manage these objects in the rich context that collaborative biological research requires:

- data lineage: data sources, data transformations, models, users
- domain knowledge & experimental metadata: the features & labels derived from domain entities

In this [blog post](https://lamin.ai/blog/problems), we discuss how the complexity of modern R&D data often blocks realizing the scientific progress it promises.

### World model

1. Teams need to have enough freedom to initiate work independently but enough structure to easily integrate datasets later on
2. Batched datasets ({class}`~lamindb.Artifact`) from physical instruments are transformed ({class}`~lamindb.Transform`) into useful representations
3. Learning needs features ({class}`~lamindb.Feature`, {class}`~bionty.CellMarker`, ...) and labels ({class}`~lamindb.ULabel`, {class}`~bionty.CellLine`, ...)
4. Insights connect dataset representations with experimental metadata and knowledge (ontologies)

### Architecture

LaminDB is a distributed system like git that can be run or hosted anywhere. As infrastructure, you merely need a database (SQLite/Postgres) and a storage location (file system, S3, GCP, HuggingFace, ...).

You can easily create your new local instance:

::::{tab-set}
:::{tab-item} Shell
```bash
lamin init --storage ./my-data-folder
```
:::
:::{tab-item} Python
```python
import lamindb as ln
ln.setup.init(storage="./my-data-folder")
```
:::
::::

Or you can let collaborators connect to a cloud-hosted instance:

::::{tab-set}
:::{tab-item} Shell
```bash
lamin connect account-handle/instance-name
```
:::
:::{tab-item} Python
```python
import lamindb as ln
ln.connect("account-handle/instance-name")
```
:::
:::{tab-item} R
```R
library(laminr)
ln <- connect("account-handle/instance-name")
```
:::
::::

For learning more about how to create & host LaminDB instances on distributed infrastructure, see {doc}`setup`. LaminDB instances work standalone but can optionally be managed by LaminHub. For an architecture diagram of LaminHub, [reach out](https://lamin.ai/contact)!


### Metada schema & API

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/XoTQFCmmj2uU4d2xyj9u.png" width="350px" style="background: transparent" align="right">

LaminDB provides a SQL schema for common metadata entities: {class}`~lamindb.Artifact`, {class}`~lamindb.Collection`, {class}`~lamindb.Transform`, {class}`~lamindb.Feature`, {class}`~lamindb.ULabel` etc. - see the [API reference](/api) or the [source code](https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py).

The core metadata schema is extendable through plugins (see green vs. red entities in **graphic**), e.g., with basic biological ({class}`~bionty.Gene`, {class}`~bionty.Protein`, {class}`~bionty.CellLine`, etc.) & operational entities (`Biosample`, `Techsample`, `Treatment`, etc.).

```{dropdown} What is the metadata schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.
[Django](https://github.com/django/django) is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

```

On top of the metadata schema, LaminDB is a Python API that models datasets as artifacts, abstracts over storage & database access, data transformations, and (biological) ontologies.

Note that the datasets schema (e.g., `.parquet` files or `.h5ad` arrays) is modeled through the `Feature` registry and does not require migrations to be updated.

### Custom schemas and plugins

LaminDB can be customized & extended with schema & app plugins building on the [Django](https://github.com/django/django) ecosystem. Examples are:

- [bionty](./bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Registries for samples, treatments, etc.

If you'd like to create your own schema or app:

1. Create a git repository with registries similar to [wetlab](https://github.com/laminlabs/wetlab)
2. Create & deploy migrations via `lamin migrate create` and `lamin migrate deploy`

### Repositories

LaminDB and its plugins consist in open-source Python libraries & publicly hosted metadata assets:

- [lamindb](https://github.com/laminlabs/lamindb): Core package.
- [bionty](https://github.com/laminlabs/bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Registries for samples, treatments, etc.
- [usecases](https://github.com/laminlabs/lamin-usecases): Use cases as visible on the docs.

All immediate dependencies are available as git submodules [here](https://github.com/laminlabs/lamindb/tree/main/sub), for instance,

- [lnschema-core](https://github.com/laminlabs/lnschema-core): Core schema.
- [lamindb-setup](https://github.com/laminlabs/lamindb-setup): Setup & configure LaminDB.
- [lamin-cli](https://github.com/laminlabs/lamin-cli): CLI for `lamindb` and `lamindb-setup`.

For a comprehensive list of open-sourced software, browse our [GitHub account](https://github.com/laminlabs).

- [lamin-utils](https://github.com/laminlabs/lamin-utils): Generic utilities, e.g., a logger.
- [readfcs](https://github.com/laminlabs/readfcs): FCS artifact reader.
- [nbproject](https://github.com/laminlabs/readfcs): Light-weight Jupyter notebook tracker.
- [bionty-assets](https://github.com/laminlabs/bionty-assets): Assets for public biological ontologies.

LaminHub is not open-sourced.

### Influences

LaminDB was influenced by many other projects, see {doc}`docs:influences`.