[![Stars](https://img.shields.io/github/stars/laminlabs/lamindb?logo=GitHub&color=yellow)](https://github.com/laminlabs/lamindb)
[![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=pypi%20package)](https://pypi.org/project/lamindb)

# Introduction

LaminDB is an open-source data framework for biology.

:::{dropdown} LaminDB features

```{include} includes/features-lamindb.md
```
:::

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

:::{dropdown} LaminHub features

```{include} includes/features-laminhub.md
```
:::

## Quickstart

You'll ingest a small dataset while tracking data lineage, and see how to validate, annotate, query & search.

### Setup

Install the `lamindb` Python package.

```shell
# install with notebook support & biological entities
pip install 'lamindb[jupyter,bionty]'
```

Initialize a LaminDB instance that stores data locally and mounts plugin {py:mod}`bionty`.

In [None]:
# store artifacts in local directory `./lamin-intro`
!lamin init --storage ./lamin-intro --schema bionty
# (optional) make Django's unnecessary functionality private for clean auto-complete
!lamin set private-django-api true

### Track transformations

When you first call {meth}`~lamindb.core.Context.track`, it auto-generates a `uid` to identify a notebook or script.

When you call it a second time, it registers a data transformation and a run: {class}`~lamindb.Transform` stores your notebooks, script, functions, and pipelines. {class}`~lamindb.Run` stores their executions.

In [None]:
import lamindb as ln

# tag your code with an auto-generated uid
ln.context.uid = "FPnfDtJz8qbE0000"  # <-- auto-generated by ln.context.track()

# track the execution of your notebook or script with inputs & outputs
ln.context.track()

:::{dropdown} Is this compliant with OpenLineage?

Yes. What OpenLineage calls a "job", LaminDB calls a "transform". What OpenLineage calls a "run", LaminDB calls a "run".

:::

:::{dropdown} What is `ln.context.uid`?

To tie a piece of code to a record in a database in a way that survives name and content changes, you need to attach it to an immutable identifier, e.g., LaminDB's `uid`.

git, by comparison, identifies code by its content hash & file name. If you rename a notebook or script file and change the content, you lose the identity of the file. Notebook platforms like Google Colab and DeepNote support renaming and changing content of a given notebook, but they do not support versioning in a simple queryable way: every notebook version comes with the same [notebook id](https://lamin.ai/blog/nbproject#metadata-tracking).

To enable versioning, LaminDB auto-generates `uid` values so that different versions of a transform are grouped by a random "stem uid" `suid`, consisting in the same first 12 characters of the `uid`. The remaining 4 characters encode a revision in a `ruid`, hence, `uid = f"{suid}{ruid}"`. You can optionally label any given version with a semantic tag via the `transform.version` field.

Datasets and all other versioned entities in lamindb are versioned in the same way.

:::

### Artifacts

An {class}`~lamindb.Artifact` stores a dataset or model as a file, folder or array.

In [None]:
import pandas as pd

# a sample dataset
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"]},
    index=["observation1", "observation2", "observation3"],
)

# create an artifact from a DataFrame
artifact = ln.Artifact.from_df(df, description="my RNA-seq", version="1")

# artifacts come with typed, relational metadata
artifact.describe()

# save data & metadata in one operation
artifact.save()

View data lineage:

In [None]:
artifact.view_lineage()

Load an artifact:

In [None]:
artifact.load()

:::{dropdown} How does this look for a file or folder?

Local:

```python
ln.Artifact("./my_data.fcs", description="my flow cytometry file")
ln.Artifact("./my_images/", description="my folder of images")
```

Remote:

```python
ln.Artifact("s3://my-bucket/my_data.fcs", description="my flow cytometry file")
ln.Artifact("s3://my-bucket/my_images/", description="my folder of images")
```

You can also use other remote file systems supported by `fsspec`.

:::

```{dropdown} How does LaminDB compare to a AWS S3?

LaminDB is a layer on top of a storage backend (AWS S3, GCP storage, local filesystem, etc.) and a database (Postgres, SQLite) for managing metadata.

Similar to organizing files in file systems & object stores with paths, you can organize artifacts using the `key` parameter of {class}`~lamindb.Artifact`.

However, LaminDB encourages you to **not** rely on semantic keys but instead organize your data based on metadata.

Rather than memorizing names of folders and files, you find data via the entities you care about: people, code, experiments, genes, proteins, cell types, etc.

LaminDB embeds each artifact into rich relational metadata and indexes them in storage with a universal ID (`uid`).

This scales much better than semantic keys, which lead to deep hierarchical information structures that can become hard to navigate.

Because metadata is typed and relational, you can work with more structure, more integrity, and richer queries compared to leveraging S3's JSON-like metadata. You'll learn more about this below.

```

:::{dropdown} Are artifacts aware of array-like data?

Yes.

You can make artifacts from paths referencing array-like objects:

```python
ln.Artifact("./my_anndata.h5ad", description="curated array")
ln.Artifact("./my_zarr_array/", description="my zarr array store")
```

Or from in-memory objects:

```python
ln.Artifact.from_df(df, description="my dataframe")
ln.Artifact.from_anndata(adata, description="annotated array")
```

You can open large artifacts for slicing from the cloud or load small artifacts directly into memory.

:::

:::{dropdown} How to version artifacts?

Every artifact is auto-versioned by its `hash` and the last for characters of the `uid`.

You can optionally pass a human-readable `version` field when you create new versions via:

```python
artifact_v2 = ln.Artifact("my_path", is_new_version_of=artifact_v1)
```

Artifacts of the same version family share the same uid stem (the first 16 characters of the `uid`).

You can see all versions of an artifact via `artifact.versions`.

:::

### Labels

Label an artifact with a {class}`~lamindb.ULabel` and a {class}`bionty.CellLine`. The same works for any entity in any custom schema module.

In [None]:
import bionty as bt

# create & save a ulabel record
candidate_marker_study = ln.ULabel(name="Candidate marker study").save()

# label the artifact
artifact.ulabels.add(candidate_marker_study)

# repeat for a bionty entity
cell_line = bt.CellLine.from_source(name="HEK293").save()
artifact.cell_lines.add(cell_line)

# describe the artifact
artifact.describe()

### Registries, records & fields

LaminDB's central classes are related records that inherit from {class}`~lamindb.core.Record`. We've already seen how to create new `artifact`, `transform` and `ulabel` records.

The easiest way to see all existing records of a given type is to call the _class method_ {class}`~lamindb.core.Record.df`.

In [None]:
ln.ULabel.df() 

Existing records are stored in the record's registry, {class}`~lamindb.core.Record`'s metaclass {class}`~lamindb.core.Registry`, which maps 1:1 to on a SQL table in the SQLite or Postgres backend.

A record and its registry share the same fields, which define the metadata you can query for. If you want to see them, look at the class or auto-complete.

In [None]:
ln.Artifact

### Queries

You can write arbitrary relational queries using the class methods {class}`~lamindb.core.Record.get` and {class}`~lamindb.core.Record.filter`. The syntax for it is Django's query syntax, one of the two most popular ORMs in Python (the other is SQLAlchemy).

In [None]:
# get a single record by uid (here, the latest version of the current notebook)
transform = ln.Transform.get("FPnfDtJz8qbE")

# get a single record by matching a field
transform = ln.Transform.get(name="Introduction")

# get a set of records by filtering on description
ln.Artifact.filter(description="my RNA-seq").df()

# query all artifacts ingested from the current notebook
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the name and labeled "Candidate marker study"
artifacts = ln.Artifact.filter(
    transform__name__icontains="intro",
    ulabels=candidate_marker_study
).all()

### Search

The class methods {class}`~lamindb.core.Record.search` and {class}`~lamindb.core.Record.lookup` help finding sets of approximately matching records.

In [None]:
# search in a registry
ln.Transform.search("intro").df()

# look up records with auto-complete
ulabels = ln.ULabel.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

### Datasets & features

What fields are to metadata records, features are to datasets. You can annotate datasets by the features they measure.

But because LaminDB validates all user input against its registries, annotating with a `"temperature"` feature doesn't work right away.

In [None]:
import pytest

with pytest.raises(ln.core.exceptions.ValidationError) as e:
    artifact.features.add_values({"temperature": 21.6})

print(e.exconly())

Following the hint in the error message, create & save a feature.

In [None]:
# create & save the "temperature" feature (only required once)
ln.Feature(name='temperature', dtype='float').save()

# now we can annotate with the feature & the value
artifact.features.add_values({"temperature": 21.6})

# describe the artifact
artifact.describe()

We can also annotate with categorical features:

In [None]:
# register a categorical feature
ln.Feature(name='study', dtype='cat').save()

# add a categorical value
artifact.features.add_values({"study": "Candidate marker study"})

# describe the artifact with type information
artifact.describe(print_types=True)

This is how you query for features.

In [None]:
ln.Artifact.features.filter(temperature__gt=21)

Features organize labels by how they're measured in datasets, independently of how labels are stored in metadata registries.

## Curate datasets

LaminDB validates & annotates categorical metadata by mapping categories on registries.

### Curate a DataFrame

Let's use the high-level {class}`~lamindb.Curate` class to curate a `DataFrame`.

In [None]:
# construct a Curate object to validate & annotate a DataFrame
curate = ln.Curate.from_df(
    df,
    # define validation criteria
    columns=ln.Feature.name,  # map column names
    categoricals={"perturbation": ln.ULabel.name},  # map categories
)

# the dataframe doesn't validate because registries don't contain the categories
curate.validate()

In [None]:
# add non-validated features based on the DataFrame columns
curate.add_new_from_columns()

# see the updated content of the features registry
ln.Feature.df()

In [None]:
# add non-validated labels based on the perturbations
curate.add_new_from("perturbation")

# see the updated content of the ULabel registry
ln.ULabel.df()

In [None]:
# given the updated registries, the validation passes
curate.validate()

# save curated artifact
artifact = curate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()

### Query for annotations

In [None]:
ulabels = ln.ULabel.lookup()
ln.Artifact.get(ulabels=ulabels.ifng)

## Biological registries

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` registries will get you pretty far.

But let's now look at what you do can with a dedicated biological registry like {class}`~bionty.Gene`.

### Access public ontologies

Every {py:mod}`bionty` registry is based on configurable public ontologies.

In [None]:
cell_types = bt.CellType.public()
cell_types

In [None]:
cell_types.search("gamma delta T cell").head(2)

### Validate & annotate with typed features

In [None]:
import anndata as ad

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation"]])

# create an annotation flow for an AnnData object
curate = ln.Curate.from_anndata(
    adata,
    # define validation criteria
    var_index=bt.Gene.symbol, # map .var.index onto Gene registry
    categoricals={adata.obs.perturbation.name: ln.ULabel.name}, 
    organism="human",  # specify the organism for the Gene registry
)
curate.add_validated_from_var_index()
curate.validate()

# save curated artifact
artifact = curate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()

### Query for typed features

In [None]:
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()

### Add new records

Create a cell type record and add a new cell state.

In [None]:
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_source(name="neuron")
neuron.save()

In [None]:
# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

## Scale up data & learning

How do you learn from new datasets that extend your previous data history? Leverage {class}`~lamindb.Collection`.

In [None]:
# a new dataset
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
        "perturbation": ["DMSO", "IFNG", "IFNG"]
    },
    index=["observation4", "observation5", "observation6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

# validate, curate and save a new artifact
curate = ln.Curate.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.perturbation.name: ln.ULabel.name},
    organism="human"
)
curate.validate()
artifact2 = curate.save_artifact(description="my RNA-seq dataset 2")

### Collections of artifacts

Create a collection using {class}`~lamindb.Collection`.

In [None]:
collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection", version="1")
collection.save()
collection.describe()
collection.view_lineage()

In [None]:
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()

### Data loaders

```
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass
```

Read this [blog post](https://lamin.ai/blog/arrayloader-benchmarks) for more on training models on sharded datasets.

## Data lineage

### Save notebooks & scripts

If you call {func}`~lamindb.core.Context.finish()`, you save the run report, source code, and compute environment to your default storage location.

```
ln.finish()
```

See an example for this introductory notebook [here](https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE65cN/fMfi7IHYSylS5AY6GXtW).

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8QksL.png" width="900px">

:::

If you want to cache a notebook or script, call:

```bash
lamin get https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE65cN
```


### Data lineage across entire projects

View the sequence of data transformations ({class}`~lamindb.Transform`) in a project (from [a use case](inv:docs#project-flow), based on [Schmidt _et al._, 2022](https://pubmed.ncbi.nlm.nih.gov/35113687/)):

```python
transform.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

Or, the generating flow of an artifact:

```python
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">


Both figures are based on mere calls to `ln.context.track()` in notebooks, pipelines & app.

## Distributed databases

### Easily create & access databases

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can connect to your instance via:

```python
ln.connect("account-handle/instance-name")
```

Or you load an instance on the command line for auto-connecting in a Python session:

```shell
lamin load "account-handle/instance-name"
```

Or you create your new instance:

```shell
lamin init --storage ./my-data-folder
```

### Custom schemas and plugins

LaminDB can be customized & extended with schema & app plugins building on the [Django](https://github.com/django/django) ecosystem. Examples are:

- [bionty](./bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Exemplary custom schema to manage samples, treatments, etc. 

If you'd like to create your own schema or app:

1. Create a git repository with registries similar to [wetlab](https://github.com/laminlabs/wetlab)
2. Create & deploy migrations via `lamin migrate create` and `lamin migrate deploy`

It's fastest if we do this for you based on our templates within an [enterprise plan](https://lamin.ai/pricing).

## Design

### Why?

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

Objects like `pd.DataFrame` are at the heart of many data science workflows but there hasn't been a tool to manage these objects in the rich context that collaborative biological research requires:

- provenance: data sources, data transformations, models, users
- domain knowledge & experimental metadata: the features & labels derived from domain entities

In this [blog post](https://lamin.ai/blog/problems), we discuss how the complexity of modern R&D data often blocks realizing the scientific progress it promises.

### Assumptions

1. Batched datasets from physical instruments are transformed ({class}`~lamindb.Transform`) into useful representations ({class}`~lamindb.Artifact`)
2. Learning needs features ({class}`~lamindb.Feature`, {class}`~bionty.CellMarker`, ...) and labels ({class}`~lamindb.ULabel`, {class}`~bionty.CellLine`, ...)
3. Insights connect representations to experimental metadata and knowledge (ontologies)

### Schema & API

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/XoTQFCmmj2uU4d2xyj9t.png" width="350px" style="background: transparent" align="right">

LaminDB provides a SQL schema for common entities: {class}`~lamindb.Artifact`, {class}`~lamindb.Collection`, {class}`~lamindb.Transform`, {class}`~lamindb.Feature`, {class}`~lamindb.ULabel` etc. - see the [API reference](/api) or the [source code](https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py).

The core schema is extendable through plugins (see blue vs. red entities in **graphic**), e.g., with basic biological ({class}`~bionty.Gene`, {class}`~bionty.Protein`, {class}`~bionty.CellLine`, etc.) & operational entities (`Biosample`, `Techsample`, `Treatment`, etc.).

```{dropdown} What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.
[Django](https://github.com/django/django) is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

```

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

### Repositories

LaminDB and its plugins consist in open-source Python libraries & publicly hosted metadata assets:

- [lamindb](https://github.com/laminlabs/lamindb): Core package.
- [bionty](https://github.com/laminlabs/bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Default wetlab schema.
- [guides](https://github.com/laminlabs/lamindb/tree/main/docs): Guides.
- [usecases](https://github.com/laminlabs/lamin-usecases): Use cases.

All immediate dependencies are available as git submodules [here](https://github.com/laminlabs/lamindb/tree/main/sub), for instance,

- [lnschema-core](https://github.com/laminlabs/lnschema-core): Core schema.
- [lamindb-setup](https://github.com/laminlabs/lamindb-setup): Setup & configure LaminDB.
- [lamin-cli](https://github.com/laminlabs/lamin-cli): CLI for `lamindb` and `lamindb-setup`.

For a comprehensive list of open-sourced software, browse our [GitHub account](https://github.com/laminlabs).

- [lamin-utils](https://github.com/laminlabs/lamin-utils): Generic utilities, e.g., a logger.
- [readfcs](https://github.com/laminlabs/readfcs): FCS artifact reader.
- [nbproject](https://github.com/laminlabs/readfcs): Light-weight Jupyter notebook tracker.
- [bionty-assets](https://github.com/laminlabs/bionty-assets): Assets for public biological ontologies.

LaminHub is not open-sourced.

### Influences

LaminDB was influenced by many other projects, see {doc}`docs:influences`.