[![Stars](https://img.shields.io/github/stars/laminlabs/lamindb?logo=GitHub&color=yellow)](https://github.com/laminlabs/lamindb)
[![pypi](https://img.shields.io/pypi/v/lamindb?color=blue&label=pypi%20package)](https://pypi.org/project/lamindb)

# Introduction

LaminDB is an open-source data framework for biology.

```{include} ../README.md
:start-line: 6
:end-line: -4
```

:::{dropdown} LaminDB features

```{include} features-lamindb.md
```
:::

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

:::{dropdown} LaminHub features

```{include} features-laminhub.md
```
:::

Basic features of LaminHub are free.
Enterprise features hosted in your or our infrastructure are available on a [paid plan](https://lamin.ai/pricing)!

## Quickstart

You'll ingest a small dataset while tracking data lineage, and learn how to validate, annotate, query & search.

### Setup

Install the `lamindb` Python package:

```shell
pip install 'lamindb[jupyter,bionty]'
```

Initialize a LaminDB instance mounting plugin {py:mod}`bionty` for biological types.

In [None]:
import lamindb as ln

# artifacts are stored in a local directory `./lamin-intro`
ln.setup.init(schema="bionty", storage="./lamin-intro")

### Track

Run {meth}`~lamindb.track` to track the input and output data of your code.

In [None]:
# tag your notebook or script with auto-generated identifiers
ln.settings.transform.stem_uid = "FPnfDtJz8qbE"
ln.settings.transform.version = "1"

# track the execution of your code
ln.track()

### Artifacts

Use {class}`~lamindb.Artifact` to manage data in local or remote storage.

In [None]:
import pandas as pd

# a sample dataset
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"]},
    index=["observation1", "observation2", "observation3"],
)

# create an artifact from a DataFrame
artifact = ln.Artifact.from_df(df, description="my RNA-seq", version="1")

# artifacts come with typed, relational metadata
artifact.describe()

# save data & metadata in one operation
artifact.save()

# view data lineage
artifact.view_lineage()

# load an artifact
artifact.load()

An artifact stores a dataset or model as either a file or a folder.

:::{dropdown} How do I register a file or folder?

Local:

```python
ln.Artifact("./my_data.fcs", description="my flow cytometry file")
ln.Artifact("./my_images/", description="my folder of images")
```

Remote:

```python
ln.Artifact("s3://my-bucket/my_data.fcs", description="my flow cytometry file")
ln.Artifact("s3://my-bucket/my_images/", description="my folder of images")
```

You can also use other remote file systems supported by `fsspec`.

:::

```{dropdown} Does LaminDB give me a file system?

You can organize artifacts using the `key` parameter of {class}`~lamindb.Artifact` as you would in cloud storage.

However, LaminDB encourages you to **not** rely on semantic keys.

Rather than memorizing names of folders and files, you find data via the entities you care about: people, code, experiments, genes, proteins, cell types, etc.

LaminDB embeds each artifact into rich relational metadata and indexes them in storage with a universal ID (`uid`).

This scales much better than semantic keys, which lead to deep hierarchical information structures that hard to navigate for humans & machines.

```

:::{dropdown} Are artifacts aware of array-like data?

Yes.

You can make artifacts from paths referencing array-like objects:

```python
ln.Artifact("./my_anndata.h5ad", description="annotated array")
ln.Artifact("./my_zarr_array/", description="my zarr array store")
```

Or from in-memory objects:

```python
ln.Artifact.from_df(df, description="my dataframe")
ln.Artifact.from_anndata(adata, description="annotated array")
```

:::

:::{dropdown} How to version artifacts?

Every artifact is auto-versioned by its `hash`.

You can also pass a human-readable `version` field and make new versions via:

```python
artifact_v2 = ln.Artifact("my_path", is_new_version_of=artifact_v1)
```

Artifacts of the same version family share the same stem uid (the first 16 characters of the `uid`).

You can see all versions of an artifact via `artifact.versions`.

:::

### Labels

Label an artifact with a {class}`~lamindb.ULabel`.

In [None]:
# create & save a label
candidate_marker_study = ln.ULabel(name="Candidate marker study").save()

# label an artifact
artifact.labels.add(candidate_marker_study)
artifact.describe()

# the ULabel registry
ln.ULabel.df() 

### Queries

Write arbitrary relational queries (under-the-hood, LaminDB is SQL & Django).

In [None]:
# get an entity by uid
transform = ln.Transform.get("FPnfDtJz8qbE")

# filter by description
ln.Artifact.filter(description="my RNA-seq").df()

# query all artifacts ingested from a notebook named "Introduction"
artifacts = ln.Artifact.filter(transform__name="Introduction").all()

# query all artifacts ingested from a notebook with "intro" in the name and labeled "Candidate marker study"
artifacts = ln.Artifact.filter(transform__name__icontains="intro", ulabels=candidate_marker_study).all()

### Search

In [None]:
# search in a registry
ln.Transform.search("intro")

# look up records with auto-complete
labels = ln.ULabel.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

## Validate & annotate

In LaminDB, validation & annotation of categoricals are closely related by mapping categories on registry content.

Let's validate a `DataFrame` by passing validation criteria while constructing an {class}`~lamindb.Annotate` flow object.

### Validate

In [None]:
# construct an object to validate & annotate a DataFrame
annotate = ln.Annotate.from_df(
    df,
    # define validation criteria
    columns=ln.Feature.name,  # map column names
    categoricals={df.perturbation.name: ln.ULabel.name},  # map categories
)

# the dataframe doesn't validate because registries don't contain the identifiers
annotate.validate()

### Update registries

In [None]:
# add non-validated identifiers to their mapped registries
annotate.add_new_from_columns()
annotate.add_new_from(df.perturbation.name)

# the registered labels & features that will from now on be used for validation
ln.ULabel.df()
ln.Feature.df()

### Annotate

In [None]:
# given the updated registries, the validation passes
annotate.validate()

# save annotated artifact
artifact = annotate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()

### Query for annotations

In [None]:
ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.ifng).one()

## Biological registries

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` registries will get you pretty far.

But let's now look at what you do can with a dedicated biological registry like {class}`~bionty.Gene`.

### Access public ontologies

Every {py:mod}`bionty` registry is based on configurable public ontologies.

In [None]:
import bionty as bt

cell_types = bt.CellType.public()
cell_types

In [None]:
cell_types.search("gamma delta T cell").head(2)

### Validate & annotate with typed features

In [None]:
import anndata as ad

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation"]])

# create an annotation flow for an AnnData object
annotate = ln.Annotate.from_anndata(
    adata,
    # define validation criteria
    var_index=bt.Gene.symbol, # map .var.index onto Gene registry
    categoricals={adata.obs.perturbation.name: ln.ULabel.name}, 
    organism="human",  # specify the organism for the Gene registry
)
annotate.validate()

# save annotated artifact
artifact = annotate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()

### Query for typed features

In [None]:
# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()

### Add new records

Create a cell type record and add a new cell state.

In [None]:
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_public(name="neuron")
neuron.save()

In [None]:
# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

## Scale up data & learning

How do you learn from new datasets that extend your previous data history? Leverage {class}`~lamindb.Collection`.

In [None]:
# a new dataset
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
        "perturbation": ["DMSO", "IFNG", "IFNG"]
    },
    index=["observation4", "observation5", "observation6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

# validate, annotate and save a new artifact
annotate = ln.Annotate.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.perturbation.name: ln.ULabel.name},
    organism="human"
)
annotate.validate()
artifact2 = annotate.save_artifact(description="my RNA-seq dataset 2")

### Collections of artifacts

Create a collection using {class}`~lamindb.Collection`.

In [None]:
collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection", version="1")
collection.save(transfer_labels=True)  # transfer labels from artifacts to collection
collection.describe()
collection.view_lineage()

In [None]:
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()

### Data loaders

```
# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass
```

Read this [blog post](https://lamin.ai/blog/arrayloader-benchmarks) for more on training models on sharded datasets.

## Data lineage

### Save notebooks & scripts

If you call {func}`~lamindb.finish()`, you save the run report, source code, and compute environment to your default storage location.

```
ln.finish()
```

See an example for this introductory notebook [here](https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE5zKv).

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8aBoM.png" width="700px">

:::

If you want to cache a notebook or script, call:

```bash
lamin get https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE5zKv
```


### Data lineage across entire projects

View the sequence of data transformations ({class}`~lamindb.Transform`) in a project (from [here](docs:project-flow), based on [Schmidt _et al._, 2022](https://pubmed.ncbi.nlm.nih.gov/35113687/)):

```python
transform.view_parents()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

Or, the generating flow of an artifact:

```python
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">


Both figures are based on mere calls to `ln.track()` in notebooks, pipelines & app.

## Distributed databases

### Easily create & access databases

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can connect to your instance via:

```python
ln.connect("account-handle/instance-name")
```

Or you load an instance on the command line for auto-connecting in a Python session:

```shell
lamin load "account-handle/instance-name"
```

Or you create your new instance:

```shell
lamin init --storage ./my-data-folder
```

### Custom schemas and plugins

LaminDB can be customized & extended with schema & app plugins building on the [Django](https://github.com/django/django) ecosystem. Examples are:

- [bionty](./bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Exemplary custom schema to manage samples, treatments, etc. 

If you'd like to create your own schema or app:

1. Create a git repository with registries similar to [wetlab](https://github.com/laminlabs/wetlab)
2. Create & deploy migrations via `lamin migrate create` and `lamin migrate deploy`

It's fastest if we do this for you based on our templates within an [enterprise plan](https://lamin.ai/pricing).

## Design

### Why?

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

The complexity of modern R&D data often blocks realizing the scientific progress it promises.

See this [blog post](https://lamin.ai/blog/problems).

### Assumptions

1. Batched datasets from physical instruments are transformed ({class}`~lamindb.Transform`) into useful representations ({class}`~lamindb.Artifact`)
2. Learning needs features ({class}`~lamindb.Feature`, {class}`~bionty.CellMarker`, ...) and labels ({class}`~lamindb.ULabel`, {class}`~bionty.CellLine`, ...)
3. Insights connect representations to experimental metadata and knowledge (ontologies)

### Schema & API

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/XoTQFCmmj2uU4d2xyj9t.png" width="350px" style="background: transparent" align="right">

LaminDB provides a SQL schema for common entities: {class}`~lamindb.Artifact`, {class}`~lamindb.Collection`, {class}`~lamindb.Transform`, {class}`~lamindb.Feature`, {class}`~lamindb.ULabel` etc. - see the [API reference](reference) or the [source code](https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py).

The core schema is extendable through plugins (see blue vs. red entities in **graphic**), e.g., with basic biological ({class}`~bionty.Gene`, {class}`~bionty.Protein`, {class}`~bionty.CellLine`, etc.) & operational entities (`Biosample`, `Techsample`, `Treatment`, etc.).

```{dropdown} What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.
[Django](https://github.com/django/django) is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

```

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

### Repositories

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

- [lamindb](https://github.com/laminlabs/lamindb): Core API, which builds on the [core schema](https://github.com/laminlabs/lnschema-core).
- [bionty](https://github.com/laminlabs/bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): An (exemplary) wetlab schema.
- [guides](https://github.com/laminlabs/lamindb/tree/main/docs/): Guides.
- [usecases](https://github.com/laminlabs/lamin-usecases): Use cases.

LaminHub is not open-sourced.

<!-- [lamindb-setup](https://github.com/laminlabs/lamindb-setup): Setup & configure LaminDB, client for LaminHub. -->
<!-- [lamin-cli](https://github.com/laminlabs/lamin-cli): CLI for `lamindb` and `lamindb-setup`. -->
<!-- [lamin-utils](https://github.com/laminlabs/lamin-utils): Generic utilities, e.g., a logger. -->
<!-- [readfcs](https://github.com/laminlabs/readfcs): FCS artifact reader. -->
<!-- [bionty-assets](https://github.com/laminlabs/bionty-assets): Hosted assets of parsed public biological ontologies. -->

### Influences

LaminDB was influenced by many other projects, see {doc}`docs:influences`.