# Introduction

LaminDB is an open-source data framework for biology.

```{include} ../README.md
:start-line: 6
:end-line: -4
```

:::{dropdown} LaminDB features

```{include} features-lamindb.md
```
:::

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

:::{dropdown} LaminHub features

```{include} features-laminhub.md
```
:::

Basic features of LaminHub are free.
Enterprise features hosted in your or our infrastructure are available on a [paid plan](https://lamin.ai/pricing)!

## Quickstart

```{warning}

Public beta: We are close to having converged a stable API, but some breaking changes might still occur.

```

You'll ingest a small dataset while tracking data lineage, and learn how to validate, annotate, query & search.

### Setup

Install the `lamindb` Python package:

```shell
pip install 'lamindb[jupyter,bionty]'
```

Initialize a LaminDB instance mounting plugin {py:mod}`bionty` for biological types.

In [None]:
import lamindb as ln

# artifacts are stored in a local directory `./lamin-intro`
ln.setup.init(schema="bionty", storage="./lamin-intro")

### Provenance

Run {meth}`~lamindb.track` to auto-generate IDs to track data lineage.

In [None]:
# tag your code with auto-generated identifiers for a script or notebook
ln.settings.transform.stem_uid = "FPnfDtJz8qbE"
ln.settings.transform.version = "1"

# track the execution of a transform with a global run context
ln.track()

### Artifacts

With {class}`~lamindb.Artifact`, you can manage data batches & models in storage as files, folders or arrays.

In [None]:
import pandas as pd

df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"]},
    index=["observation1", "observation2", "observation3"],
)

# create an artifact from a DataFrame
artifact = ln.Artifact.from_df(df, description="my RNA-seq", version="1")

# any artifact comes with typed, relational metadata
artifact.describe()

In [None]:
# if you save an artifact, you save data & metadata in one operation
artifact.save()

# for any artifact, you can view its data lineage
artifact.view_lineage()

In [None]:
# load an artifact
artifact.load()

:::{dropdown} Provenance on the hub

The screenshot shows a notebook with its latest report, runs, output files, and parent notebooks. On the run view, you'll see input files.

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8aBoM.png" width="700px">

:::

### Labels

Add an universal label {class}`~lamindb.ULabel` to artifact.

In [None]:
candidate_marker_study = ln.ULabel(name="Candidate marker study")
candidate_marker_study.save()
artifact.labels.add(candidate_marker_study)
artifact.describe()

### Query

Because, under-the-hood, LaminDB is SQL & Django, you can write arbitrarily complex relational queries.

In [None]:
# a simple query
ln.Artifact.filter(description="my RNA-seq").df()

# query all artifacts ingested from a notebook titled "Introduction"
artifacts = ln.Artifact.filter(transform__name="Introduction").all()

# query all artifacts ingested from a notebook titled "Introduction" and labeled by "Candidate marker study"
artifacts = ln.Artifact.filter(transform__name__icontains="intro", ulabels=candidate_marker_study).all()

:::{dropdown} Query on the hub

If you work with a remote instance on LaminHub, you can compose queries as shown below.

Because LaminDB's metadata-management is based on SQL, registries can easily have 10s of millions of rows.

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/L188T2JjzZHWHfv2S0ib.png" width="700px">

:::

### Search

Search records in a registry.

In [None]:
ln.Transform.search("intro")

### Look up

Look up records in a registry with auto-complete.

In [None]:
labels = ln.ULabel.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

### Public ontologies

Every {py:mod}`bionty` registry is based on public ontologies.

In [None]:
import bionty as bt

cell_types = bt.CellType.public()
cell_types

In [None]:
cell_types.search("gamma delta T cell").head(2)

## Validate & annotate

Let's validate the columns measured in a `DataFrame`.

In [None]:
annotate = ln.Annotate.from_df(
    df, 
    fields={"perturbation": ln.ULabel.name}, # validate categories in the perturbation column
    feature_field=ln.Feature.name, # validate features using the Feature registry
)

annotate.validate()

Let's register features and labels to, henceforth, consider them validated.

In [None]:
annotate.register_features(validated_only=False)
annotate.update_registry("perturbation")

View the registered features and labels.

In [None]:
ln.Feature.df()

In [None]:
ln.ULabel.df()

In [None]:
# now the validation passes
annotate.validate()

In [None]:
# create, annotate and save an artifact
artifact = annotate.register_artifact(description="my RNA-seq", version="1")
artifact.describe()

:::{dropdown} Annotated artifacts on the hub

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/DjVOPEBiAcGlt3Gq7Qc1.png" width="700px">

:::

In [None]:
# get lookup object for the entities of interest
lookups = annotate.lookup()
lookups

In [None]:
# filter artifacts with specific labels
perturbations = lookups["perturbation"]
ln.Artifact.filter(ulabels=candidate_marker_study).filter(ulabels=perturbations.ifng).one()

## Biological types

{class}`~lamindb.Feature` and {class}`~lamindb.ULabel` will get you pretty far.
However, if you frequently use a specific entity, you'll want a dedicated registry.

Let’s look at the example of {class}`~bionty.Gene` and use it to register features.

### Validate typed features

In [None]:
import anndata as ad

adata = ad.AnnData(df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation"]])

annotate = ln.Annotate.from_anndata(adata, 
                      obs_fields={"perturbation": ln.ULabel.name}, 
                      var_field=bt.Gene.symbol, # note that we are using the Gene registry
                      organism="human") # specify the organism for the Gene registry
annotate.validate()

In [None]:
artifact = annotate.register_artifact(description="my RNA-seq", version="1")
artifact.describe()

In [None]:
# query for genes & the linked artifacts
genes = bt.Gene.filter(organism__name="human").lookup()
feature_sets_with_cd8a = ln.FeatureSet.filter(genes=genes.cd8a).all()
ln.Artifact.filter(feature_sets__in=feature_sets_with_cd8a).df()

### Manage biological registries

Create a cell type record and add a new cell state.

In [None]:
# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_public(name="neuron")
neuron.save()

In [None]:
# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

## Collections of artifacts

In [None]:
# access a new batch of data
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
        "perturbation": ["DMSO", "IFNG", "IFNG"]
    },
    index=["observation4", "observation5", "observation6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

# validate, annotate and register a new artifact
annotate = ln.Annotate.from_anndata(adata, 
                      obs_fields={"perturbation": ln.ULabel.name}, 
                      var_field=bt.Gene.symbol, 
                      organism="human")
annotate.validate()
artifact2 = annotate.register_artifact(description="my RNA-seq batch 2")

Create a collection using {class}`~lamindb.Collection`.

In [None]:
collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection", version="1")
collection.save()
collection.describe()
collection.view_lineage()

In [None]:
# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

In [None]:
# iterate over its artifacts
collection.artifacts.df()

Using {class}`~lamindb.core.MappedCollection` you can train machine learning models on large collections of artifacts:

```
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(label_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
dl = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in dl:
    pass
```

## Save notebooks & scripts

If you call {func}`~lamindb.finish()`, you save the run report, source code, and compute environment to your default storage location.

```
ln.finish()
```

See an example for this introductory notebook [here](https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE5zKv).

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8aBoM.png" width="700px">

:::

If you want to download a notebook or script, call:

```bash
lamin stage https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE5zKv
```


## Data lineage

View the sequence of data transformations ({class}`~lamindb.Transform`) in a project (from [here](docs:project-flow), based on [Schmidt _et al._, 2022](https://pubmed.ncbi.nlm.nih.gov/35113687/)):

```python
transform.view_parents()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

Or, the generating flow of an artifact:

```python
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">


Both figures are based on mere calls to `ln.track()` in notebooks, pipelines & app.

## Loading instances

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can load your instance on the command-line using:

```shell
lamin load myhandle/myinstance
```

## Custom schemas and plugins

LaminDB can be customized & extended with schema & app plugins building on the [Django](https://github.com/django/django) ecosystem. Examples are:

- [bionty](./bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Exemplary custom schema to manage samples, treatments, etc. 

If you'd like to create your own schema or app:

1. Create a git repository with registries similar to [wetlab](https://github.com/laminlabs/wetlab)
2. Create & deploy migrations via `lamin migrate create` and `lamin migrate deploy`

It's fastest if we do this for you based on our templates within an [enterprise plan](https://lamin.ai/pricing).

## Design

### Why?

The complexity of modern R&D data often blocks realizing the scientific progress it promises.

See this [blog post](https://lamin.ai/blog/problems).

### Assumptions

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

1. Data comes in batches from physical instruments and are transformed ({class}`~lamindb.Transform`) into useful representations ({class}`~lamindb.Artifact`)
2. Learning needs features ({class}`~lamindb.Feature`, {class}`~bionty.CellMarker`, ...) and labels ({class}`~lamindb.ULabel`, {class}`~bionty.CellLine`, ...)
3. Insights connect representations to experimental metadata and knowledge (ontologies)

### Schema & API

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/XoTQFCmmj2uU4d2xyj9t.png" width="350px" style="background: transparent" align="right">

LaminDB provides a SQL schema for common entities: {class}`~lamindb.Artifact`, {class}`~lamindb.Collection`, {class}`~lamindb.Transform`, {class}`~lamindb.Feature`, {class}`~lamindb.ULabel` etc. - see the [API reference](reference) or the [source code](https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py).

The core schema is extendable through plugins (see blue vs. red entities in **graphic**), e.g., with basic biological ({class}`~bionty.Gene`, {class}`~bionty.Protein`, {class}`~bionty.CellLine`, etc.) & operational entities (`Biosample`, `Techsample`, `Treatment`, etc.).

```{dropdown} What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.
[Django](https://github.com/django/django) is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

```

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

### Repositories

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

- [lamindb](https://github.com/laminlabs/lamindb): Core API, which builds on the [core schema](https://github.com/laminlabs/lnschema-core).
- [bionty](https://github.com/laminlabs/bionty): Registries for basic biological entities, coupled to public ontologies.
- [wetlab](https://github.com/laminlabs/wetlab): Exemplary custom schema to manage samples, treatments, etc.
- [guides](https://github.com/laminlabs/lamindb/tree/main/docs/): Guides.
- [usecases](https://github.com/laminlabs/lamin-usecases): Use cases.

The guides and use-cases in notebooks can be run on [Saturn Cloud](https://github.com/laminlabs/run-lamin-on-saturn), Google Vertex AI, Google Colab, and others.

LaminHub is not open-sourced.

<!--- [lamindb-setup](https://github.com/laminlabs/lamindb-setup): Setup & configure LaminDB, client for LaminHub. -->
<!-- - [lamin-cli](https://github.com/laminlabs/lamin-cli): CLI for `lamindb` and `lamindb-setup`. -->
<!--- [lamin-utils](https://github.com/laminlabs/lamin-utils): Generic utilities, e.g., a logger. -->
<!--- [readfcs](https://github.com/laminlabs/readfcs): FCS artifact reader. -->
<!-- [bionty-assets](https://github.com/laminlabs/bionty-assets): Hosted assets of parsed public biological ontologies. -->

### Influences

LaminDB was influenced by many other projects, see {doc}`docs:influences`.

In [None]:
# clean up test instance
!lamin delete --force lamin-intro
!rm -r lamin-intro