# Introduction

```{include} ../README.md
:start-line: 6
:end-line: -4
```

:::{dropdown} LaminDB features

```{include} features-lamindb.md
```
:::

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

:::{dropdown} LaminHub features

```{include} features-laminhub.md
```
:::

Basic features of LaminHub are free. Enterprise features hosted in your or our infrastructure are available on a [paid plan](https://lamin.ai/pricing)!

## Quickstart

```{warning}

Public beta: Close to having converged a stable API, but some breaking changes might still occur.

```

### Setup

Install the `lamindb` Python package:
```shell
pip install 'lamindb[jupyter,bionty]'
```

Init a LaminDB instance:

In [None]:
!lamin init --schema bionty --storage ./lamin-intro

To access public biological ontologies, we passed `--schema bionty`, which mounted plug-in {mod}`lnschema_bionty`.

Because we passed a local directory `./lamin-intro` to `--storage`, by default, artifacts are stored locally. You could pass an AWS or GCP bucket instead: `s3://my-bucket` `gs://my-bucket`.

### Track data lineage

With {class}`~lamindb.track`, create a global run context to track data lineage:

In [None]:
import lamindb as ln
import pandas as pd

ln.track()

### Create artifacts

With {class}`~lamindb.Artifact`, you can manage data batches & models in storage as files, folders or arrays.

In [None]:
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"], "note": ["bad", "good", "ok"]},
    index=["observation1", "observation2", "observation3"],
)
artifact = ln.Artifact(df, description="my RNA-seq", version="1")

Any artifact comes with typed, relational metadata:

In [None]:
artifact.describe()

If you save an artifact, you'll save data & metadata in one operation:

In [None]:
artifact.save()

For any artifact, you can view its data lineage:

In [None]:
artifact.view_lineage()

:::{dropdown} Data provenance in the UI

The screenshot shows a notebook with its latest report, runs, output files, and parent notebooks. On the run view, you'll see input files.

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/RGXj5wcAf7EAc6J8aBoM.png" width="700px">

:::

Loading an artifact returns an object determined by its `.accessor` and `.suffix`:

In [None]:
artifact.load()

### Query

A simple query:

In [None]:
ln.Artifact.filter(description="my RNA-seq").df()

To query all artifacts ingested from a notebook with title `"Introduction"`:

In [None]:
transform = ln.Transform.filter(name="Introduction").one()
artifacts = ln.Artifact.filter(transform=transform).all()

Because, under-the-hood, LaminDB is SQL & Django, you can write arbitrarily complex relational queries.

This can give you a gist of it:

In [None]:
artifacts = ln.Artifact.filter(transform__name__icontains="intro", created_by__handle="anonymous").all()

:::{dropdown} Query in the UI

If you work with a remote instance on LaminHub, you can compose queries as shown below.

Because LaminDB's metadata-management is based on SQL, you'll find that it scales to very large tables.

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/L188T2JjzZHWHfv2S0ib.png" width="700px">

:::

### Search

Search the {class}`~lamindb.Artifact` registry:

In [None]:
ln.Artifact.search("RNAseq")

Or search any other registry, e.g., {class}`~lamindb.Transform`:

In [None]:
ln.Transform.search("intro")

### Look up

We can look up records in any registry with auto-complete until we have more than 200k entries:

In [None]:
users = ln.User.lookup()

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

## Validate & annotate

To define a validation framework for your data, you can define features and labels using their registries.

### Features & labels

For instance, populate the feature registry ({class}`~lamindb.Feature`) based on the columns measured in the `DataFrame`:

In [None]:
features = ln.Feature.from_df(df)
features.df()

We don't want to register `"note"` as a feature and delete it from the list, but save all other features:

In [None]:
del features[4]
ln.save(features)

The registry now looks like this:

In [None]:
ln.Feature.df()

Let's also create a label using {class}`~lamindb.ULabel`, LaminDB's universal label registry.

(Later, we'll use typed labels to deal with, e.g., 100k gene identifiers.)

In [None]:
study = ln.ULabel(name="Candidate marker study")
study.save()
ln.ULabel.df()

We can model hierachical labels like so:

In [None]:
is_study= ln.ULabel(name="is_study")
is_study.save()
is_study.children.add(study)
study.view_parents()

Let us do the same for perturbation labels: 

In [None]:
perturbations = [ln.ULabel(name=label) for label in df["perturbation"].unique()]
ln.save(perturbations)
is_perturbation = ln.ULabel(name="is_perturbation")
is_perturbation.save()
is_perturbation.children.add(*perturbations)
is_perturbation.view_parents(with_children=True)

### Validate & annotate data

Now that we defined features, we can validate a data batch:

In [None]:
artifact = ln.Artifact.from_df(df, description="my RNA-seq")
artifact.save()

(Because we already saved the same data, LaminDB retrieves it instead of creating a new artifact.)

Saving the artifact linked its validated features:

In [None]:
artifact.describe()

Annotating an artifact with a label works like so:

In [None]:
artifact.labels.add(study)
artifact.describe()

We can also associate labels with a feature:

In [None]:
features = ln.Feature.lookup()
artifact.labels.add(perturbations, feature=features.perturbation)
artifact.describe()

:::{dropdown} Artifacts with context in the UI

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/DjVOPEBiAcGlt3Gq7Qc1.png" width="700px">

:::

### Query for annotations

Get lookup object for the entities of interest:

In [None]:
studies = is_study.children.lookup()
perturbations = is_perturbation.children.lookup()

In [None]:
artifact = ln.Artifact.filter(ulabels__in=[studies.candidate_marker_study, perturbations.ifng]).distinct().one()

Delete an artifact:

In [None]:
artifact.delete(permanent=True)

## Biological types

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` will get you pretty far.

But if you use an entity many times, you typically want a dedicated registry, which you can use to type your code & as an interface for public ontologies.

Let's do this with {class}`~lnschema_bionty.Gene` and {class}`~lnschema_bionty.Tissue` from plug-in {mod}`lnschema_bionty`:

### Access public ontologies 

Import gene records from a public ontology, which we'll use to validate features:

In [None]:
import lnschema_bionty as lb

genes = lb.Gene.from_values(df.columns, organism="human")
ln.save(genes)
lb.Gene.df()

### Validate typed features

To manage features of different types, let us use an `AnnData` object:

In [None]:
import anndata as ad

In [None]:
adata = ad.AnnData(df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation", "note"]])

Create an artifact & validate features using the symbol field of `Gene`:

In [None]:
artifact = ln.Artifact.from_anndata(
    adata, description="my RNA-seq", field=lb.Gene.symbol, organism="human"
)
artifact.save()

### Annotate with typed labels

Search the public tissue ontology from the bionty store:

In [None]:
lb.Tissue.public().search("umbilical blood").head(2)

Define a tissue label:

In [None]:
tissue = lb.Tissue.from_public(name="umbilical cord blood")
tissue.save()
tissue.view_parents(distance=2)

Annotate the artifact:

In [None]:
artifact.labels.add(study)
artifact.labels.add(adata.obs.perturbation, feature=features.perturbation)
artifact.labels.add(tissue)
artifact.describe()

Query for genes & the linked artifacts:

In [None]:
genes = lb.Gene.filter(organism__name="human").lookup()

# all gene sets measuring CD8A
genesets_with_cd8a = ln.FeatureSet.filter(genes=genes.cd8a).all()

# all artifacts measuring CD8A
ln.Artifact.filter(feature_sets__in=genesets_with_cd8a).df()

### Append a new batch of data

Assume we now run a pipeline in which we access a new batch of data:

In [None]:
transform = ln.Transform(name="Cell Ranger", type="pipeline", version="1")
ln.track(transform)

Access a new batch of data:

In [None]:
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
        "perturbation": ["DMSO", "IFNG", "IFNG"]
    },
    index=["observation4", "observation5", "observation6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

Because gene `"CD38"` is not yet registered, it doesn't yet validate:

In [None]:
artifact2 = ln.Artifact.from_anndata(
    adata, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)

Let's add it to the `Gene` registry and re-create the artifact - now all features validate:

In [None]:
lb.Gene.from_public(symbol="CD38", organism="human").save()
artifact2 = ln.Artifact.from_anndata(
    adata, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)
artifact2.save()

### Collections of artifacts

Create a collection using {class}`~lamindb.Collection`:

In [None]:
collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection", version="1")
collection.save()
collection.describe()
collection.view_lineage()

If it's small enough, you can load the entire collection into memory as if it was one:

In [None]:
collection.load()

Iterate over its artifacts:

In [None]:
collection.artifacts.df()

### Train a machine learning model

Using {class}`~lamindb.dev.MappedCollection` you can train machine learning models on large collections of artifacts:

```
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(label_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
dl = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in dl:
    pass
```

## More examples

### Understand data lineage

View the sequence of data transformations ({class}`~lamindb.Transform`) in a project (from [here](docs:project-flow), based on [Schmidt _et al._, 2022](https://pubmed.ncbi.nlm.nih.gov/35113687/)):

```python
transform.view_parents()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

Or, the generating flow of an artifact:

```python
artifact.view_lineage()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">


Both figures are based on mere calls to `ln.track()` in notebooks, pipelines & app.


### Manage biological registries

Create a cell type registry from public knowledge and add a new cell state (from [here](bio-registries)):

In [None]:
import lnschema_bionty as lb

# create an ontology-coupled cell type record and save it
lb.CellType.from_public(name="neuron").save()

# create a record to track a new cell state
new_cell_state = lb.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
cell_types = lb.CellType.lookup()
new_cell_state.parents.add(cell_types.neuron)

In [None]:
# view ontological hierarchy
new_cell_state.view_parents(distance=2)

### Leverage a mesh of instances

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can load your instance on the command-line using:

```shell
lamin load myhandle/myinstance
```

If you run `lamin save <notebook_path>`, you will save the notebook to your default storage location.

You can explore the notebook report corresponding to the quickstart [here](https://lamin.ai/laminlabs/lamindata/record/core/Transform?id=FPnfDtJz8qbEz8) in LaminHub.

### Manage custom schemas

LaminDB can be customized & extended with schema & app plug-ins building on the [Django](https://github.com/django/django) ecosystem. Examples are

- [lnschema_bionty](lnschema_bionty): Registries for basic biological entities, coupled to public ontologies.
- [lnschema_lamin1](https://github.com/laminlabs/lnschema-lamin1): Exemplary custom schema to manage samples, treatments, etc. 

If you'd like to create your own schema or app:

1. Create a git repository with registries similar to [lnschema_lamin1](https://github.com/laminlabs/lnschema-lamin1)
2. Create & deploy migrations via `lamin migrate create` and `lamin migrate deploy`

It's fastest if we do this for you based on our templates within an enterprise plan.

## Design

### Why?

See this [blog post](https://lamin.ai/blog/2022/problems).

### Schema & API

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/XoTQFCmmj2uU4d2xyj9t.png" width="350px" style="background: transparent" align="right">

LaminDB provides a SQL schema for common entities: {class}`~lamindb.Artifact`, {class}`~lamindb.Collection`, {class}`~lamindb.Transform`, {class}`~lamindb.Feature`, {class}`~lamindb.ULabel` etc. - see the [API reference](reference) or the [source code](https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py).

The core schema is extendable through plug ins (see blue vs. red entities in **graphic**), e.g., with basic biological ({class}`~lnschema_bionty.Gene`, {class}`~lnschema_bionty.Protein`, {class}`~lnschema_bionty.CellLine`, etc.) & operational entities (`Biosample`, `Techsample`, `Treatment`, etc.).

```{dropdown} What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.

[Django](https://github.com/django/django) is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

In the first year, LaminDB used SQLModel/SQLAlchemy -- we might bring back compatibility.

```

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

The code for this is open-source & accessible through the dependencies & repositories listed below.
 
### Dependencies

- Data is stored in a platform-independent way: 
    - location → local, on AWS S3 or GCP Storage, accessed through `fsspec`
    - format → blob-like artifacts or queryable formats like parquet, zarr, HDF5, TileDB, ...
- Metadata is stored in SQL: current backends are SQLite (small teams) and Postgres (any team size).
- Django ORM for schema management & metadata queries.
- Biological knowledge sources & ontologies: see [Bionty](https://lamin.ai/docs/bionty).

For more details, see the [pyproject.toml](https://github.com/laminlabs/lamindb/blob/main/pyproject.toml) artifact in lamindb & the linked repositories below.

### Repositories

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

- [lamindb](https://github.com/laminlabs/lamindb): Core API, which builds on the [core schema](https://github.com/laminlabs/lnschema-core).
- [lnschema-bionty](https://github.com/laminlabs/lnschema-bionty): Registries for basic biological entities, coupled to public ontologies.
- [lnschema-lamin1](https://github.com/laminlabs/lnschema-lamin1): Exemplary custom schema to manage samples, treatments, etc.
- [lamindb-setup](https://github.com/laminlabs/lamindb-setup): Setup & configure LaminDB, client for LaminHub.
- [lamin-cli](https://github.com/laminlabs/lamin-cli): CLI for `lamindb` and `lamindb-setup`.
- [bionty](https://github.com/laminlabs/bionty): Accessor for public biological ontologies.
- [nbproject](https://github.com/laminlabs/nbproject): Metadata parser for Jupyter notebooks.
- [lamin-utils](https://github.com/laminlabs/lamin-utils): Generic utilities, e.g., a logger.
- [readfcs](https://github.com/laminlabs/readfcs): FCS artifact reader.
<!-- [bionty-assets](https://github.com/laminlabs/bionty-assets): Hosted assets of parsed public biological ontologies. -->

LaminHub is not open-sourced, and neither are plug-ins that model lab operations.


### Assumptions & principles

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

1. Data is generated in batches by instruments that process physical samples.
2. Batches are transformed into more useful representations
3. Semantics of high-level embeddings ("inflammatory", "lipophile") are anchored in experimental metadata and knowledge (ontologies)
4. Experimental metadata is another ontology type
5. Experiments measure features ({class}`~lamindb.Feature`, {class}`~lnschema_bionty.CellMarker`, ...)
6. Samples are annotated by labels ({class}`~lamindb.ULabel`, {class}`~lnschema_bionty.CellLine`, ...)
7. Learning and data warehousing both iterate transformations (see **graphic**, {class}`~lamindb.Transform`)
8. Basic biological entities should have the same meaning to anyone and across any data platform

### Influences

LaminDB was influenced by many other projects, see {doc}`docs:influences`.

## Notebooks

- Find all tutorial & guide notebooks [here](https://github.com/laminlabs/lamindb/tree/main/docs/) and use cases [here](https://github.com/laminlabs/lamin-usecases).
- You can run these notebooks in hosted versions of JupyterLab, e.g., [Saturn Cloud](https://github.com/laminlabs/run-lamin-on-saturn), Google Vertex AI, Google Colab, and others.

In [None]:
# clean up test instance
!lamin delete --force lamin-intro
!rm -r lamin-intro