# Introduction

```{include} ../README.md
:start-line: 6
:end-line: -4
```

:::{dropdown} LaminDB features

```{include} features-lamindb.md
```
:::

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

:::{dropdown} LaminHub features

```{include} features-laminhub.md
```
:::

Basic features of LaminHub are free. Enterprise features, support, integration tests & wetlab plug-ins hosted in your or our infrastructure are available on a [paid plan](https://lamin.ai/pricing): please [reach out](https://lamin.ai/contact)!

## Quickstart

```{warning}

Public beta: Close to having converged a stable API, but some breaking changes might still occur.

```

### Setup LaminDB

1. Install the `lamindb` Python package:
    ```shell
    pip install 'lamindb[jupyter,bionty]'
    ```
2. [Sign up](https://lamin.ai/signup) for a free account (see more [info](https://lamin.ai/docs/setup)) and copy the API key.
3. Log in on the command line (data remains in your infrastructure, with Lamin having no access to it):
    ```shell
    lamin login <email> --key <API-key>
    ```

You can now init LaminDB instances like you init git repositories:

In [None]:
!lamin init --schema bionty --storage ./lamin-intro  # or s3://my-bucket, gs://my-bucket as default storage

Because we passed `--schema bionty`, this instance mounted plug-in {mod}`lnschema_bionty`.

### Register a file

Track files using the {class}`~lamindb.File` registry:

In [None]:
import lamindb as ln
import pandas as pd

# track run context
ln.track()

# access a batch of data
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7]},
    index=["observation1", "observation2", "observation3"],
)

# create a file (versioning is optional)
file = ln.File(df, description="my RNA-seq", version="1")
# register file
file.save()

### Access a file

In [None]:
# search a file
ln.File.search("RNAseq")

# query a file
file = ln.File.filter(description__contains="RNA-seq").first()

# view data flow
file.view_flow()

# describe metadata
file.describe()

# load the file
df = file.load()

### Define features & labels

Define features and labels using {class}`~lamindb.Feature` and {class}`~lamindb.ULabel`:

In [None]:
# define features
features = ln.Feature.from_df(df)
ln.save(features)

# define tissue label
tissue = ln.ULabel(name="umbilical blood")
tissue.save()

# define a parent label
is_tissue = ln.ULabel(name="is_tissue")
is_tissue.save()
is_tissue.children.add(tissue)

# view hierarchy
tissue.view_parents()

### Validate & annotate data

In [None]:
# create file & validate features
file = ln.File.from_df(df, description="my RNA-seq")
# register file & link validated features
file.save()

# annotate with a label
file.labels.add(tissue)
# show metadata
file.describe()

### Query for annotations

In [None]:
# a look-up object for all the children of "is_tissue" in ULabel registry
tissues = is_tissue.children.lookup()

# query for exactly one result annotated with umbilical blood
dataset = ln.File.filter(ulabels=tissues.umbilical_blood).one()

# permanently delete the file (without the permanent flag, moves to trash)
file.delete(permanent=True)

### Use biological types

The generic {class}`~lamindb.Feature` and {class}`~lamindb.ULabel` will get you pretty far.

But if you use an entity many times, you typically want a dedicated registry, which you can use to type your code & as an interface for public ontologies.

Let's do this with {class}`~lnschema_bionty.Gene` and {class}`~lnschema_bionty.Tissue` from plug-in {mod}`lnschema_bionty`:

In [None]:
import lnschema_bionty as lb

# create gene records from the public ontology as features
genes = lb.Gene.from_values(df.columns, organism="human")
ln.save(genes)

# query the entire Gene registry content as a DataFrame
lb.Gene.filter().df()

# create file & validate features using the symbol field of Gene
file = ln.File.from_df(
    df, description="my RNA-seq", field=lb.Gene.symbol, organism="human"
)
file.save()

# search the public tissue ontology from the bionty store
lb.Tissue.bionty().search("umbilical blood").head(2)

# define tissue label
tissue = lb.Tissue.from_bionty(name="umbilical cord blood")
tissue.save()

# ontological hierarchy comes by default
tissue.view_parents(distance=2)

# annotate with tissue label
file.labels.add(tissue)

# show metadata
file.describe()

Query for gene sets & the linked files:

In [None]:
# an object to auto-complete human genes
genes = lb.Gene.filter(organism__name="human").lookup()

# all gene sets measuring CD8A
genesets_with_cd8a = ln.FeatureSet.filter(genes=genes.cd8a).all()

# all files measuring CD8A
ln.File.filter(feature_sets__in=genesets_with_cd8a).df()

### Append a new batch of data

In [None]:
# assume we now run a pipeline in which we access a new batch of data
transform = ln.Transform(name="RNA-seq file ingestion", type="pipeline", version="1")
ln.track(transform)

# access a new batch of data with a different schema
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
    },
    index=["observation4", "observation5", "observation6"],
)

# because gene `"CD38"` is not yet registered, it doesn't yet validate
file2 = ln.File.from_df(
    df, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)

# let's add it to the `Gene` registry and re-create the file - now everything passes
lb.Gene.from_bionty(symbol="CD38", organism="human").save()

# now we can validate all features
file2 = ln.File.from_df(
    df, description="my RNA-seq batch 2", field=lb.Gene.symbol, organism="human"
)
file2.save()

Create a dataset using {class}`~lamindb.Dataset` by linking both batches in a "sharded dataset":

In [None]:
dataset = ln.Dataset([file, file2], name="my RNA-seq dataset")
dataset.save()
dataset.describe()
dataset.view_flow()

You can load the entire dataset into memory as if it was one:

In [None]:
dataset.load()

Or iterate over its files:

In [None]:
dataset.files.df()

## More examples

### Understand data flow

View the sequence of data transformations ({class}`~lamindb.Transform`) in a project (from [here](docs:project-flow), based on [Schmidt _et al._, 2022](https://pubmed.ncbi.nlm.nih.gov/35113687/)):

```python
transform.view_parents()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/b0geN1HDHXlORqMOOPay.svg" width="400">

Or, the generating flow of a file or dataset:

```python
file.view_flow()
```

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/KQmzmmLOeBN0C8Ykitjn.svg" width="800">


Both figures are based on mere calls to `ln.track()` in notebooks, pipelines & app.


### Manage biological registries

Create a cell type registry from public knowledge and add a new cell state (from [here](bio-registries)):

In [None]:
import lnschema_bionty as lb

# create an ontology-coupled cell type record and save it
lb.CellType.from_bionty(name="neuron").save()

# create a record to track a new cell state
new_cell_state = lb.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
cell_types = lb.CellType.lookup()
new_cell_state.parents.add(cell_types.neuron)

In [None]:
# view ontological hierarchy
new_cell_state.view_parents(distance=2)

### Leverage a mesh of instances

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can load your instance on the command-line using:

```shell
lamin load myhandle/myinstance
```

If you run `lamin save <notebook_path>`, you will save the notebook to your default storage location.

You can explore the notebook report corresponding to the quickstart [here](https://lamin.ai/laminlabs/lamindata/record/core/Transform?id=FPnfDtJz8qbEz8) in LaminHub.

### Manage custom schemas

LaminDB can be customized & extended with schema & app plug-ins building on the [Django](https://github.com/django/django) ecosystem. Examples are

- [lnschema_bionty](lnschema_bionty): Registries for basic biological entities, coupled to public ontologies.
- [lnschema_lamin1](https://github.com/laminlabs/lnschema-lamin1): Exemplary custom schema to manage samples, treatments, etc. 

If you'd like to create your own schema or app:

1. Create a git repository with registries similar to [lnschema_lamin1](https://github.com/laminlabs/lnschema-lamin1)
2. Create & deploy migrations via `lamin migrate create` and `lamin migrate deploy`

It's fastest if we do this for you based on our templates within an enterprise plan.

## Design

### Why?

We wrote a [blog post](https://lamin.ai/blog/2022/problems) about the key problems Lamin tries to solve when starting to work on it.

### Schema & API

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/XoTQFCmmj2uU4d2xyj9t.png" width="350px" style="background: transparent" align="right">

LaminDB provides a SQL schema for common entities: {class}`~lamindb.File`, {class}`~lamindb.Dataset`, {class}`~lamindb.Transform`, {class}`~lamindb.Feature`, {class}`~lamindb.ULabel` etc. - see the [API reference](reference) or the [source code](https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py).

The core schema is extendable through plug ins (see blue vs. red entities in **graphic**), e.g., with basic biological ({class}`~lnschema_bionty.Gene`, {class}`~lnschema_bionty.Protein`, {class}`~lnschema_bionty.CellLine`, etc.) & operational entities (`Biosample`, `Techsample`, `Treatment`, etc.).

```{dropdown} What is the schema language?

Data models are defined in Python using the Django ORM. Django translates them to SQL tables.

[Django](https://github.com/django/django) is one of the most-used & highly-starred projects on GitHub (~1M dependents, ~73k stars) and has been robustly maintained for 15 years.

In the first year, LaminDB used SQLModel/SQLAlchemy -- we might bring back compatibility.

```

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

The code for this is open-source & accessible through the dependencies & repositories listed below.
 
### Dependencies

- Data is stored in a platform-independent way: 
    - location → local, on AWS S3 or GCP Storage, accessed through `fsspec`
    - format → blob-like files or queryable formats like parquet, zarr, HDF5, TileDB, ...
- Metadata is stored in SQL: current backends are SQLite (small teams) and Postgres (any team size).
- Django ORM for schema management & metadata queries.
- Biological knowledge sources & ontologies: see [Bionty](https://lamin.ai/docs/bionty).

For more details, see the [pyproject.toml](https://github.com/laminlabs/lamindb/blob/main/pyproject.toml) file in lamindb & the linked repositories below.

### Repositories

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

- [lamindb](https://github.com/laminlabs/lamindb): Core API, which builds on the [core schema](https://github.com/laminlabs/lnschema-core).
- [lnschema-bionty](https://github.com/laminlabs/lnschema-bionty): Registries for basic biological entities, coupled to public ontologies.
- [lnschema-lamin1](https://github.com/laminlabs/lnschema-lamin1): Exemplary custom schema to manage samples, treatments, etc.
- [lamindb-setup](https://github.com/laminlabs/lamindb-setup): Setup & configure LaminDB, client for LaminHub.
- [lamin-cli](https://github.com/laminlabs/lamin-cli): CLI for `lamindb` and `lamindb-setup`.
- [bionty](https://github.com/laminlabs/bionty): Accessor for public biological ontologies.
- [nbproject](https://github.com/laminlabs/nbproject): Metadata parser for Jupyter notebooks.
- [lamin-utils](https://github.com/laminlabs/lamin-utils): Generic utilities, e.g., a logger.
- [readfcs](https://github.com/laminlabs/readfcs): FCS file reader.
<!-- [bionty-assets](https://github.com/laminlabs/bionty-assets): Hosted assets of parsed public biological ontologies. -->

LaminHub is not open-sourced, and neither are plug-ins that model lab operations.


### Assumptions & principles

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5MYQci.svg" width="350px" style="background: transparent" align="right">

1. Data is generated by instruments that process physical samples: it comes in batches stored as immutable files.
2. Files are transformed into more useful data representations, e.g.:
   - Summary statistics, e.g., count matrices for fastq files
   - Arrays of non-array-like input data (e.g., images)
   - Higher-level embeddings for lower-level array, text or graph representations
   - Concatenated arrays for large-scale atlas-like datasets
3. Semantics of high-level embeddings ("inflammatory", "lipophile") are anchored in experimental metadata and knowledge (ontologies)
4. Experimental metadata is another ontology type
5. Experiments measure features ({class}`~lamindb.Feature`, {class}`~lnschema_bionty.CellMarker`, ...)
6. Samples are annotated by labels ({class}`~lamindb.ULabel`, {class}`~lnschema_bionty.CellLine`, ...)
7. Learning and data warehousing both iterate transformations (see **graphic**, {class}`~lamindb.Transform`)
8. Basic biological entities should have the same meaning to anyone and across any data platform
9. Schema migrations should be easy

### Influences

LaminDB was influenced by many other projects, see {doc}`docs:influences`.

## Notebooks

- Find all tutorial & guide notebooks [here](https://github.com/laminlabs/lamindb/tree/main/docs/) and use cases [here](https://github.com/laminlabs/lamin-usecases).
- You can run these notebooks in hosted versions of JupyterLab, e.g., [Saturn Cloud](https://github.com/laminlabs/run-lamin-on-saturn), Google Vertex AI, Google Colab, and others.
- Jupyter Lab & Notebook offer a fully interactive experience, VS Code & others require using the CLI to track notebooks: `lamin track my-notebook.ipynb`

In [None]:
!lamin delete --force lamin-intro