# Manage interactive analyses

Capturing and documenting the origin and flow of biological data throughout its lifecycle is important as it enables the traceability and reliability of biological data & insights, verify experimental outcomes, meet stringent regulatory standards, and foster the reproducibility of scientific discoveries.

While tracking data lineage is easier when it's governed by deterministic pipelines, it becomes hard when interactive human-driven analyses become relevant.

This use case walks through how LaminDB helps with the latter by enabling to `ln.track()` data flow through notebooks & teams of analysts.

## Setup

```{warning}

Please ensure that you have created or loaded a LaminDB instance before running the remaining part of this notebook!
```

In [None]:
# A lamindb instance containing Bionty schema (skip if you already loaded your instance)
!lamin init --storage ./analysis-usecase --schema bionty

Import `lamindb` and `lnschema_bt` which enables us to connect [Bionty](https://github.com/laminlabs/bionty) with [LaminDB](https://github.com/laminlabs/lamindb). This enables us to map AnnData metadata annotations against ontologies and create SQL records within LaminDB to eventually make them queryable.

In [None]:
import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = 3  # show hints
lb.settings.species = "human"  # globally set species

## Track cell types, tissues and diseases

Let's enable tracking of the current notebook as the transform of this file using {func}`docs:lamindb.track`:

In [None]:
ln.track()

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

In [None]:
adata = ln.dev.datasets.anndata_with_obs()

In [None]:
adata

In [None]:
adata.var_names[:5]

In [None]:
adata.obs[["tissue", "cell_type", "disease"]].value_counts()

### Register biological metadata and link to the dataset

As a first step, we register the Anndata object with LaminDB using {func}`docs:lamindb.File` and {func}`docs:lamindb.save`:

In [None]:
file = ln.File.from_anndata(
    adata, key="mini_anndata_with_obs.h5ad", var_ref=lb.Gene.ensembl_gene_id
)

In [None]:
file.save()

Using {func}`docs:lamindb.parse` we can associate the cell types, tissues, diseases with the Ontology metadata:

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

Associate the cell types and tissues with the {func}`docs:lamindb.File` object via features.

In [None]:
file.features.add_labels(cell_types)
file.features.add_labels(tissues)
file.features.add_labels(diseases)

In [None]:
file.describe()

### Your vocabulary store

Examine the currently available cell types and tissues:

In [None]:
lb.CellType.filter().df()

In [None]:
lb.Tissue.filter().df()

## Processing of the dataset

In the following we will modify the AnnData object to demonstrate data lineage tracking with LaminDB.

To track our data transformation we create a new {func}`docs:lamindb.Transform` of type "pipeline":

In [None]:
transform = ln.Transform(
    type="pipeline", name="subset_to_T_cells_and_liver_lymphoma", version="0.1.0"
)

Set the current tracking to the new transform using {func}`docs:lamindb.track`:

In [None]:
ln.track(transform)

### Get a cloud-backed AnnData object

In [None]:
file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()

In [None]:
adata = file.backed()
adata

In [None]:
adata.obs[["cell_type", "disease"]].value_counts()

### Subset dataset to specific cell types and diseases

Create the subset:

In [None]:
subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)

In [None]:
adata_subset = adata[subset_obs]
adata_subset

In [None]:
adata_subset.obs[["cell_type", "disease"]].value_counts()

### Add the subset `AnnData` to LaminDB

This subset can now be registered with LaminDB.

In [None]:
file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    var_ref=lb.Gene.ensembl_gene_id,
)

In [None]:
file_subset.save()

Add labels to features:

In [None]:
cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.features.add_labels(cell_types)
file_subset.features.add_labels(tissues)
file_subset.features.add_labels(diseases)

In [None]:
file_subset.describe()

## Examining data lineage

Common questions that might arise are:

- Which h5ad file is in the `subset` subfolder?
- Which notebook ingested this file?
- By whom?
- And which file is its parent?

Let's answer this using LaminDB:

Query a subsetted `.h5ad` file containing "hematopoietic stem cell" and "T cell" to learn which h5ad file is in the `subset` subfolder:

In [None]:
cell_types_bt_lookup = lb.CellType.lookup()

In [None]:
le_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()

In [None]:
le_subset

In [None]:
file.view_lineage()

Which notebook ingested this file?

In [None]:
file_subset.transform

Who ingested this file?

In [None]:
file_subset.created_by

What are the parent files?

In [None]:
file_subset.run.input_files.list("key")

## Try it yourself

This notebook is available at [https://github.com/laminlabs/lamin-usecases](https://github.com/laminlabs/lamin-usecases).

In [None]:
!lamin delete analysis-usecase
!rm -r ./analysis-usecase