[![Jupyter Notebook](https://img.shields.io/badge/Jupyter%20Notebook-orange)](https://github.com/laminlabs/lamin-usecases/blob/main/docs/birds-eye.ipynb)

# Bird's eye view

## Background

Data lineage tracks data's journey, detailing its origins, transformations, and interactions to trace the source of biological insights, verify experimental outcomes, meet regulatory standards, and increase the robustness of research.
While tracking data lineage is easier when it is governed by deterministic pipelines, it becomes hard when its governed by interactive human-driven analyses.

Here, we'll backtrace file transformations through notebooks, pipelines & app uploads in a research project based on [Schmidt22](https://pubmed.ncbi.nlm.nih.gov/35113687/).
The study conducted genome-wide CRISPR activation and interference screens in primary human T cells to identify gene networks controlling IL-2 and IFN-γ production revealing insights into cytokine regulation.

## Setup

We need an instance:

In [None]:
!lamin init --storage ./mydata

Import lamindb:

In [None]:
import lamindb as ln

We simulate the raw data processing of Schmidt22 with toy data in a real world setting with multiple collaborators (here testuser1 and testuser2):

In [None]:
assert ln.setup.settings.user.handle == "testuser1"

bfx_run_output = ln.dev.datasets.dir_scrnaseq_cellranger(
    "perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
ln.File(bfx_run_output.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(bfx_run_output.parent / "fastq/perturbseq_R2_001.fastq.gz").save()

## Track a bioinformatics pipeline

When working with a pipeline, we'll register it before running it.

This only happens once and could be done by anyone on your team.

In [None]:
ln.setup.login("testuser2")

In [None]:
transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")

In [None]:
ln.User.filter().df()

In [None]:
transform

In [None]:
ln.track(transform)

Now, let's stage a few files from an instrument upload:

In [None]:
files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]

Assume we processed them and obtained 3 output files in a folder `'filtered_feature_bc_matrix'`:

In [None]:
output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

Let's look at the data lineage at this stage:

In [None]:
output_files[0].view_lineage()

And let's keep running the Cell Ranger pipeline in the background.

In [None]:
transform = ln.Transform(
    name="Preprocess Cell Ranger outputs", version="2.0", type="pipeline"
)
ln.track(transform)
[f.stage() for f in output_files]
filepath = ln.dev.datasets.schmidt22_perturbseq(basedir=ln.settings.storage)
file = ln.File(filepath, description="perturbseq counts")
file.save()

## Track app upload & analytics

The hidden cell below simulates additional analytic steps including:

* uploading phenotypic screen data
* scRNA-seq analysis
* analyses of the integrated datasets

In [None]:
# app upload
ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)
filepath = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
file = ln.File(filepath, description="Raw data of schmidt22 crispra GWS")
file.save()

# upload and analyze the GWS data
ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)
file_wgs = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
df = file_wgs.load().set_index("id")
hits_df = df[df["pos|fdr"] < 0.01].copy()
file_hits = ln.File(hits_df, description="hits from schmidt22 crispra GWS")
file_hits.save()

Let's see what the data lineage of this looks:

In [None]:
file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

In [None]:
# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()

import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

The outcome of it are a few figures stored as image files. Let's query one of them and look at the data lineage:

## Track notebooks

We'd now like to track the current Jupyter notebook to continue the work:

In [None]:
ln.track()

## Visualize data lineage

Let's load one of the plots:

In [None]:
file = ln.File.filter(key__contains="figures/matrixplot").one()
file.stage()

In [None]:
from IPython.display import Image, display

display(Image(filename=file.stage()))

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

In [None]:
file.view_lineage()

Alternatively, we can also purely look at the sequence of transforms and ignore the files:

In [None]:
transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()

In [None]:
transform.parents.df()

In [None]:
transform.view_parents()

## Understand runs

We tracked pipeline and notebook runs through {class}`~docs:lamindb.dev.run_context`, which stores a {class}`~docs:lamindb.Transform` and a {class}`~docs:lamindb.Run` record as a global context.

{class}`~lamindb.File` objects are the inputs and outputs of runs. 

:::{dropdown} What if I don't want a global context?

Sometimes, we don't want to create a global run context but manually pass a run when creating a file:
```python
run = ln.Run(transform=transform)
ln.File(filepath, run=run)
```

:::

:::{dropdown} When does a file appear as a run input?

When accessing a file via `stage()`, `load()` or `backed()`, two things happen:

1. The current run gets added to `file.input_of`
2. The transform of that file gets added as a parent of the current transform

You can then switch off auto-tracking of run inputs if you set `ln.settings.track_run_inputs = False`: {doc}`docs:faq/track-run-inputs`

You can also track run inputs on a case by case basis via `is_run_input=True`, e.g., here:
```python
file.load(is_run_input=True)
```

:::

## Query by provenance

We can query or search for the notebook that created the file:

In [None]:
transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

In [None]:
ln.File.filter(transform=transform).df()

Which transform ingested a given file?

In [None]:
file = ln.File.filter().first()
file.transform

And which user?

In [None]:
file.created_by

Which transforms were created by a given user?

In [None]:
users = ln.User.lookup()

In [None]:
ln.Transform.filter(created_by=users.testuser2).df()

Which notebooks were created by a given user?

In [None]:
ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

We can also view all recent additions to the entire database:

In [None]:
ln.view()

In [None]:
!lamin login testuser1
!lamin delete --force mydata
!rm -r ./mydata