# Track data lineage across pipelines, app uploads & notebooks

LaminDB makes it easy to know where your files & datasets came from.

In this guide, you'll learn how to backtrace file transformations through notebooks, pipelines & app uploads.

In [None]:
# initialize a test instance for this notebook
# this should be run before importing lamindb in Python
!lamin login testuser1
!lamin init --storage ./mydata
!lamin login testuser2
!lamin load testuser1/mydata

In [None]:
import lamindb as ln

ln.settings.verbosity = 3  # show hints

In [None]:
# To make the example of this guide richer, let's create data registered in uploads and pipeline runs by testuser1:
bfx_run_output = ln.dev.datasets.generate_cell_ranger_files(
    "schmidt22_perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.setup.login("testuser1")
transform = ln.Transform(name="Chromium 10x upload", type="pipeline")
ln.track(transform)
file1 = ln.File(bfx_run_output.parent / "fastq/schmidt22_perturbseq_R1_001.fastq.gz")
file1.save()
file2 = ln.File(bfx_run_output.parent / "fastq/schmidt22_perturbseq_R2_001.fastq.gz")
file2.save()
# let's now login testuser2 to start with guide
ln.setup.login("testuser2")

## Track a bioinformatics pipeline

When working with a pipeline, we'll register it before running it.

This only happens once and could be done by anyone on your team.

In [None]:
ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline").save()

A user then queries or searches the pipeline:

In [None]:
transform = ln.Transform.select(name="Cell Ranger", version="7.2.0").one()

And passes the record to {func}`~lamindb.track` to create a global run context ({class}`lamindb.context`):

In [None]:
ln.track(transform)

Let's stage a few files from an instrument upload:

In [None]:
files = ln.File.select(key__startswith="fastq/schmidt22_perturbseq_").all()
filepaths = [file.stage() for file in files]

Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix':

In [None]:
ln.File.tree("schmidt22_perturbseq/filtered_feature_bc_matrix/")

In [None]:
out_files = ln.File.from_dir(
    "./mydata/schmidt22_perturbseq/filtered_feature_bc_matrix/"
)
ln.save(out_files)

Each of these files now has transform and run records:

In [None]:
out_files[0].transform

In [None]:
out_files[0].run

Let's look at the data lineage at this stage:

In [None]:
out_files[0].view_lineage()

Let's keep running the Cell Ranger pipeline in the background:

In [None]:
# continue with more precessing steps of the cell ranger output data
transform = ln.Transform(
    name="Preprocess Cell Ranger outputs", version="2.0", type="pipeline"
)
ln.track(transform)

[f.stage() for f in out_files]
filepath = ln.dev.datasets.schmidt22_perturbseq(basedir=ln.settings.storage)
file = ln.File(filepath, description="schmidt22_perturbseq counts")
file.save()

## Track app upload & analytics

The hidden cell below simulates additional analytic steps including:

* uploading phenotypic screen data
* scRNA-seq analysis
* analyses of the integrated datasets

In [None]:
# app upload
ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)

# upload and analyze the GWS data
filepath = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
file = ln.File(filepath, description="Raw data of schmidt22 crispra GWS")
file.save()
ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)

file_wgs = ln.File.select(key="schmidt22-crispra-gws-IFNG.csv").one()
df = file_wgs.load().set_index("id")
hits_df = df[df["pos|fdr"] < 0.01].copy()
file_hits = ln.File(hits_df, description="hits from schmidt22 crispra GWS")
file_hits.save()

Let's see how the data lineage of this looks:

In [None]:
file = ln.File.select(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

## Track notebooks

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

In [None]:
# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.select(key="schmidt22_perturbseq.h5ad").one()
adata = file_ps.load()
screen_hits = file_hits.load()
import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

The outcome of it are a few figures stored as image files. Let's query one of them and look at the data lineage:

In [None]:
file = ln.File.select(key__contains="figures/matrixplot").one()
file.view_lineage()

We'd now like to track the current Jupyter notebook to continue the work:

In [None]:
ln.track()

Let's load the image file:

In [None]:
file.stage()

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

In [None]:
file.view_lineage()

We can also purely look at the sequence of transforms:

In [None]:
transform = ln.Transform.search("Track data lineage", return_queryset=True).first()

In [None]:
transform.parents.df()

In [None]:
transform.view_parents()

And if you or another user re-runs a notebook, they'll be informed about parents in the logging:

In [None]:
ln.track()

## Understand runs

Under-the-hood we already tracked pipeline and notebook runs through the global context: `context.run`.

You can see this most easily by looking at the `File.run` attribute (in addition to `File.transform`).

{class}`~lamindb.File` objects are the `inputs` and `outputs` of such runs. 

Sometimes, we don't want to create a global run context but manually pass a run when creating a file:
```
ln.File(filepath, run=ln.Run(transform=transform))
```

When accessing files (_staging_, _loading_, etc.) are two things:

1. The current run gets added to `file.input_of` of the file that is accessed from the transform
2. The transform of that file got linked as a parent to the current transform

While run outputs are _automatically_ tracked as data sources once you call `ln.track()`, you can then still switch off auto-tracking of run inputs if you set `ln.settings.track_run_inputs = False`.

You can also track run inputs on a case by case basis via `is_run_input=True`, e.g., here:
```
file.load(is_run_input=True)
```

## Query by provenance

We can query or search for the notebook that created the file:

In [None]:
transform = ln.Transform.search("Track data lineage", return_queryset=True).first()

And then find all the files created by that notebook:

In [None]:
ln.File.select(transform=transform).df()

Which transform ingested a given file?

In [None]:
file = ln.File.select().first()
file.transform

And which user?

In [None]:
file.created_by

Which transforms were created by a given user?

In [None]:
users = ln.User.lookup(field="handle")

In [None]:
ln.Transform.select(created_by=users.testuser1).df()

Which notebooks were created by a given user?

In [None]:
ln.Transform.select(created_by=users.testuser1, type="notebook").df()

And of course, we can also view all recent additions to the entire database:

In [None]:
ln.view()

In [None]:
!lamin login testuser1
!lamin delete mydata
!rm -r ./mydata