# Track data lineage

Tracking data lineage will let you know where data came from.

Knowing where a file came from means backtracing file transformations through notebooks, pipelines, apps & users.

Let's see how LaminDB helps with this!

In [None]:
# initialize a test instance for this notebook
# this needs to be called *before* importing lamindb in Python
# if you'd like to load or init an instance after, use the Python API: ln.setup.init(...)
!lamin init --storage ./mydata

In [None]:
import lamindb as ln

ln.settings.verbosity = 3  # show hints

## Notebooks

Let us first track the data source.

Here, it's a Jupyter notebook, so we can run:

In [None]:
ln.track()

This creates a global run context, which you can access via:

In [None]:
ln.context.transform

In [None]:
ln.context.run

Let's now register a file in LaminDB:

In [None]:
filepath = ln.dev.datasets.file_fastq()
file = ln.File(filepath, key="fastqs/input.fastq.gz")
file.save()

We see that now, the file has transform and run records that are not `None`!

In [None]:
file.transform

In [None]:
file.run

So, whenever we use this file, we'll know where it came from! ✅

Conversely, we can query or search for the notebook that created the file:

In [None]:
transform = ln.Transform.search("Track data lineage", top_hit=True)

And then find all the files created by that notebook:

In [None]:
ln.File.select(transform=transform).df()

## Pipelines

When working with a pipeline, we'll register it before running it.

In [None]:
# below, `type` defaults to `"pipeline"` outside of an ipython environment
ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline").save()

When you want to run the pipeline, you'll query for it, first: 

In [None]:
transform = ln.Transform.select(name="Cell Ranger", version="7.2.0").one()

And pass the record to {func}`~lamindb.track`, as before:

In [None]:
ln.track(transform)

## Runs

All of this is great already! But why do we need `~lamindb.Run` then?

{class}`~lamindb.File` objects are the `inputs` and `outputs` of runs!

This allows to track from which pipeline or notebook run a file came from.

We can also manually pass a run and not use the global run context set by `ln.track`:
```
run = ln.Run(transform=transform, name="ingest-fastq")
ln.File(filepath, run=run)
```

## Track run inputs

While run outputs are _automatically_ tracked as data sources, run inputs aren't unless you set `ln.settings.track_run_inputs = True`.

To track it as an input for the current run, set `is_run_input=True`.

In [None]:
file_fastq = ln.File.select(key="fastqs/input.fastq.gz").one()
file_fastq.stage(is_run_input=True)

Let's get an exemplary output filepath:

In [None]:
output_filepath = ln.dev.datasets.file_bam()

In [None]:
output_filepath

We know that the data source is automatically linked and we can simply save it:

In [None]:
ln.File(output_filepath, key="bams/output.bam").save()

If we query the file later on:

In [None]:
file = ln.File.select(key="bams/output.bam").one()

We'll have access to the run that produced it:

In [None]:
run = file.run

run

And from this run object, we see which input files were used:

In [None]:
run.inputs.all()

And which other outputs were produced:

In [None]:
run.outputs.all()

In [None]:
assert run.inputs.all()[0].key == "fastqs/input.fastq.gz"

## Query by provenance

Which notebook ingested a given file?

In [None]:
file = ln.File.select(key="fastqs/input.fastq.gz").first()
file.transform

Which transforms were created by a given user?

In [None]:
users = ln.User.lookup(field="handle")

In [None]:
ln.Transform.select(created_by=users.testuser1).df()

Which notebooks were created by a given user?

In [None]:
ln.Transform.select(created_by=users.testuser1, type="notebook").df()

Which user ingested this file `input.fastq.gz`?

In [None]:
ln.File.select(key="fastqs/input.fastq.gz").one().created_by

In [None]:
!lamin delete mydata
!rm -r ./mydata