# Track data lineage / provenance

If you track data lineage, you'll know where your files (& datasets) came from.

In this guide, you'll learn how to backtrace file transformations through notebooks, pipelines, apps & users.

In [None]:
# initialize a test instance for this notebook
# this should be called *before* importing lamindb in Python
# if you'd like to load or init an instance after, use the Python API: ln.setup.init(...)
!lamin init --storage ./mydata

In [None]:
import lamindb as ln

ln.settings.verbosity = 3  # show hints

In [None]:
# let's simulate an instrument upload for the sake of this guide
transform = ln.Transform(name="Chromium 10x upload", type="pipeline")
transform.save()
ln.track(transform)
ln.dev.datasets.generate_cell_ranger_files(
    sample_name="sample_1",
    basedir=ln.settings.storage / "cellranger_run_001",
    output_only=False,
)
files = ln.File.from_dir(ln.settings.storage / "cellranger_run_001" / "fastq")
ln.save(files)

## Track pipelines

When working with a pipeline, we'll register it before running it.

This only happens once and could be done by anyone on your team.

In [None]:
ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline").save()

The user who runs the pipeline queries or searches for it: 

In [None]:
transform = ln.Transform.select(name="Cell Ranger", version="7.2.0").one()
# or search
# transform = ln.Transform.search("Cell Ranger", return_queryset=True).first()

And passes the record to {func}`~lamindb.track`:

In [None]:
ln.track(transform)

This creates a global run context:

In [None]:
ln.context.transform

In [None]:
ln.context.run

Let's stage a few files from an instrument upload:

In [None]:
files = ln.File.select(key__startswith="cellranger_run_001/fastq").all()
filepaths = [file.stage() for file in files]

Assume we processed them and obtained output files in two folders "sample_1" and "sample_2":

In [None]:
out_files = ln.File.from_dir(ln.settings.storage / "cellranger_run_001" / "sample_1")
ln.save(out_files)

Each of these files now has transform and run records that are not `None`!

In [None]:
out_files[0].transform

In [None]:
out_files[0].run

## Track notebooks

Let's now track a notebook. In many editors you can simply call:

In [None]:
ln.track()

Let's load one of the files produced in the pipeline we ran before!

In [None]:
file = ln.File.select(key__contains="sample_1/filtered", suffix=".h5").one()

In [None]:
file.stage()

## Visualize data lineage

There are two simple ways for visualizing data lineage.

### From a transform

The first way starts from a notebook:

In [None]:
transform = ln.Transform.search("Track data lineage", return_queryset=True).first()

Visualizing parent transforms and data is straight-forward:

In [None]:
transform.parents.all()

In [None]:
transform.view_parents()

If you or another user re-runs a notebook, they'll immediately be informed about parents:

In [None]:
ln.track()

### From a file

In [None]:
file.view_lineage()

In [None]:
assert len(file.input_of.all()) > 0
assert len(ln.context.transform.parents.all()) > 0

## Understand runs

Under-the-hood we already tracked pipeline and notebook runs through the global context: `context.run`.

You can see this most easily by looking at the `File.run` attribute (in addition to `File.transform`).

{class}`~lamindb.File` objects are the `inputs` and `outputs` of such runs. 

Sometimes, we don't want to create a global run context but manually pass a run when creating a file:
```
ln.File(filepath, run=ln.Run(transform=transform))
```

When accessing files (_staging_, _loading_, etc.) are two things:

1. The current run gets added to `file.input_of` of the file that is accessed from the transform
2. The transform of that file got linked as a parent to the current transform

While run outputs are _automatically_ tracked as data sources once you call `ln.track()`, you can then still switch off auto-tracking of run inputs if you set `ln.settings.track_run_inputs = False`.

You can also track run inputs on a case by case basis via `is_run_input=True`, e.g., here:
```
file.load(is_run_input=True)
```

## Query by provenance

We can query or search for the notebook that created the file:

In [None]:
transform = ln.Transform.search("Track data lineage", return_queryset=True).first()

And then find all the files created by that notebook:

In [None]:
ln.File.select(transform=transform).df()

We see that now, we have two transform records in the `Transform` registry:

In [None]:
ln.view()

Which transform ingested a given file?

In [None]:
file = ln.File.select().first()
file.transform

And which user?

In [None]:
file.created_by

Which transforms were created by a given user?

In [None]:
users = ln.User.lookup(field="handle")

In [None]:
ln.Transform.select(created_by=users.testuser1).df()

Which notebooks were created by a given user?

In [None]:
ln.Transform.select(created_by=users.testuser1, type="notebook").df()

In [None]:
!lamin delete mydata
!rm -r ./mydata