# Track data lineage

Reference: {class}`~lamindb.Transform`, {class}`~lamindb.Run`

## What is a `Run`?

{class}`~lamindb.File` objects are transformed by {class}`~lamindb.Run` objects: they are the {attr}`~lamindb.Run.inputs` and {attr}`~lamindb.Run.outputs` of runs.

For each `File` object, you can access the generating {class}`~lamindb.Run` and {class}`~lamindb.Transform` objects via {attr}`~lamindb.File.run` and {attr}`~lamindb.File.transform`!

In [None]:
import lamindb as ln

## Notebook run

The metadata of Jupyter notebooks is automatically detected and `ln.Run` assumes `global_context=True`: we don't need to keep track of the run record ourselves, but can access it via `ln.context`:

In [None]:
ln.track()

In [None]:
ln.context.run

Let us query where `File` "iris_new" has been ingested:

In [None]:
ln.select(ln.Transform).join(ln.Run).join(ln.File, name="iris_new").first()

Alternatively, you can query for the run that contains a notebook attribute:

In [None]:
with ln.Session() as ss:
    file = ss.select(ln.File, name="iris_new").one()
    print(file.run.transform)

## Pipeline run

In [None]:
filepath = ln.dev.datasets.file_fastq()

When working with a pipeline, we'll register it before running it.

In [None]:
transform = ln.Transform(name="10x scRNA-seq nextseq", type="pipeline")

In [None]:
ln.track(transform)

We can then use the {func}`~lamindb.track` as before (if we don't register a pipeline with the correct name, we'll be asked to):

In [None]:
file_fastq = ln.File(filepath)

In [None]:
ln.add(file_fastq)

We can also manually pass a run and not use the global run context (`ln.context`) set by `ln.track`:
```
run = ln.Run(transform=transform, name="ingest-fastq")
ln.File(filepath, run=run)
```

## Track run inputs

While run outputs are automatically tracked as data sources, run inputs aren't.

However, you can simply call `is_run_input` upon loading `File`.

Let's register a downstream pipeline:

In [None]:
ln.track(ln.Transform(name="Cell Ranger", version="7", type="pipeline"))

Let's query input data for this pipeline, a fastq.

To process in the pipeline, we need to `load()` it (download it from the cloud and access the on-disk or in-memory representation).

To track it as an input for the current run, set `is_run_input=True`.

In [None]:
with ln.Session() as ss:
    file_fastq = ss.select(ln.File, name="input.fastq.gz").one()
    file_fastq.load(is_run_input=True)

In [None]:
file_fastq.input_of

In [None]:
ln.add(file_fastq)

In [None]:
output_filepath = ln.dev.datasets.file_bam()

In [None]:
output_filepath

In [None]:
file = ln.File(output_filepath)

ln.add(file)

## Data lineage

Now let's track from which files that the `output.bam` file is generated, aka, the input file of the run that produced file `output.bam`

In [None]:
with ln.Session() as ss:
    run = ss.select(ln.Run).join(ln.File, name="output.bam").one()
    assert run.inputs[0].name == "input.fastq.gz"
    print(run.inputs)