# Track data lineage: `Pipeline`, `Notebook`, `Run`


## What is a `Run`?

{class}`~lamindb.DObject` are transformed by instances of {class}`~lamindb.Run`. They are the {attr}`~lamindb.Run.inputs` and {attr}`~lamindb.Run.outputs` of runs.

Conversely, the {attr}`~lamindb.DObject.source` of {class}`~lamindb.DObject` is always the output of a run!

In the default schema, a `Run` can be created from a `Notebook` or a `Pipeline`.

In [None]:
import lamindb as ln
import lamindb.schema as lns

## Notebook run

The metadata of Jupyter notebooks is automatically detected and `ln.Run` assumes `global_context=True`: we don't need to keep track of the run record ourselves, but can access it via `ln.context`:

In [None]:
ln.Run()

In [None]:
ln.context.run

Let us query where `DObject` "iris_new" has been ingested:

In [None]:
ln.select(lns.Notebook).join(lns.Run).join(ln.DObject, name="iris_new").first()

Alternatively, you can query for the run that contains a notebook attribute:

In [None]:
with ln.Session() as ss:
    dobject = ss.select(ln.DObject, name="iris_new").one()
    print(dobject.source.notebook)

## Pipeline run

In [None]:
filepath = ln.dev.datasets.file_fastq()

When working with a pipeline, we'll register it before running it.

In [None]:
pipeline = ln.add(lns.Pipeline(v="1", name="10x scRNA-seq nextseq"))

pipeline

We can then use the {class}`~lamindb.context` as before (if we don't register a pipeline with the correct name, we'll be asked to):

In [None]:
ln.Run(pipeline_name="10x scRNA-seq nextseq", global_context=True)

In [None]:
dobject_fastq = ln.DObject(filepath)

In [None]:
ln.add(dobject_fastq)

We can also manually pass a run:
```
run = lns.Run(pipeline=pipeline, name="ingest-fastq")
ln.DObject(filepath, source=run)
```

## Track run inputs

While run outputs are automatically tracked as data sources, run inputs aren't.

However, you can simply call `is_run_input` upon loading `DObject`.

Let's register a downstream pipeline:

In [None]:
pipeline = ln.add(lns.Pipeline(name="Cell Ranger", v="7"))

And a run context for it:

In [None]:
ln.Run(pipeline_name="Cell Ranger", global_context=True)

Let's query input data for this pipeline, a fastq.

To process in the pipeline, we need to `load()` it (download it from the cloud and access the on-disk or in-memory representation).

To track it as an input for the current run, set `is_run_input=True`.

In [None]:
with ln.Session() as ss:
    dobject_fastq = ss.select(ln.DObject, name="input").one()
    dobject_fastq.load(is_run_input=True)

In [None]:
dobject_fastq.targets

In [None]:
ln.add(dobject_fastq)

In [None]:
output_filepath = ln.dev.datasets.file_bam()

In [None]:
output_filepath

In [None]:
dobject = ln.DObject(output_filepath)

ln.add(dobject)

## Data lineage

Now let's track from which files that the `output.bam` file is generated, aka, the input file of the run that produced file `output.bam`

In [None]:
with ln.Session() as ss:
    run = ss.select(lns.Run).join(ln.DObject, name="output", suffix=".bam").one()
    assert run.inputs[0].name == "input"
    print(run.inputs)