# Track data from notebook & pipeline runs


## What is a `Run`?

{class}`lamindb.DObject` are atomic datasets in object storage: jointly measured observations of variables (features). They are generated by running a data transformation, instances of {class}`lamindb.schema.Run`.

For instance:

- Jupyter notebook runs
- Pipeline (workflow) runs
- Physical instruments making measurements
- Human decisions based on data visualizations

Runs have {meth}`~lamindb.schema.Run.inputs` and {meth}`~lamindb.schema.Run.outputs`.

## Jupyter notebooks

```{important}

`ln.nb.header()` tracks notebooks and takes care of the follwing:
1. Add a `Notebook` record `notebook` to the database
2. Add a `Run` record `run`, linked against `notebook` to the database
3. Expose `run` as `ln.nb.run`
4. Ensure that `ln.add(dobject)` sets `dobject.source = run`
5. Enable to track run inputs via `ln.DObject.load(is_run_input=True)`

```

In [None]:
import lamindb as ln
import lamindb.schema as lns

ln.nb.header()

Let's track where the data was ingested:

In [None]:
ln.select(lns.Notebook).join(lns.Run).join(ln.DObject, name="iris_new").one()

Alternatively, you can query for the run that contains a notebook attribute:

```{admonition} What is ln.Session()?
:class: important

Why do we need session here? Find out in our [Session guide](https://lamin.ai/docs/db/faq/session).

```

In [None]:
with ln.Session() as ss:
    source_run = ss.select(lns.Run).join(ln.DObject, name="iris_new").one()
    print(source_run.notebook)

## Ingest data from a pipeline run

### Ingest raw data

In [None]:
filepath = ln.dev.datasets.file_fastq()

filepath

Create a BFX pipeline:

In [None]:
pipeline = ln.add(lns.Pipeline(v="1", name="10x scRNA-seq nextseq"))

In [None]:
pipeline

And a pipeline run:

In [None]:
run = lns.Run(pipeline=pipeline, name="ingest-fastq")

In [None]:
run

We see the run points to the pipeline:

In [None]:
run.pipeline

Let us ingest data from this pipeline run.

In [None]:
dobject_fq = ln.DObject(filepath, source=run)

In [None]:
dobject_fq

In [None]:
dobject_fq.source

In [None]:
dobject_fq = ln.add(dobject_fq)

We can now select dobject by `run`:

In [None]:
ln.select(ln.DObject).join(lns.Run, name="ingest-fastq").df()

### Ingest and track pipeline outputs

In [None]:
output_filepath = ln.dev.datasets.file_bam()

output_filepath

Let's now register another pipeline, which will use cellranger to analyze the scRNA-seq data from the input fastq file.

In [None]:
pipeline = ln.add(lns.Pipeline(v="7", name="Cell Ranger v7"))
run = lns.Run(pipeline=pipeline, name="cellranger scRNA-seq")

run

```{note}

Linking run input files will allow data lineage tracking.
```

In [None]:
run.inputs.append(dobject_fq)

In [None]:
dobject = ln.DObject(output_filepath, source=run)

ln.add(dobject)

### Track data lineage

Now let's track from which files that the `output.bam` file is generated, aka, the input file of the run that produced file `output.bam`

In [None]:
with ln.Session() as ss:
    run = ss.select(lns.Run).join(ln.DObject, name="output", suffix=".bam").one()
    print(run.inputs)

You can query the notebook that belongs to a run id like this:

In [None]:
nb = ln.select(lns.Notebook).join(lns.Run, id=ln.nb.run.id).one()

nb