# Track provenance

{class}`~lamindb.File` objects are the {attr}`~lamindb.Run.inputs` and {attr}`~lamindb.Run.outputs` of run {class}`~lamindb.Run` objects. What is run is a {class}`~lamindb.Transform`.

In each `File` object, access the generating {class}`~lamindb.Run` and {class}`~lamindb.Transform` objects via {attr}`lamindb.File.run` and {attr}`lamindb.File.transform`.

In [None]:
# initialize a test instance for this notebook
!lamin init --storage ./myobjects

In [None]:
import lamindb as ln

## Notebooks

Let us first track the data source. Here, it's a Jupyter notebook, so we can run:

In [None]:
ln.track()

:::{dropdown} Track a pipeline instead of a notebook

If this is run in a pipeline, we need to pass a {class}`~lamindb.Transform` object of `type` "pipeline":

```
transform = ln.Transform("My script")  # optionally pass type="pipeline"
ln.track(transform)
```

This readily creates a {class}`~lamindb.Run` for the pipeline.

A pipeline is any non-interactive session: any Python script or workflow tool you may use.

:::

A global run context is created upon `ln.track()`:

In [None]:
ln.context.transform

In [None]:
ln.context.run

Let's add a file:

In [None]:
filepath = ln.dev.datasets.file_mini_csv()
filepath = filepath.rename(ln.setup.settings.storage.root / filepath.name)

In [None]:
file = ln.File(filepath)
ln.add(file)

We see that now, the file has transform and run records that are not `None`!

In [None]:
file.transform

And hence, we can query for it!

In [None]:
ln.select(ln.File).where(ln.File.transform == ln.context.transform).one()

## Pipelines

In [None]:
filepath = ln.dev.datasets.file_fastq()

When working with a pipeline, we'll register it before running it.

In [None]:
transform = ln.Transform(
    name="10x scRNA-seq nextseq", type="pipeline"
)  # `type`` will default to `"pipeline"` outside an interactive (ipython) environment

We can then use the {func}`~lamindb.track` as before:

In [None]:
ln.track(transform)

In [None]:
file_fastq = ln.File(filepath)

In [None]:
ln.add(file_fastq)

:::{dropdown}

We can also manually pass a run and not use the global run context set by `ln.track`:
```
run = ln.Run(transform=transform, name="ingest-fastq")
ln.File(filepath, run=run)
```

:::

## Track run inputs

While run outputs are automatically tracked as data sources, run inputs aren't.

Let's register a pipeline that takes the fastq file as an input:

In [None]:
ln.track(ln.Transform(name="Cell Ranger", version="7", type="pipeline"))

To track it as an input for the current run, set `is_run_input=True`.

In [None]:
file_fastq = ln.select(ln.File, name="input.fastq.gz").one()
file_fastq.stage(is_run_input=True)

Let's get an exemplary output filepath:

In [None]:
output_filepath = ln.dev.datasets.file_bam()

In [None]:
output_filepath

In [None]:
file = ln.File(output_filepath)

ln.add(file)

Let's query the input file of the run that produced file `output.bam`:

In [None]:
with ln.Session() as ss:
    file = ss.select(ln.File, name="output.bam").one()
    print(file.run.inputs)

In [None]:
assert file.run.inputs[0].name == "input.fastq.gz"

## Query by provenance

### Run inputs & outputs

From which run does file `output.bam` come from?

In [None]:
with ln.Session() as ss:
    file = ss.select(ln.File, name="output.bam").first()
    print(file.run)

Which other files did this run have as input and outputs?

In [None]:
with ln.Session() as ss:
    file = ss.select(ln.File, name="output.bam").first()
    print(file.run.inputs)
    print(file.run.outputs)

### Notebooks

Which notebook ingested a given file?

In [None]:
file = ln.select(ln.File, name="mini.csv").first()
print(file.transform)

Which notebooks were created by testuser2?

In [None]:
ln.select(ln.Transform, type="notebook").join(ln.Transform.created_by).where(
    ln.User.handle == "testuser2"
).df()

### Pipelines

Which pipeline produced this file `input.fastq.gz`?

In [None]:
file = ln.select(ln.File, name="input.fastq.gz").one()
print(file.transform)

Which pipelines were created by testuser1?

In [None]:
ln.select(ln.Transform, type="pipeline").join(ln.Transform.created_by).where(
    ln.User.handle == "testuser1"
).df()

### Users

Which users have interacted with the database?

In [None]:
ln.select(ln.User).df()

Which user ingested this file `input.fastq.gz`?

In [None]:
with ln.Session() as ss:
    file = ss.select(ln.File, name="input.fastq.gz").one()
    print(file.created_by)

Which users created notebooks with "lineage" in the title?

In [None]:
ln.select(ln.User.handle, ln.Transform.title).join(ln.Transform.created_by).where(
    ln.Transform.title.contains("lineage")
).df()

Which user created this pipeline?

In [None]:
ln.select(ln.Transform).df()

In [None]:
ln.select(ln.User).join(ln.Transform, name="Cell Ranger", version="7").one()

In [None]:
!lamin delete myobjects