# Track Nextflow workflows

[Nextflow](https://www.nextflow.io/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly.

Here, we'll run `nf-core/rnaseq` to process `.fastq` files from bulk RNA sequencing using STAR, RSEM, HISAT2, Salmon with gene/isoform counts and extensive quality control ([reference](https://nf-co.re/rnaseq/3.12.0)).

![](https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0//docs/images/nf-core-rnaseq_metro_map_grey.png)


Let's create a test instance:

In [None]:
!lamin init --storage . --name nextflow-bulkrna

In [None]:
import lamindb as ln

## Download test data

Download test data using git:

In [None]:
!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3 --depth 1

Track the download:

In [None]:
download = ln.Transform(name="Download")
download_url = "https://github.com/nf-core/test-datasets"
ln.track(download, reference=download_url, reference_type="url")

Register input files - they'll automatically be linked against the download run:

In [None]:
sample_sheet = ln.File("test-datasets/samplesheet/v3.10/samplesheet_test.csv")
ln.save(sample_sheet)
input_fastqs = ln.File.from_dir("test-datasets/testdata/GSE110004/")
ln.save(input_fastqs)

Visualize data lineage for one of the files:

In [None]:
sample_sheet.view_lineage()

## Track the Nextflow run

(We'd start here if input files were tracked in the cloud with LaminDB rather than downloaded through git.)

Track the Nextflow pipeline & run:

In [None]:
nextflow_bulkrna = ln.Transform(
    name="nf-core rnaseq",
    version="3.11.2",
    type="pipeline",
    reference="https://github.com/laminlabs/nextflow-lamin-usecases",
)
ln.track(nextflow_bulkrna)

If we now stage input files, they'll be tracked as run inputs.

(As data is already locally available in this test case, staging won't download anything.)

In [None]:
sample_sheet.stage()
[input_fastq.stage() for input_fastq in input_fastqs]

All data is now in place and we can run the nextflow pipeline:

In [None]:
!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name {ln.dev.run_context.run.id}

Here, we passed the LaminDB run id to nextflow so that we can query it from within nextflow.

## Register outputs

### QC

In [None]:
multiqc_file = ln.File("rna-seq-results/multiqc/star_salmon/multiqc_report.html")
multiqc_file.save()

:::{dropdown} How would I register all QC files?

```python
multiqc_results = ln.File.from_dir("rna-seq-results/multiqc/")
ln.save(multiqc_results)
```

:::

### Count matrix

In [None]:
count_matrix = ln.File("rna-seq-results/salmon/salmon.merged.gene_counts.tsv")
count_matrix.save()

## Link biological entities

To make the count matrix queryable by biological entities (genes, experimental metadata, etc.), we can now proceed with: {doc}`docs:bulkrna`

## Register the Nextflow execution id

If we want to be able to query LaminDB for Nextflow execution ID, this here is a way to get it:

In [None]:
import subprocess

command = f"nextflow log | grep -F '{ln.dev.run_context.run.id}' | awk '{{print $8}}'"
session_id = subprocess.getoutput(command)

run = ln.Run.filter(transform__name="nf-core rnaseq").order_by("-run_at").first()
run.reference = session_id
run.reference_type = "nextflow_id"
run.save()

## Visualize

View data lineage:

In [None]:
count_matrix.view_lineage()

View the database content:

In [None]:
ln.view()

Clean up the test instance:

In [None]:
!lamin delete --force nextflow-bulkrna