# Track Nextflow workflows

[Nextflow](https://www.nextflow.io/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly.

The workflow [nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.


## Setup

To run this notebook, you need to load a LaminDB instance that has the `bionty` schema mounted.

Here, we’ll create a test instance (skip if you’d like to run it using your instance):

In [None]:
!lamin init --storage . --name nextflow-bulkrna

In [None]:
import lamindb as ln
from pathlib import Path

## Download test data

In [None]:
!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3 --depth 1

To keep track of the download, let's create a "Download" transform and a track a run pointing to the reference url:

In [None]:
download = ln.Transform(name="Download")
ln.track(
    download, reference="https://github.com/nf-core/test-datasets", reference_type="url"
)

Let's register the files we need from the download, they'll automatically be linked against the download run:

In [None]:
input_fastqs_file = ln.File.from_dir("test-datasets/testdata/GSE110004/")
ln.save(input_fastqs_file)
sample_sheet_file = ln.File("test-datasets/samplesheet/v3.10/samplesheet_test.csv")
ln.save(sample_sheet_file)

Let's visualize data lineage for one of the files:

In [None]:
sample_sheet_file.view_lineage()

## Track the nf-core rnaseq run

Let's now track the Nextflow workflow:

In [None]:
nextflow_bulkrna = ln.Transform(
    name="nf-core rnaseq",
    version="3.11.2",
    type="pipeline",
    reference="https://github.com/laminlabs/nextflow-lamin-usecases",
)

ln.track(nextflow_bulkrna)

If we now stage input files, they'll be tracked as run inputs (if input data is tracked in the cloud and registered in LaminDB, this is where we'd typcically start):

In [None]:
sample_sheet_file.stage()
[input_fastq.stage() for input_fastq in input_fastqs_file]

We'll pass the LaminDB run id to the nextflow run, so that we can easily find it from within Nextflow:

In [None]:
!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name {ln.dev.run_context.run.id} -resume

## Register outputs

### QC

In [None]:
# this would register 240 files, we don't need them here
# multiqc_results = ln.File.from_dir("rna-seq-results/multiqc/")
# ln.save(multiqc_results)

In [None]:
multiqc_file = ln.File("rna-seq-results/multiqc/star_salmon/multiqc_report.html")
multiqc_file.save()

### Count matrix

In [None]:
count_matrix = ln.File("rna-seq-results/salmon/salmon.merged.gene_counts.tsv")
count_matrix.save()

To make it queryable by biological entities (genes, etc.), we can now proceed with: {doc}`docs:bulkrna`

## Visualize

View data lineage:

In [None]:
count_matrix.view_lineage()

View the database content:

In [None]:
ln.view()

Clean up the test instance:

In [None]:
!lamin delete --force nextflow-bulkrna