# Nextflow

```{note}
In practice, workflows are not run from notebooks.
This notebook solely serves as a demo and we encourage users to run workflows from the command line or cloud solutions.
```

[Nextflow](https://www.nextflow.io/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly.

Here, we'll run `nf-core/rnaseq` to process `.fastq` files from bulk RNA sequencing using STAR, RSEM, HISAT2, Salmon with gene/isoform counts and extensive quality control ([reference](https://nf-co.re/rnaseq/3.12.0)).

![](https://raw.githubusercontent.com/nf-core/rnaseq/3.12.0//docs/images/nf-core-rnaseq_metro_map_grey.png)


## Setup

Let's create a test instance:

In [None]:
!lamin init --storage . --name nextflow-bulkrna

In [None]:
import lamindb as ln
from subprocess import getoutput

## Download test data

Download test data using git:

In [None]:
!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3 --depth 1

Track the download:

In [None]:
download = ln.Transform(name="Download")
download_url = "https://github.com/nf-core/test-datasets"
# create global run containing the download_url
ln.track(download, reference=download_url, reference_type="url")

Register input files - they'll automatically be linked against the download run:

In [None]:
sample_sheet = ln.File("test-datasets/samplesheet/v3.10/samplesheet_test.csv").save()
input_fastqs = ln.File.from_dir("test-datasets/testdata/GSE110004/")
ln.save(input_fastqs)

Visualize data lineage for one of the files:

In [None]:
sample_sheet.view_flow()

## Track the Nextflow run

Track the Nextflow workflow & run:

In [None]:
transform = ln.Transform(
    name="nf-core rnaseq",
    version="3.11.2",
    type="pipeline",
    reference="https://github.com/laminlabs/nextflow-lamin-usecases",
)
ln.track(transform)
# let's grab the run of the global run context
run = ln.dev.run_context.run

If we now stage input files, they'll be tracked as inputs for the global run:

In [None]:
input_sample_sheet_path = sample_sheet.stage()
input_paths = [input_fastq.stage() for input_fastq in input_fastqs]

Run the nextflow pipeline:

In [None]:
!nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name {run.id}

Here, we passed the LaminDB run id to nextflow so that we can query it from within nextflow.

## Register outputs

### QC

In [None]:
multiqc_file = ln.File("rna-seq-results/multiqc/star_salmon/multiqc_report.html").save()

:::{dropdown} How would I register all QC files?

```python
multiqc_results = ln.File.from_dir("rna-seq-results/multiqc/")
ln.save(multiqc_results)
```

:::

### Count matrix

In [None]:
count_matrix = ln.File("rna-seq-results/salmon/salmon.merged.gene_counts.tsv").save()

## Track Nextflow ID

Let us look at the nextflow logs:

In [None]:
!nextflow log

Let us add the information about the session ID to our `run` record:

In [None]:
nextflow_id = getoutput(f"nextflow log | awk '/{run.id}/{{print $8}}'")
run.reference = nextflow_id
run.reference_type = "nextflow_id"
run.save()

## Link biological entities

To make the count matrix queryable by biological entities (genes, experimental metadata, etc.), we can now proceed with: {doc}`docs:bulkrna`

## Visualize

View data lineage:

In [None]:
count_matrix.view_flow()

View the database content:

In [None]:
ln.view()

Clean up the test instance:

In [None]:
!lamin delete --force nextflow-bulkrna