# Tracking Bulk RNA-seq Nextflow runs

## Background

[Nextflow](https://www.nextflow.io/) is a workflow management system used for orchestrating and executing scientific workflows across different computational environments. Fundamental features include ease of scalability, portability, and reproducibility, as it allows researchers to define complex workflows in a platform-agnostic manner and run them efficiently on various computing infrastructures.

While Nextflow together with nf-tower focuses on executing reproducible and trackable bioinformatics pipelines, LaminDB offers a provenance-aware data lake.

Here, we will demonstrate how to track Nextflow workflow execution and generated biological entities with [lamin](https://lamin.ai/).

## Setup

To run this notebook, you need to load a LaminDB instance that has the `bionty` schema mounted.

Here, we’ll create a test instance (skip if you’d like to run it using your instance):

In [None]:
!lamin init --storage ./nextflow_rna_seq --schema bionty

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import os
from pathlib import Path

ln.settings.verbosity = 3  # show hints

In [None]:
lb.settings.species = "human"

## Tracking nf-core rnaseq

[nf-core rnaseq](https://nf-co.re/rnaseq/3.12.0) is arguably one of the most popular pipelines for bulk RNA sequencing using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.

First, we create a new Transform object for our pipeline run.

In [None]:
rna_seq_transform = ln.Transform(
    name="nf-core rnaseq",
    version="3.11.2",
    type="pipeline",
    reference="https://github.com/laminlabs/nextflow-lamin-usecases/",
).save()

In [None]:
ln.track(rna_seq_transform)

We download the [test data](https://github.com/nf-core/test-datasets/tree/rnaseq3) for the pipeline to track it with Lamin.

In [None]:
!git clone https://github.com/nf-core/test-datasets --single-branch --branch rnaseq3

In [None]:
input_fastqs_file = ln.File.from_dir(
    "test-datasets/testdata/GSE110004/", storage_root=Path(".")
)
sample_sheet_file = ln.File("test-datasets/samplesheet/v3.10/samplesheet_test.csv")
ln.save(input_fastqs_file)
ln.save(sample_sheet_file)

Let’s set the input files for our run

In [None]:
run = ln.Run.select(created_by_id="DzTjkKse").one()
run

In [None]:
run.input_files.set(input_fastqs_file)
run.reference_type = "nextflow_name"

To sync the workflow execution name with Lamin, we export it as an environment variable.

In [None]:
os.environ["LAMINDB_RUN_ID"] = "lamin_rnaseq"

Next, we run the pipeline with its test dataset and track output files and features with Lamin.
We already ran the pipeline beforehand, but the run command is depicted below.

In [None]:
# !nextflow run nf-core/rnaseq -r 3.11.2 -profile test,docker --outdir rna-seq-results -name $LAMINDB_RUN_ID -resume

As a first step, we ingest all results from the pipeline run.

In [None]:
multiqc_results = ln.File.from_dir(
    "rna-seq-results/multiqc/", storage_root=Path("."), run=run
)
ln.save(multiqc_results)

In [None]:
multiqc_file = ln.File.select(key__icontains="multiqc_report.html").one()
multiqc_file

Let's examine the multiqc report:

In [None]:
import shutil
from IPython.display import IFrame

# Copying file to a directory accessible by the IPython Tornado web server
shutil.copy(multiqc_file.stage(), "./multiqc_report.html")
IFrame(src="multiqc_report.html", width=1000, height=600)

We further ingest the merged Salmon gene counts since we plan on working further with the count table:

In [None]:
salmon_gene_counts_table = ln.File(
    "rna-seq-results/salmon/salmon.merged.gene_counts.tsv", run=run
)
ln.save(salmon_gene_counts_table)

In [None]:
gene_counts_df = pd.read_csv(salmon_gene_counts_table.stage(), sep="\t")

We further track all genes that are associated with the count table.

In [None]:
feature_set_genes = ln.FeatureSet.from_values(
    gene_counts_df["gene_name"], lb.Gene.symbol
)
feature_set_genes.save()
salmon_gene_counts_table.feature_sets.add(feature_set_genes)

The dataset contains yeast samples and our species is set to human. Hence, a lot of gene records are being created. Bionty will soon also support yeast genes.

## Conclusion

Lamin makes it easy to track pipeline executions and to ingest and output files that can subsequently be used for custom downstream analyses. This is complementary to nf-tower.