# Tracking Bulk RNA-seq Nextflow runs

## Background

[Nextflow](https://www.nextflow.io/) is a workflow management system used for orchestrating and executing scientific workflows across different computational environments. Fundamental features include ease of scalability, portability, and reproducibility, as it allows researchers to define complex workflows in a platform-agnostic manner and run them efficiently on various computing infrastructures.

Here, we will demonstrate how to track Nextflow workflow execution and generated biological entities with [lamin](https://lamin.ai/).

## Setup

To run this notebook, you need to load a LaminDB instance that has the `bionty`` schema mounted.

Here, we’ll create a test instance (skip if you’d like to run it using your instance):

In [None]:
import lamindb as ln
import lnschema_bionty as lb
import pandas as pd

ln.setup.load("laminlabs/lamindata")

ln.settings.verbosity = 3  # show hints

In [None]:
lb.settings.species = "human"

## Fetching and tracking NGS files

A pipeline run is fundamentally one set of transformations. Hence, we create a {func}`docs:lamindb.Transform` object using Bionty's lookup for the [nf-core fetchngs](https://nf-co.re/fetchngs) pipeline.
`lnschema-bionty` currently does not have a `BFXPipeline` ORM and we therefore access Bionty directly.

In [None]:
import bionty as bt

bfx_lookup = bt.BFXPipeline().lookup()

We plan on using Version 1.9. Let's find it with our lookup object...

In [None]:
fetch_ngs_1_9 = bfx_lookup.fetchngs_v1_9
fetch_ngs_1_9

...and create a {func}`docs:lamindb.Transform` object.

In [None]:
fetch_ngs_1_9_transform = ln.Transform(
    name=fetch_ngs_1_9.name, version=fetch_ngs_1_9.versions, type="pipeline"
).save()

In [None]:
ln.track(fetch_ngs_1_9_transform)

First, we fetch several FASTQ files using [nf-core fetchngs](https://nf-co.re/fetchngs). For simplicity, we use the `test` profile which has a pre-defined small set of FASTQ files. Later, we'll use them as input for [nf-core rna-seq](https://nf-co.re/rnaseq).

In [None]:
!nextflow run nf-core/fetchngs -r 1.9 --nf_core_pipeline=rnaseq -profile test,docker --outdir fetchngs-results -resume

In [None]:
!tree fetchngs-results

The pipeline run results in several FASTQ files with associated md5 sum files, metadata files, samplesheets, and execution reports. We ingest all files with Lamin.

In [None]:
fetchngs_results = ln.File.from_dir("fetchngs-results")
ln.save(fetchngs_results)

Now that all output files or our run are ingested, we can access them easily and work with them.
Lamin tracks data lineage:

In [None]:
execution_report_file = ln.File.select(key__icontains="execution_report").one()
execution_report_file.view_lineage()

We cna now easily examine for example the execution report:

In [None]:
import shutil
from IPython.display import IFrame

# Copying file to a directory accessible by the IPython Tornado web server
shutil.copy(execution_report_file.stage(), "./execution_report.html")
IFrame(src="execution_report.html", width=1000, height=600)

Alternatively, we can calculate various metrics of interest for the downloaded FASTQ files.

In [None]:
a_fastq_file = ln.File.select(
    key__icontains="fetchngs-results/fastq", suffix=".fastq.gz"
).first()
a_fastq_file

In [None]:
import gzip
from Bio import SeqIO

total_gc_count = total_bases = 0

with gzip.open(a_fastq_file.stage(), "rt") as handle:
    for record in SeqIO.parse(handle, "fastq"):
        sequence = str(record.seq)
        gc_count = sequence.count("G") + sequence.count("C")
        total_gc_count += gc_count
        total_bases += len(sequence)

total_gc_content = (total_gc_count / total_bases) * 100

print(f"Total GC Content: {total_gc_content:.2f}%")

In [None]:
a_fastq_file.view_lineage()

## Analysing raw FASTQ files and generating a count table

The fetched 

For faster processing we down-sample the FASTq files.

3. https://github.com/nf-core/rnaseq/ on the FASTQs
4. Track all output files

## Downstream analysis of RNA counts

5. https://github.com/nf-core/differentialabundance on the count table
6. Track all output files + use Bionty whereever we can

## Conclusion