# Track Snakemake workflows

[Snakemake](https://snakemake.readthedocs.io/en/stable/) is a workflow management system used for executing scientific workflows across platforms scalably, portably, and reproducibly. 

Here, we’ll run snakemake-workflows/rna-seq-star-deseq2 to perform differential gene expression analysis with STAR and deseq2 ([reference](https://github.com/snakemake-workflows/rna-seq-star-deseq2)).

## Setup

Let’s create a test instance:

In [None]:
!lamin init --storage . --name snakemake-bulkrna

In [None]:
import lamindb as ln

## Download test data

The Snakemake pipeline comes with test data.
Therefore, we clone the whole pipeline using git:

In [None]:
!git clone https://github.com/snakemake-workflows/rna-seq-star-deseq2 --single-branch --branch v2.0.0

In [None]:
root_dir = "rna-seq-star-deseq2"

Track the download:

In [None]:
download = ln.Transform(name="Download")
download_url = "https://github.com/snakemake-workflows/rna-seq-star-deseq2"
# create global run containing the download_url
ln.track(download, reference=download_url, reference_type="url")

Register input files - they’ll automatically be linked against the download run:

In [None]:
sample_sheet = ln.File(f"{root_dir}/.test/config_basic/samples.tsv")
ln.save(sample_sheet)
input_fastqs = ln.File.from_dir(f"{root_dir}/ngs-test-data/reads/")
ln.save(input_fastqs)

Visualize data lineage for one of the files:

In [None]:
sample_sheet.view_flow()

## Track Snakemake run

(We’d start here if input files were tracked in the cloud with LaminDB rather than downloaded through git.)

Track the Snakemake workflow & run:

In [None]:
transform = ln.Transform(
    name="snakemake-workflows/rna-seq-star-deseq2",
    version="2.0.0",
    type="pipeline",
    reference="https://github.com/laminlabs/snakemake-lamin-usecases",
)
transform.save()
run = ln.Run(transform=transform)

If we now stage input files, they’ll be tracked as run inputs.

(In this test case, data is already locally available and staging won’t download anything.)

In [None]:
input_sample_sheet_path = sample_sheet.stage()
input_paths = [input_fastq.stage() for input_fastq in input_fastqs]

All data is now locally available, and we can run the snakemake pipeline:

In [None]:
!snakemake \
    --directory rna-seq-star-deseq2/.test \
    --snakefile rna-seq-star-deseq2/workflow/Snakefile \
    --configfile rna-seq-star-deseq2/.test/config_basic/config.yaml \
    --use-conda \
    --show-failed-logs \
    --cores 2 \
    --conda-cleanup-pkgs cache

## Register outputs

## QC

In [None]:
multiqc_file = ln.File(f"{root_dir}/results/qc/multiqc_report.html")
multiqc_file.save()

:::{dropdown} How would I register all QC files?

```python
multiqc_results = ln.File.from_dir(f"{root_dir}/results/qc/multiqc_report_data/")
ln.save(multiqc_results)
```

:::

## Count matrix

In [None]:
count_matrix = ln.File(f"{root_dir}/results/counts/all_symbol.tsv")
count_matrix.save()

## Track Snakemake ID TBD

## Link biological entities

To make the count matrix queryable by biological entities (genes, experimental metadata, etc.), we can now proceed with: {doc}`docs:bulkrna`

## Visualize

View data lineage:

In [None]:
count_matrix.view_flow()

View the database content:

In [None]:
ln.view()

Clean up the test instance:

In [None]:
!lamin delete --force snakemake-bulkrna