# Snakemake

```{note}
This notebook serves as a demo for Python scripting that you could run before and after Snakemake runs.
Typically, you would run workflows from the command line or cloud solutions.
```

[Snakemake](https://snakemake.readthedocs.io/en/stable/) is a workflow manager for executing scientific workflows across platforms scalably, portably, and reproducibly.

This guide shows how to register a Snakemake run with inputs & outputs for the example of the [snakemake-workflows/rna-seq-star-deseq2](https://github.com/snakemake-workflows/rna-seq-star-deseq2) pipeline by running a Python script.

The approach could be automated by deploying the script via a serverless environment trigger (e.g., AWS Lambda).

In [None]:
!lamin init --storage . --name snakemake-bulkrna

In [None]:
import lamindb as ln

## Download test data

We clone the Snakemake pipeline with git to access the included test data.

In [None]:
!git clone https://github.com/snakemake-workflows/rna-seq-star-deseq2 --single-branch --branch v3.1.0

In [None]:
root_dir = "rna-seq-star-deseq2"

Track the download:

In [None]:
download = ln.Transform(key="Download")
ln.track(transform=download)

Register input files - they’ll automatically be linked against the download run:

In [None]:
sample_sheet = ln.Artifact(f"{root_dir}/.test/config_basic/samples.tsv").save()
input_fastqs = ln.Artifact.from_dir(f"{root_dir}/.test/ngs-test-data/reads/")
ln.save(input_fastqs)

Visualize data lineage for one of the files:

In [None]:
sample_sheet.view_lineage()

## Track Snakemake run

(We’d start here if input files were tracked in the cloud with LaminDB rather than downloaded through git.)

Track the Snakemake workflow & run:

In [None]:
transform = ln.Transform(
    name="snakemake-workflows/rna-seq-star-deseq2",
    version="2.0.0",
    type="pipeline",
    reference="https://github.com/laminlabs/snakemake-lamin-usecases",
)
ln.track(transform)
run = ln.context.run  # let's grab the global run record

If we now stage input files, they’ll be tracked as run inputs.

(In this test case, data is already locally available and staging won’t download anything.)

In [None]:
input_sample_sheet_path = sample_sheet.cache()
input_paths = [input_fastq.cache() for input_fastq in input_fastqs]

All data is now locally available, and we can run the snakemake pipeline:

In [None]:
!snakemake \
    --directory rna-seq-star-deseq2/.test \
    --snakefile rna-seq-star-deseq2/workflow/Snakefile \
    --configfile rna-seq-star-deseq2/.test/config_basic/config.yaml \
    --use-conda \
    --show-failed-logs \
    --cores 2 \
    --conda-frontend conda \
    --conda-cleanup-pkgs cache

## Register outputs

## Quality control

In [None]:
multiqc_file = ln.Artifact(f"{root_dir}/.test/results/qc/multiqc_report.html").save()

:::{dropdown} How would I register all QC files?

```python
multiqc_results = ln.Artifact.from_dir(f"{root_dir}/results/qc/multiqc_report_data/")
ln.save(multiqc_results)
```

:::

## Count matrix

In [None]:
count_matrix = ln.Artifact(f"{root_dir}/.test/results/counts/all.symbol.tsv")
count_matrix.save()

## Track Snakemake ID

Snakemake does not have an easily accessible ID that is associated with a run.
Therefore, we need to extract it from the log files.
We're planning to simplify this process in the future.

In [None]:
import pathlib
from datetime import datetime

PATH_TO_DOT_SNAKEMAKE_LOG = "rna-seq-star-deseq2/.test/.snakemake/log"
log_files_file_names = list(
    map(
        lambda lf: str(lf).split("/")[-1],
        list(pathlib.Path(PATH_TO_DOT_SNAKEMAKE_LOG).glob("*.snakemake.log")),
    )
)

timestamps = [
    datetime.strptime(filename.split(".")[0], "%Y-%m-%dT%H%M%S")
    for filename in log_files_file_names
]
snakemake_id = log_files_file_names[timestamps.index(max(timestamps))].split(".")[1]

Let us add the information about the session ID to our run record:

In [None]:
run.reference = snakemake_id
run.reference_type = "snakemake_id"
run.save()

## Link biological entities

To make the count matrix queryable by biological entities (genes, experimental metadata, etc.), we can now proceed with: {doc}`docs:bulkrna`

## Visualize

View data lineage:

In [None]:
count_matrix.view_lineage()

Clean up the test instance:

In [None]:
!lamin delete --force snakemake-bulkrna