# Basics: An example workflow

- https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

## Environment Setting

In [3]:
!conda env list

# conda environments:
#
base                     /home/jingwora/mambaforge
snakemake-tutorial    *  /home/jingwora/mambaforge/envs/snakemake-tutorial



In [2]:
!python --version

Python 3.8.5


## File preparation

In [7]:
# Downalod file
!curl -L https://github.com/snakemake/snakemake-tutorial-data/archive/v5.24.1.tar.gz -o snakemake-tutorial-data.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 7085k    0 7085k    0     0  2517k      0 --:--:--  0:00:02 --:--:-- 2948k


In [8]:
# Unzip file
!tar --wildcards -xf snakemake-tutorial-data.tar.gz --strip 1 "*/data" "*/environment.yaml"

## Overview

A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.

### Background
- __DNA sequencing__ produces gigabytes of data from a single biological sample
- For technical reasons, DNA sequencing cuts the DNA of a sample into millions of small pieces, called __reads__.
- In order to recover the genome of the sample, one has to map these reads against a known reference genome (for example, the human one obtained during the famous human genome project). This task is called __read mapping__.
- By investigating the differences between the mapped reads and the reference sequence at a particular genome position, __variants__ can be detected. 

- `bwa_map` rule name
- `input`, `output`: directives fillowed by lists of files that are expected to be used or created by the rule.
- `shell`:  directive contains the shell command to execute
- The shell command invokes `bwa mem` with reference genome and reads, and pipes the output into `samtools` which creates a compressed BAM file containing the alignments. The output of `samtools` is redirected into the output file defined by the rule with `>`.
- Snakemake applies the rules given in the Snakefile in a top-down way.
- `{ }` : wildcards, all output files of a rule have to contain exactly the same wildcards.

```
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}" 

```
Execution

```
snakemake -np mapped_reads/A.bam
```

execute the workflow with target D1.sorted.txt

`snakemake D1.sorted.txt`

execute the workflow without target: first rule defines target

`snakemake`

dry-run

`snakemake -n`

dry-run, print shell commands

`snakemake -n -p`

dry-run, print execution reason for each job

`snakemake -n -r`

dry-run, print execution reason for each job

`snakemake -n -r`

executing a workflow

`$ snakemake --cores 1 mapped_reads/A.bam`

specify multiple targets
```
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
$ snakemake -np mapped_reads/{A,B}.bam
```

visualization of the DAG

`snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg`

### Job execution
- The application of a rule to generate a set of output files is called job. 
- A job is executed if and only if: 
  - output file is target and does not exist
  - output file needed by another executed job and does not exist
  - input file newer than output file
  - rule has been modified
  - input file will be updated by other job
  - execution is enforced

### Steps

1. Mapping reads
2. Sorting read alignments 
3. Indexing read alignments and visualizing the DAG of jobs

In [10]:
%%writefile Snakefile
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"        

Overwriting Snakefile


In [11]:
!snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

[33mBuilding DAG of jobs...[0m
