# Basics: An example workflow

- https://snakemake.readthedocs.io/en/stable/tutorial/basics.html

## Environment Setting

In [None]:
# # Installing Mambaforge
# !curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -o Mambaforge-Linux-x86_64.sh
# !bash Mambaforge-Linux-x86_64.sh

In [None]:
# # # Preparing a working directory
# !mkdir snakemake-tutorial
# !cd snakemake-tutorial

In [None]:

# !curl -L https://github.com/snakemake/snakemake-tutorial-data/archive/v5.24.1.tar.gz -o snakemake-tutorial-data.tar.gz
# !tar --wildcards -xf snakemake-tutorial-data.tar.gz --strip 1 "*/data" "*/environment.yaml"


In [None]:
# # Creating an environment with the required software
# !conda activate base
# !mamba env create --name snakemake-tutorial --file environment.yaml
# !conda install -n base -c conda-forge mamba

In [None]:
# # Activating the environment
# !conda activate snakemake-tutorial

### Version

In [3]:
!conda env list

# conda environments:
#
base                     /home/jingwora/mambaforge
snakemake-tutorial    *  /home/jingwora/mambaforge/envs/snakemake-tutorial



In [2]:
!python --version

Python 3.8.5


In [42]:
!samtools --version

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.


In [43]:
import pysam
pysam.__version__

'0.15.2'

In [44]:
!bwa


Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are

In [45]:
!bcftools --version

bcftools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


In [46]:
!snakemake --version

6.15.5


## Overview

A Snakemake workflow is defined by specifying rules in a Snakefile. Rules decompose the workflow into small steps (for example, the application of a single tool) by specifying how to create sets of output files from sets of input files. Snakemake automatically determines the dependencies between the rules by matching file names.

### Background
- __DNA sequencing__ produces gigabytes of data from a single biological sample
- For technical reasons, DNA sequencing cuts the DNA of a sample into millions of small pieces, called __reads__.
- In order to recover the genome of the sample, one has to map these reads against a known reference genome (for example, the human one obtained during the famous human genome project). This task is called __read mapping__.
- By investigating the differences between the mapped reads and the reference sequence at a particular genome position, __variants__ can be detected. 

- `bwa_map` rule name
- `input`, `output`: directives fillowed by lists of files that are expected to be used or created by the rule.
- `shell`:  directive contains the shell command to execute
- The shell command invokes `bwa mem` with reference genome and reads, and pipes the output into `samtools` which creates a compressed BAM file containing the alignments. The output of `samtools` is redirected into the output file defined by the rule with `>`.
- Snakemake applies the rules given in the Snakefile in a top-down way.
- `{ }` : wildcards, all output files of a rule have to contain exactly the same wildcards.

```
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}" 

```
Execution

```
snakemake -np mapped_reads/A.bam
```

execute the workflow with target D1.sorted.txt

`snakemake D1.sorted.txt`

execute the workflow without target: first rule defines target

`snakemake`

dry-run

`snakemake -n`

dry-run, print shell commands

`snakemake -n -p`

dry-run, print execution reason for each job

`snakemake -n -r`

dry-run, print execution reason for each job

`snakemake -n -r`

executing a workflow

`$ snakemake --cores 1 mapped_reads/A.bam`

specify multiple targets
```
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
$ snakemake -np mapped_reads/{A,B}.bam
```

visualization of the DAG

`snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg`

### Job execution
- The application of a rule to generate a set of output files is called job. 
- A job is executed if and only if: 
  - output file is target and does not exist
  - output file needed by another executed job and does not exist
  - input file newer than output file
  - rule has been modified
  - input file will be updated by other job
  - execution is enforced

### Steps

1. Mapping reads
2. Sorting read alignments 
3. Indexing read alignments and visualizing the DAG of jobs
4. Calling genomic variants
5. Using custom scripts
6. Adding a target rule

In [39]:
%%writefile Snakefile

SAMPLES = ["A", "B"]


rule all:
    input:
        "plots/quals.svg"


rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"


rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"


rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"


rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"


rule plot_quals:
    input:
        "calls/all.vcf"
    output:
        "plots/quals.svg"
    script:
        "scripts/plot-quals.py"

Overwriting Snakefile


In [40]:
!mkdir scripts
!mkdir calls
!mkdir plots

mkdir: cannot create directory ‘scripts’: File exists
mkdir: cannot create directory ‘calls’: File exists
mkdir: cannot create directory ‘plots’: File exists


In [31]:
%%writefile scripts/plot-quals.py

import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
from pysam import VariantFile

quals = [record.qual for record in VariantFile(snakemake.input[0])]
plt.hist(quals)

plt.savefig(snakemake.output[0])

Overwriting scripts/plot-quals.py


### Create the DAG

In [35]:
!snakemake --dag sorted_reads/{A,B}.bam.bai | dot -Tsvg > dag.svg

[33mBuilding DAG of jobs...[0m


In [36]:
!snakemake --dag all | dot -Tsvg > dag_all.svg

[33mBuilding DAG of jobs...[0m


### Excecute

In [48]:
!snakemake -n

[33mBuilding DAG of jobs...[0m
Traceback (most recent call last):
  File "/home/jingwora/mambaforge/envs/snakemake-tutorial/lib/python3.8/site-packages/snakemake/__init__.py", line 701, in snakemake
    success = workflow.execute(
  File "/home/jingwora/mambaforge/envs/snakemake-tutorial/lib/python3.8/site-packages/snakemake/workflow.py", line 1066, in execute
    logger.run_info("\n".join(dag.stats()))
  File "/home/jingwora/mambaforge/envs/snakemake-tutorial/lib/python3.8/site-packages/snakemake/dag.py", line 2191, in stats
    yield tabulate(rows, headers="keys")
  File "/home/jingwora/mambaforge/envs/snakemake-tutorial/lib/python3.8/site-packages/tabulate/__init__.py", line 2048, in tabulate
    list_of_lists, headers = _normalize_tabular_data(
  File "/home/jingwora/mambaforge/envs/snakemake-tutorial/lib/python3.8/site-packages/tabulate/__init__.py", line 1471, in _normalize_tabular_data
    rows = list(map(lambda r: r if _is_separating_line(r) else list(r), rows))
  File "/home