# Snakemake

In order to start working on a pipeline, the first thing is to make sure that all the tools, software or codes are installed within an environment. We will write a small pipeline in snakemake.

https://anaconda.org/bioconda/snakemake
https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html#

Notes on direction of the lecture.

1. First we need to make a hands-on implementation in snakemake
2. We can learn about different kind of python implementations. For example, look at my thesis work. 

### setting up conda enviornment

conda create -n smk

conda activate smk

conda install bioconda::snakemake

conda install -n smk ipykernel

conda install -n smk nb_conda_kernels

conda install -n smk anaconda::jupyter

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html

Let's have a look at what we are trying to achieve here.

In [5]:
!pwd

/home/mrinalmanu/Documents/notebooks_python/snakemake_zero


In [10]:
!ls

config	config.json  input  notebook_8_snakemake.ipynb	snakefile.txt


In [12]:
!mkdir input

mkdir: cannot create directory ‘input’: File exists


In [13]:
!touch input/instructions.txt

In [15]:
!ls input

instructions.txt


In [16]:
!head input/instructions.txt

In [20]:
!echo "Some text here." > input/instructions.txt

In [21]:
!head input/instructions.txt

Some text here.


What if we just tried to concatenate to a file that does not exist.

In [22]:
!echo "Some more text here." > input/instructions2.txt

In [23]:
!head input/instructions2.txt

Some more text here.


We can look a bit at logging in python as well.

In [29]:
!ls input
!mkdir output

instructions.txt  instructions2.txt


In [30]:
!cat input/*.txt >> output/combined_instructions.txt

In [31]:
!head output/combined_instructions.txt

Some text here.
Some more text here.


Let's try to implement a rule in snakemake for the same task.

If you want to set a version requirement.

In [32]:
from snakemake.utils import min_version
min_version("2.3.0")

First we describe what will be the final output.

In [33]:
name = []
for i in range(0,10):
    name.append('{}.txt'.format(i))

In [34]:
name

['0.txt',
 '1.txt',
 '2.txt',
 '3.txt',
 '4.txt',
 '5.txt',
 '6.txt',
 '7.txt',
 '8.txt',
 '9.txt']

We need to redirect an output to these files


rule append_text:
    """
    Append text to a file.
    """
    output:
        "input/{name}"
    shell:
        """
        echo "Appended {name}" >> {name}
        """

Now we can write a rule for combining these files.

rule concat_files:
    """
    Append text to a file.
    """
    output:
        "output/combined_instructions.txt"
    shell:
        """
        cat output/.*txt >> output/combined_instructions.txt
        """

We can generate a rulegraph at the end to check how it all going.

rule generate_rulegraph:
    """
    Generate a rulegraph for the workflow.
    """
    output:
        "results/rulegraph.png"
    shell:
        """
        snakemake --snakefile snakefile.smk --config max_reads=0 --rulegraph | dot -Tpng > {output}
        """

Now we can look at how to launch this pipeline.

In [42]:
!ls

config	config.json  input  notebook_8_snakemake.ipynb	output	snakefile.txt


In [44]:
!head -300 config.json

{"samples": ["1","2","3"]}

In [45]:
!head -300 snakefile.txt

from snakemake.utils import min_version
min_version("2.3.0") 
# actually you should use versions higher than 5.11, they have lint
# lint helps a lot with debugging

configfile: "config.json"

# usually sankemake checks if the output already exists, to avoid
# re-running the pipeline, this is why there is always an "all" rule

SAMPLES = config['samples']
        
rule all:
    """
    Collect the main inputs and outputs of the workflow.
    """
    input:
        "combined_instructions.txt"

    
rule touch_files:
    """
    Create a file.
    """
    output:
        "{sample}.txt"
    shell:
        """
        touch {output} && printf "Hello! {output} " >> {output}
        """

rule concat_files:
    input:
        files=expand("{sample}.txt", sample=SAMPLES),
    output:
        "combined_instructions.txt",
    params:
        cmd="cat",
    shell:
        """
        {params.cmd} {input.files} >> {output}
        """

rule generate_rulegra

To run snakemake with a specific snakefile, you can call it with the -s or --snakefile command line arg.

In [47]:
!snakemake -h

usage: snakemake [-h] [--snakefile FILE] [--gui [PORT]] [--cores [N]]
                 [--local-cores N] [--resources [NAME=INT [NAME=INT ...]]]
                 [--config [KEY=VALUE [KEY=VALUE ...]]] [--configfile FILE]
                 [--list] [--list-target-rules] [--directory DIR] [--dryrun]
                 [--printshellcmds] [--debug-dag] [--dag]
                 [--force-use-threads] [--rulegraph] [--d3dag] [--summary]
                 [--detailed-summary] [--archive FILE] [--touch]
                 [--keep-going] [--force] [--forceall]
                 [--forcerun [TARGET [TARGET ...]]]
                 [--prioritize TARGET [TARGET ...]]
                 [--until TARGET [TARGET ...]]
                 [--omit-from TARGET [TARGET ...]] [--allow-ambiguity]
                 [--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
                 [--drmaa-log-dir DIR] [--cluster-config FILE]
                 [--immediate-submit] [--jobscript SCRIPT] [--jobname NAME]
  

In [48]:
!snakemake -s snakefile.txt

[33mProvided cores: 1[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob counts:
	count	jobs
	1	all
	1	concat_files
	3	touch_files
	5[0m
[32m[0m
[32mrule touch_files:
    output: 3.txt
    jobid: 3
    wildcards: sample=3[0m
[32m[0m
[32mFinished job 3.[0m
[32m1 of 5 steps (20%) done[0m
[32m[0m
[32mrule touch_files:
    output: 2.txt
    jobid: 4
    wildcards: sample=2[0m
[32m[0m
[32mFinished job 4.[0m
[32m2 of 5 steps (40%) done[0m
[32m[0m
[32mrule touch_files:
    output: 1.txt
    jobid: 2
    wildcards: sample=1[0m
[32m[0m
[32mFinished job 2.[0m
[32m3 of 5 steps (60%) done[0m
[32m[0m
[32mrule concat_files:
    input: 1.txt, 2.txt, 3.txt
    output: combined_instructions.txt
    jobid: 1[0m
[32m[0m
[32mFinished job 1.[0m
[32m4 of 5 steps (80%) done[0m
[32m[0m
[32mlocalrule all:
    input: combined_instructions.txt
    jobid: 0[0m
[32m[0m
[32mFinished job 0.[0m
[32m5 of 5 steps (100%) done[0m


We are now ready to do some advanced pipelining.

Example workflow: https://snakemake.readthedocs.io/en/stable/tutorial/basics.html#step-1-mapping-reads

Implementation of GATK-4 pipeline for exome analyses.
https://github.com/mrinalmanu/gatk4_exome_scripts

# Assignment

Put 1.txt, 2.txt, and 3.txt in a folder called **input_directory**, and create an output directory. The snakemake pipeline should read the input from **input_directory**, and should make an output to **output_directory**

During the appending of text, put your names after hello. You can also name one of the files after your name.

Hint: You can add a variable called **input_dir** and **output_dir** to set paths within the config, or the snakemake file.

EON