# Setup

First things first, let's check our working directory.

In [None]:
pwd  # some bash commands will work without additional characters in Jupyter notebooks if automagic is on

To build our pipeline, we'll need to make sure we have access to some dependencies first.  You have a few options here:
1. Use the JupyterLab binder button on the [github repo](https://github.com/marskar/snakemake/tree/master).  All of the depenencies are included in the environment.yml file used to build the binder container.
2. Install dependencies locally:
    - Clone the repo (`git clone https://github.com/marskar/snakemake.git`)
    - Use conda to install the dependencies locally (`conda install -c bioconda <tool=version>`)
3. Create a local environment:
    - Clone the repo (`git clone https://github.com/marskar/snakemake.git`)
    - Use the environment file to build a conda environment locally (`conda env create -f environment.yml`)
    - Be sure to start up `jupyter notebook` from within this environment!

Let's check the versions of the dependencies to ensure they're there and accessible.

In [None]:
# if using option 3 above, check your environment:
!conda env list

In [None]:
# check that we have the required dependencies
!conda list | grep -Ei "snakemake|samtools|picard|bwa|gatk|python"

# Input data

We're going to start with FASTQ files from real human samples, but instead of whole genome data, we're focusing on chromosome 5, position 100000000-101500000.  Our reference genome file also only includes a portion of chr5 (5:99900000-104100000).  This is to keep file sizes small and run times short!  Now, let's make sure we can see the raw files we intend to work on.

In [None]:
ls -lh raw_data/

In [None]:
# what does a fastq look like?
!head -n8 raw_data/Patient_A.r1.fastq  # to run a single line of bash, prepend with "!"

In [None]:
ls -lh ref_genome/

In [None]:
# what does a reference genome look like?
!head -n4 ref_genome/chr5_ref.fasta

# Plan

__Goal:__ assemble a working DNA-seq pipeline!

- Align sequencing data to a reference genome
- Call variants in the aligned data

# First rule: indexing

Our first rule is going to take our reference genome file, and index it so that the alignment tool can read it.  We can write the rule in any text editor, but for this class, we'll write it here in the notebook and save it to a file.

In [None]:
%%writefile snakefile_test1

'''
Commands to run:
bwa index <ref> --> indices with the endings .amb, .ann, .bwt, .pac
samtools faidx <ref> --> index with ending .fai
picard CreateSequenceDictionary REFERENCE=<ref> OUTPUT=<file> --> index that removes the original ending and appends .dict
''' 




Let's test our first rule!  You can run this rule from the command line or here in the notebook using cell magic.  For our first test, we're going to try a __dry-run__ by using the `-n` flag.  If your snakefile is called something other than "Snakefile," use `-s <filename>`.

In [None]:
!snakemake -ns snakefile_test1

Looks great!  Note that the dry-run does not actually execute any jobs; it shows the execution plan.  

Now let's try running our pipeline for real.  Add the `-p` flag to print the job that's run for each rule.

In [None]:
!snakemake -ps snakefile_test1

In [None]:
ls -lh ref_genome

Hooray!!  Our first rule worked.  Note that Snakemake stdout provides a beautiful log of steps run and errors encountered.

What happens if we try to run it again?

In [None]:
!snakemake -ps snakefile_test1

# Second rule: alignment

Our next rule is going to take the short reads in our FASTQ files, and align them to a reference sequence using a tool called [__bwa__](https://github.com/lh3/bwa).  

In [None]:
'''
Command to run:
bwa mem -R "@RG\tID:<readgroup>\tSM:<sample>\tPL:<seq_platform>" <reference> <r1 fastq> <r2 fastq> | samtools sort -o <output file>

bwa requires the indices with endings .amb, .ann, .bwt, .pac, and .sa be present.

First, try it by hard-coding a sample name.
'''



Note how not all the listed input files are actually needed in the command line, but they are required for the command to run successfully (i.e. bwa will give an error if the index files are not there).  Snakemake recommends that you include all file dependencies in the input section, even if they're not used in the command invocation.

This is a good example of the use of `params` in a rule.  Here, they're used to define the metadata for the bam file.

Hold up.  We don't want to hard-code our sample files into a pipeline, or else we have to change the code for every sample! How do we handle this?

![xkcd](https://imgs.xkcd.com/comics/the_general_problem.png)

In [None]:
'''
Now, try it again using a list of samples.
'''



What if we use a list, as above?  This would run an alignment on each fastq individually, which would be fine if we had single-end reads.  But, we have paired-end reads, which means you've sequenced in both directions, and you need to align two related fastq files per sample.

In [None]:
'''
Now, try it using a dict.
'''



What if we use a dict?  Much better!  Now our `rule align_fastqs` is generalizable.  If you were using this pipeline in real life, you'd probably require the user to provide a sample file where each line has the sample name, fastq1, and fastq2, and you'd read that in to a dict (rather than explicitly defining a dict like we did).

Note that input (or params) can be the return value of a function, as in this example.

Let's put the two rules together, and then try running them.

In [None]:
%%writefile snakefile_test2

ref = 'ref_genome/chr5_ref.fasta'
refNoExt = os.path.splitext(ref)[0]

rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule index_ref:
    input:
        ref
    output:
        o1 = ref + '.amb',
        o2 = ref + '.ann',
        o3 = ref + '.bwt',
        o4 = ref + '.pac',
        o5 = ref + '.sa',
        o6 = ref + '.fai',
        o7 = refNoExt + '.dict'
    shell:
        'bwa index {input};'
        'samtools faidx {input};'
        'picard CreateSequenceDictionary REFERENCE={input} OUTPUT={output.o7}'

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    params:
        pl = 'ILLUMINA',
        rg = '{sample}_rg',
        sm = '{sample}'
    shell:
        'bwa mem -R "@RG\tID:{params.rg}\tSM:{params.sm}\tPL:{params.pl}" {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

In [None]:
!snakemake -ps snakefile_test2

Oh no!  What went wrong?  We haven't given snakemake a target file.  Let's add a `rule all`.

In [None]:
%%writefile snakefile_test3

ref = 'ref_genome/chr5_ref.fasta'
refNoExt = os.path.splitext(ref)[0]

rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule all:
    '''
    What goes here?
    '''

rule index_ref:
    input:
        ref
    output:
        o1 = ref + '.amb',
        o2 = ref + '.ann',
        o3 = ref + '.bwt',
        o4 = ref + '.pac',
        o5 = ref + '.sa',
        o6 = ref + '.fai',
        o7 = refNoExt + '.dict'
    shell:
        'bwa index {input};'
        'samtools faidx {input};'
        'picard CreateSequenceDictionary REFERENCE={input} OUTPUT={output.o7}'

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    params:
        pl = 'ILLUMINA',
        rg = '{sample}_rg',
        sm = '{sample}'
    shell:
        'bwa mem -R "@RG\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}" {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

Now snakemake knows that `{sample}` should expand to PatientA and PatientB, and that the pipeline should end up producing the files `'aligned/{sample}.bam'`.  Let's try running it (this will take a minute or two):

In [None]:
!snakemake -ps snakefile_test3

In [None]:
ls -lh aligned/

Yay!  A few things of note:
- We now have one aligned bam file per patient!
- Snakemake automatically created the directory `aligned/` for us.
- The stdout also includes a bunch of information from bwa.  There are ways to clean this up, but we're going to skip over that for now.
- Snakemake saw that the indexed reference files were already created, so it did not re-run that rule.

# Third rule: index bams

Like the reference genome, the aligned bam files need to be indexed for the next tool to be able to read them.  We'll need to write the rule and update the rule all with the new target file.

In [None]:
%%writefile snakefile_test4

ref = 'ref_genome/chr5_ref.fasta'
refNoExt = os.path.splitext(ref)[0]

rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule all:
    input:
        '''
        What goes here?
        '''

rule index_ref:
    input:
        ref
    output:
        o1 = ref + '.amb',
        o2 = ref + '.ann',
        o3 = ref + '.bwt',
        o4 = ref + '.pac',
        o5 = ref + '.sa',
        o6 = ref + '.fai',
        o7 = refNoExt + '.dict'
    shell:
        'bwa index {input};'
        'samtools faidx {input};'
        'picard CreateSequenceDictionary REFERENCE={input} OUTPUT={output.o7}'

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    params:
        pl = 'ILLUMINA',
        rg = '{sample}_rg',
        sm = '{sample}'
    shell:
        'bwa mem -R "@RG\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}" {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

'''
Command to run:
samtools index <input file>  --> index with ending .bai
'''

In [None]:
!snakemake -ps snakefile_test4

In [None]:
ls -lh aligned/

# Fourth rule: calling

This fourth rule will compare our aligned sequences to the reference genome, look at places where there's a discrepancy, and report back those variants.  We'll need to write the rule and update the rule all with the new target file.

In [None]:
%%writefile snakefile_test5

ref = 'ref_genome/chr5_ref.fasta'
refNoExt = os.path.splitext(ref)[0]

rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule all:
    input:
        '''
        What goes here?
        '''

rule index_ref:
    input:
        ref
    output:
        o1 = ref + '.amb',
        o2 = ref + '.ann',
        o3 = ref + '.bwt',
        o4 = ref + '.pac',
        o5 = ref + '.sa',
        o6 = ref + '.fai',
        o7 = refNoExt + '.dict'
    shell:
        'bwa index {input};'
        'samtools faidx {input};'
        'picard CreateSequenceDictionary REFERENCE={input} OUTPUT={output.o7}'

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    params:
        pl = 'ILLUMINA',
        rg = '{sample}_rg',
        sm = '{sample}'
    shell:
        'bwa mem -R "@RG\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}" {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'
        
rule index_bams:
    input:
        'aligned/{sample}.bam'
    output:
        'aligned/{sample}.bam.bai'
    shell:
        'samtools index {input}'
        
        
'''
Command to run:
gatk HaplotypeCaller -I <input file> -O <output file> -R <reference>
'''

In [None]:
!snakemake -ps snakefile_test5

In [None]:
ls -lh called/

In [None]:
!grep -A5 "^#CHROM" called/PatientA.vcf

# Put it all together

Add some comments and save the finalized pipeline.

In [None]:
%%writefile Genomics_pipeline

# AUTHOR: BB

'''

'''

# get user data:
ref = 'ref_genome/chr5_ref.fasta'
refNoExt = os.path.splitext(ref)[0]

rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule all:
    input:
        expand('called/{sample}.vcf', sample=sampleDict.keys())

rule index_ref:
    '''
    
    '''
    input:
        ref
    output:
        o1 = ref + '.amb',
        o2 = ref + '.ann',
        o3 = ref + '.bwt',
        o4 = ref + '.pac',
        o5 = ref + '.sa',
        o6 = ref + '.fai',
        o7 = refNoExt + '.dict'
    shell:
        'bwa index {input};'
        'samtools faidx {input};'
        'picard CreateSequenceDictionary REFERENCE={input} OUTPUT={output.o7}'

rule align_fastqs: 
    '''

    
    bwa complains if using literal tabs, so
    make sure your snakemake command prints
    "\t".
    '''
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    params:
        pl = 'ILLUMINA',
        rg = '{sample}_rg',
        sm = '{sample}'
    shell:
        'bwa mem -R "@RG\\tID:{params.rg}\\tSM:{params.sm}\\tPL:{params.pl}" {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'
        
rule index_bams:
    '''

    '''
    input:
        'aligned/{sample}.bam'
    output:
        'aligned/{sample}.bam.bai'
    shell:
        'samtools index {input}'
        
rule call_variants:
    '''

    '''
    input:
        ref = ref,
        r1 = ref + '.fai',
        r2 = refNoExt + '.dict',
        bam = 'aligned/{sample}.bam',
        bai = 'aligned/{sample}.bam.bai'
    output:
        'called/{sample}.vcf'
    shell:
        'gatk HaplotypeCaller -I {input.bam} -O {output} -R {input.ref}'

Let's remove all the files we've generated and run the pipeline as one sequence of rules.

In [None]:
rm -r aligned/ called/ ref_genome/chr5_ref.dict ref_genome/chr5_ref.fasta.*

Now, let's do a dry-run of our complete pipeline.

In [None]:
!snakemake -nps Genomics_pipeline

Great!  Now let's see the DAG (directed acyclic graph).

In [None]:
!snakemake -nps Genomics_pipeline --dag | dot -Tsvg > dag2.svg

In [None]:
ls

![dag](dag2.svg)

Looks good - no unexpected recursion or weird relationships.  Now let's run it for real!

In [None]:
!snakemake -ps Genomics_pipeline

In [None]:
# take a peek at the final output files
!for i in called/Patient*vcf; do echo $i; grep -A5 "^#CHROM" $i; echo ""; done

In [None]:
# how many variants were called for each patient?
!grep -cv "^#" called/*vcf