# Setup

First things first, let's check our working directory.

In [1]:
pwd()

'/home/jovyan/Genomics_practice'

In [None]:

# wget https://github.com/broadinstitute/gatk/releases/download/4.1.1.0/gatk-4.1.1.0.zip

To build our pipeline, we'll need to make sure we have access to some dependencies first.  All of these are included in the environment.yaml file used to build the binder container, but you can use conda to install them if you're working outside of the binder container (`conda install -c bioconda <tool=version>`).  Let's check the versions of the dependencies to ensure they're there and accessible.

In [31]:
%%bash
echo -e "snakemake:  $(snakemake --version)"
echo -e "samtools:   $(samtools --version | head -n1)"
echo -e "bwa:        $(bwa |& grep Version)"
echo -e "gatk:       $(gatk --version)"
echo -e "python:     $(python3 --version)"

snakemake:  5.4.5
samtools:   samtools 1.9
bwa:        Version: 0.7.17-r1188
python:     


Python 3.6.8 :: Anaconda, Inc.


# Input data

We're going to start with FASTQ files from real human samples, but instead of whole genome data, we're focusing on chromosome 5, position 100000000-101500000.  Our reference genome file also only includes a portion of chr5.  This is to keep file sizes small and run times short!  Now, let's make sure we can see the raw files we intend to work on.

In [12]:
%%bash
ls -lh raw_data/

total 284M
-rw-r--r-- 1 jovyan jovyan 66M Apr 17 16:44 Patient_A.r1.fastq
-rw-r--r-- 1 jovyan jovyan 66M Apr 17 16:44 Patient_A.r2.fastq
-rw-r--r-- 1 jovyan jovyan 77M Apr 17 16:44 Patient_B.r1.fastq
-rw-r--r-- 1 jovyan jovyan 77M Apr 17 16:44 Patient_B.r2.fastq


In [9]:
%%bash
# what does a fastq look like?
head -n8 raw_data/Patient_A.r1.fastq

@E00572:97:H5YN2CCXY:5:1101:1773:44327/1
GATGGAGATGAGGAACTTGATGGAAACTGGAGCAAACGTGACTCTTGTTATGCTTTAGCAAAAATACCGGCAGGATTTTGTCCCTGCCCTAGAGATCTGTGGAATTTTGAACTTGAGAGAGAGGATTTAGAGCATCTGCCCCAAGAAAAT
+
<AAFF<JFFFFJJFJJJJ<JJJJFJJJJF-FJJ7FJFFJJJJJFJJJJJJJFJJJJJJ77FFFJJFFFJ<JFFF7-FFJJJFJJJFFFJJJ-7--<<FJJJJ--AA<-<7<F<7F--7A77J-7A-FF-<<A--7F<F--)-)))----7
@E00572:97:H5YN2CCXY:5:1101:1803:49127/1
GCCAAGGGAACCCCCAGCCCTACCCAGGGAAACCGGGAGTGATTGTGTAACTCCAGGAAACCATGCTTCTACCATGGATCTTTGCAACCCATGGATCAGGAGATCCCCCTGTGAGCTCATGCCACCAGGACCTTGGGTCTGACACACAGC
+
AAAAFJJJF<FJ-FJAFFJJJA<FFJAFFFFFFJJFAAJJJJJJFJJJFJJFJJJFJFJJJJJJJJJJJJJJJJJJFFJJJJJFA<7JFAAFJJ-AF7FFFJ-FF-F-AF<J77-A-7F-7FF-AA<JA<A-<<A)-AAF<F<FF--A7)


In [15]:
%%bash
ls -lh ref_genome/

total 4.1M
-rw-r--r-- 1 jovyan jovyan 4.1M Apr 17 16:44 chr5_ref.fasta


In [11]:
%%bash
# what does a reference genome look like?
head -n4 ref_genome/chr5_ref.fasta

>5:99900000-104100000
AATAGGAAATCAAAGGGAATTTTAAGAGCTATTTTGAGACAAAAAAAAAATGGCATAACA
AAACTTATGGGATGCAGCAAAAGCATTGCTAGGAGAGAAGTTTATAGCAATAAATGCTTA
TGCTATGAAAGAAGAAAGACTTCAAATAAACAACCTAGCTTTACCCTTTCAGAAAGTGGC


# Plan

__Goal:__ assemble a working DNA-seq pipeline!

- Align sequencing data to a reference genome
- Call variants in the aligned data
- Annotate variants

# First rule: indexing

Our first rule is going to take our reference genome file, and index it so that the alignment tool can read it.  We can write the rule in any text editor, but for this class, we'll write it here in the notebook and save it to a file.

In [12]:
%%writefile snakefile_test1

ref = 'ref_genome/chr5_ref.fasta'

rule index_ref:
    '''
    This rule creates the indices needed by
    bwa in order to use a reference fasta.
    '''
    input:
        ref
    output:
        ref + '.amb',
        ref + '.ann',
        ref + '.bwt',
        ref + '.pac',
        ref + '.sa'
    shell:
        'bwa index {input}'

Overwriting snakefile_test1


Let's test our first rule!  You can run this rule from the command line or here in the notebook using cell magic.  For our first test, we're going to try a __dry-run__ by using the `-n` flag.  If your snakefile is called something other than "Snakefile," use `-s <filename>`.

In [23]:
%%bash
snakemake -ns snakefile_test1

Building DAG of jobs...
Job counts:
	count	jobs
	1	index_ref
	1

[Wed Apr 17 18:26:17 2019]
rule index_ref:
    input: ref_genome/chr5_ref.fasta
    output: ref_genome/chr5_ref.fasta.amb, ref_genome/chr5_ref.fasta.ann, ref_genome/chr5_ref.fasta.bwt, ref_genome/chr5_ref.fasta.pac, ref_genome/chr5_ref.fasta.sa
    jobid: 0

Job counts:
	count	jobs
	1	index_ref
	1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.


Looks great!  Note that the dry-run does not actually execute any jobs; it shows the execution plan.  

Now let's try running our pipeline for real.  Add the `-p` flag to print the job that's run for each rule.

In [24]:
%%bash
snakemake -ps snakefile_test1

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	index_ref
	1

[Wed Apr 17 18:29:38 2019]
rule index_ref:
    input: ref_genome/chr5_ref.fasta
    output: ref_genome/chr5_ref.fasta.amb, ref_genome/chr5_ref.fasta.ann, ref_genome/chr5_ref.fasta.bwt, ref_genome/chr5_ref.fasta.pac, ref_genome/chr5_ref.fasta.sa
    jobid: 0

bwa index ref_genome/chr5_ref.fasta
[bwa_index] Pack FASTA... 0.07 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 1.74 seconds elapse.
[bwa_index] Update BWT... 0.05 sec
[bwa_index] Pack forward-only FASTA... 0.04 sec
[bwa_index] Construct SA from BWT and Occ... 0.56 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index ref_genome/chr5_ref.fasta
[main] Real time: 2.515 sec; CPU: 2.460 sec
[Wed Apr 17 18:29:40 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/jovyan/Genomics_practice/.snakemake/log/2019-04-17T182938.288375.snakemake.log


Hooray!!  Our first rule worked.  Note that Snakemake stdout provides a beautiful log of steps run and errors encountered.

What happens if we try to run it again?

In [25]:
%%bash
snakemake -ps snakefile_test1

Building DAG of jobs...
Nothing to be done.
Complete log: /home/jovyan/Genomics_practice/.snakemake/log/2019-04-17T183110.819471.snakemake.log


# Second rule: aligning

In [None]:
# include trimming?

Our next rule is going to take the short reads in our FASTQ files, and align them to a reference sequence using a tool called [__bwa__](https://github.com/lh3/bwa).  

In [13]:
rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = 'raw_data/Patient_A.r1.fastq',
        fq2 = 'raw_data/Patient_A.r2.fastq'
    output:
        'aligned/PatientA.bam'
    shell:
        'bwa mem {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

SyntaxError: invalid syntax (<ipython-input-13-661d7e937908>, line 1)

Note how not all the listed input files are actually needed in the command line, but they are required for the command to run successfully.  Snakemake recommends that you include all file dependencies in the input section, even if they're not used in the command invocation.

Hold up.  We don't want to hard-code our sample files into a pipeline, or else we have to change code for every sample! How do we handle this?

![xkcd](images/xkcd_generalization.png)

In [None]:
SAMPLES = ['Patient_A.r1', 'Patient_A.r2', 'Patient_B.r1', 'Patient_B.r2',]

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq = 'raw_data/{sample}.fastq'
    output:
        'aligned/{sample}.bam'
    shell:
        'bwa mem {input.ref} {input.fq} | samtools sort -o {output}'

What if we use a list, as above?  This would run an alignment on each fastq individually, which would be fine if we had single-end reads.  But, we have paired-end reads, which means you've sequenced in both directions, and you need to align two related fastq files per sample.

In [10]:
rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    shell:
        'bwa mem {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

Appending to snakefile_test1


What if we use a dict?  Much better!  Now our `rule align_fastqs` is generalizable.  If you were using this pipeline in real life, you'd probably require the user to provide a sample file where each line has the sample name, fastq1, and fastq2, and you'd read that in to a dict (rather than explicitly defining a dict like we did).

Note that input (or params) can be the return value of a function, as in this example.

Let's put the two rules together, and then try running them.

In [14]:
%%writefile snakefile_test2

ref = 'ref_genome/chr5_ref.fasta'
rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule index_ref:
    '''
    This rule creates the indices needed by
    bwa in order to use a reference fasta.
    '''
    input:
        ref
    output:
        ref + '.amb',
        ref + '.ann',
        ref + '.bwt',
        ref + '.pac',
        ref + '.sa'
    shell:
        'bwa index {input}'

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    shell:
        'bwa mem {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

Writing snakefile_test2


In [15]:
%%bash
snakemake -ps snakefile_test2

Building DAG of jobs...
Nothing to be done.
Complete log: /home/jovyan/Genomics_practice/.snakemake/log/2019-04-17T192800.221988.snakemake.log


Oh no!  What went wrong?  We haven't given snakemake a target file.  Let's add a `rule all`.

In [18]:
%%writefile snakefile_test3

ref = 'ref_genome/chr5_ref.fasta'
rawDataPath = 'raw_data/'
sampleDict = {
    'PatientA':['Patient_A.r1.fastq', 'Patient_A.r2.fastq'],
    'PatientB':['Patient_B.r1.fastq', 'Patient_B.r2.fastq']
}

def get_read1_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read1

def get_read2_fastq(wildcards):
    (read1, read2) = sampleDict[wildcards.sample]
    return rawDataPath + read2


rule all:
    input:
        expand('aligned/{sample}.bam', sample=sampleDict.keys())

rule index_ref:
    '''
    This rule creates the indices needed by
    bwa in order to use a reference fasta.
    '''
    input:
        ref
    output:
        ref + '.amb',
        ref + '.ann',
        ref + '.bwt',
        ref + '.pac',
        ref + '.sa'
    shell:
        'bwa index {input}'

rule align_fastqs: 
    input:
        ref = ref,
        r1 = ref + '.amb',
        r2 = ref + '.ann',
        r3 = ref + '.bwt',
        r4 = ref + '.pac',
        r5 = ref + '.sa',
        fq1 = get_read1_fastq,
        fq2 = get_read2_fastq
    output:
        'aligned/{sample}.bam'
    shell:
        'bwa mem {input.ref} {input.fq1} {input.fq2} | samtools sort -o {output}'

Overwriting snakefile_test3


Now snakemake knows that {sample} should expand to PatientA and PatientB, and that the pipeline should end up producing the files `'aligned/{sample}.bam'`.  Let's try running it (this will take a minute or two):

In [19]:
%%bash
snakemake -ps snakefile_test3

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	2	align_fastqs
	1	all
	3

[Wed Apr 17 19:31:18 2019]
rule align_fastqs:
    input: ref_genome/chr5_ref.fasta, ref_genome/chr5_ref.fasta.amb, ref_genome/chr5_ref.fasta.ann, ref_genome/chr5_ref.fasta.bwt, ref_genome/chr5_ref.fasta.pac, ref_genome/chr5_ref.fasta.sa, raw_data/Patient_B.r1.fastq, raw_data/Patient_B.r2.fastq
    output: aligned/PatientB.bam
    jobid: 2
    wildcards: sample=PatientB

bwa mem ref_genome/chr5_ref.fasta raw_data/Patient_B.r1.fastq raw_data/Patient_B.r2.fastq | samtools sort -o aligned/PatientB.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 66674 sequences (10000123 bp)...
[M::process] read 66674 sequences (10000097 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 31731, 0, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size

Yay!  A few things of note:
- We now have one aligned bam file per patient!
- Snakemake automatically created the directory `aligned/` for us.
- The stdout also includes a bunch of information from bwa.  There are ways to clean this up, but we're going to skip over that for now.
- Snakemake saw that the indexed reference files were already created, so it did not re-run that rule.

# Third rule: calling

This third rule will compare our aligned sequences to the reference genome, look at places where there's a discrepancy, and report back those variants.

In [None]:
rule call_variants:
    input:
        ref = ,
        bam = 
    output:
    shell:
        'gatk HaplotypeCaller -I:{input.bam} -O:{output} -R:{input.ref}'