# Variant Calling

Now that we have the aligned sequences to the reference, we can look for difference between the aligned sequences and the reference. This process is also known as variant calling.

Let's review the files that we have.

In [1]:
ls -lh

total 480M
-rw-rw-r-- 1 jupyter jupyter  14K May 19 16:54 01 - Preparations for Finding a Disease Mutation.ipynb
-rw-rw-r-- 1 jupyter jupyter  16K May 19 17:04 02 - Aligning the FASTQ File.ipynb
-rw-rw-r-- 1 jupyter jupyter 5.4K May 19 11:12 03 - Variant Calling.ipynb
-rw-rw-r-- 1 jupyter jupyter 4.8K May 19 11:12 04 - Annotation of Variants.ipynb
-rw-r--r-- 1 jupyter jupyter 176M May 19 11:12 chr5.fa
-rw-rw-r-- 1 jupyter jupyter  131 May 19 11:12 chr5.fa.amb
-rw-rw-r-- 1 jupyter jupyter   43 May 19 11:12 chr5.fa.ann
-rw-rw-r-- 1 jupyter jupyter 173M May 19 11:12 chr5.fa.bwt
-rw-rw-r-- 1 jupyter jupyter   23 May 19 11:12 chr5.fa.fai
-rw-rw-r-- 1 jupyter jupyter  44M May 19 11:12 chr5.fa.pac
-rw-rw-r-- 1 jupyter jupyter  87M May 19 11:12 chr5.fa.sa
-rw-r--r-- 1 jupyter jupyter 820K May 19 11:12 input.fq
-rw-rw-r-- 1 jupyter jupyter  25K May 19 16:53 [0m[38;5;13minput.png[0m
-rw-rw-r-- 1 jupyter jupyter 6.0K May 19 16:53 input.qual
-rw-rw-r-- 1 jupyter jupyter 963K May 

## Preparing for variant calling

In order to perform variant calling, we need to preprocess the SAM file as well as the reference chromosome 5 fasta file. The following steps will be performed :

* Converting SAM to BAM format (this is the binary compressed version)
* Sorting the BAM file
* Indexing the BAM file
* Indexing the reference fasta file


### Processing the SAM file generated by BWA

We will process the SAM file using samtools, which can be accessed after loading the module.

In [2]:
module load bio/samtools
samtools view


Usage:   samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options: -b       output BAM
         -C       output CRAM (requires -T)
         -1       use fast BAM compression (implies -b)
         -u       uncompressed BAM output (implies -b)
         -h       include header in SAM output
         -H       print SAM header only (no alignments)
         -c       print only the count of matching records
         -o FILE  output file name [stdout]
         -U FILE  output reads not selected by filters to FILE [null]
         -t FILE  FILE listing reference names and lengths (see long help) [null]
         -T FILE  reference sequence FASTA FILE [null]
         -L FILE  only include reads overlapping this BED FILE [null]
         -r STR   only include reads in read group STR [null]
         -R FILE  only include reads with read group listed in FILE [null]
         -q INT   only include reads with mapping quality >= INT [0]
         -l STR   only include re

We will begin by converting the SAM file to the compressed BAM format. To do this, we will need to refer to the reference fasta file and the SAM file. The output will be redirected to a new BAM file.

In [3]:
samtools view -bT chr5.fa mapped.sam > mapped.bam



In [4]:
ls -l

total 491496
-rw-rw-r-- 1 jupyter jupyter     13805 May 19 16:54 01 - Preparations for Finding a Disease Mutation.ipynb
-rw-rw-r-- 1 jupyter jupyter     15890 May 19 17:04 02 - Aligning the FASTQ File.ipynb
-rw-rw-r-- 1 jupyter jupyter      9039 May 19 17:08 03 - Variant Calling.ipynb
-rw-rw-r-- 1 jupyter jupyter      4877 May 19 11:12 04 - Annotation of Variants.ipynb
-rw-r--r-- 1 jupyter jupyter 184533572 May 19 11:12 chr5.fa
-rw-rw-r-- 1 jupyter jupyter       131 May 19 11:12 chr5.fa.amb
-rw-rw-r-- 1 jupyter jupyter        43 May 19 11:12 chr5.fa.ann
-rw-rw-r-- 1 jupyter jupyter 180915336 May 19 11:12 chr5.fa.bwt
-rw-rw-r-- 1 jupyter jupyter        23 May 19 11:12 chr5.fa.fai
-rw-rw-r-- 1 jupyter jupyter  45228817 May 19 11:12 chr5.fa.pac
-rw-rw-r-- 1 jupyter jupyter  90457680 May 19 11:12 chr5.fa.sa
-rw-r--r-- 1 jupyter jupyter    839362 May 19 11:12 input.fq
-rw-rw-r-- 1 jupyter jupyter     25235 May 19 16:53 [0m[38;5;13minput.png[0m
-rw-rw-r-- 1 jupyter jupyter  

Notice that the BAM files is much smaller than the SAM file

Next, we will sort and index the BAM file

In [5]:
samtools sort mapped.bam mapped.sort
samtools index mapped.sort.bam



In [6]:
ls -l

total 491780
-rw-rw-r-- 1 jupyter jupyter     13805 May 19 16:54 01 - Preparations for Finding a Disease Mutation.ipynb
-rw-rw-r-- 1 jupyter jupyter     15890 May 19 17:04 02 - Aligning the FASTQ File.ipynb
-rw-rw-r-- 1 jupyter jupyter      9039 May 19 17:08 03 - Variant Calling.ipynb
-rw-rw-r-- 1 jupyter jupyter      4877 May 19 11:12 04 - Annotation of Variants.ipynb
-rw-r--r-- 1 jupyter jupyter 184533572 May 19 11:12 chr5.fa
-rw-rw-r-- 1 jupyter jupyter       131 May 19 11:12 chr5.fa.amb
-rw-rw-r-- 1 jupyter jupyter        43 May 19 11:12 chr5.fa.ann
-rw-rw-r-- 1 jupyter jupyter 180915336 May 19 11:12 chr5.fa.bwt
-rw-rw-r-- 1 jupyter jupyter        23 May 19 11:12 chr5.fa.fai
-rw-rw-r-- 1 jupyter jupyter  45228817 May 19 11:12 chr5.fa.pac
-rw-rw-r-- 1 jupyter jupyter  90457680 May 19 11:12 chr5.fa.sa
-rw-r--r-- 1 jupyter jupyter    839362 May 19 11:12 input.fq
-rw-rw-r-- 1 jupyter jupyter     25235 May 19 16:53 [0m[38;5;13minput.png[0m
-rw-rw-r-- 1 jupyter jupyter  

### Indexing the reference fasta file

To prepare the reference fasta file for variant calling, we need to index the file

In [7]:
module load bio/samtools
samtools faidx chr5.fa



In [8]:
ls -l

total 491780
-rw-rw-r-- 1 jupyter jupyter     13805 May 19 16:54 01 - Preparations for Finding a Disease Mutation.ipynb
-rw-rw-r-- 1 jupyter jupyter     15890 May 19 17:04 02 - Aligning the FASTQ File.ipynb
-rw-rw-r-- 1 jupyter jupyter      9039 May 19 17:08 03 - Variant Calling.ipynb
-rw-rw-r-- 1 jupyter jupyter      4877 May 19 11:12 04 - Annotation of Variants.ipynb
-rw-r--r-- 1 jupyter jupyter 184533572 May 19 11:12 chr5.fa
-rw-rw-r-- 1 jupyter jupyter       131 May 19 11:12 chr5.fa.amb
-rw-rw-r-- 1 jupyter jupyter        43 May 19 11:12 chr5.fa.ann
-rw-rw-r-- 1 jupyter jupyter 180915336 May 19 11:12 chr5.fa.bwt
-rw-rw-r-- 1 jupyter jupyter        23 May 19 17:09 chr5.fa.fai
-rw-rw-r-- 1 jupyter jupyter  45228817 May 19 11:12 chr5.fa.pac
-rw-rw-r-- 1 jupyter jupyter  90457680 May 19 11:12 chr5.fa.sa
-rw-r--r-- 1 jupyter jupyter    839362 May 19 11:12 input.fq
-rw-rw-r-- 1 jupyter jupyter     25235 May 19 16:53 [0m[38;5;13minput.png[0m
-rw-rw-r-- 1 jupyter jupyter  

Notice the .fai file

In [9]:
head chr5.fa.fai

chr5	180915260	6	50	51


## Variant calling using Platypus

There are different methods for variant calling. Here, we will use Platypus, a haplotype-based variant caller that increases the sensitivity and specificity of the varianta calls (http://www.well.ox.ac.uk/platypus)

<img src="https://bchdb.nus.edu.sg/media/notebook/ng.3036-F1.jpg" width=800></img>

In [None]:
module load bio/platypus
Platypus.py callVariants -h

The variant caller requires the reference file and the BAM file with its index.

In [13]:
Platypus.py callVariants --refFile=chr5.fa --bamFiles=mapped.sort.bam --assemble=1 --nCPU=8 -o result.vcf

2017-05-19 18:49:04,914 - INFO - Beginning variant calling
2017-05-19 18:49:04,915 - INFO - Output will go to result.vcf
2017-05-19 18:49:04,928 - INFO - Processing region chr5:0-100000. (Only printing this message every 10 regions of size 100000)
2017-05-19 18:49:04,929 - INFO - Processing region chr5:100000-200000. (Only printing this message every 10 regions of size 100000)
2017-05-19 18:49:04,929 - INFO - Processing region chr5:200000-300000. (Only printing this message every 10 regions of size 100000)
2017-05-19 18:49:04,930 - INFO - Processing region chr5:400000-500000. (Only printing this message every 10 regions of size 100000)
2017-05-19 18:49:04,930 - INFO - Processing region chr5:300000-400000. (Only printing this message every 10 regions of size 100000)
2017-05-19 18:49:04,931 - INFO - Processing region chr5:500000-600000. (Only printing this message every 10 regions of size 100000)
2017-05-19 18:49:04,931 - INFO - Processing region chr5:600000-700000. (Only printin

Let's look at the first 50 lines of the VCF file

In [None]:
head -n 50 result.vcf