# Variant Calling

Now that we have the aligned sequences to the reference, we can look for difference between the aligned sequences and the reference. This process is also known as variant calling.

Let's review the files that we have.

In [None]:
ls -lh

## Preparing for variant calling

![](images/workflow-preprocess.png)

In order to perform variant calling, we need to preprocess the SAM file as well as the reference chromosome 5 fasta file. The following steps will be performed :

* Converting SAM to BAM format (this is the binary compressed version)
* Sorting the BAM file
* Marking duplicates in the BAM file
* Indexing the BAM file
* Indexing the reference fasta file


### Processing the SAM file generated by BWA

We will process the SAM file using samtools, which can be accessed after loading the module.

In [None]:
samtools view

We will begin by converting the SAM file to the compressed BAM format. To do this, we will need to refer to the reference fasta file and the SAM file. The output will be redirected to a new BAM file.

In [None]:
samtools view -bT chr5.fa mapped.sam > mapped.bam

In [None]:
ls -lh

Notice that the BAM files is much smaller than the SAM file

Next, we will sort and index the BAM file

In [None]:
samtools sort -o mapped.sort.bam mapped.bam 
samtools index mapped.sort.bam

We will mark any duplicates in sorted alignments

In [None]:
sambamba markdup mapped.sort.bam mapped.sort.dedup.bam

In [None]:
ls -lh

### Indexing the reference fasta file

To prepare the reference fasta file for variant calling, we need to index the file

In [None]:
samtools faidx chr5.fa

In [None]:
ls -lh

Notice the .fai file

In [None]:
head chr5.fa.fai

## Variant calling using freebayes

![](images/workflow-variant.png)

There are different methods for variant calling. Here, we will use `freebayes`, a haplotype-based variant caller that calls variants based on the sequences of reads aligned to a target and not the precise alignment

![](https://github.com/ekg/freebayes/raw/v1.3.0/paper/haplotype_calling.png)

We run `freebayes` with `-h` to see the options 

In [None]:
freebayes -h

The variant caller requires the reference file and the BAM file with its index. We save the results to `result.vcf` with the `>` operator

In [None]:
freebayes -f chr5.fa mapped.sort.dedup.bam > result.vcf

# VCF Format

The Variant Call Format (VCF) is a text-based format for specifying variants. An example is shown below:

![](images/vcfexample.png)

The basic fields are as follows:

![](images/vcf.png)

Let's look at the first 100 lines of the VCF file output

In [None]:
head -n 100 result.vcf