# Annotation of Variants

We have uncovered variants that differ from the reference genome, but we do not know if the variants affect genes/regions in the genome that may explain a disease or a phenotype.

To do this, we will annotate the VCF file by using a tool called ANNOVAR that cross references the variants with databases of gene regions, population variants, functional mutations and others.

We will first take a look at the list of files again:

In [1]:
ls -lh

total 481M
-rw-rw-r-- 1 jupyter jupyter  14K May 19 16:54 01 - Preparations for Finding a Disease Mutation.ipynb
-rw-rw-r-- 1 jupyter jupyter  16K May 19 17:04 02 - Aligning the FASTQ File.ipynb
-rw-rw-r-- 1 jupyter jupyter  61K May 19 17:16 03 - Variant Calling.ipynb
-rw-rw-r-- 1 jupyter jupyter 4.8K May 19 11:12 04 - Annotation of Variants.ipynb
-rw-r--r-- 1 jupyter jupyter 176M May 19 11:12 chr5.fa
-rw-rw-r-- 1 jupyter jupyter  131 May 19 11:12 chr5.fa.amb
-rw-rw-r-- 1 jupyter jupyter   43 May 19 11:12 chr5.fa.ann
-rw-rw-r-- 1 jupyter jupyter 173M May 19 11:12 chr5.fa.bwt
-rw-rw-r-- 1 jupyter jupyter   23 May 19 17:09 chr5.fa.fai
-rw-rw-r-- 1 jupyter jupyter  44M May 19 11:12 chr5.fa.pac
-rw-rw-r-- 1 jupyter jupyter  87M May 19 11:12 chr5.fa.sa
-rw-r--r-- 1 jupyter jupyter 820K May 19 11:12 input.fq
-rw-rw-r-- 1 jupyter jupyter  25K May 19 16:53 [0m[38;5;13minput.png[0m
-rw-rw-r-- 1 jupyter jupyter 6.0K May 19 16:53 input.qual
-rw-rw-r-- 1 jupyter jupyter 651K May 

We will use the corrected VCF file for annotation using ANNOVAR. To do this, we need to first convert the VCF file to the proper internal format .avinput using the 'convert2annovar.pl' program.

After creating the .avinput file, we use it to created an annotated table using the 'table_annovar.pl' program. For the annotation, we need to provide several important parameters:

- buildver - the genome version (typically hg19)
- annovar database - this is the directory for all the indexes
- protocol - this is to specify the databases for annotation

In this example, we will use 3 databases in the protocol specification:
- refGene - this tells us if the mutations occur in a gene
- snp138 - this is the catalog of variants (dbSNP version 138)
- clinvar_20150629 - this is a catalog of clinically important disease mutations

For each database, we need to specify the operation type. In this case:

- refGene - gene-based (g)
- snp138 - filtered (f)
- clinvar_20150629 - filtered (f)

In [2]:
ls -lh /data/reference/human/annovar

total 59G
-rw-r--r-- 1 root root  892 Jan  6  2016 annovar_downdb.log
-rw-r--r-- 1 root root 853M Jan  6  2016 hg19_AFR.sites.2012_04.txt
-rw-r--r-- 1 root root  84M Jan  6  2016 hg19_AFR.sites.2012_04.txt.idx
-rw-r--r-- 1 root root 1.4G Jan  6  2016 hg19_AFR.sites.2014_10.txt
-rw-r--r-- 1 root root  87M Jan  6  2016 hg19_AFR.sites.2014_10.txt.idx
-rw-r--r-- 1 root root 1.5G Jan  6  2016 hg19_AFR.sites.2015_08.txt
-rw-r--r-- 1 root root  87M Jan  6  2016 hg19_AFR.sites.2015_08.txt.idx
-rw-r--r-- 1 root root 1.3G Jan  6  2016 hg19_ALL.sites.2012_04.txt
-rw-r--r-- 1 root root  86M Jan  6  2016 hg19_ALL.sites.2012_04.txt.idx
-rw-r--r-- 1 root root 2.8G Jan  6  2016 hg19_ALL.sites.2014_10.txt
-rw-r--r-- 1 root root  89M Jan  6  2016 hg19_ALL.sites.2014_10.txt.idx
-rw-r--r-- 1 root root 3.2G Jan  6  2016 hg19_ALL.sites.2015_08.txt
-rw-r--r-- 1 root root  89M Jan  6  2016 hg19_ALL.sites.2015_08.txt.idx
-rw-r--r-- 1 root root 660M Jan  6  2016 hg19_AMR.sites.2012_04.txt
-rw-r--

In [3]:
module load bio/annovar

# set ANNOVAR path
ANNOVAR=/data/reference/human/annovar

convert2annovar.pl \
            -format vcf4 result.vcf \
            -allsample \
            -withfreq \
            --includeinfo \
            -outfile result.avinput;

table_annovar.pl result.avinput $ANNOVAR \
            -buildver hg19 \
            -out result \
            -remove \
            -protocol refGene,snp138,clinvar_20150629 -operation g,f,f \
            -nastring . \
            -csvout;

NOTICE: Finished reading 60 lines from VCF file
NOTICE: A total of 12 locus in VCF file passed QC threshold, representing 12 SNPs (9 transitions and 3 transversions) and 0 indels/substitutions
NOTICE: Finished writing allele frequencies based on 12 SNP genotypes (9 transitions and 3 transversions) and 0 indels/substitutions for 1 samples
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system command <annotate_variation.pl -geneanno -buildver hg19 -dbtype refGene -outfile result.refGene -exonsort result.avinput /data/reference/human/annovar>
NOTICE: Reading gene annotation from /data/reference/human/annovar/hg19_refGene.txt ... Done with 49665 transcripts (including 10886 without coding sequence annotation) for 25936 unique genes
NOTICE: Reading FASTA sequences from /data/reference/human/annovar/hg19_refGeneMrna.fa ... Done with 1 sequences
NOTICE: Finished gene-based annotation on 12 geneti

# Taking a look at the annotated variant file

In [4]:
head result.hg19_multianno.csv

Chr,Start,End,Ref,Alt,Func.refGene,Gene.refGene,GeneDetail.refGene,ExonicFunc.refGene,AAChange.refGene,snp138,clinvar_20150629
chr5,148386525,148386525,T,G,"exonic","SH3TC2",".","synonymous SNV","SH3TC2:NM_024577:exon16:c.A3594C:p.P1198P","rs6871030","CLINSIG=non-pathogenic,probable-non-pathogenic;CLNDBN=not_provided,not_specified;CLNREVSTAT=criteria_provided\x2c_single_submitter,no_assertion_criteria_provided;CLNACC=RCV000128035.1,RCV000118337.2;CLNDSDB=MedGen,MedGen;CLNDSDBID=CN221809,CN169374"
chr5,148389763,148389763,G,A,"intronic","SH3TC2",".",.,.,"rs1025476",.
chr5,148389868,148389868,T,G,"exonic","SH3TC2",".","nonsynonymous SNV","SH3TC2:NM_024577:exon14:c.A3292C:p.T1098P","rs77636085",.
chr5,148406032,148406032,C,T,"intronic","SH3TC2",".",.,.,"rs10075404",.
chr5,148406386,148406386,T,C,"intronic","SH3TC2",".",.,.,"rs17722209",.
chr5,148406435,148406435,G,A,"exonic","SH3TC2",".","stopgain","SH3TC2:NM_024577:exon11:c.C2860T:p.R954X","rs80338933","CLINSIG=pathogenic|pathogeni

We can download the CSV file and open it in Microsoft Excel

# Preparing the VCF file for visualization

In [5]:
module load bio/htslib
bgzip -c result.vcf > result.vcf.gz



In [6]:
tabix -p vcf result.vcf.gz



In [7]:
ls -l

total 492476
-rw-rw-r-- 1 jupyter jupyter     13805 May 19 16:54 01 - Preparations for Finding a Disease Mutation.ipynb
-rw-rw-r-- 1 jupyter jupyter     15890 May 19 17:04 02 - Aligning the FASTQ File.ipynb
-rw-rw-r-- 1 jupyter jupyter     14148 May 19 17:17 03 - Variant Calling.ipynb
-rw-rw-r-- 1 jupyter jupyter     17722 May 19 17:22 04 - Annotation of Variants.ipynb
-rw-r--r-- 1 jupyter jupyter 184533572 May 19 11:12 chr5.fa
-rw-rw-r-- 1 jupyter jupyter       131 May 19 11:12 chr5.fa.amb
-rw-rw-r-- 1 jupyter jupyter        43 May 19 11:12 chr5.fa.ann
-rw-rw-r-- 1 jupyter jupyter 180915336 May 19 11:12 chr5.fa.bwt
-rw-rw-r-- 1 jupyter jupyter        23 May 19 17:09 chr5.fa.fai
-rw-rw-r-- 1 jupyter jupyter  45228817 May 19 11:12 chr5.fa.pac
-rw-rw-r-- 1 jupyter jupyter  90457680 May 19 11:12 chr5.fa.sa
-rw-r--r-- 1 jupyter jupyter    839362 May 19 11:12 input.fq
-rw-rw-r-- 1 jupyter jupyter     25235 May 19 16:53 [0m[38;5;13minput.png[0m
-rw-rw-r-- 1 jupyter jupyter  

## Linking to web-based genome browser

Go to http://chromozoom.org

Paste this line under custom track: 

```track type=vcfTabix name="My VCF" bigDataUrl=http://bchdb.nus.edu.sg/media/notebook/result.vcf.gz```