# Annotation of Variants

We have uncovered variants that differ from the reference genome, but we do not know if the variants affect genes/regions in the genome that may explain a disease or a phenotype.

To do this, we will annotate the VCF file by using a tool called ANNOVAR that cross references the variants with databases of gene regions, population variants, functional mutations and others.

We will first take a look at the list of files again:

In [9]:
ls -lh

total 534M
-rw-r--r-- 1 root root  17K Jul 23 08:58 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  17K Jul 23 09:00 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  62K Jul 23 06:26 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  41K Jul 23 08:44 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 23 08:49  chr5.fa
-rw-r--r-- 1 root root  588 Jul 23 08:51  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 23 08:51  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 23 08:51  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 23 09:01  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 23 08:51  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 23 08:52  chr5.fa.sa
-rw-r--r-- 1 root root  32M Jul 21 03:06  clinvar_20200720.vcf.gz
-rw-r--r-- 1 root root 284K Jul 21 03:06  clinvar_20200720.vcf.gz.tbi
-rw-r--r-- 1 root root 820K Jul 22 21:34  input.fq
-rw-r--r-- 1 root root 225K Jul 23 08:49  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 23 08:49  inpu

We will use the corrected VCF file for annotation using ANNOVAR. To do this, we need to first convert the VCF file to the proper internal format .avinput using the 'convert2annovar.pl' program.

After creating the .avinput file, we use it to created an annotated table using the 'table_annovar.pl' program. For the annotation, we need to provide several important parameters:

- buildver - the genome version (typically hg19)
- annovar database - this is the directory for all the indexes
- protocol - this is to specify the databases for annotation

In this example, we will use 3 databases in the protocol specification:
- refGene - this tells us if the mutations occur in a gene
- snp138 - this is the catalog of variants (dbSNP version 138)
- clinvar_20150629 - this is a catalog of clinically important disease mutations

For each database, we need to specify the operation type. In this case:

- refGene - gene-based (g)
- snp138 - filtered (f)
- clinvar_20150629 - filtered (f)

In [10]:
ls -lh

total 534M
-rw-r--r-- 1 root root  17K Jul 23 08:58 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  17K Jul 23 09:00 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  62K Jul 23 06:26 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  41K Jul 23 08:44 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 23 08:49  chr5.fa
-rw-r--r-- 1 root root  588 Jul 23 08:51  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 23 08:51  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 23 08:51  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 23 09:01  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 23 08:51  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 23 08:52  chr5.fa.sa
-rw-r--r-- 1 root root  32M Jul 21 03:06  clinvar_20200720.vcf.gz
-rw-r--r-- 1 root root 284K Jul 21 03:06  clinvar_20200720.vcf.gz.tbi
-rw-r--r-- 1 root root 820K Jul 22 21:34  input.fq
-rw-r--r-- 1 root root 225K Jul 23 08:49  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 23 08:49  inpu

In [11]:
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz.tbi

--2020-07-23 09:02:40--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 2607:f220:41e:250::7, 2607:f220:41e:250::10, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32772597 (31M) [application/x-gzip]
Saving to: ‘clinvar_20200720.vcf.gz.1’


2020-07-23 09:02:54 (2.58 MB/s) - ‘clinvar_20200720.vcf.gz.1’ saved [32772597/32772597]

--2020-07-23 09:02:54--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz.tbi
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.13, 2607:f220:41e:250::7, 2607:f220:41e:250::10, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 290018 (283K) [application/x-gzip]
Saving to: ‘clinvar_20200720.vcf.gz.tbi.1’


2020-07-23 09:

In [12]:
SnpSift annotate -v clinvar_20200720.vcf.gz result.vcf

00:00:00	SnpSift version 4.3t (build 2017-11-24 10:18), by Pablo Cingolani
00:00:00	Command: 'annotate'
00:00:00	Reading configuration file 'snpEff.config'
00:00:00	done
00:00:00	Annotating entries from: 'result.vcf'
00:00:00	Opening VCF input 'result.vcf'
00:00:00	Annotating:	Input file    : 'result.vcf'	Database file : 'clinvar_20200720.vcf.gz'
00:00:00	Annotating method: TABIX
##fileformat=VCFv4.2
##fileDate=20200723
##source=freeBayes v1.3.2-dirty
##reference=chr5.fa
##contig=<ID=chr5,length=181538259>
##phasing=none
##commandline="freebayes -f chr5.fa mapped.sort.bam"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AN,Numbe

# Taking a look at the annotated variant file

We can download the CSV file and open it in Microsoft Excel

# Preparing the VCF file for visualization

In [13]:
bgzip -c result.vcf > result.vcf.gz

In [14]:
tabix -p vcf result.vcf.gz

In [15]:
ls -l

total 579612
-rw-r--r-- 1 root root     16662 Jul 23 08:58 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root     16745 Jul 23 09:00 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root     64157 Jul 23 09:02 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root     40409 Jul 23 09:02 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 185169031 Jul 23 08:49  chr5.fa
-rw-r--r-- 1 root root       588 Jul 23 08:51  chr5.fa.amb
-rw-r--r-- 1 root root        44 Jul 23 08:51  chr5.fa.ann
-rw-r--r-- 1 root root 181538356 Jul 23 08:51  chr5.fa.bwt
-rw-r--r-- 1 root root        23 Jul 23 09:01  chr5.fa.fai
-rw-r--r-- 1 root root  45384566 Jul 23 08:51  chr5.fa.pac
-rw-r--r-- 1 root root  90769184 Jul 23 08:52  chr5.fa.sa
-rw-r--r-- 1 root root  32772597 Jul 21 03:06  clinvar_20200720.vcf.gz
-rw-r--r-- 1 root root  32772597 Jul 21 03:06  clinvar_20200720.vcf.gz.1
-rw-r--r-- 1 root root    290018 Jul 21 03:06  clinvar_20200720.vcf.gz.tbi
-rw-r--r-- 1 root

## Linking to web-based genome browser

Go to http://chromozoom.org

Paste this line under custom track: 

```track type=vcfTabix name="My VCF" bigDataUrl=http://bchdb.nus.edu.sg/media/notebook/result.vcf.gz```