# Annotation of Variants

We have uncovered variants that differ from the reference genome, but we do not know if the variants affect genes/regions in the genome that may explain a disease or a phenotype.

To do this, we will annotate the VCF file by using a tool called ANNOVAR that cross references the variants with databases of gene regions, population variants, functional mutations and others.

We will first take a look at the list of files again:

In [1]:
ls -lh

total 483M
-rw-r--r-- 1 root root  15K Jul 23 14:50 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  17K Jul 23 14:44 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  63K Jul 23 14:44 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  50K Jul 23 14:44 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 23 14:45  chr5.fa
-rw-r--r-- 1 root root  588 Jul 23 14:48  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 23 14:48  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 23 14:48  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 23 14:50  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 23 14:48  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 23 14:49  chr5.fa.sa
-rw-r--r-- 1 root root 820K Jul 23 14:44  input.fq
-rw-r--r-- 1 root root 225K Jul 23 14:46  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 23 14:46  input_fastqc.zip
-rw-r--r-- 1 root root 212K Jul 23 14:49  mapped.bam
-rw-r--r-- 1 root root 963K Jul 23 14:49  mapped.sam
-rw-r--r-- 1 root

We will use the corrected VCF file for annotation using ANNOVAR. To do this, we need to first convert the VCF file to the proper internal format .avinput using the 'convert2annovar.pl' program.

After creating the .avinput file, we use it to created an annotated table using the 'table_annovar.pl' program. For the annotation, we need to provide several important parameters:

- buildver - the genome version (typically hg19)
- annovar database - this is the directory for all the indexes
- protocol - this is to specify the databases for annotation

In this example, we will use 3 databases in the protocol specification:
- refGene - this tells us if the mutations occur in a gene
- snp138 - this is the catalog of variants (dbSNP version 138)
- clinvar_20150629 - this is a catalog of clinically important disease mutations

For each database, we need to specify the operation type. In this case:

- refGene - gene-based (g)
- snp138 - filtered (f)
- clinvar_20150629 - filtered (f)

In [2]:
ls -lh

total 483M
-rw-r--r-- 1 root root  15K Jul 23 14:50 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  17K Jul 23 14:44 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  63K Jul 23 14:44 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  50K Jul 23 14:44 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 23 14:45  chr5.fa
-rw-r--r-- 1 root root  588 Jul 23 14:48  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 23 14:48  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 23 14:48  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 23 14:50  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 23 14:48  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 23 14:49  chr5.fa.sa
-rw-r--r-- 1 root root 820K Jul 23 14:44  input.fq
-rw-r--r-- 1 root root 225K Jul 23 14:46  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 23 14:46  input_fastqc.zip
-rw-r--r-- 1 root root 212K Jul 23 14:49  mapped.bam
-rw-r--r-- 1 root root 963K Jul 23 14:49  mapped.sam
-rw-r--r-- 1 root

In [3]:
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz.tbi

--2020-07-23 14:50:57--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 2607:f220:41e:250::7, 2607:f220:41e:250::10, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32772597 (31M) [application/x-gzip]
Saving to: ‘clinvar_20200720.vcf.gz’


2020-07-23 14:51:13 (2.16 MB/s) - ‘clinvar_20200720.vcf.gz’ saved [32772597/32772597]

--2020-07-23 14:51:13--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz.tbi
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 2607:f220:41e:250::7, 2607:f220:41e:250::10, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 290018 (283K) [application/x-gzip]
Saving to: ‘clinvar_20200720.vcf.gz.tbi’


2020-07-23 14:51:17 

In [5]:
SnpSift annotate -v clinvar_20200720.vcf.gz result.vcf > result.annotate.vcf

00:00:00	SnpSift version 4.3t (build 2017-11-24 10:18), by Pablo Cingolani
00:00:00	Command: 'annotate'
00:00:00	Reading configuration file 'snpEff.config'
00:00:00	done
00:00:00	Annotating entries from: 'result.vcf'
00:00:00	Opening VCF input 'result.vcf'
00:00:00	Annotating:	Input file    : 'result.vcf'	Database file : 'clinvar_20200720.vcf.gz'
00:00:00	Annotating method: TABIX
00:00:00	Done.
	Total annotated entries : 11
	Total entries           : 40
	Percent                 : 27.50%
	Errors (bad references) : 0


# Taking a look at the annotated variant file

We can download the CSV file and open it in Microsoft Excel

# Preparing the VCF file for visualization

In [9]:
bgzip -c result.annotate.vcf > result.annotate.vcf.gz
tabix -p vcf result.annotate.vcf.gz

In [10]:
ls -lh

total 514M
-rw-r--r-- 1 root root  15K Jul 23 14:50 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  16K Jul 23 14:51 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  61K Jul 23 14:51 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  12K Jul 23 14:52 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 23 14:45  chr5.fa
-rw-r--r-- 1 root root  588 Jul 23 14:48  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 23 14:48  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 23 14:48  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 23 14:50  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 23 14:48  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 23 14:49  chr5.fa.sa
-rw-r--r-- 1 root root  32M Jul 21 03:06  clinvar_20200720.vcf.gz
-rw-r--r-- 1 root root 284K Jul 21 03:06  clinvar_20200720.vcf.gz.tbi
-rw-r--r-- 1 root root 820K Jul 23 14:44  input.fq
-rw-r--r-- 1 root root 225K Jul 23 14:46  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 23 14:46  inpu