# Annotation of Variants

We have uncovered variants that differ from the reference genome, but we do not know if the variants affect genes/regions in the genome that may explain a disease or a phenotype.

To do this, we will annotate the VCF file by using a tool called `SnpEff/SnpSift`

http://snpeff.sourceforge.net

We will first take a look at the list of files again:

In [1]:
ls -lh

total 483M
-rw-r--r-- 1 root root  15K Jul 24 09:02 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  17K Jul 24 11:07 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  62K Jul 24 11:56 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  11K Jul 24 11:56 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 24 08:53  chr5.fa
-rw-r--r-- 1 root root  588 Jul 24 08:59  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 24 08:59  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 24 08:59  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 24 11:42  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 24 08:59  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 24 09:00  chr5.fa.sa
-rw-r--r-- 1 root root 820K Jul 24 08:52  input.fq
-rw-r--r-- 1 root root 225K Jul 24 08:54  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 24 08:54  input_fastqc.zip
-rw-r--r-- 1 root root 212K Jul 24 11:38  mapped.bam
-rw-r--r-- 1 root root 214K Jul 24 11:42  mapped.dedup.bam
-rw-r--r-- 

We will annotate the VCF file against ClinVar

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5753237/

This resource aggregates data from various laboratories and expert panels about the interpretation of variants

We will download the GRCh38 version https://www.ncbi.nlm.nih.gov/variation/docs/ClinVar_vcf_files/


In [2]:
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz.tbi

--2020-07-24 11:57:48--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::7, 2607:f220:41e:250::10, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32772597 (31M) [application/x-gzip]
Saving to: ‘clinvar_20200720.vcf.gz’


2020-07-24 11:58:00 (3.23 MB/s) - ‘clinvar_20200720.vcf.gz’ saved [32772597/32772597]

--2020-07-24 11:58:01--  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20200720.vcf.gz.tbi
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::7, 2607:f220:41e:250::10, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 290018 (283K) [application/x-gzip]
Saving to: ‘clinvar_20200720.vcf.gz.tbi’


2020-07-24 11:58:03 

In [4]:
SnpSift annotate -v clinvar_20200720.vcf.gz result.vcf > result.annotate.vcf

00:00:00	SnpSift version 4.3t (build 2017-11-24 10:18), by Pablo Cingolani
00:00:00	Command: 'annotate'
00:00:00	Reading configuration file 'snpEff.config'
00:00:00	done
00:00:00	Annotating entries from: 'result.vcf'
00:00:00	Opening VCF input 'result.vcf'
00:00:01	Annotating:	Input file    : 'result.vcf'	Database file : 'clinvar_20200720.vcf.gz'
00:00:01	Annotating method: TABIX
00:00:01	Done.
	Total annotated entries : 12
	Total entries           : 39
	Percent                 : 30.77%
	Errors (bad references) : 0


# Taking a look at the annotated variant file

![](images/clinvar.png)

In [7]:
tail result.annotate.vcf

chr5	149028538	130296	A	G	6.51745	.	AB=0.214286;ABP=12.937;AC=1;AF=0.5;AN=2;AO=3;CIGAR=1X;DP=14;DPB=14;DPRA=0;EPP=3.73412;EPPR=4.78696;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=1.24842;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=109;QR=368;RO=11;RPL=1;RPP=3.73412;RPPR=4.78696;RPR=2;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=11;SRP=26.8965;SRR=0;TYPE=snp;AF_ESP=0.48470;AF_EXAC=0.44561;AF_TGP=0.43710;ALLELEID=135743;CLNDISDB=MONDO:MONDO:0011113,MedGen:C1866636,OMIM:601596,Orphanet:ORPHA99949|MONDO:MONDO:0013237,MedGen:C3150596,OMIM:613353|MONDO:MONDO:0015626,MedGen:C0007959,Orphanet:ORPHA166,SNOMED_CT:50548001|MONDO:MONDO:0018995,MedGen:C4082197,Orphanet:ORPHA64749,SNOMED_CT:715795005|MedGen:CN169374;CLNDN=Charcot-Marie-Tooth_disease,_type_4C|Mononeuropathy_of_the_median_nerve,_mild|Charcot-Marie-Tooth_disease|Charcot-Marie-Tooth_disease_type_4|not_specified;CLNHGVS=NC_000005.10:g.149028538A>G;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign;CLNVC=sing

We can look for annotations where the keyword `Pathogenic` is present

In [11]:
grep Pathogenic result.annotate.vcf

chr5	149026872	2482	G	A	358.298	.	AB=0.377778;ABP=8.84915;AC=1;AF=0.5;AN=2;AO=17;CIGAR=1X;DP=45;DPB=45;DPRA=0;EPP=3.13803;EPPR=10.7656;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=82.5012;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=657;QR=1065;RO=28;RPL=8;RPP=3.13803;RPPR=10.7656;RPR=9;RUN=1;SAF=17;SAP=39.9253;SAR=0;SRF=28;SRP=63.8115;SRR=0;TYPE=snp;AF_EXAC=0.00088;ALLELEID=17521;CLNDISDB=MONDO:MONDO:0011113,MedGen:C1866636,OMIM:601596,Orphanet:ORPHA99949|MONDO:MONDO:0013237,MedGen:C3150596,OMIM:613353|MONDO:MONDO:0015626,MedGen:C0007959,Orphanet:ORPHA166,SNOMED_CT:50548001|MONDO:MONDO:0018995,MedGen:C4082197,Orphanet:ORPHA64749,SNOMED_CT:715795005|MeSH:D030342,MedGen:C0950123|MedGen:CN169374|MedGen:CN239303|MedGen:CN517202;CLNDN=Charcot-Marie-Tooth_disease,_type_4C|Mononeuropathy_of_the_median_nerve,_mild|Charcot-Marie-Tooth_disease|Charcot-Marie-Tooth_disease_type_4|Inborn_genetic_diseases|not_specified|SH3TC2-Related_Disorders|not_provided;CLNHGVS=NC_000005.10:g.1490268

# Preparing the VCF file for visualization

We can compress and index the VCF file so that it can be visualized using the IGV browser

In [9]:
bgzip -c result.annotate.vcf > result.annotate.vcf.gz
tabix -p vcf result.annotate.vcf.gz

In [10]:
ls -lh

total 515M
-rw-r--r-- 1 root root  15K Jul 24 09:02 '01 - Preparations for Finding a Disease Mutation.ipynb'
-rw-r--r-- 1 root root  17K Jul 24 11:07 '02 - Aligning the FASTQ File.ipynb'
-rw-r--r-- 1 root root  62K Jul 24 11:56 '03 - Variant Calling.ipynb'
-rw-r--r-- 1 root root  21K Jul 24 13:21 '04 - Annotation of Variants.ipynb'
-rw-r--r-- 1 root root 177M Jul 24 08:53  chr5.fa
-rw-r--r-- 1 root root  588 Jul 24 08:59  chr5.fa.amb
-rw-r--r-- 1 root root   44 Jul 24 08:59  chr5.fa.ann
-rw-r--r-- 1 root root 174M Jul 24 08:59  chr5.fa.bwt
-rw-r--r-- 1 root root   23 Jul 24 11:42  chr5.fa.fai
-rw-r--r-- 1 root root  44M Jul 24 08:59  chr5.fa.pac
-rw-r--r-- 1 root root  87M Jul 24 09:00  chr5.fa.sa
-rw-r--r-- 1 root root  32M Jul 21 03:06  clinvar_20200720.vcf.gz
-rw-r--r-- 1 root root 284K Jul 21 03:06  clinvar_20200720.vcf.gz.tbi
-rw-r--r-- 1 root root 820K Jul 24 08:52  input.fq
-rw-r--r-- 1 root root 225K Jul 24 08:54  input_fastqc.html
-rw-r--r-- 1 root root 235K Jul 24 08:54  inpu

To visualize the aligned reads with the variants, we will need to download 4 files
- mapped.dedup.sort.bam
- mapped.dedup.sort.bam.bai
- result.annotate.vcf.gz
- result.annotate.vcf.gz.tbi

We will import these into the IGV browser (GRCh38 human genome)

# Using a web tool (VarMap)

https://academic.oup.com/bioinformatics/article/35/22/4854/5514476
https://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/DisaStr/GetPage.pl?varmap=TRUE