Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExAC DNPs #37

Closed
alexpenson opened this issue Feb 25, 2016 · 6 comments
Closed

ExAC DNPs #37

alexpenson opened this issue Feb 25, 2016 · 6 comments
Assignees

Comments

@alexpenson
Copy link
Member

DNPs composed of two ExAC SNPs are not annotated.

Solution: add these DNPs (and TNPs?) to the vcf.

@ckandoth ckandoth self-assigned this Feb 29, 2016
@alexpenson
Copy link
Member Author

@ckandoth
Copy link
Collaborator

Sample VCF where 1000g, NHLBI EVS, and ExAC allele freqs are only provided for the SNPs, but not for the DNP:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=G,Type=Integer,Description="Allelic Depths of REF and ALT(s) in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TUMOR   NORMAL
6   41903782    .   A   C   .   .   .   GT:AD:DP    0/1:10,10:20    0/0:22,0:22
6   41903783    .   G   A   .   .   .   GT:AD:DP    0/1:11,10:21    0/0:21,1:22
6   41903782    .   AG  CA  .   .   .   GT:AD:DP    0/1:10,10:20    0/0:21,0:21

@ckandoth
Copy link
Collaborator

I figure I could make this part of this gist for VEP installation, where I already require the user to edit the ExAC VCF with bcftools. bcftools concat has a feature to --ligate phased VCFs by matching overlapping haplotypes... which would have been perfect iff everyone reported phased VCFs. 🙄

Really, we just need a simple script that merges adjacent ExAC SNPs with similar allele-counts. The ExAC VCF is already sorted by position. So a one-liner is possible.

@ckandoth
Copy link
Collaborator

ckandoth commented Nov 1, 2016

I just figured out VEP's internal solution when reporting allele freqs from 1000g, ESP, or ExAC. They apparently report the freq of all non-reference alleles at the given POS, without trying to match the variant allele to the input. So in the above example, the DNP at position 41903782 would get assigned the allele freq of SNPs and/or indels at the same position. This would include the allele freq of the SNP at position 41903782.

@ckandoth
Copy link
Collaborator

ckandoth commented Nov 2, 2016

Just hit a bottleneck. 6-41903782-A-C has AC=65104, and 6-41903783-G-A has AC=5264. And they are found in phase only in 2432 individuals. This phased ACs are only reported in the ExAC web pages, and is not in their downloadable VCF. Ideally, a future release will list multiallelic loci with separate allele counts for SNPs and MNPs. But right now, we just don't have the correct ACs for DNPs or MNPs.

So as a workaround, I will report the sum of ACs of all non-reference alleles at a given position, even if it doesn't match the input allele. Same as explained in my last comment here. So for an input of 6-41903782-AG-CA, the allele count reported will be the sum of ACs of 6-41903782-A-C and 6-41903782-A-T

@ckandoth
Copy link
Collaborator

ckandoth commented Nov 3, 2016

This is now handled in 86e58e3

@ckandoth ckandoth closed this as completed Nov 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants