Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cse identify not giving correct indel ranges #83

Open
yang-yangfeng opened this issue Jul 25, 2017 · 1 comment
Open

cse identify not giving correct indel ranges #83

yang-yangfeng opened this issue Jul 25, 2017 · 1 comment

Comments

@yang-yangfeng
Copy link
Contributor

It seems like cse identify is having trouble inferring the actual length of indels from the vcf, and is just assuming all variants are snvs.

For example:

clinseq_7$ pwd
/gscmnt/gc2602/griffithlab/regtools/yafeng/hcc1395/clinseq_7
clinseq_7$ grep -R -i -n "13408142" hcc1395_filtered.vcf
25161:6 13408142 . CCAA . . PASS .
clinseq_7$ grep -R -i -n '1340814' ../output/cse_identify_filtered.tsv
184:6 13365894 13408142 JUNC00000183 1 - GT-AG 1 0 0 DA 1 1 1 GFOD1 ENST00000379284,ENST00000379287 6:13408141-13408142

Another example:

clinseq_7$ grep -R -i -n '45438295' hcc1395_filtered.vcf
494:1 45438295 . . ACAC . PASS .
clinseq_7$ grep -R -i -n '45438295' ../output/cse_identify_filtered_i50e5.tsv
85:1 45435716 45443987 JUNC00000084 36 - GT-AG 2 1 2 DA 1 1 1 EIF2B3 ENST00000360403,ENST00000372182,ENST00000372183,ENST00000477953,ENST00000480675,ENST00000487532,ENST00000497010 1:45438294-45438295
86:1 45438246 45443987 JUNC00000085 2 - GT-AG 1 0 1 DA 1 1 1 EIF2B3 ENST00000360403,ENST00000372182,ENST00000372183,ENST00000477953,ENST00000480675,ENST00000487532,ENST00000497010 1:45438294-45438295

It seems like this is happening consistently, as all of the variants listed in the last column of the cse identify output tsvs are snvs (i.e. stop - start = 1).

clinseq_7$ awk '{ $4 = $3 - $2 } 1' ../output/variants_filtered_E.bed | awk '{print $4}' | sort | uniq
1
clinseq_7$ awk '{ $4 = $3 - $2 } 1' ../output/variants_filtered_i50e5.bed | awk '{print $4}' | sort | uniq
1

Could definitely be related to the vcf since we were having troubles with it before. Investigating.

@malachig
Copy link
Member

So there are two related problems here:

Currently regtools only takes into account the first two columns (chr, start) when looking at VCF files. This might cause large deletions to be discounted if only the start is taken into account unless we look at the length of the deletion as well.

Since only the start position is being considered, in the regtools output, insertions and deletions are being misrepresented as SNVs. This creates confusion when comparing back to the input VCF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants