You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like this is happening consistently, as all of the variants listed in the last column of the cse identify output tsvs are snvs (i.e. stop - start = 1).
Currently regtools only takes into account the first two columns (chr, start) when looking at VCF files. This might cause large deletions to be discounted if only the start is taken into account unless we look at the length of the deletion as well.
Since only the start position is being considered, in the regtools output, insertions and deletions are being misrepresented as SNVs. This creates confusion when comparing back to the input VCF
It seems like cse identify is having trouble inferring the actual length of indels from the vcf, and is just assuming all variants are snvs.
For example:
clinseq_7$ pwd
/gscmnt/gc2602/griffithlab/regtools/yafeng/hcc1395/clinseq_7
clinseq_7$ grep -R -i -n "13408142" hcc1395_filtered.vcf
25161:6 13408142 . CCAA . . PASS .
clinseq_7$ grep -R -i -n '1340814' ../output/cse_identify_filtered.tsv
184:6 13365894 13408142 JUNC00000183 1 - GT-AG 1 0 0 DA 1 1 1 GFOD1 ENST00000379284,ENST00000379287 6:13408141-13408142
Another example:
clinseq_7$ grep -R -i -n '45438295' hcc1395_filtered.vcf
494:1 45438295 . . ACAC . PASS .
clinseq_7$ grep -R -i -n '45438295' ../output/cse_identify_filtered_i50e5.tsv
85:1 45435716 45443987 JUNC00000084 36 - GT-AG 2 1 2 DA 1 1 1 EIF2B3 ENST00000360403,ENST00000372182,ENST00000372183,ENST00000477953,ENST00000480675,ENST00000487532,ENST00000497010 1:45438294-45438295
86:1 45438246 45443987 JUNC00000085 2 - GT-AG 1 0 1 DA 1 1 1 EIF2B3 ENST00000360403,ENST00000372182,ENST00000372183,ENST00000477953,ENST00000480675,ENST00000487532,ENST00000497010 1:45438294-45438295
It seems like this is happening consistently, as all of the variants listed in the last column of the cse identify output tsvs are snvs (i.e. stop - start = 1).
clinseq_7$ awk '{ $4 = $3 - $2 } 1' ../output/variants_filtered_E.bed | awk '{print $4}' | sort | uniq
1
clinseq_7$ awk '{ $4 = $3 - $2 } 1' ../output/variants_filtered_i50e5.bed | awk '{print $4}' | sort | uniq
1
Could definitely be related to the vcf since we were having troubles with it before. Investigating.
The text was updated successfully, but these errors were encountered: