Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
85 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Comparing VCF files | ||
|
||
Between two different pipeline runs we ended up with different VCF | ||
files. The starting point (BAMs) was the same, but in each pipeline | ||
different procedures may have been followed, i.e. the processing steps | ||
for variant calling were not exactly the same. The first | ||
freebayes+varscan2 pipeline (P1) I wrote after testing many callers | ||
including somatic sniper and strelka, so I should know that well | ||
enough. The second pipeline (P2) includes more variant callers. To | ||
find out how they compared and which output was preferred I decided to | ||
do some analysis using bio-vcf. | ||
|
||
## Comparing freebayes output | ||
|
||
The freebayes somatic variant calling files differed in size. Just | ||
looking at one sample P1 had 479 lines and P2 had 729. The germline | ||
calls, however, where comparable in size. But when I ran a diff it | ||
showed these differed significantly too: | ||
|
||
wc -l germline*vcf | ||
3527 germline1.vcf | ||
3500 germline2.vcf | ||
cat germline1.vcf|bio-vcf -e "[r.chr,r.pos]"|sort > germline1.txt | ||
cat germline2.vcf|bio-vcf -e "[r.chr,r.pos]"|sort > germline2.txt | ||
diff germline1.txt germline2.txt |wc | ||
1751 | ||
|
||
To zoom in on settings, lets look at read depth on chromosome 7 (-v | ||
and --num-threads=1 options I typically use when trying new filters | ||
because they give digestable output): | ||
|
||
cat germline1.vcf|bio-vcf -v --num-threads 1 --filter 'rec.chr=="7"' -e '[r.chr,r.pos,r.info.dp]'|sort | ||
7 90225928 34 | ||
7 95216394 69 | ||
7 97821397 97 | ||
7 97821398 98 | ||
7 97822115 96 | ||
7 97822210 94 | ||
7 98503849 109 | ||
7 98543545 68 | ||
7 98543546 69 | ||
7 98582562 38 | ||
7 98650051 48 | ||
7 99690690 78 | ||
7 99690747 27 | ||
7 99693552 34 | ||
|
||
cat germline2.vcf|bio-vcf -v --num-threads 1 --filter 'rec.chr=="7"' -e '[r.chr,r.pos,r.info.dp]'|sort | ||
7 90225928 121 | ||
7 95216394 534 | ||
7 97822115 1053 | ||
7 97822210 1044 | ||
7 97834704 249 | ||
7 98503849 1579 | ||
7 98547176 59 | ||
7 98553739 21 | ||
7 98648517 75 | ||
7 98650051 344 | ||
7 99690690 455 | ||
7 99690747 168 | ||
7 99693552 107 | ||
|
||
OK, this is informative. P1 called variants after removing duplicate reads. P2 | ||
did not. That explains the different in number of variants called. | ||
|
||
Unfortunately the sequencing concerns an FFPE dataset. FFPE degrades | ||
over time and DNA changes. In itself this is not a problem because we | ||
sequence many cells and the changed ones do not necessarily | ||
dominate. We do, however, amplify the DNA before sequencing through a | ||
PCR-type process. This means that randomly these FFPE changes may | ||
become dominant at a certain position and variant callers score them | ||
as genuine variants. I have studied this data and there is ample | ||
evidence of this effect. The only way to address this is by removing | ||
duplicate reads - so the amplified reads get compressed into one | ||
(theoretically, because sometimes there are multiple errors confusing | ||
things a bit). Removing duplicates is the *only* way and this can not | ||
happen *after* variant calling. | ||
|
||
This means P2 is out of the window. It is useless data also for the | ||
other variant callers. I don't even have to check the somatic calling. | ||
|
||
## Conclusion | ||
|
||
A simple read depth check with bio-vcf proved that P2 had no | ||
merit. Either we rerun it after removing duplicates or we rely on P1. |