-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we generate .VCF file with information for all position including homozygous references and not in block format #318
Comments
Hi @dhwani2410 The VCF DeepVariant generates does contain hom-ref calls, which are labeled with "RefCall" instead of "PASS" in the FILTER column. These are the positions that DeepVariant evaluated candidate variants at (due to signal in some reads) but found to be hom-ref. The gVCF includes many more positions that were not evaluated and is only needed for specific applications. It uses blocks to keep the file size reasonably short. For more information on the gVCF format, see our documentation page on that topic: https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-gvcf-support.md From your question, it sounds like the VCF file has the information you need and works with the downstream tools you mentioned, so you may not need the gVCF file. |
Is there an option where we can force the deep variant to give information for all base position? and later filter if they are bad quality of not
When we see allele depth(AD) for both rows, we observed that both the row has the almost same number of reads supporting it(107,106 and 106,107). Why is no call(./.) given for first and not for second? Can you please elaborate in why are multiple places given ./. despite reads supporting it |
Hi @dhwani2410 Is there any chance that this is from a sample for which publicly available data is present (e.g. HG001). I could look at pileups in the region for a more informed opinion. To your question about different variants - DeepVariant classifies event on a position-by-position basis, so the classifier does not have information about what output it gave at nearby variants. This can cause you to know a bit more than the classifier when looking at its full output. But it explains why the classifier can have a different result when two variants are putatively phased. One phenomenon that we observe with DeepVariant is that it often calls positions with substantial support as reference when they are nearby to other variants. This seems to be related to regions of segmental duplication, where the apparent variants come from reads that are mismapped to the region. We have a poster which documents this effect: https://pbs.twimg.com/media/ERe2bSyWsAcE00h?format=jpg&name=4096x4096 So DeepVariant seems to learn something of the signature of segmental duplication and is likely not calling these positions as variant for this reason. On average, DeepVariant seems to be correct in these cases (that they are not true variants) more often than it is incorrect, which is why it has learned this pattern. However, it will not be correct in all cases. Either way, when you observe this pattern, it is a good idea to look at the region and consider whether the apparent variants may come from a duplication of related sequence. |
@AndrewCarroll Thanks for your response and the VCF example I have shared is my own data and not public data. Could you please refer to which part of my query you have answered? Maybe, I am not clear with your replies and which part does it address? |
Hi @dhwani2410
I think this request corresponds to wanting gVCF output. When you use the --output_gvcf option. This will write reference ranges for regions of the genome that did not produce any candidates for variant calling. I did not answer this question in the prior response.
My prior answer addresses this question - that there are sometimes regions which may have markers of segmental duplication instead of variants, and DeepVariant will sometimes call positions as reference. It makes this decision on a position-by-position basis and as a result, some of these calls may be inconsistent with what appears from a visual phasing of the reads. |
@AndrewCarroll I agree GVCF gives more information, but I need to process the VCF file and downstream processing tools accepts only VCF format. Just like in GATK we can get information for all bases using "-ERC BP resolution", is it possible here in deep variant to get information of all variant as well as non-variant sites in VCF format |
Hi @dhwani2410 No, DeepVariant does not have a BP resolution option. |
I would like to have this feature as well |
I have a few queries about your output:-
Can we generate VCF file with each position information?
Is it possible to convert g.VCF to VCF since commonly used software used VCF format? What does QUAL "0" mean in VCF? Some FILTER column with QUAL 0 is marked as "RefCall" while some as "." Can you please explain how is filter decided in VCF?
Both the row has QUAL value "0". Then why is first row has given homozygous call and the second row with now call?
RefCall:- Which would mean Homozygous reference
PASS:- Can be homozygous/Heterozygous
. :- what does this mean?
The text was updated successfully, but these errors were encountered: