Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we generate .VCF file with information for all position including homozygous references and not in block format #318

Closed
dhwani2410 opened this issue Jun 17, 2020 · 8 comments

Comments

@dhwani2410
Copy link

dhwani2410 commented Jun 17, 2020

I have a few queries about your output:-

  1. Can we generate VCF file with each position information?

  2. Is it possible to convert g.VCF to VCF since commonly used software used VCF format? What does QUAL "0" mean in VCF? Some FILTER column with QUAL 0 is marked as "RefCall" while some as "." Can you please explain how is filter decided in VCF?

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DRR015476
6 1 . A <*> 0 . END=587 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
-- -- -- -- -- -- -- -- -- --
6 588 . A <*> 0 . END=588 GT:GQ:MIN_DP:PL ./.:0:1:29,3,0

Both the row has QUAL value "0". Then why is first row has given homozygous call and the second row with now call?

  1. I have observed that there are three options available in filter column

RefCall:- Which would mean Homozygous reference
PASS:- Can be homozygous/Heterozygous
. :- what does this mean?

@dhwani2410 dhwani2410 changed the title Can we generate .VCF file with information for all bases? Particularly homozygous references Can we generate .VCF file with information for all position including homozygous references and not in block format Jun 17, 2020
@MariaNattestad
Copy link
Collaborator

Hi @dhwani2410

The VCF DeepVariant generates does contain hom-ref calls, which are labeled with "RefCall" instead of "PASS" in the FILTER column. These are the positions that DeepVariant evaluated candidate variants at (due to signal in some reads) but found to be hom-ref.

The gVCF includes many more positions that were not evaluated and is only needed for specific applications. It uses blocks to keep the file size reasonably short. For more information on the gVCF format, see our documentation page on that topic: https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-gvcf-support.md

From your question, it sounds like the VCF file has the information you need and works with the downstream tools you mentioned, so you may not need the gVCF file.

@dhwani2410
Copy link
Author

dhwani2410 commented Jun 17, 2020

https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-gvcf-support.md
@MariaNattestad Thank you for the quick response.I guess my question was not clear so, I will try to rewrite my query.

  1. I wanted to know the base at all coordinates in a list of interval regions. The VCF file does not output all the positions in that interval region. If VCF file is able to give hom-ref, the. why are multiple locations missed? and out of around 1000bp, I am able to get information of around 200 bp?

Is there an option where we can force the deep variant to give information for all base position? and later filter if they are bad quality of not
Just like GATK has BP resolution option when we run the variant caller.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DRR015476
6 5285 . T G 1.6 RefCall . GT:GQ:DP:AD:VAF:PL ./.:5:213:107,106:0.497653:0,3,31
-- -- -- -- -- -- -- -- -- --
6 5288 . G A 22.1 PASS . GT:GQ:DP:AD:VAF:PL 0/1:20:220:106,107:0.486364:22,0,25

When we see allele depth(AD) for both rows, we observed that both the row has the almost same number of reads supporting it(107,106 and 106,107). Why is no call(./.) given for first and not for second? Can you please elaborate in why are multiple places given ./. despite reads supporting it

@AndrewCarroll
Copy link
Collaborator

Hi @dhwani2410

Is there any chance that this is from a sample for which publicly available data is present (e.g. HG001). I could look at pileups in the region for a more informed opinion.

To your question about different variants - DeepVariant classifies event on a position-by-position basis, so the classifier does not have information about what output it gave at nearby variants. This can cause you to know a bit more than the classifier when looking at its full output. But it explains why the classifier can have a different result when two variants are putatively phased.

One phenomenon that we observe with DeepVariant is that it often calls positions with substantial support as reference when they are nearby to other variants. This seems to be related to regions of segmental duplication, where the apparent variants come from reads that are mismapped to the region. We have a poster which documents this effect: https://pbs.twimg.com/media/ERe2bSyWsAcE00h?format=jpg&name=4096x4096

So DeepVariant seems to learn something of the signature of segmental duplication and is likely not calling these positions as variant for this reason. On average, DeepVariant seems to be correct in these cases (that they are not true variants) more often than it is incorrect, which is why it has learned this pattern. However, it will not be correct in all cases.

Either way, when you observe this pattern, it is a good idea to look at the region and consider whether the apparent variants may come from a duplication of related sequence.

@dhwani2410
Copy link
Author

@AndrewCarroll Thanks for your response and the VCF example I have shared is my own data and not public data.

Could you please refer to which part of my query you have answered? Maybe, I am not clear with your replies and which part does it address?

@AndrewCarroll
Copy link
Collaborator

Hi @dhwani2410

I wanted to know the base at all coordinates in a list of interval regions. The VCF file does not output all the positions in that interval region. If VCF file is able to give hom-ref, the. why are multiple locations missed? and out of around 1000bp, I am able to get information of around 200 bp? Is there an option where we can force the deep variant to give information for all base position? and later filter if they are bad quality of not
Just like GATK has BP resolution option when we run the variant caller.

I think this request corresponds to wanting gVCF output. When you use the --output_gvcf option. This will write reference ranges for regions of the genome that did not produce any candidates for variant calling. I did not answer this question in the prior response.

Why is no call(./.) given for first and not for second? Can you please elaborate in why are multiple places given ./. despite reads supporting it

My prior answer addresses this question - that there are sometimes regions which may have markers of segmental duplication instead of variants, and DeepVariant will sometimes call positions as reference. It makes this decision on a position-by-position basis and as a result, some of these calls may be inconsistent with what appears from a visual phasing of the reads.

@dhwani2410
Copy link
Author

@AndrewCarroll
I think this request corresponds to wanting gVCF output. When you use the --output_gvcf option. This will write reference ranges for regions of the genome that did not produce any candidates for variant calling. I did not answer this question in the prior response.

I agree GVCF gives more information, but I need to process the VCF file and downstream processing tools accepts only VCF format. Just like in GATK we can get information for all bases using "-ERC BP resolution", is it possible here in deep variant to get information of all variant as well as non-variant sites in VCF format

@MariaNattestad
Copy link
Collaborator

Hi @dhwani2410

No, DeepVariant does not have a BP resolution option.
We are not currently planning to add one, but we can keep this in mind in case we see more demand for such a feature in the future.

@ESDeutekom
Copy link

ESDeutekom commented Feb 13, 2023

I would like to have this feature as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants