-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Medaka variant calling, annotation, classification: zigosity determination; supporting reads; #244
Comments
Hi Maria, Regarding your question 1) about the annotation tool, there is a known bug in the annotation tool affecting the DP field, but not the DPS field. As you say, the DPS field is a count of reads spanning the variant pos -25 -> pos + 25. Hence they do overlap the variant and have sufficient anchoring sequence either side of the variant to allow reliable realignment to the reference and alternative sequence(s). Indeed one cannot assume that all of these reads support the alternative allele - this is what the SR field is for, which bring us onto question 2). With regards to how medaka calculates zygosity, medaka factors the diploid variant calling problem into two independent sets of consensus and variant calls on each haplotype (see the docs for more information). Whatshap is used to partition reads by haplotype. Reads which can't be haplotyped (e.g. because they don't overlap with a heterozygous SNP) will be used in forming the consensus for both haplotypes. Variant calls on each haplotype are finally combined into a diploid VCF. If a variant is found on both haplotypes it is homozygous, if it is found on just one it will be heterozygous. It is possible that your variant call is homozygous because the reads supporting either A/G are spread across both hapolotypes, and medaka predicts A on both. This is consistent with the medaka quality scores for the consensus call on each haplotype (q1 and q2) are both around 26. What were the genotypes of the neighboring variants (e.g. C->T and C->A)? It is also worth mentioning that medaka's neural network model can take into account several other factors such as local sequence context and strand bias when predicting the consensus sequence. This means it can make predictions which differ to the result of a naive base-counting consensus. Whilst this does make it's predictions harder to interpret, it also the reason why medaka provides higher accuracy overall. It's worth mentioning that medaka variant calls can be further refined by regenotyping them with DeepVariant. We have not tested this on non-human datasets, but it might be worth trying. A tutorial on how to do this can be found here. Regarding the read counts, if you are able to share a bam containing the reads around that variant and your reference and vcf I can look into this and see if there is a bug in the code. |
In case you are able to share these files and would prefer to do so privately, I've just emailed you a secure file-sharing link. |
Dear Mike,
I will get back to you as soon as possible with a few comments and questions. In the meantime I have sent you the relevant files via the upload link you shared.
Thank you for your help.
Kind regards.
Maria
…--------------------------------------#NoVAT4Science----------------------------------------
Maria João Mendes de Carvalho, PhD
Genome Sequencing and Analysis Lab - iBiGEN
Department of Medical Sciences & Institute for Biomedicine – DCM & iBiMED
University of Aveiro
Agras do Crasto
3810-193 Aveiro
Portugal
https://www.ua.pt/dcm/
http://www.ua.pt/ibimed/<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ua.pt%2Fibimed%2F&data=01%7C01%7CCarvalhoM%40cardiff.ac.uk%7Ca33a87b4c26442ca698e08d6c8cda90c%7Cbdb74b3095684856bdbf06759778fcbc%7C1&sdata=hXIxQXdITqSdGL%2F%2BA%2FyoWurwzMx0CFBj9ow540ty%2FDk%3D&reserved=0>
Phone number: +351234370200 (xt. 22126)
Email address: mjcarvalho@ua.pt<mailto:mjcarvalho@ua.pt>
Science should be a fundamental activity in every country. Research in Portugal pays 23%VAT challenging the advances in RTD and the competitiveness in an already EU widening country. #NoVAT4Science
From: mwykes <notifications@github.com>
Reply to: nanoporetech/medaka <reply@reply.github.com>
Date: Friday, 15 January 2021 at 15:38
To: nanoporetech/medaka <medaka@noreply.github.com>
Cc: Maria João Carvalho <mjcarvalho@ua.pt>, Author <author@noreply.github.com>
Subject: Re: [nanoporetech/medaka] Medaka variant calling, annotation, classification: zigosity determination; supporting reads; (#244)
Hi Maria,
Thanks for your questions. One thing before that though - you mention that you are using canu for read-correction; are you using corrected reads for medaka variant calling? If so, I would not recommend doing this; Medaka is trained on uncorrected reads and giving medaka corrected reads may result in lower accuracy variant calls unless you retrain a medaka model on corrected reads. Furthermore, our medaka variant calling models are only trained on depth up to 60X, so may not work as well at the very high depth you are using.
Regarding your question 1) about the annotation tool, there is a known bug in the annotation tool affecting the DP field, but not the DPS field. As you say, the DPS field is a count of reads spanning the variant pos -25 -> pos + 25. Hence they do overlap the variant and have sufficient anchoring sequence either side of the variant to allow reliable realignment to the reference and alternative sequence(s). Indeed one cannot assume that all of these reads support the alternative allele - this is what the SR field is for, which bring us onto question 2).
With regards to how medaka calculates zygosity, medaka factors the diploid variant calling problem into two independent sets of consensus and variant calls on each haplotype (see the docs<https://nanoporetech.github.io/medaka/snp.html> for more information). Whatshap is used to partition reads by haplotype. Reads which can't be haplotyped (e.g. because they don't overlap with a heterozygous SNP) will be used in forming the consensus for both haplotypes. Variant calls on each haplotype are finally combined into a diploid VCF. If a variant is found on both haplotypes it is homozygous, if it is found on just one it will be heterozygous. It is possible that your variant call is homozygous because the reads supporting either A/G are spread across both hapolotypes, and medaka predicts A on both. This is consistent with the medaka quality scores for the consensus call on each haplotype (q1 and q2) are both around 26. What were the genotypes of the neighboring variants (e.g. C->T and C->A)? It is also worth mentioning that medaka's neural network model can take into account several other factors such as local sequence context and strand bias when predicting the consensus sequence. This means it can make predictions which differ to the result of a naive base-counting consensus. Whilst this does make it's predictions harder to interpret, it also the reason why medaka provides higher accuracy overall.
It's worth mentioning that medaka variant calls can be further refined by regenotyping them with DeepVariant. We have not tested this on non-human datasets, but it might be worth trying. A tutorial on how to do this can be found here.
Regarding the read counts, if you are able to share a bam containing the reads around that variant and your reference and vcf I can look into this and see if there is a bug in the code.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#244 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASO6EVKIKSQD73MXEN2WC6TS2BOM5ANCNFSM4WCYDLQQ>.
|
Hi Maria, In my last comment, I forgot to address your question about the SR field, (Is it possible that medaka has switched the number of reads supporting the reference and the alteration?). Yes, indeed, something does seem to be wrong here, I will look into this further and update the issue. Mike |
Dear Mike,
Thank you for your help. I need some time to have a look at this and will get back to you.
Kind regards.
Maria
…--------------------------------------#NoVAT4Science----------------------------------------
Maria João Mendes de Carvalho, PhD
Genome Sequencing and Analysis Lab - iBiGEN
Department of Medical Sciences & Institute for Biomedicine – DCM & iBiMED
University of Aveiro
Agras do Crasto
3810-193 Aveiro
Portugal
https://www.ua.pt/dcm/
http://www.ua.pt/ibimed/<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ua.pt%2Fibimed%2F&data=01%7C01%7CCarvalhoM%40cardiff.ac.uk%7Ca33a87b4c26442ca698e08d6c8cda90c%7Cbdb74b3095684856bdbf06759778fcbc%7C1&sdata=hXIxQXdITqSdGL%2F%2BA%2FyoWurwzMx0CFBj9ow540ty%2FDk%3D&reserved=0>
Phone number: +351234370200 (xt. 22126)
Email address: mjcarvalho@ua.pt<mailto:mjcarvalho@ua.pt>
Science should be a fundamental activity in every country. Research in Portugal pays 23%VAT challenging the advances in RTD and the competitiveness in an already EU widening country. #NoVAT4Science
From: mwykes <notifications@github.com>
Reply to: nanoporetech/medaka <reply@reply.github.com>
Date: Tuesday, 19 January 2021 at 18:29
To: nanoporetech/medaka <medaka@noreply.github.com>
Cc: Maria João Carvalho <mjcarvalho@ua.pt>, Author <author@noreply.github.com>
Subject: Re: [nanoporetech/medaka] Medaka variant calling, annotation, classification: zigosity determination; supporting reads; (#244)
Hi Maria,
In my last comment, I forgot to address your question about the SR field, (Is it possible that medaka has switched the number of reads supporting the reference and the alteration?). Yes, indeed, something does seem to be wrong here, I will look into this further and update the issue.
Mike
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#244 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ASO6EVOWKPBMIOMMLP4TIPTS2XFRDANCNFSM4WCYDLQQ>.
|
Hi Maria, I have checked the SR field. It seems the problem is related to the fact that there seem to be multiple ways in which reads are aligning around that SNP - with one set of reads having multiple ~50 bp insertions within 200 bp of the SNP and another set having a single larger 462 bp insertion at ~3130. The 47 bp insertion within the 25 bp padding of the SNP seems to trip up the annotation tool and results in a different alignment of the trimmed read sequences around the SNP, and hence affects the SR counts. It it possible that this could be an artifact of the read correction - I would recommend repeating the analysis on the uncorrected reads. It may also be preferable not to error-correct reads when dealing with a diploid genome; unless you are using ploidy-aware methods, you may run the risk of correcting "errors" in reads which are actually heterozygous variants. The impact of the 47-base insertion can be avoided by setting
Having seen the limited use of these SR annotations, we have already made the read-realignment part of the annotation optional in the latest pre-release of medaka and are considering removing it from the next release, such that only depth-related annotations are added. Let me know if you have any other questions about this. Mike |
Dear Mike,
Thank you for your help and patience.
I have re-run the analysis without reads’ correction with canu before read mapping against the reference genome using minimap2, and the there’s no longer a variant called at the location we have been discussing (3038).
Below are the print screens of IGV showing the bam and phased.bam files.
In fact, with Canu corrected reads, medaka called 95167 variants and 110716 variants were called with uncorrected reads. This is, of course, raw data without applying the filters I used in the upstream analysis, which gave 53512 SNPs for the analysis with Canu corrected reads and 78228 SNPs with uncorrected reads. For this organism, we expect to obtain around 60,000 SNPs.
I am not sure what happened here. Perhaps using Canu does not result in lower accuracy, in fact it reduces the number of variants called which may indicate there’s higher accuracy?
Also, regarding the heterozygous/homozygous genotype issue, the majority of SNPs for this organism are heterozygous, which was not seen using medaka with Canu corrected reads, if we consider the supporting reads for alteration vs reference. In fact, it is the other way around: the big majority of SNPs called were homozygous.
In conclusion, I am not sure what to think.
bam file (filtering for secondary and supplementary alignments)
[Graphical user interface Description automatically generated with low confidence]
phased.bam file (filtering for secondary and supplementary alignments)
[Graphical user interface, chart Description automatically generated]
Thanks.
Kind regards.
Maria
|
Dear Mike I have Candida albicans isolates ONT and Illumina sequenced. I would like to call variants on these genomes, hence I used medaka. I would like to improve the previous analysis I have done and be able to interpret medaka outputs to filter out high quality called SNPs. Thanks |
Hi Maria, Apologies for the slow reply. I don't think I can offer any advice other than that I have already given, namely how we would recommend you use medaka, and that DeepVariant genotyping of medaka or PEPPER variants can improve accuracy. It's worth noting that Medaka is supported as a "Research Release" only. Research releases are provided as technology demonstrators to provide early access to features or stimulate Community development of tools. Support for this software is minimal and is only provided directly by the developers. Kind regards, |
I am calling variants on C. albicans strains genomes using the following workflow:
Read correction with Canu (v2.0)
Minimap2 for mapping reads against the reference (v2.17)
Variant calling (medaka_variant -f <REFERENCE.fasta> -i <reads.bam>), variant annotation (medaka tools annotate variants.vcf reference.fasta alignments.bam variants.annot.vcf), variant classification with Medaka (medaka tools classify_variants variants.annot.vcf) Medaka v1.2.1
Variant annotation and functional effect prediction with SnpEff (v4.3)
I came across the following situation:
Ca22chr3A_C_albicans_SC5314 3038 . G A 26.3085 PASS AR=0,1;DP=189;DPS=121,68;DPSP=199;SC=14307,8962,13935,8754;SR=116,67,8,7;pos1=3038;pos2=3038;q1=26.819;q2=25.798;type=snp;ANN=A|upstream_gene_variant|MODIFIER|C3_00020W_A|C3_00020W_A|transcript|C3_00020W_A-T|protein_coding||c.-961G>A|||||961|,A|upstream_gene_variant|MODIFIER|C3_00040W_A|C3_00040W_A|transcript|C3_00040W_A-T|protein_coding||c.-2933G>A|||||2933|,A|downstream_gene_variant|MODIFIER|C3_00010C_A|C3_00010C_A|transcript|C3_00010C_A-T|protein_coding||c.*790C>T|||||790|,A|downstream_gene_variant|MODIFIER|C3_00030C_A|C3_00030C_A|transcript|C3_00030C_A-T|protein_coding||c.*1865C>T|||||1865|,A|intergenic_region|MODIFIER|CHR_START-C3_00010C_A|CHR_START-C3_00010C_A|intergenic_region|CHR_START-C3_00010C_A|||n.3038G>A|||||| GT:GQ 1|1:26
1. DPSP = Depth of reads spanning pos +-25; I now understand that these reads may not necessarily cover the variant position.
In this case, we have 199 reads which span position 3038 with 25bp upstream and downstream the position. I understand this is useful to phase the haplotypes; I don’t understand how we can assume these reads support the alteration.
2. We have 183 reads supporting the reference (SR fwd+rev; 116+67) and 15 reads supporting the alteration (SR fwd + rev; 8+7); in total, 198. I suppose these are only reads spanning position +-25bp at position (DPSP); one read was ambiguous (AR). However the GT of the alteration is 1|1, which means it’s homozygous; I don’t understand how the program classifies this variant as homozygous.
Additionally, looking at this position in IGV (print screen attached), it showed total count 290: 183 reads for the alteration (A) and 107 reads for the reference (G). This is not in agreement with the vcf (I understand this is the quality filtering in medaka, but I’d like to understand more about the quality filtering done throughout medaka) and does not support an homozygous alteration.
Would you, please, let me know how medaka calculates the zygosity of the alteration? Is it possible that medaka has switched the number of reads supporting the reference and the alteration?
Thank you.
Kind regards.
Maria
The text was updated successfully, but these errors were encountered: